Message boards :
Number crunching :
Host falling back to CPU processing running v6.08 cuda and ATI device 1 taking far more time
Message board moderation
Author | Message |
---|---|
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 |
This MB WU. And 1 (NVIDIA GPU) wrongly mentioned as Anonumous Platform NVIDIA GPU, Result ID 993030463. And Device 1 of my ATI 5870s GPUs is slower and has a lower load as device 0 Both are in PCIe 2.0 16x; PCIe 2.0 x8 modus. Can't find an explanation why it's slower and has lower load? |
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
This MB WU. happens - he's running a 295.x driver, will be the monitor sleep bug. And 1 (NVIDIA GPU) wrongly mentioned as Anonumous Platform NVIDIA GPU, Bingo. You've found me another example of a bug I've been chasing. Showing NV but has run as CPU. For some reason tasks are having the wrong label on the website list. So it's not limited to one host but is something general going on. Anybody else sees wrongly attributed tasks, please link the host. Still needs figuring out if it's a general server side bug or limited to anything like boinc 7 clients or anonymous platform And Device 1 of my ATI 5870s GPUs is slower and has a lower load as No idea. One for the ATI gurus or Raistmer. I'm not the Pope. I don't speak Ex Cathedra! |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
Is it possible that this is a lost GPU task that was resent to the CPU but not correctly relabeled? (just thinking out loud...) |
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
yes, might be another side effect of the scheduler change/bug that is causing tasks to be 'resent' even though they are there. But it needs somebody who is seeing tasks being mislabeled on his host to run the <sched_op_debug> log flag, so you know what the client has requested, has received and then compare that to what the server thinks it did. I'm not the Pope. I don't speak Ex Cathedra! |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Could this be a possible mechanism? We all know that when a task is genuinely lost, and resent, it can be scheduled to a different resource from the one originally allocated - like the perennial classic of the VLAR assigned to CPU, lost, then reallocated to GPU, which keeps tripping people up. But that's for a genuine resend, where the client receives and acts upon the second allocation (not the vlar example, obviously). But as jravin posted in Unannounced Server-Side Change?, there's an active bug which causes tasks to be resent when they are not lost. According to jravin's log, the second assignment is rejected as an error, because the host already has the task. Presumably, it'll get processed as originally allocated, the first time round - but possibly the website has been updated in the meantime to reflect the attempted second assignment. |
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 |
|
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
Thanx all for your explanations, I'll check on my ATI host almost daily and I've had several of these lately, two different computers, two different manifestations. A work unit marked "ATI" has been completed on an nVidia card and the CPU and I just found one marked for the CPU that was done on an nVidia card (and this was a second computer). In the first case I thought it might be because of a mixed environment, like you have, both ATI and nVidia in the computer. In the second case, there are only nVidia cards and the CPU. In one case I'm running 7.0.x and in the other 6.10.60. I've also been getting odd strings of identical completion times in clusters of work units. The CPU times are very different, so the work obviously isn't identical. (i.e., something seems to be assigning work-times, rather than measuring them) So, it isn't the result of the mixed GPU environment and it isn't a consequence of updating to version 7.x.x of BOINC. Oh, and each of these two machines is running a (slightly) different Lunatics version and different nVidia driver version. Sounds like something server-side to me. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Could this be a possible mechanism... Well, it seemed to work. 2456405740 998156827 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed and validated 190.09 11.03 5.87 SETI@home Enhanced (from Valid tasks for computer 4292666) Those tasks were actually issued on 25 May, and I had already long since computed them on NVidia GPU before I allowed reporting to take place. I had around 140 tasks to report, so they were taken as two sets of 64 and then the remainder. Each set of 64 generated a 'resend lost results' event, and I made sure that one of them was a CPU-only request. Another clue, if any were needed: the Lunatics CPU apps are good, but even they can't complete a task in 190 seconds elapsed / 12 seconds CPU. In short, there was absolutely nothing wrong with the processing of these WUs on my machine: the only problems are the 'Sent' datestamp and the 'Application' name shown on the website. In the long term, that might mess up runtime estimation and hence credit - I'll report it again. |
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 |
Could this be a possible mechanism... Well, you're right about runtime estimation and thus credit. A look at elapsed and CPU time, makes clear it wasn't computed by CPU! (This hot wheater forces me to downclock both CPU (Q6600) and GPU (GTX470, yesterday I found the host CPU at 109C! and GPU at 100C. It'll throttel down at 110C, CPU that is). |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I found another set showing the 'identical runtime' syndrome: 2456388857 998148573 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed, waiting for validation 310.00 29.52 pending SETI@home Enhanced - all showing 310 seconds exactly. According to the starting/finished entries in my message log, they ran for 724, 704, and 373 seconds respectively. Edit, on second thoughts, cancel that - panic over. I've just realised what it might be. Look at the 'Sent' and 'Time reported' columns - 9:06:51 and 9:12:01 respectively. What's the difference between them? Yup, 310 seconds exactly. I think there's an anti-cheat mechanism in place which means you can't claim a runtime which is greater than the length of time the task was out in the field. That one's definitely going to hurt credit. |
Wedge009 Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553 |
Fred J. Verster wrote: And Device 1 of my ATI 5870s GPUs is slower and has a lower load as All I can think of is that their respective WUs may have different blanking percentages. As I understand it, blanking has substantial impact on GPU load and overall WU processing time. Of course, if you've already considered that, I can't think of anything else right now. LadyL wrote: For some reason tasks are having the wrong label on the website list. I often reschedule VLAR WUs from ATI GPU to CPU (I know the slow down is not as severe on ATI GPU as it is for NV GPU). Those tasks are still listed as ATI WUs on the site, and having it processed by the CPU seems to adversely affect the DCF for my ATI WUs, too. Don't know if you've considered this already, but it's a thought. Soli Deo Gloria |
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
I found another set showing the 'identical runtime' syndrome: Maybe my mind is only a very small thing to waste, but I don't understand how that can happen. How can it take longer to crunch than the amount of time you've had the work unit in your "possession?" If the answer is, "It can't," then I understand that much. So we've got a "sent" or "time reported" problem; is that what I understand you to be saying? I'm still getting those "streaks." |
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
I found another set showing the 'identical runtime' syndrome: The webpage gives the time it thinks it sent it out i.e. the time of the false resend at which point you might already have crunched the unit, because it wasn't really a ghost. The time it sticks into runtime is then the time between send and report, if that is smaller than the time reported by the task - that gives the string of identical runtimes. On BOINC 6.12.34 and Boinc 7 this can be mitigated by setting <max_tasks_reported>64</max_tasks_reported> in cc_config.xml. I'm not the Pope. I don't speak Ex Cathedra! |
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 |
|
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.