Host falling back to CPU processing running v6.08 cuda and ATI device 1 taking far more time

Author	Message
Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 1236563 - Posted: 25 May 2012, 15:45:02 UTC This MB WU. And 1 (NVIDIA GPU) wrongly mentioned as Anonumous Platform NVIDIA GPU, Result ID 993030463. And Device 1 of my ATI 5870s GPUs is slower and has a lower load as device 0 Both are in PCIe 2.0 16x; PCIe 2.0 x8 modus. Can't find an explanation why it's slower and has lower load? ID: 1236563 ·

LadyL Volunteer tester Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0	Message 1236604 - Posted: 25 May 2012, 16:45:20 UTC - in response to Message 1236563. This MB WU. happens - he's running a 295.x driver, will be the monitor sleep bug. And 1 (NVIDIA GPU) wrongly mentioned as Anonumous Platform NVIDIA GPU, Result ID 993030463. Bingo. You've found me another example of a bug I've been chasing. Showing NV but has run as CPU. For some reason tasks are having the wrong label on the website list. So it's not limited to one host but is something general going on. Anybody else sees wrongly attributed tasks, please link the host. Still needs figuring out if it's a general server side bug or limited to anything like boinc 7 clients or anonymous platform And Device 1 of my ATI 5870s GPUs is slower and has a lower load as device 0 Both are in PCIe 2.0 16x; PCIe 2.0 x8 modus. Can't find an explanation why it's slower and has lower load? No idea. One for the ATI gurus or Raistmer. I'm not the Pope. I don't speak Ex Cathedra! ID: 1236604 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1236615 - Posted: 25 May 2012, 16:57:00 UTC - in response to Message 1236604. Bingo. You've found me another example of a bug I've been chasing. Showing NV but has run as CPU. For some reason tasks are having the wrong label on the website list. So it's not limited to one host but is something general going on. Anybody else sees wrongly attributed tasks, please link the host. Still needs figuring out if it's a general server side bug or limited to anything like boinc 7 clients or anonymous platform Is it possible that this is a lost GPU task that was resent to the CPU but not correctly relabeled? (just thinking out loud...) ID: 1236615 ·

LadyL Volunteer tester Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0	Message 1236632 - Posted: 25 May 2012, 17:12:09 UTC - in response to Message 1236615. Bingo. You've found me another example of a bug I've been chasing. Showing NV but has run as CPU. For some reason tasks are having the wrong label on the website list. So it's not limited to one host but is something general going on. Anybody else sees wrongly attributed tasks, please link the host. Still needs figuring out if it's a general server side bug or limited to anything like boinc 7 clients or anonymous platform Is it possible that this is a lost GPU task that was resent to the CPU but not correctly relabeled? (just thinking out loud...) yes, might be another side effect of the scheduler change/bug that is causing tasks to be 'resent' even though they are there. But it needs somebody who is seeing tasks being mislabeled on his host to run the <sched_op_debug> log flag, so you know what the client has requested, has received and then compare that to what the server thinks it did. I'm not the Pope. I don't speak Ex Cathedra! ID: 1236632 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1236637 - Posted: 25 May 2012, 17:31:51 UTC Could this be a possible mechanism? We all know that when a task is genuinely lost, and resent, it can be scheduled to a different resource from the one originally allocated - like the perennial classic of the VLAR assigned to CPU, lost, then reallocated to GPU, which keeps tripping people up. But that's for a genuine resend, where the client receives and acts upon the second allocation (not the vlar example, obviously). But as jravin posted in Unannounced Server-Side Change?, there's an active bug which causes tasks to be resent when they are not lost. According to jravin's log, the second assignment is rejected as an error, because the host already has the task. Presumably, it'll get processed as originally allocated, the first time round - but possibly the website has been updated in the meantime to reflect the attempted second assignment. ID: 1236637 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 1238130 - Posted: 27 May 2012, 17:30:48 UTC - in response to Message 1236637. Thanx all for your explanations, I'll check on my ATI host almost daily and will watch out for wrong-platform names, as well ;-) ID: 1238130 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1238355 - Posted: 28 May 2012, 4:02:12 UTC - in response to Message 1238130. Thanx all for your explanations, I'll check on my ATI host almost daily and will watch out for wrong-platform names, as well ;-) I've had several of these lately, two different computers, two different manifestations. A work unit marked "ATI" has been completed on an nVidia card and the CPU and I just found one marked for the CPU that was done on an nVidia card (and this was a second computer). In the first case I thought it might be because of a mixed environment, like you have, both ATI and nVidia in the computer. In the second case, there are only nVidia cards and the CPU. In one case I'm running 7.0.x and in the other 6.10.60. I've also been getting odd strings of identical completion times in clusters of work units. The CPU times are very different, so the work obviously isn't identical. (i.e., something seems to be assigning work-times, rather than measuring them) So, it isn't the result of the mixed GPU environment and it isn't a consequence of updating to version 7.x.x of BOINC. Oh, and each of these two machines is running a (slightly) different Lunatics version and different nVidia driver version. Sounds like something server-side to me. ID: 1238355 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1238441 - Posted: 28 May 2012, 9:49:52 UTC - in response to Message 1236637. Could this be a possible mechanism... Well, it seemed to work. 2456405740 998156827 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed and validated 190.09 11.03 5.87 SETI@home Enhanced Anonymous platform (CPU) 2456405738 998156842 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed and validated 190.94 12.03 39.69 SETI@home Enhanced Anonymous platform (CPU) 2456405735 998156831 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed and validated 191.66 11.66 24.38 SETI@home Enhanced Anonymous platform (CPU) (from Valid tasks for computer 4292666) Those tasks were actually issued on 25 May, and I had already long since computed them on NVidia GPU before I allowed reporting to take place. I had around 140 tasks to report, so they were taken as two sets of 64 and then the remainder. Each set of 64 generated a 'resend lost results' event, and I made sure that one of them was a CPU-only request. Another clue, if any were needed: the Lunatics CPU apps are good, but even they can't complete a task in 190 seconds elapsed / 12 seconds CPU. In short, there was absolutely nothing wrong with the processing of these WUs on my machine: the only problems are the 'Sent' datestamp and the 'Application' name shown on the website. In the long term, that might mess up runtime estimation and hence credit - I'll report it again. ID: 1238441 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 1238449 - Posted: 28 May 2012, 10:44:36 UTC - in response to Message 1238441. Could this be a possible mechanism... Well, it seemed to work. 2456405740 998156827 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed and validated 190.09 11.03 5.87 SETI@home Enhanced Anonymous platform (CPU) 2456405738 998156842 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed and validated 190.94 12.03 39.69 SETI@home Enhanced Anonymous platform (CPU) 2456405735 998156831 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed and validated 191.66 11.66 24.38 SETI@home Enhanced Anonymous platform (CPU) (from Valid tasks for computer 4292666) Those tasks were actually issued on 25 May, and I had already long since computed them on NVidia GPU before I allowed reporting to take place. I had around 140 tasks to report, so they were taken as two sets of 64 and then the remainder. Each set of 64 generated a 'resend lost results' event, and I made sure that one of them was a CPU-only request. Another clue, if any were needed: the Lunatics CPU apps are good, but even they can't complete a task in 190 seconds elapsed / 12 seconds CPU. In short, there was absolutely nothing wrong with the processing of these WUs on my machine: the only problems are the 'Sent' datestamp and the 'Application' name shown on the website. In the long term, that might mess up runtime estimation and hence credit - I'll report it again. Well, you're right about runtime estimation and thus credit. A look at elapsed and CPU time, makes clear it wasn't computed by CPU! (This hot wheater forces me to downclock both CPU (Q6600) and GPU (GTX470, yesterday I found the host CPU at 109C! and GPU at 100C. It'll throttel down at 110C, CPU that is). ID: 1238449 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1238482 - Posted: 28 May 2012, 14:13:22 UTC Last modified: 28 May 2012, 14:25:57 UTC I found another set showing the 'identical runtime' syndrome: 2456388857 998148573 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed, waiting for validation 310.00 29.52 pending SETI@home Enhanced Anonymous platform (CPU) 2456383224 998145684 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed, waiting for validation 310.00 19.86 pending SETI@home Enhanced Anonymous platform (CPU) 2456353853 998131663 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed, waiting for validation 310.00 17.67 pending SETI@home Enhanced Anonymous platform (CPU) - all showing 310 seconds exactly. According to the starting/finished entries in my message log, they ran for 724, 704, and 373 seconds respectively. Edit, on second thoughts, cancel that - panic over. I've just realised what it might be. Look at the 'Sent' and 'Time reported' columns - 9:06:51 and 9:12:01 respectively. What's the difference between them? Yup, 310 seconds exactly. I think there's an anti-cheat mechanism in place which means you can't claim a runtime which is greater than the length of time the task was out in the field. That one's definitely going to hurt credit. ID: 1238482 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1239238 - Posted: 1 Jun 2012, 2:05:22 UTC - in response to Message 1236604. Last modified: 1 Jun 2012, 2:05:56 UTC Fred J. Verster wrote: And Device 1 of my ATI 5870s GPUs is slower and has a lower load as device 0 Both are in PCIe 2.0 16x; PCIe 2.0 x8 modus. Can't find an explanation why it's slower and has lower load? All I can think of is that their respective WUs may have different blanking percentages. As I understand it, blanking has substantial impact on GPU load and overall WU processing time. Of course, if you've already considered that, I can't think of anything else right now. LadyL wrote: For some reason tasks are having the wrong label on the website list. So it's not limited to one host but is something general going on. Anybody else sees wrongly attributed tasks, please link the host. Still needs figuring out if it's a general server side bug or limited to anything like boinc 7 clients or anonymous platform I often reschedule VLAR WUs from ATI GPU to CPU (I know the slow down is not as severe on ATI GPU as it is for NV GPU). Those tasks are still listed as ATI WUs on the site, and having it processed by the CPU seems to adversely affect the DCF for my ATI WUs, too. Don't know if you've considered this already, but it's a thought. Soli Deo Gloria ID: 1239238 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1239259 - Posted: 1 Jun 2012, 2:39:56 UTC - in response to Message 1238482. I found another set showing the 'identical runtime' syndrome: 2456388857 998148573 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed, waiting for validation 310.00 29.52 pending SETI@home Enhanced Anonymous platform (CPU) 2456383224 998145684 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed, waiting for validation 310.00 19.86 pending SETI@home Enhanced Anonymous platform (CPU) 2456353853 998131663 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed, waiting for validation 310.00 17.67 pending SETI@home Enhanced Anonymous platform (CPU) - all showing 310 seconds exactly. According to the starting/finished entries in my message log, they ran for 724, 704, and 373 seconds respectively. Edit, on second thoughts, cancel that - panic over. I've just realised what it might be. Look at the 'Sent' and 'Time reported' columns - 9:06:51 and 9:12:01 respectively. What's the difference between them? Yup, 310 seconds exactly. I think there's an anti-cheat mechanism in place which means you can't claim a runtime which is greater than the length of time the task was out in the field. That one's definitely going to hurt credit. Maybe my mind is only a very small thing to waste, but I don't understand how that can happen. How can it take longer to crunch than the amount of time you've had the work unit in your "possession?" If the answer is, "It can't," then I understand that much. So we've got a "sent" or "time reported" problem; is that what I understand you to be saying? I'm still getting those "streaks." ID: 1239259 ·

LadyL Volunteer tester Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0	Message 1239390 - Posted: 1 Jun 2012, 11:33:08 UTC - in response to Message 1239259. I found another set showing the 'identical runtime' syndrome: 2456388857 998148573 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed, waiting for validation 310.00 29.52 pending SETI@home Enhanced Anonymous platform (CPU) 2456383224 998145684 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed, waiting for validation 310.00 19.86 pending SETI@home Enhanced Anonymous platform (CPU) 2456353853 998131663 28 May 2012 \| 9:06:51 UTC 28 May 2012 \| 9:12:01 UTC Completed, waiting for validation 310.00 17.67 pending SETI@home Enhanced Anonymous platform (CPU) - all showing 310 seconds exactly. According to the starting/finished entries in my message log, they ran for 724, 704, and 373 seconds respectively. Edit, on second thoughts, cancel that - panic over. I've just realised what it might be. Look at the 'Sent' and 'Time reported' columns - 9:06:51 and 9:12:01 respectively. What's the difference between them? Yup, 310 seconds exactly. I think there's an anti-cheat mechanism in place which means you can't claim a runtime which is greater than the length of time the task was out in the field. That one's definitely going to hurt credit. Maybe my mind is only a very small thing to waste, but I don't understand how that can happen. How can it take longer to crunch than the amount of time you've had the work unit in your "possession?" If the answer is, "It can't," then I understand that much. So we've got a "sent" or "time reported" problem; is that what I understand you to be saying? I'm still getting those "streaks." The webpage gives the time it thinks it sent it out i.e. the time of the false resend at which point you might already have crunched the unit, because it wasn't really a ghost. The time it sticks into runtime is then the time between send and report, if that is smaller than the time reported by the task - that gives the string of identical runtimes. On BOINC 6.12.34 and Boinc 7 this can be mitigated by setting <max_tasks_reported>64</max_tasks_reported> in cc_config.xml. I'm not the Pope. I don't speak Ex Cathedra! ID: 1239390 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 1239395 - Posted: 1 Jun 2012, 11:56:16 UTC - in response to Message 1239390. Thanx for all the answers, I tried MW WU to see if both GPUs have the load and they do, SETI MB and AstroPulse WUs are all different. The AR on MB work and blanking on AstroPulse. ID: 1239395 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.