Host falling back to CPU processing running v6.08 cuda and ATI device 1 taking far more time

Message boards : Number crunching : Host falling back to CPU processing running v6.08 cuda and ATI device 1 taking far more time
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1236563 - Posted: 25 May 2012, 15:45:02 UTC

This MB WU.

And 1 (NVIDIA GPU) wrongly mentioned as Anonumous Platform NVIDIA GPU,
Result ID
993030463.


And Device 1 of my ATI 5870s GPUs is slower and has a lower load as
device 0
Both are in PCIe 2.0 16x; PCIe 2.0 x8 modus.
Can't find an explanation why it's slower and has lower load?

ID: 1236563 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1236604 - Posted: 25 May 2012, 16:45:20 UTC - in response to Message 1236563.  

This MB WU.

happens - he's running a 295.x driver, will be the monitor sleep bug.

And 1 (NVIDIA GPU) wrongly mentioned as Anonumous Platform NVIDIA GPU,
Result ID
993030463.


Bingo. You've found me another example of a bug I've been chasing.
Showing NV but has run as CPU.
For some reason tasks are having the wrong label on the website list.
So it's not limited to one host but is something general going on.
Anybody else sees wrongly attributed tasks, please link the host.
Still needs figuring out if it's a general server side bug or limited to anything like boinc 7 clients or anonymous platform

And Device 1 of my ATI 5870s GPUs is slower and has a lower load as
device 0
Both are in PCIe 2.0 16x; PCIe 2.0 x8 modus.
Can't find an explanation why it's slower and has lower load?


No idea. One for the ATI gurus or Raistmer.
I'm not the Pope. I don't speak Ex Cathedra!
ID: 1236604 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1236615 - Posted: 25 May 2012, 16:57:00 UTC - in response to Message 1236604.  


Bingo. You've found me another example of a bug I've been chasing.
Showing NV but has run as CPU.
For some reason tasks are having the wrong label on the website list.
So it's not limited to one host but is something general going on.
Anybody else sees wrongly attributed tasks, please link the host.
Still needs figuring out if it's a general server side bug or limited to anything like boinc 7 clients or anonymous platform


Is it possible that this is a lost GPU task that was resent to the CPU but not correctly relabeled?
(just thinking out loud...)
ID: 1236615 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1236632 - Posted: 25 May 2012, 17:12:09 UTC - in response to Message 1236615.  


Bingo. You've found me another example of a bug I've been chasing.
Showing NV but has run as CPU.
For some reason tasks are having the wrong label on the website list.
So it's not limited to one host but is something general going on.
Anybody else sees wrongly attributed tasks, please link the host.
Still needs figuring out if it's a general server side bug or limited to anything like boinc 7 clients or anonymous platform


Is it possible that this is a lost GPU task that was resent to the CPU but not correctly relabeled?
(just thinking out loud...)


yes, might be another side effect of the scheduler change/bug that is causing tasks to be 'resent' even though they are there.

But it needs somebody who is seeing tasks being mislabeled on his host to run the <sched_op_debug> log flag, so you know what the client has requested, has received and then compare that to what the server thinks it did.
I'm not the Pope. I don't speak Ex Cathedra!
ID: 1236632 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1236637 - Posted: 25 May 2012, 17:31:51 UTC

Could this be a possible mechanism?

We all know that when a task is genuinely lost, and resent, it can be scheduled to a different resource from the one originally allocated - like the perennial classic of the VLAR assigned to CPU, lost, then reallocated to GPU, which keeps tripping people up.

But that's for a genuine resend, where the client receives and acts upon the second allocation (not the vlar example, obviously).

But as jravin posted in Unannounced Server-Side Change?, there's an active bug which causes tasks to be resent when they are not lost.

According to jravin's log, the second assignment is rejected as an error, because the host already has the task. Presumably, it'll get processed as originally allocated, the first time round - but possibly the website has been updated in the meantime to reflect the attempted second assignment.
ID: 1236637 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1238130 - Posted: 27 May 2012, 17:30:48 UTC - in response to Message 1236637.  

Thanx all for your explanations, I'll check on my ATI host almost daily and
will watch out for wrong-platform names, as well ;-)


ID: 1238130 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1238355 - Posted: 28 May 2012, 4:02:12 UTC - in response to Message 1238130.  

Thanx all for your explanations, I'll check on my ATI host almost daily and
will watch out for wrong-platform names, as well ;-)



I've had several of these lately, two different computers, two different manifestations.

A work unit marked "ATI" has been completed on an nVidia card and the CPU and I just found one marked for the CPU that was done on an nVidia card (and this was a second computer).

In the first case I thought it might be because of a mixed environment, like you have, both ATI and nVidia in the computer.

In the second case, there are only nVidia cards and the CPU.

In one case I'm running 7.0.x and in the other 6.10.60. I've also been getting odd strings of identical completion times in clusters of work units. The CPU times are very different, so the work obviously isn't identical. (i.e., something seems to be assigning work-times, rather than measuring them)

So, it isn't the result of the mixed GPU environment and it isn't a consequence of updating to version 7.x.x of BOINC.

Oh, and each of these two machines is running a (slightly) different Lunatics version and different nVidia driver version.

Sounds like something server-side to me.
ID: 1238355 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1238441 - Posted: 28 May 2012, 9:49:52 UTC - in response to Message 1236637.  

Could this be a possible mechanism...

Well, it seemed to work.

2456405740 998156827 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed and validated 190.09 11.03 5.87 SETI@home Enhanced
Anonymous platform (CPU)
2456405738 998156842 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed and validated 190.94 12.03 39.69 SETI@home Enhanced
Anonymous platform (CPU)
2456405735 998156831 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed and validated 191.66 11.66 24.38 SETI@home Enhanced
Anonymous platform (CPU)

(from Valid tasks for computer 4292666)

Those tasks were actually issued on 25 May, and I had already long since computed them on NVidia GPU before I allowed reporting to take place. I had around 140 tasks to report, so they were taken as two sets of 64 and then the remainder. Each set of 64 generated a 'resend lost results' event, and I made sure that one of them was a CPU-only request. Another clue, if any were needed: the Lunatics CPU apps are good, but even they can't complete a task in 190 seconds elapsed / 12 seconds CPU.

In short, there was absolutely nothing wrong with the processing of these WUs on my machine: the only problems are the 'Sent' datestamp and the 'Application' name shown on the website.

In the long term, that might mess up runtime estimation and hence credit - I'll report it again.
ID: 1238441 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1238449 - Posted: 28 May 2012, 10:44:36 UTC - in response to Message 1238441.  

Could this be a possible mechanism...

Well, it seemed to work.

2456405740 998156827 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed and validated 190.09 11.03 5.87 SETI@home Enhanced
Anonymous platform (CPU)
2456405738 998156842 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed and validated 190.94 12.03 39.69 SETI@home Enhanced
Anonymous platform (CPU)
2456405735 998156831 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed and validated 191.66 11.66 24.38 SETI@home Enhanced
Anonymous platform (CPU)

(from Valid tasks for computer 4292666)

Those tasks were actually issued on 25 May, and I had already long since computed them on NVidia GPU before I allowed reporting to take place. I had around 140 tasks to report, so they were taken as two sets of 64 and then the remainder. Each set of 64 generated a 'resend lost results' event, and I made sure that one of them was a CPU-only request. Another clue, if any were needed: the Lunatics CPU apps are good, but even they can't complete a task in 190 seconds elapsed / 12 seconds CPU.

In short, there was absolutely nothing wrong with the processing of these WUs on my machine: the only problems are the 'Sent' datestamp and the 'Application' name shown on the website.

In the long term, that might mess up runtime estimation and hence credit - I'll report it again.


Well, you're right about runtime estimation and thus credit. A look at elapsed
and CPU time, makes clear it wasn't computed by CPU!
(This hot wheater forces me to downclock both CPU (Q6600) and GPU (GTX470, yesterday I found the host CPU at 109C! and GPU at 100C. It'll throttel down
at 110C, CPU that is).

ID: 1238449 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1238482 - Posted: 28 May 2012, 14:13:22 UTC
Last modified: 28 May 2012, 14:25:57 UTC

I found another set showing the 'identical runtime' syndrome:

2456388857 998148573 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed, waiting for validation 310.00 29.52 pending SETI@home Enhanced
Anonymous platform (CPU)
2456383224 998145684 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed, waiting for validation 310.00 19.86 pending SETI@home Enhanced
Anonymous platform (CPU)
2456353853 998131663 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed, waiting for validation 310.00 17.67 pending SETI@home Enhanced
Anonymous platform (CPU)

- all showing 310 seconds exactly.

According to the starting/finished entries in my message log, they ran for 724, 704, and 373 seconds respectively.

Edit, on second thoughts, cancel that - panic over. I've just realised what it might be.

Look at the 'Sent' and 'Time reported' columns - 9:06:51 and 9:12:01 respectively. What's the difference between them? Yup, 310 seconds exactly. I think there's an anti-cheat mechanism in place which means you can't claim a runtime which is greater than the length of time the task was out in the field. That one's definitely going to hurt credit.
ID: 1238482 · Report as offensive
Wedge009
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 451
Credit: 431,396,357
RAC: 553
Australia
Message 1239238 - Posted: 1 Jun 2012, 2:05:22 UTC - in response to Message 1236604.  
Last modified: 1 Jun 2012, 2:05:56 UTC

Fred J. Verster wrote:
And Device 1 of my ATI 5870s GPUs is slower and has a lower load as
device 0
Both are in PCIe 2.0 16x; PCIe 2.0 x8 modus.
Can't find an explanation why it's slower and has lower load?

All I can think of is that their respective WUs may have different blanking percentages. As I understand it, blanking has substantial impact on GPU load and overall WU processing time.

Of course, if you've already considered that, I can't think of anything else right now.

LadyL wrote:
For some reason tasks are having the wrong label on the website list.
So it's not limited to one host but is something general going on.
Anybody else sees wrongly attributed tasks, please link the host.
Still needs figuring out if it's a general server side bug or limited to anything like boinc 7 clients or anonymous platform

I often reschedule VLAR WUs from ATI GPU to CPU (I know the slow down is not as severe on ATI GPU as it is for NV GPU). Those tasks are still listed as ATI WUs on the site, and having it processed by the CPU seems to adversely affect the DCF for my ATI WUs, too.

Don't know if you've considered this already, but it's a thought.
Soli Deo Gloria
ID: 1239238 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1239259 - Posted: 1 Jun 2012, 2:39:56 UTC - in response to Message 1238482.  

I found another set showing the 'identical runtime' syndrome:

2456388857 998148573 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed, waiting for validation 310.00 29.52 pending SETI@home Enhanced
Anonymous platform (CPU)
2456383224 998145684 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed, waiting for validation 310.00 19.86 pending SETI@home Enhanced
Anonymous platform (CPU)
2456353853 998131663 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed, waiting for validation 310.00 17.67 pending SETI@home Enhanced
Anonymous platform (CPU)

- all showing 310 seconds exactly.

According to the starting/finished entries in my message log, they ran for 724, 704, and 373 seconds respectively.

Edit, on second thoughts, cancel that - panic over. I've just realised what it might be.

Look at the 'Sent' and 'Time reported' columns - 9:06:51 and 9:12:01 respectively. What's the difference between them? Yup, 310 seconds exactly. I think there's an anti-cheat mechanism in place which means you can't claim a runtime which is greater than the length of time the task was out in the field. That one's definitely going to hurt credit.


Maybe my mind is only a very small thing to waste, but I don't understand how that can happen.

How can it take longer to crunch than the amount of time you've had the work unit in your "possession?"

If the answer is, "It can't," then I understand that much.

So we've got a "sent" or "time reported" problem; is that what I understand you to be saying?

I'm still getting those "streaks."
ID: 1239259 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1239390 - Posted: 1 Jun 2012, 11:33:08 UTC - in response to Message 1239259.  

I found another set showing the 'identical runtime' syndrome:

2456388857 998148573 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed, waiting for validation 310.00 29.52 pending SETI@home Enhanced
Anonymous platform (CPU)
2456383224 998145684 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed, waiting for validation 310.00 19.86 pending SETI@home Enhanced
Anonymous platform (CPU)
2456353853 998131663 28 May 2012 | 9:06:51 UTC 28 May 2012 | 9:12:01 UTC Completed, waiting for validation 310.00 17.67 pending SETI@home Enhanced
Anonymous platform (CPU)

- all showing 310 seconds exactly.

According to the starting/finished entries in my message log, they ran for 724, 704, and 373 seconds respectively.

Edit, on second thoughts, cancel that - panic over. I've just realised what it might be.

Look at the 'Sent' and 'Time reported' columns - 9:06:51 and 9:12:01 respectively. What's the difference between them? Yup, 310 seconds exactly. I think there's an anti-cheat mechanism in place which means you can't claim a runtime which is greater than the length of time the task was out in the field. That one's definitely going to hurt credit.


Maybe my mind is only a very small thing to waste, but I don't understand how that can happen.

How can it take longer to crunch than the amount of time you've had the work unit in your "possession?"

If the answer is, "It can't," then I understand that much.

So we've got a "sent" or "time reported" problem; is that what I understand you to be saying?

I'm still getting those "streaks."


The webpage gives the time it thinks it sent it out i.e. the time of the false resend at which point you might already have crunched the unit, because it wasn't really a ghost.
The time it sticks into runtime is then the time between send and report, if that is smaller than the time reported by the task - that gives the string of identical runtimes.

On BOINC 6.12.34 and Boinc 7 this can be mitigated by setting <max_tasks_reported>64</max_tasks_reported>
in cc_config.xml.

I'm not the Pope. I don't speak Ex Cathedra!
ID: 1239390 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1239395 - Posted: 1 Jun 2012, 11:56:16 UTC - in response to Message 1239390.  

Thanx for all the answers, I tried MW WU to see if both GPUs have the
load and they do, SETI MB and AstroPulse WUs are all different.
The AR on MB work and blanking on AstroPulse.


ID: 1239395 · Report as offensive

Message boards : Number crunching : Host falling back to CPU processing running v6.08 cuda and ATI device 1 taking far more time


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.