Task Duration Correction Factor Problem?

Message boards : Number crunching : Task Duration Correction Factor Problem?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 954981 - Posted: 15 Dec 2009, 15:28:23 UTC

My secondary machine (4912909: 2 x Opteron 8354 = 8 cores) + GTS 250 has been doing about 120 WUs or so/day for a long time now. My usual work buffer (8 days) is around 1000 WUs. BOINC is 6.6.36.

Suddenly today, my TDCF went from about 0.16 to 0.06 and then to 0.056615, and I started d/l tasks like crazy - now up to over 1700 total; I had to go to "No New Tasks" to stop it. (I did so to keep BOINC from having display problems with long queues (known problem)).

Any idea why this might start to happen? It's not happened before on this machine.
ID: 954981 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 954983 - Posted: 15 Dec 2009, 15:29:49 UTC - in response to Message 954981.  

I'm wondering if you got ahold of some shorties that made your numbers change


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 954983 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 954985 - Posted: 15 Dec 2009, 15:36:10 UTC - in response to Message 954981.  

My secondary machine (4912909: 2 x Opteron 8354 = 8 cores) + GTS 250 has been doing about 120 WUs or so/day for a long time now. My usual work buffer (8 days) is around 1000 WUs. BOINC is 6.6.36.

Suddenly today, my TDCF went from about 0.16 to 0.06 and then to 0.056615, and I started d/l tasks like crazy - now up to over 1700 total; I had to go to "No New Tasks" to stop it. (I did so to keep BOINC from having display problems with long queues (known problem)).

Any idea why this might start to happen? It's not happened before on this machine.


I've noticed I get this whenever there is a "short storm". Since I just run CPU tasks this doesn't effect me as much, but with GPU tasks where you are burning though them every few minutes it scales up a lot.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 954985 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 954986 - Posted: 15 Dec 2009, 15:42:02 UTC - in response to Message 954983.  

I'm wondering if you got ahold of some shorties that made your numbers change


I have been getting shorties for a while, though, as has my other machine, which is "faster" (2x2352 + TWO GTS 250), which has had a TDCF of 0.08. So there seems to be something of an inconsistency? Or am I misreading what TDCF means?
ID: 954986 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 954991 - Posted: 15 Dec 2009, 15:47:10 UTC - in response to Message 954981.  
Last modified: 15 Dec 2009, 15:48:28 UTC


jravin, I guess you calculated a bunch of shorties. So BOINC thought you updated your PC with faster equipment.. and DLed more WUs for to fill up your few days cache size.
Nothing to worry about. It's like BOINC work.

You have the flops entries in your app_info.xml ?


I would update to BOINC V6.10.18, because V6.6.36 have the CUDA BUG.




ID: 954991 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 954994 - Posted: 15 Dec 2009, 15:50:43 UTC - in response to Message 954986.  

I'm wondering if you got ahold of some shorties that made your numbers change


I have been getting shorties for a while, though, as has my other machine, which is "faster" (2x2352 + TWO GTS 250), which has had a TDCF of 0.08. So there seems to be something of an inconsistency? Or am I misreading what TDCF means?


Perhaps the short tasks for the one machine are validating & the other machine is loading up pendings.

This is one of those times it where not being able to easily tell what a machine has been doing makes it hard to track issues.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 954994 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 954996 - Posted: 15 Dec 2009, 15:56:37 UTC - in response to Message 954986.  

I'm wondering if you got ahold of some shorties that made your numbers change

I have been getting shorties for a while, though, as has my other machine, which is "faster" (2x2352 + TWO GTS 250), which has had a TDCF of 0.08. So there seems to be something of an inconsistency? Or am I misreading what TDCF means?

If you're using the CUDA 2.3 runtime DLLs, then yes: shorties (VHAR) do crunch particularly efficiently (run time even less than predicted), and that will tend to drive TDCF downwards. If, at the same time, the server happens to be issuing shorties, you will receive a lot of new tasks: the work issued is based on estimated time to run, so the same work request might trigger the receipt of one lengthy task or four/five shorties.

From your initial description, I wondered if your card har encountered an error condition and started returning all tasks as -9 overflow: that can happen, and would have the effect you descibe (it can be cured by restarting the computer). But looking through your task list, that doesn't seem to have happened this time.
ID: 954996 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 955006 - Posted: 15 Dec 2009, 16:25:34 UTC - in response to Message 954994.  

Perhaps the short tasks for the one machine are validating & the other machine is loading up pendings.

DCF has nothing to to with pendings. The one thing happens on the client and the other on the server, both totally independent of each other.

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours
ID: 955006 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 955027 - Posted: 15 Dec 2009, 17:39:07 UTC - in response to Message 954981.  

Suddenly today, my TDCF went from about 0.16 to 0.06 and then to 0.056615, and I started d/l tasks like crazy - now up to over 1700 total; I had to go to "No New Tasks" to stop it. (I did so to keep BOINC from having display problems with long queues (known problem)).

Any idea why this might start to happen? It's not happened before on this machine.

The problem is that the amount of "work" in a work unit varies by angle range: not just the amount of calculations, but the types of floating point instructions.

... and while BOINC considers all of them to be the same, some FLOPs are bigger than others.

That's going to make DCF wiggle a bit.

Lots of noisy work units can do the same thing.

I wouldn't call it a problem as much as an interesting anomaly.
ID: 955027 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 955052 - Posted: 15 Dec 2009, 23:30:35 UTC

@Sutaru - what CUDA bug in 6.6.36? - I haven't noticed any problems with CUDA - except that I rarely get sent any. I use the rescheduler to balance my load.

@Richard - nope, no error condition as you describe. I HAVE had that happen once in a while, but not very recently. Of my 3 GTS 250s (all EVGA), that's the only one that has the problem...

@Ned - thanks for the additional info. But as an anomaly, I'd rather not see it...BOINC becomes unresponsive with > 1500 or so WUs in its queues...perhaps some more sophisticated queue handling might be in order - I'll bet there's been no significant changes for a long time, at least (guessing) since before the Big Crunchers (helloooo, CUDA!) arrived. I'm sure the Boys can do this in their copious free time (<g>).
ID: 955052 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 955053 - Posted: 15 Dec 2009, 23:30:35 UTC - in response to Message 955006.  

Perhaps the short tasks for the one machine are validating & the other machine is loading up pendings.

DCF has nothing to to with pendings. The one thing happens on the client and the other on the server, both totally independent of each other.

Gruß,
Gundolf


*bonks head*

right dcf changes on tack completion even w/o connecting to a server.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 955053 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 955067 - Posted: 15 Dec 2009, 23:52:01 UTC - in response to Message 955052.  

@Sutaru - what CUDA bug in 6.6.36? - I haven't noticed any problems with CUDA - except that I rarely get sent any. I use the rescheduler to balance my load.

With Boinc 6.6.36, if you have a large cache, when you get new GPU tasks at the bottom of the cache that are eithier shorties, or just will be in deadline trouble,
Boinc will start the GPU task in deadline trouble, compute it for maybe 10 or 20 secs,
then switch to the next task in deadline trouble, the problem is if Boinc 6.6.36 switches GPU tasks before they checkpoint,
then Boinc doesn't free up the GPU memory, and all the Cuda tasks then run in CPU fallback mode and take many hour to complete, and driving up DCF, making things worse,
the only reliable way to get the GPU to compute again is to restart Boinc, (and also lowering the cache)
This is fixed in Boinc 6.6.37 and later:

- client: when suspending a GPU job, always remove it from memory, even if it hasn't checkpointed. Otherwise we'll typically run another GPU job right away, and it will bomb out or revert to CPU mode because it can't allocate video RAM

Claggy
ID: 955067 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 955082 - Posted: 16 Dec 2009, 0:42:15 UTC - in response to Message 955052.  

@Ned - thanks for the additional info. But as an anomaly, I'd rather not see it...

Which is exactly why I chose the word "anomaly" -- because like you, I'd prefer a perfect world.

Unfortunately, after more years than I care to admit working in computers, I've yet to see that perfect world.

But we've also moved to a different problem.

The problem is that computers have evolved to the point where they need to cache more than 1,000 work units, and the BOINC Manager (as I understand the issue) is unhappy over somewhere around 1,000.

I'm pointing that out because even if the DCF issue is fixed, that 1,000 unit problem is still going to sneak up on us.

For now, the only work-around is to cut your cache size in half. That is not a fix.
ID: 955082 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 955099 - Posted: 16 Dec 2009, 1:36:14 UTC

@Claggy - I HAVE noticed something like what you said, but not quite. I do notice that I wind up with a lot of CUDA WUs "Waiting", and the number grows over time (I've seen as many as 40 or so, with one "Running"); but they run for varying amounts of time before being "Waiting", not just a few seconds...
My workaround is to reschedule everything back to the CPU; then the "Waiting" CUDA WUs get finished by BOINC, because there are no other CUDA WUs to start. When they are all done, I reschedule back to my standard mix of 40/60 (CPU/GPU).

Haven't noticed that the crash of CUDA processing I've seen 3 or 4 times is related to this, but I'll try to watch more carefully and see if it looks like it or not when it happens again.

ID: 955099 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 955101 - Posted: 16 Dec 2009, 1:45:40 UTC - in response to Message 955099.  

@Claggy - I HAVE noticed something like what you said, but not quite. I do notice that I wind up with a lot of CUDA WUs "Waiting", and the number grows over time (I've seen as many as 40 or so, with one "Running"); but they run for varying amounts of time before being "Waiting", not just a few seconds...
My workaround is to reschedule everything back to the CPU; then the "Waiting" CUDA WUs get finished by BOINC, because there are no other CUDA WUs to start. When they are all done, I reschedule back to my standard mix of 40/60 (CPU/GPU).

Haven't noticed that the crash of CUDA processing I've seen 3 or 4 times is related to this, but I'll try to watch more carefully and see if it looks like it or not when it happens again.

What's your DCF at the moment?

I recommend you upgrade Boinc to the latest release candidate, Boinc 6.10.18

Claggy
ID: 955101 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 955107 - Posted: 16 Dec 2009, 2:14:31 UTC - in response to Message 955101.  

What's your DCF at the moment?

I recommend you upgrade Boinc to the latest release candidate, Boinc 6.10.18

Claggy


It's up to 0.2+ now - quite a change. How often does it (I assume BOINC on my machine) recompute the DCF?

Don't think I will upgrade for awhile - I'm actually reasonably happy with it (and also with 6.6.20 on my other machine). Besides, I'm not sure how to upgrade without causing problems for my current queues and the fact that I'm using the Optimized apps. Is there a page that gives complete details for how to upgrade in these circumstances? (I don't want to have to flush or run down my WUs to 0).
ID: 955107 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 955110 - Posted: 16 Dec 2009, 3:10:10 UTC - in response to Message 955107.  

What's your DCF at the moment?

I recommend you upgrade Boinc to the latest release candidate, Boinc 6.10.18

Claggy


It's up to 0.2+ now - quite a change. How often does it (I assume BOINC on my machine) recompute the DCF?

Don't think I will upgrade for awhile - I'm actually reasonably happy with it (and also with 6.6.20 on my other machine). Besides, I'm not sure how to upgrade without causing problems for my current queues and the fact that I'm using the Optimized apps. Is there a page that gives complete details for how to upgrade in these circumstances? (I don't want to have to flush or run down my WUs to 0).

I'm generally running the latest alpha version of BOINC.

I run optimized apps, and I don't dump the cache. I just run the installer.

There is always some risk of losing everything, but I can't remember the last time that happened.
ID: 955110 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 955136 - Posted: 16 Dec 2009, 5:00:18 UTC - in response to Message 955107.  

What's your DCF at the moment?

I recommend you upgrade Boinc to the latest release candidate, Boinc 6.10.18

Claggy


It's up to 0.2+ now - quite a change. How often does it (I assume BOINC on my machine) recompute the DCF?

Don't think I will upgrade for awhile - I'm actually reasonably happy with it (and also with 6.6.20 on my other machine). Besides, I'm not sure how to upgrade without causing problems for my current queues and the fact that I'm using the Optimized apps. Is there a page that gives complete details for how to upgrade in these circumstances? (I don't want to have to flush or run down my WUs to 0).

DCF is computed at the completion of every task.


BOINC WIKI
ID: 955136 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 955176 - Posted: 16 Dec 2009, 12:34:57 UTC - in response to Message 955107.  

What's your DCF at the moment?

I recommend you upgrade Boinc to the latest release candidate, Boinc 6.10.18

Claggy


It's up to 0.2+ now - quite a change. How often does it (I assume BOINC on my machine) recompute the DCF?

Don't think I will upgrade for awhile - I'm actually reasonably happy with it (and also with 6.6.20 on my other machine). Besides, I'm not sure how to upgrade without causing problems for my current queues and the fact that I'm using the Optimized apps. Is there a page that gives complete details for how to upgrade in these circumstances? (I don't want to have to flush or run down my WUs to 0).

You don't need to worry about your cache or any optimised apps, installing a new version of Boinc installs without touching them,
Generally it takes a lot less than 5 mins, i just shut down Boinc, run the the installer, then start it up straight away, (you don't even need to uninstall the old version)
I've used most of the versions since v5.x.x, and rarely have problems upgrading,

Claggy
ID: 955176 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 955188 - Posted: 16 Dec 2009, 13:45:57 UTC

@Claggy - well, I may give it a shot later today (fingers crossed).

Right now, gotta take the family dog for a CAT scan; she may have a cancer in her chest - fingers also crossed.
(Do they do DOG scans on cats?).
ID: 955188 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Task Duration Correction Factor Problem?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.