Message boards :
Number crunching :
Task Duration Correction Factor Problem?
Message board moderation
Author | Message |
---|---|
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
My secondary machine (4912909: 2 x Opteron 8354 = 8 cores) + GTS 250 has been doing about 120 WUs or so/day for a long time now. My usual work buffer (8 days) is around 1000 WUs. BOINC is 6.6.36. Suddenly today, my TDCF went from about 0.16 to 0.06 and then to 0.056615, and I started d/l tasks like crazy - now up to over 1700 total; I had to go to "No New Tasks" to stop it. (I did so to keep BOINC from having display problems with long queues (known problem)). Any idea why this might start to happen? It's not happened before on this machine. |
skildude Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60 |
I'm wondering if you got ahold of some shorties that made your numbers change In a rich man's house there is no place to spit but his face. Diogenes Of Sinope |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
My secondary machine (4912909: 2 x Opteron 8354 = 8 cores) + GTS 250 has been doing about 120 WUs or so/day for a long time now. My usual work buffer (8 days) is around 1000 WUs. BOINC is 6.6.36. I've noticed I get this whenever there is a "short storm". Since I just run CPU tasks this doesn't effect me as much, but with GPU tasks where you are burning though them every few minutes it scales up a lot. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
I'm wondering if you got ahold of some shorties that made your numbers change I have been getting shorties for a while, though, as has my other machine, which is "faster" (2x2352 + TWO GTS 250), which has had a TDCF of 0.08. So there seems to be something of an inconsistency? Or am I misreading what TDCF means? |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
jravin, I guess you calculated a bunch of shorties. So BOINC thought you updated your PC with faster equipment.. and DLed more WUs for to fill up your few days cache size. Nothing to worry about. It's like BOINC work. You have the flops entries in your app_info.xml ? I would update to BOINC V6.10.18, because V6.6.36 have the CUDA BUG. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
I'm wondering if you got ahold of some shorties that made your numbers change Perhaps the short tasks for the one machine are validating & the other machine is loading up pendings. This is one of those times it where not being able to easily tell what a machine has been doing makes it hard to track issues. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
I'm wondering if you got ahold of some shorties that made your numbers change If you're using the CUDA 2.3 runtime DLLs, then yes: shorties (VHAR) do crunch particularly efficiently (run time even less than predicted), and that will tend to drive TDCF downwards. If, at the same time, the server happens to be issuing shorties, you will receive a lot of new tasks: the work issued is based on estimated time to run, so the same work request might trigger the receipt of one lengthy task or four/five shorties. From your initial description, I wondered if your card har encountered an error condition and started returning all tasks as -9 overflow: that can happen, and would have the effect you descibe (it can be cured by restarting the computer). But looking through your task list, that doesn't seem to have happened this time. |
Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0 |
Perhaps the short tasks for the one machine are validating & the other machine is loading up pendings. DCF has nothing to to with pendings. The one thing happens on the client and the other on the server, both totally independent of each other. Gruß, Gundolf Computer sind nicht alles im Leben. (Kleiner Scherz) SETI@home classic workunits 3,758 SETI@home classic CPU time 66,520 hours |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Suddenly today, my TDCF went from about 0.16 to 0.06 and then to 0.056615, and I started d/l tasks like crazy - now up to over 1700 total; I had to go to "No New Tasks" to stop it. (I did so to keep BOINC from having display problems with long queues (known problem)). The problem is that the amount of "work" in a work unit varies by angle range: not just the amount of calculations, but the types of floating point instructions. ... and while BOINC considers all of them to be the same, some FLOPs are bigger than others. That's going to make DCF wiggle a bit. Lots of noisy work units can do the same thing. I wouldn't call it a problem as much as an interesting anomaly. |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
@Sutaru - what CUDA bug in 6.6.36? - I haven't noticed any problems with CUDA - except that I rarely get sent any. I use the rescheduler to balance my load. @Richard - nope, no error condition as you describe. I HAVE had that happen once in a while, but not very recently. Of my 3 GTS 250s (all EVGA), that's the only one that has the problem... @Ned - thanks for the additional info. But as an anomaly, I'd rather not see it...BOINC becomes unresponsive with > 1500 or so WUs in its queues...perhaps some more sophisticated queue handling might be in order - I'll bet there's been no significant changes for a long time, at least (guessing) since before the Big Crunchers (helloooo, CUDA!) arrived. I'm sure the Boys can do this in their copious free time (<g>). |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Perhaps the short tasks for the one machine are validating & the other machine is loading up pendings. *bonks head* right dcf changes on tack completion even w/o connecting to a server. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
@Sutaru - what CUDA bug in 6.6.36? - I haven't noticed any problems with CUDA - except that I rarely get sent any. I use the rescheduler to balance my load. With Boinc 6.6.36, if you have a large cache, when you get new GPU tasks at the bottom of the cache that are eithier shorties, or just will be in deadline trouble, Boinc will start the GPU task in deadline trouble, compute it for maybe 10 or 20 secs, then switch to the next task in deadline trouble, the problem is if Boinc 6.6.36 switches GPU tasks before they checkpoint, then Boinc doesn't free up the GPU memory, and all the Cuda tasks then run in CPU fallback mode and take many hour to complete, and driving up DCF, making things worse, the only reliable way to get the GPU to compute again is to restart Boinc, (and also lowering the cache) This is fixed in Boinc 6.6.37 and later: - client: when suspending a GPU job, always remove it from memory, even if it hasn't checkpointed. Otherwise we'll typically run another GPU job right away, and it will bomb out or revert to CPU mode because it can't allocate video RAM Claggy |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
@Ned - thanks for the additional info. But as an anomaly, I'd rather not see it... Which is exactly why I chose the word "anomaly" -- because like you, I'd prefer a perfect world. Unfortunately, after more years than I care to admit working in computers, I've yet to see that perfect world. But we've also moved to a different problem. The problem is that computers have evolved to the point where they need to cache more than 1,000 work units, and the BOINC Manager (as I understand the issue) is unhappy over somewhere around 1,000. I'm pointing that out because even if the DCF issue is fixed, that 1,000 unit problem is still going to sneak up on us. For now, the only work-around is to cut your cache size in half. That is not a fix. |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
@Claggy - I HAVE noticed something like what you said, but not quite. I do notice that I wind up with a lot of CUDA WUs "Waiting", and the number grows over time (I've seen as many as 40 or so, with one "Running"); but they run for varying amounts of time before being "Waiting", not just a few seconds... My workaround is to reschedule everything back to the CPU; then the "Waiting" CUDA WUs get finished by BOINC, because there are no other CUDA WUs to start. When they are all done, I reschedule back to my standard mix of 40/60 (CPU/GPU). Haven't noticed that the crash of CUDA processing I've seen 3 or 4 times is related to this, but I'll try to watch more carefully and see if it looks like it or not when it happens again. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
@Claggy - I HAVE noticed something like what you said, but not quite. I do notice that I wind up with a lot of CUDA WUs "Waiting", and the number grows over time (I've seen as many as 40 or so, with one "Running"); but they run for varying amounts of time before being "Waiting", not just a few seconds... What's your DCF at the moment? I recommend you upgrade Boinc to the latest release candidate, Boinc 6.10.18 Claggy |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
What's your DCF at the moment? It's up to 0.2+ now - quite a change. How often does it (I assume BOINC on my machine) recompute the DCF? Don't think I will upgrade for awhile - I'm actually reasonably happy with it (and also with 6.6.20 on my other machine). Besides, I'm not sure how to upgrade without causing problems for my current queues and the fact that I'm using the Optimized apps. Is there a page that gives complete details for how to upgrade in these circumstances? (I don't want to have to flush or run down my WUs to 0). |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
What's your DCF at the moment? I'm generally running the latest alpha version of BOINC. I run optimized apps, and I don't dump the cache. I just run the installer. There is always some risk of losing everything, but I can't remember the last time that happened. |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
What's your DCF at the moment? DCF is computed at the completion of every task. BOINC WIKI |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
What's your DCF at the moment? You don't need to worry about your cache or any optimised apps, installing a new version of Boinc installs without touching them, Generally it takes a lot less than 5 mins, i just shut down Boinc, run the the installer, then start it up straight away, (you don't even need to uninstall the old version) I've used most of the versions since v5.x.x, and rarely have problems upgrading, Claggy |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
@Claggy - well, I may give it a shot later today (fingers crossed). Right now, gotta take the family dog for a CAT scan; she may have a cancer in her chest - fingers also crossed. (Do they do DOG scans on cats?). |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.