Message boards :
Number crunching :
Short estimated runtimes - don't panic
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
Just curiosity, if flops are used then there should no be changes in the estimations? Does it means that they are playing with the APR? LOL... taking those steps to be able to trust on the APR was the intended meaning of "playing with"... AP V6 APR is working very well on my hosts, just a litle above of the perfect value to keep the DCF at 1 +/-10%, very acceptable anyway. I guess that the filter on the splitters plus the threshold for the % blanked needed to be used in the APR, did a very good job. But, APR for MB is more than 30 times higher than it should be... good to know that is going to be fixed! |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
So I was thinking that once one of my already-cached APs finished the times for the newly-acquired tasks would be adjusted, but they haven't. It appears taht I have to complete one of those tasks for it to work. So as I burn through older tasks, the cache is being replenished with more and more low-estimate tasks. I could.. and probably should go with a smaller cache or at least NNT before I start getting close to deadline issues. However, I'm churning through 6 per day, which means with a 25-day deadline, I shouldn't go much past about 140 of them cached. But I think my 1GB disk limit will end up being hit somewhere around 127 or so, so it should work out just fine. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
kittyman Send message Joined: 9 Jul 00 Posts: 51474 Credit: 1,018,363,574 RAC: 1,004 |
Great to finally see some movement towards setting things right again! "Time is simply the mechanism that keeps everything from happening all at once." |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Horacio wrote: ... None of your hosts are showing high APRs for MB. In any case, the change to ignore runtimes of result_overfow MB tasks was made last year so there's nothing new to affect those APRs. Joe |
MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85 |
Great to finally see some movement towards setting things right again! Agreed, but for my main cruncher the effects today have been rather dramatic. The first GPU WUs it downloaded after the weekly maintenance must have been shorties which (like all shorties on my system) immediately ran in HP mode. As the system corrected their short predicted run times, it also adjusted all the older WUs as well. As a result I now have around 800 MB WUs each with a predicted run time of 10 hours when they will actually take 20 minutes each on my GTX460! The net result has been that the cruncher has been starting and stopping WUs in HP mode all day. At one stage the list of WUs started and suspened because another one was higher priority covered a whole page of the screen. To calm it down I have had to select NNT and suspend all non-started WUs. I will release them a few at a time. Has needed constant manual intervention all day. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
Horacio wrote:... This one, http://setiathome.berkeley.edu/host_app_versions.php?hostid=6187288 S@H enhanced GPU APR is now around 133, the flops value that works is 13 (and still some WUs give a DCF raw_ratio above 2)... Some time ago, (GPUs added in octuber last year) this value was above 300. And in this one http://setiathome.berkeley.edu/host_app_versions.php?hostid=6569691, APR is 46.64 while the working flops value is 10, this host was build this year. Ok, right now its not 30 times greather, but still not enough accurate to beeing able to stop using flops. Anyway, the APR is just one side of this matter, the estimated tasks sizes of the MB Wus seem to be very accurate (and consistent with the Angle Range) for the GT430, but they fail badly on the 560Ti in which some tasks estimated as shorties take much more time than expected and viceversa... (But, I guess this is somewhat related to the different optimizations that are used with each hardware and I dont see any way in which the project can help from server side) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13822 Credit: 208,696,464 RAC: 304 |
Has needed constant manual intervention all day. I can't see why. It's not going to miss any deadlines so why not just let it go? Grant Darwin NT |
MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85 |
Has needed constant manual intervention all day. It had so many suspended GPU WUs that it was running out of GPU memory. I was getting error messages saying WUs were suspended waiting for GPU memory. By manually suspending them and releasing them a few at a time I have managed to get both GPUs in the system running properly again. |
kittyman Send message Joined: 9 Jul 00 Posts: 51474 Credit: 1,018,363,574 RAC: 1,004 |
Has needed constant manual intervention all day. You could try temporarily unchecking the option to leave suspended WUs in memory in either your preferences or in Boinc. I am not positive this works for GPU tasks, but if it does it might allow you to leave Boinc on autopilot. "Time is simply the mechanism that keeps everything from happening all at once." |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Has needed constant manual intervention all day. It does work for both CPU & GPU applications. I have had to use that set to disabled as my GT8500 only has 256MB. So no room to run one and hold onto another. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14666 Credit: 200,643,578 RAC: 874 |
Has needed constant manual intervention all day. Really? Which version of BOINC? I thought they'd dropped the (very brief) experiment of keeping GPU apps in video memory when suspended a long time ago. Wait while I look... Edit ... that was easy. The policy for GPU jobs: That's from the changelog for v6.6.12, 4 March 2009. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Has needed constant manual intervention all day. And Boinc 6.6.37 had (6 July 2009): - client: when suspending a GPU job, always remove it from memory, even if it hasn't checkpointed. Otherwise we'll typically run another GPU job right away, and it will bomb out or revert to CPU mode because it can't allocate video RAM Claggy |
MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85 |
Has needed constant manual intervention all day. I had considered that, but like you was not sure if it applied to GPUs or not and the response from Richard suggests not. I am a bit of a control freak so the NNT and suspend most tasks option worked OK. the system has just finished all the partially started tasks and I have unsuspended another 50 or so GPU tasks. I will leave it on NNT overnight (with enough non-suspended tasks to keep it going) and see what the estimated completion times look like tomorrow. They should return to normal, especially as the CPU is busy crunching APs and so will not interfere. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Has needed constant manual intervention all day. You are in fact correct. I just checked that with 6.12.33 I am using now and the 6.10.48 I was using before. I must have been remembering how it worked in a much older version or had some other issue where there were multiple GPU exes running on my 8500. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Horacio wrote:... 133 GFLOPS corresponds to run time around 20 minutes for midrange, 6 minutes for VHAR, your 560ti GPUs are doing tasks at about that speed. With flops at ~1/10 APR, the servers have been able to compensate effectively by scaling rsc_fpops_est down, and the change in allowed ratio to 1/50 doesn't make a difference there. No need to change anything for S@H Enhanced since you're well past the crazy things that happen before the averages are established. But when S@H v7 is released here, getting the <flops> right as early as possible would make it possible to avoid those difficulties. OTOH, those difficulties are typically just uncomfortable rather than damaging, a good excuse for discussions here. And in this one http://setiathome.berkeley.edu/host_app_versions.php?hostid=6569691, APR is 46.64 while the working flops value is 10, this host was build this year. The main problem with a few tasks taking much longer than estimated is that DCF jumps up to predict that all cached tasks are also going to run slowly, and only comes back down slowly. I can just barely understand why "being able to stop using flops" seems like a good thing to some users. My take is that the BOINC core client provides such an extremely conservative flops that it simply makes sense to provide something better. Those doing only CPU work can get by without <flops> in app_info.xml, but those with CUDA or OpenCL capable GPUs which outperform their CPUs by a factor of 10 or so should be aware that the core client is telling the servers the GPUs are slower than the CPUs if <flops> are not set. After APR is established the servers will compensate, of course, but that period just after release of a new app (or after a host has accidentally been assigned a new hostID) when the estimates are terrible need not be so. Joe |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
The main problem with a few tasks taking much longer than estimated is that DCF jumps up to predict that all cached tasks are also going to run slowly, and only comes back down slowly. I was not talking about the estimated times (I dont really pay atention to that numbers as they change due to the DCF variations), I was talking about Estimated Task Sizes. The real crunching time for the GT430 is proportional to the size almost always and also consistent with the Angle range, but in the 560Ti there are cases in which the real time is so far from the expected for that size that the raw_ratio (from DCF-debug) boldly goes where no WU went before!. I can just barely understand why "being able to stop using flops" seems like a good thing to some users. Just because Im lazy and updating the optimized apps require some extra work editing the app_info, but also a lot of work to get the new right value when there are new optimizations and different performances. Of course "beeing able" dosnt means "take the option out"...(I like to know that there are available lifeboats, but I prefer not to have to use them... LOL) For hosts running only one project, any error in this estimations have almost no consequences, at most some over/under-fetched cache and just for a while. The "worst" issues comes when there are several projects and one of them changes all the estimated times by a factor of 5 (due to DCF rised by just one weird WU) making that all the other projects with short deadlines enter in panic mode (preventing the offending project to run, which delays the tuning down of the wrong DCF) and then as in any "panic" situation everything is messed up until the cops arrive... Anyway, I agree that all this is just an excusse to talk about something here. |
MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85 |
Great to finally see some movement towards setting things right again! It now seems that this problem is not related to the predicted short run times. Yesterday my main cruncher downloaded a whole load of updates from Microsoft and has been running strangely ever since. I leave the computer (which is based at work) with the monitor manually turned off but the PC set never to go to sleep and this has not changed. However, when left like this both graphics cards (as of yesterday) hang. They start one WU and after 30 seconds in which no processing is done try a different WU and so on. Overnight last night, they managed to crunch just 4 WUs between them! However, when I remotely monitor the PC using LogMeIn (as I did all day yesterday) then everything works OK. The CPU is not affected by this problem and has been happily crunching APs throughout. One of the Microsoft updates was a new video driver (295.73). I am about to go into work (supposed to be on holiday!) to do a clean reinstall of the Nvidea driver that has worked fine for the last 3 months (270.61). However if anyone has any other suggestions as to what could be causing this and solutions I would like to hear them. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
Great to finally see some movement towards setting things right again! If Im not wrong the drivers installed have a bug which disables the GPUs when the monitor goes to sleep (from the OS). If you set the monitor to never go to sleep you wont need to reinstall drivers. (but you will need to turn it off to not waste energy) |
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
Great to finally see some movement towards setting things right again! setiathome_CUDA: cudaGetDeviceCount() call failed. setiathome_CUDA: No CUDA devices found setiathome_CUDA: Found 0 CUDA device(s): In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device cannot be used Cuda device initialisation retry 1 of 6, waiting 5 secs... Cuda error 'Couldn't get cuda device count ' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 146 : no CUDA-capable device is detected. You are on a 295 driver, which has a known bug causeing CUDA devices to disappear when the display goes to sleep on DVI connections IIRC. Yes, that driver microsoft installed is the culprit - either downgrade tp 290.x or upgrade to 301.x edit: and as horacio says using display setting to keep them active and turning off the monitor (which you do anyway if I understood correctly) is a workaround. I'm not the Pope. I don't speak Ex Cathedra! |
MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85 |
You are on a 295 driver, which has a known bug causeing CUDA devices to disappear when the display goes to sleep on DVI connections IIRC. Thanks all for the advice after a lot of messing around this morning I managed to get the cruncher back to driver 270.61 and now all seems OK again. Just another example of Bill Gates trying to stop us finding ET:)) Now if it will just behave for a few days (been fighting weekend power cuts and overnight GPU crashes for the last few weeks) I should finally make it to 7Million credits which BoincStats was expecting me to get to yesterday! I have another question. My motherboard has a x16 and a x4 slot for the GPUs. Whilst messing around this morning I noticed that my faster GPU (GTX460) is currently in the x4 slot and the slower GPU (GT430) is in the x16 slot. Would I see a (significant) performance enhancement if I swapped the GPUs over? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.