Message boards :
Number crunching :
Computation errors
Message board moderation
| Author | Message |
|---|---|
|
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0
|
A 'brand new' WU, taking much computation time. Split 1 May, gives result_overflow with 1 Gaussian and 30 Triplets on CPU so it should be no surprise CUDA apps bail out with the too many triplets -12 "Unsupported function" exit. The unusual factor is that more than 77% of the processing is done before that Triplet zone is reached. That's not a nice time to throw an error exit. I don't see any -6, and VLARkill certainly wouldn't force one for that AR 0.422886 task. Joe |
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0
|
A 'brand new' WU, taking much computation time. Task 1636695959 Seems to give multiple errors, -6; -12 on my (slowest) CUDA host. Btw, it's the only host with 'enough' tasks for 1 maybe 2 days. Just finished a few AP WU's on the ATI host. (Collatz C. 'has taken over')
|
Spectrum Send message Joined: 14 Jun 99 Posts: 468 Credit: 53,129,336 RAC: 0
|
OK to save time and confusion I have reset the project on this computer (Apologies to my Wingmen) and removed and re-installed a clean latest version of Seti@home, I will see if it behaves for two weeks (the time I am working away from home )then consider installing an optimised app, something strange was happening so start from scratch I find it the easiest way when you are not so project savvy, thanks to all for your input but I just dont have the time to implement the fixes and am not confident enough to start editing the app info files. Cheers and keep on crunching :)
|
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 13797 Credit: 40,757,560 RAC: 151
|
The details for one of tha AP's d/loaded recently are; <workunit> <name>ap_31dc09ag_B6_P0_00395_20100624_25036.wu</name> <app_name>astropulse_v505</app_name> <version_num>505</version_num> <rsc_fpops_est>1821052114462310.000000</rsc_fpops_est> The <rsc_fpops_est> is identical for all AP tasks on both computers So no, I guess the changes for AP have not even been started, yet! |
|
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0
|
The one thing that appears not to have been done is "Application Info" for AP tasks. Hmm, I think I see at least 11 validated AP tasks since the new server code was introduced on your Q660 host. If so, the server should have just started adjusting the estimate and bound values. Could you look in the host's client_state.xml and see if the rsc_fpops_est for the most recently downloaded AP tasks is significantly different from 1.821e+15? Probably most of the discrepancy is simply because the host DCF is now much higher, but if server-side has not started to adjust then maybe the whole mechanism is broken for AP rather than just the display of "Application details". Joe |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 13797 Credit: 40,757,560 RAC: 151
|
The one thing that appears not to have been done is "Application Info" for AP tasks. That was meant to be "whilst the GPU has 100's of tasks. |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0
|
I didn't even think of that!! I had to do a detach and when I attached again I got 15 GPU tasks (with their time to completion way out of whack) and one AP. I was wondering why I hadn't been able to get any CPU tasks. PROUD MEMBER OF Team Starfire World BOINC |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 13797 Credit: 40,757,560 RAC: 151
|
The one thing that appears not to have been done is "Application Info" for AP tasks. The AP tasks I have d/loaded in the last few hours have completion estimates of 134 hrs, actual is 12.5 hrs average. Therefore the BOINC client thinks there is enough CPU work and is only requesting GPU tasks. It appears that unless things change soon the CPU will run dry whilst the CPU has 100's of tasks. |
|
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0
|
The set and forget brigade might cause tasks to abort and not know about it. I wish that were fully true. Unfortunately there are people who install optimized applications, apply some tweaks picked up here, then figure they're set for life. It's why I always try to work a mention of the user's obligation to stay in touch into any recommendation to consider going optimized. Joe |
|
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0
|
The set and forget brigade might cause tasks to abort and not know about it. The "set-and-forget" brigade would not have an app_info file to add a flops entry to never mind a flops entry. So shouldn't be a problem. F.
|
|
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0
|
I essentially did the same by reinstalling the lunatics drivers and not editing the app_info.xml, and the computation errors are gone, but how are we going to correct the runtime now ? Under the old system there was just the single Duration Correction Factor (DCF) on each host for all the project applications. Using <flops> in the right ratios allowed the DCF to be reasonably stable, which in turn meant reasonable work fetch, etc. Under the new system the servers keep statistics for each application intended to do the same thing. If the averages kept by the server stabilize, the host DCF should also stabilize (in the vicinity of 1.0). Having <flops> in the app_info.xml will no longer directly affect the server code. On the host, though, <flops> will still affect some things: the Round Robin simulation the host uses to figure out whether it needs to go into High Priority, the estimated run times, and the Maximum Elapsed Time before BOINC kills a task and reports a -177 error. Using a low <flops> for a fast GPU keeps that error from happening (when there is no <flops> the core client uses the CPU Whetstone benchmark instead). But that low <flops> makes the RR sim extremely inaccurate. There may be a compromise setting for GPU <flops> which will work OK, if so it will be less than the former recommendation but somewhat higher than the CPU flops. Joe |
|
FiveHamlet Send message Joined: 5 Oct 99 Posts: 783 Credit: 32,638,578 RAC: 0
|
I had to remove the flops from my app file on 2 rigs it was causing a Cuda abort error. Both rigs now working ok again. I only got the info from the threads. The set and forget brigade might cause tasks to abort and not know about it. Dave
|
|
Bernd Noessler Send message Joined: 15 Nov 09 Posts: 99 Credit: 52,635,434 RAC: 0
|
If you have problems with the -177 errors you can try the following: You need an editor which can search for regular expressions. Linux users like me might use the midnight commander. Stop boinc and load the client_state.xml into the editor. search for regexp: \<rsc_fpops_bound\>.*\< and replace with text: <rsc_fpops_bound>2e15< It affects only tasks already in the cache. I've done this on three of my machines and it worked. |
Geek@Play Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0
|
Changes were done on the server side software where the servers would provide the clients with the necessary flops calculation. The server is also supposed to get the DCF to a value of 1 or near that. Still some bugs to squash on this server software. This is working fine on 2 of the 4 client computers I have. The other two are throwing -177 fault codes often. Because of the new server code any flops in the app_info file should probably be removed. Boinc....Boinc....Boinc....Boinc.... |
|
Bearcat Send message Joined: 10 Sep 99 Posts: 106 Credit: 10,778,506 RAC: 0
|
I essentially did the same by reinstalling the lunatics drivers and not editing the app_info.xml, and the computation errors are gone, but how are we going to correct the runtime now ? The runtime doesn't correct itself enough, which was the reason why we had to put the flops entry in there to begin with. Either that or increase the cache to twice the necessary days to compensate. Has anything changed lately that causes this not to work anymore ? |
Wiggo "Democratic Socialist" Send message Joined: 24 Jan 00 Posts: 18404 Credit: 261,360,520 RAC: 1,109
|
I essentially did the same by reinstalling the lunatics drivers and not editing the app_info.xml, and the computation errors are gone, but how are we going to correct the runtime now ? The runtime should correct itself soon enough as some of my pc's which suddenly had longer predicted w/u times did so within a few hrs. |
|
Bearcat Send message Joined: 10 Sep 99 Posts: 106 Credit: 10,778,506 RAC: 0
|
I removed the "flops" section from the app_info because it "terminated" all my CUDA WU's with calculation errors after 17m32s. I essentially did the same by reinstalling the lunatics drivers and not editing the app_info.xml, and the computation errors are gone, but how are we going to correct the runtime now ? |
|
TheFreshPrince a.k.a. BlueTooth76 Send message Joined: 4 Jun 99 Posts: 210 Credit: 10,315,944 RAC: 0
|
I removed the "flops" section from the app_info because it "terminated" all my CUDA WU's with calculation errors after 17m32s. Rig name: "x6Crunchy" OS: Win 7 x64 MB: Asus M4N98TD EVO CPU: AMD X6 1055T 2.8(1,2v) GPU: 2x Asus GTX560ti Member of: Dutch Power Cows |
|
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 771
|
Hi all keep getting computation errors lately on the standard non Cuda work units and a windows error message saying the AK 8b win has had an error, any ideas? I believe this is an interaction between flops, DCF and the new handling on the server side. Now you may get DCF >>1 (like, say, 7.68) - before, it was almost always < 1.0; this causes time allowed for running a WU to be much less than it used to be (before these changes). Unfortunately, the Power That Is has not told us what the algorithms he is implementing are, so it is impossible to understand exactly what is going on. Maybe we will be enlightened at some point, but I wouldn't hold my breath, if I were you - or me. |
Geek@Play Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0
|
Hi all keep getting computation errors lately on the standard non Cuda work units and a windows error message saying the AK 8b win has had an error, any ideas? Yes.........I am having the exact same problems........... See -177 (0xffffffffffffff4f) Faults and my faults are continueing. Fix is going to have to be at the server side. Boinc....Boinc....Boinc....Boinc.... |
©2020 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.