Computation errors

Message boards : Number crunching : Computation errors
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1008240 - Posted: 25 Jun 2010, 18:54:16 UTC - in response to Message 1008123.  

A 'brand new' WU, taking much computation time.
Task 1636695959

Seems to give multiple errors, -6; -12 on my (slowest) CUDA host.
...

Split 1 May, gives result_overflow with 1 Gaussian and 30 Triplets on CPU so it should be no surprise CUDA apps bail out with the too many triplets -12 "Unsupported function" exit. The unusual factor is that more than 77% of the processing is done before that Triplet zone is reached. That's not a nice time to throw an error exit.

I don't see any -6, and VLARkill certainly wouldn't force one for that AR 0.422886 task.
                                                               Joe
ID: 1008240 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1008123 - Posted: 25 Jun 2010, 11:03:31 UTC - in response to Message 1008106.  
Last modified: 25 Jun 2010, 11:15:41 UTC

A 'brand new' WU, taking much computation time.
Task 1636695959

Seems to give multiple errors, -6; -12 on my (slowest) CUDA host.
Btw, it's the only host with 'enough' tasks for 1 maybe 2 days.
Just finished a few AP WU's on the ATI host. (Collatz C. 'has taken over')
ID: 1008123 · Report as offensive
Profile Spectrum
Avatar

Send message
Joined: 14 Jun 99
Posts: 468
Credit: 53,129,336
RAC: 0
Australia
Message 1008106 - Posted: 25 Jun 2010, 9:46:39 UTC

OK to save time and confusion I have reset the project on this computer (Apologies to my Wingmen) and removed and re-installed a clean latest version of Seti@home, I will see if it behaves for two weeks (the time I am working away from home )then consider installing an optimised app, something strange was happening so start from scratch I find it the easiest way when you are not so project savvy, thanks to all for your input but I just dont have the time to implement the fixes and am not confident enough to start editing the app info files.

Cheers and keep on crunching :)
ID: 1008106 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 13797
Credit: 40,757,560
RAC: 151
United Kingdom
Message 1008066 - Posted: 25 Jun 2010, 5:57:18 UTC
Last modified: 25 Jun 2010, 6:02:00 UTC

The details for one of tha AP's d/loaded recently are;

<workunit>
<name>ap_31dc09ag_B6_P0_00395_20100624_25036.wu</name>
<app_name>astropulse_v505</app_name>
<version_num>505</version_num>
<rsc_fpops_est>1821052114462310.000000</rsc_fpops_est>

The <rsc_fpops_est> is identical for all AP tasks on both computers


So no, I guess the changes for AP have not even been started, yet!
ID: 1008066 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1008057 - Posted: 25 Jun 2010, 5:14:04 UTC - in response to Message 1008042.  

The one thing that appears not to have been done is "Application Info" for AP tasks.

The AP tasks I have d/loaded in the last few hours have completion estimates of 134 hrs, actual is 12.5 hrs average. Therefore the BOINC client thinks there is enough CPU work and is only requesting GPU tasks.

It appears that unless things change soon the CPU will run dry whilst the CPU has 100's of tasks.

That was meant to be "whilst the GPU has 100's of tasks.

Hmm, I think I see at least 11 validated AP tasks since the new server code was introduced on your Q660 host. If so, the server should have just started adjusting the estimate and bound values. Could you look in the host's client_state.xml and see if the rsc_fpops_est for the most recently downloaded AP tasks is significantly different from 1.821e+15? Probably most of the discrepancy is simply because the host DCF is now much higher, but if server-side has not started to adjust then maybe the whole mechanism is broken for AP rather than just the display of "Application details".
                                                              Joe
ID: 1008057 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 13797
Credit: 40,757,560
RAC: 151
United Kingdom
Message 1008042 - Posted: 25 Jun 2010, 4:10:48 UTC - in response to Message 1008024.  

The one thing that appears not to have been done is "Application Info" for AP tasks.

The AP tasks I have d/loaded in the last few hours have completion estimates of 134 hrs, actual is 12.5 hrs average. Therefore the BOINC client thinks there is enough CPU work and is only requesting GPU tasks.

It appears that unless things change soon the CPU will run dry whilst the CPU has 100's of tasks.

That was meant to be "whilst the GPU has 100's of tasks.
ID: 1008042 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1008028 - Posted: 25 Jun 2010, 3:17:17 UTC - in response to Message 1008024.  

I didn't even think of that!! I had to do a detach and when I attached again I got 15 GPU tasks (with their time to completion way out of whack) and one AP. I was wondering why I hadn't been able to get any CPU tasks.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1008028 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 13797
Credit: 40,757,560
RAC: 151
United Kingdom
Message 1008024 - Posted: 25 Jun 2010, 3:07:14 UTC

The one thing that appears not to have been done is "Application Info" for AP tasks.

The AP tasks I have d/loaded in the last few hours have completion estimates of 134 hrs, actual is 12.5 hrs average. Therefore the BOINC client thinks there is enough CPU work and is only requesting GPU tasks.

It appears that unless things change soon the CPU will run dry whilst the CPU has 100's of tasks.
ID: 1008024 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1008000 - Posted: 25 Jun 2010, 1:47:51 UTC - in response to Message 1007910.  

The set and forget brigade might cause tasks to abort and not know about it.

Dave

The "set-and-forget" brigade would not have an app_info file to add a flops entry to never mind a flops entry. So shouldn't be a problem.

F.

I wish that were fully true. Unfortunately there are people who install optimized applications, apply some tweaks picked up here, then figure they're set for life. It's why I always try to work a mention of the user's obligation to stay in touch into any recommendation to consider going optimized.
                                                             Joe
ID: 1008000 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 1007910 - Posted: 24 Jun 2010, 22:57:57 UTC - in response to Message 1007821.  

The set and forget brigade might cause tasks to abort and not know about it.

Dave

The "set-and-forget" brigade would not have an app_info file to add a flops entry to never mind a flops entry. So shouldn't be a problem.

F.
ID: 1007910 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1007869 - Posted: 24 Jun 2010, 21:21:15 UTC - in response to Message 1007770.  

I essentially did the same by reinstalling the lunatics drivers and not editing the app_info.xml, and the computation errors are gone, but how are we going to correct the runtime now ?

The runtime should correct itself soon enough as some of my pc's which suddenly had longer predicted w/u times did so within a few hrs.


The runtime doesn't correct itself enough, which was the reason why we had to put the flops entry in there to begin with. Either that or increase the cache to twice the necessary days to compensate. Has anything changed lately that causes this not to work anymore ?

Under the old system there was just the single Duration Correction Factor (DCF) on each host for all the project applications. Using <flops> in the right ratios allowed the DCF to be reasonably stable, which in turn meant reasonable work fetch, etc.

Under the new system the servers keep statistics for each application intended to do the same thing. If the averages kept by the server stabilize, the host DCF should also stabilize (in the vicinity of 1.0). Having <flops> in the app_info.xml will no longer directly affect the server code.

On the host, though, <flops> will still affect some things: the Round Robin simulation the host uses to figure out whether it needs to go into High Priority, the estimated run times, and the Maximum Elapsed Time before BOINC kills a task and reports a -177 error. Using a low <flops> for a fast GPU keeps that error from happening (when there is no <flops> the core client uses the CPU Whetstone benchmark instead). But that low <flops> makes the RR sim extremely inaccurate. There may be a compromise setting for GPU <flops> which will work OK, if so it will be less than the former recommendation but somewhat higher than the CPU flops.
                                                                Joe
ID: 1007869 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 1007821 - Posted: 24 Jun 2010, 18:09:53 UTC

I had to remove the flops from my app file on 2 rigs it was causing a Cuda abort error.
Both rigs now working ok again.
I only got the info from the threads.
The set and forget brigade might cause tasks to abort and not know about it.

Dave
ID: 1007821 · Report as offensive
Bernd Noessler

Send message
Joined: 15 Nov 09
Posts: 99
Credit: 52,635,434
RAC: 0
Germany
Message 1007802 - Posted: 24 Jun 2010, 17:05:35 UTC - in response to Message 1007791.  

If you have problems with the -177 errors you can try the following:

You need an editor which can search for regular expressions. Linux users like me might use the midnight commander.
Stop boinc and load the client_state.xml into the editor.

search for regexp:
\<rsc_fpops_bound\>.*\<

and replace with text:
<rsc_fpops_bound>2e15<

It affects only tasks already in the cache.
I've done this on three of my machines and it worked.

ID: 1007802 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1007791 - Posted: 24 Jun 2010, 16:30:27 UTC
Last modified: 24 Jun 2010, 16:33:49 UTC

Changes were done on the server side software where the servers would provide the clients with the necessary flops calculation. The server is also supposed to get the DCF to a value of 1 or near that. Still some bugs to squash on this server software.

This is working fine on 2 of the 4 client computers I have. The other two are throwing -177 fault codes often.

Because of the new server code any flops in the app_info file should probably be removed.
Boinc....Boinc....Boinc....Boinc....
ID: 1007791 · Report as offensive
Bearcat

Send message
Joined: 10 Sep 99
Posts: 106
Credit: 10,778,506
RAC: 0
United States
Message 1007770 - Posted: 24 Jun 2010, 15:37:23 UTC - in response to Message 1007691.  

I essentially did the same by reinstalling the lunatics drivers and not editing the app_info.xml, and the computation errors are gone, but how are we going to correct the runtime now ?

The runtime should correct itself soon enough as some of my pc's which suddenly had longer predicted w/u times did so within a few hrs.


The runtime doesn't correct itself enough, which was the reason why we had to put the flops entry in there to begin with. Either that or increase the cache to twice the necessary days to compensate. Has anything changed lately that causes this not to work anymore ?

ID: 1007770 · Report as offensive
Profile Wiggo "Democratic Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 18404
Credit: 261,360,520
RAC: 1,109
Australia
Message 1007691 - Posted: 24 Jun 2010, 7:03:09 UTC - in response to Message 1007690.  

I essentially did the same by reinstalling the lunatics drivers and not editing the app_info.xml, and the computation errors are gone, but how are we going to correct the runtime now ?

The runtime should correct itself soon enough as some of my pc's which suddenly had longer predicted w/u times did so within a few hrs.
ID: 1007691 · Report as offensive
Bearcat

Send message
Joined: 10 Sep 99
Posts: 106
Credit: 10,778,506
RAC: 0
United States
Message 1007690 - Posted: 24 Jun 2010, 6:50:09 UTC - in response to Message 1007687.  

I removed the "flops" section from the app_info because it "terminated" all my CUDA WU's with calculation errors after 17m32s.


I essentially did the same by reinstalling the lunatics drivers and not editing the app_info.xml, and the computation errors are gone, but how are we going to correct the runtime now ?

ID: 1007690 · Report as offensive
TheFreshPrince a.k.a. BlueTooth76
Avatar

Send message
Joined: 4 Jun 99
Posts: 210
Credit: 10,315,944
RAC: 0
Netherlands
Message 1007687 - Posted: 24 Jun 2010, 6:16:32 UTC

I removed the "flops" section from the app_info because it "terminated" all my CUDA WU's with calculation errors after 17m32s.
Rig name: "x6Crunchy"
OS: Win 7 x64
MB: Asus M4N98TD EVO
CPU: AMD X6 1055T 2.8(1,2v)
GPU: 2x Asus GTX560ti
Member of: Dutch Power Cows
ID: 1007687 · Report as offensive
Cruncher-American

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 771
United States
Message 1007650 - Posted: 24 Jun 2010, 1:43:54 UTC - in response to Message 1007647.  

Hi all keep getting computation errors lately on the standard non Cuda work units and a windows error message saying the AK 8b win has had an error, any ideas?




Yes.........I am having the exact same problems...........

See -177 (0xffffffffffffff4f) Faults and my faults are continueing. Fix is going to have to be at the server side.



I believe this is an interaction between flops, DCF and the new handling on the server side. Now you may get DCF >>1 (like, say, 7.68) - before, it was almost always < 1.0; this causes time allowed for running a WU to be much less than it used to be (before these changes). Unfortunately, the Power That Is has not told us what the algorithms he is implementing are, so it is impossible to understand exactly what is going on.

Maybe we will be enlightened at some point, but I wouldn't hold my breath, if I were you - or me.
ID: 1007650 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1007647 - Posted: 24 Jun 2010, 1:29:15 UTC - in response to Message 1007628.  

Hi all keep getting computation errors lately on the standard non Cuda work units and a windows error message saying the AK 8b win has had an error, any ideas?




Yes.........I am having the exact same problems...........

See -177 (0xffffffffffffff4f) Faults and my faults are continueing. Fix is going to have to be at the server side.


Boinc....Boinc....Boinc....Boinc....
ID: 1007647 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Computation errors


 
©2020 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.