CPU Computation errors

Message boards : Number crunching : CPU Computation errors
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1988887 - Posted: 5 Apr 2019, 12:41:54 UTC

I am trying to run my AMD 2700 Box at 3.7GHz.

Apparently the cpu is throwing errors every day. https://setiathome.berkeley.edu/results.php?hostid=8684146&offset=0&show_names=0&state=6&appid=

I have upped the cpu voltage offset again today.

Is there anything else I should be tinkering with?

Tom
A proud member of the OFA (Old Farts Association).
ID: 1988887 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1988890 - Posted: 5 Apr 2019, 13:03:49 UTC - in response to Message 1988887.  

You could tell us what the error messages are :-)

On a sample of three, all had

Exit status	194 (0x000000C2) EXIT_ABORTED_BY_CLIENT
finish file present too long
and in one case, "Restarted at 27.00 percent." three times. I think the poor CPU (and disk) is stressed out servicing all those GPUs.

What you could do is to try building a new client from master, with #3019:

When an app finishes, it writes a "finish file",
which ensures the client that the app really finished.

If the app process is still there N seconds after the finish file appears,
the client assumes that something went wrong, and it aborts the job.

Previously N was 10.
This was too small during periods of heavy paging.
I increased it to 300.
Or you could wait for the official release of client version 7.16: that should start moving any day now, as soon as Keith Myers and I sign off on the 'work fetch with max_concurrent' bug.
ID: 1988890 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1988925 - Posted: 5 Apr 2019, 17:46:24 UTC - in response to Message 1988890.  

You could tell us what the error messages are :-)

On a sample of three, all had

Exit status	194 (0x000000C2) EXIT_ABORTED_BY_CLIENT
finish file present too long
and in one case, "Restarted at 27.00 percent." three times. I think the poor CPU (and disk) is stressed out servicing all those GPUs.

What you could do is to try building a new client from master, with #3019:

When an app finishes, it writes a "finish file",
which ensures the client that the app really finished.

If the app process is still there N seconds after the finish file appears,
the client assumes that something went wrong, and it aborts the job.

Previously N was 10.
This was too small during periods of heavy paging.
I increased it to 300.
Or you could wait for the official release of client version 7.16: that should start moving any day now, as soon as Keith Myers and I sign off on the 'work fetch with max_concurrent' bug.


I couldn't seem to find the error message. Sorry.

Since I am running Tbars All-in-One unless the "client" resides outside of the gpu processing exe it's not going to be helpful. I can replace most of the executables in the BOINC folder. But replacing the gpu task means someone has to fix/recompile from the same source Tbar was using.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1988925 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1988936 - Posted: 5 Apr 2019, 18:47:32 UTC - in response to Message 1988925.  

I couldn't seem to find the error message. Sorry.
Not to worry. It took a while to load a list that long, but they arrived eventually and I know where to look.

Since I am running Tbars All-in-One unless the "client" resides outside of the gpu processing exe it's not going to be helpful. I can replace most of the executables in the BOINC folder. But replacing the gpu task means someone has to fix/recompile from the same source Tbar was using.

Tom
The 'client', in modern - post 2005 - terminology is the boinc or boinc.exe binary executable program. In older reference works, you'll sometimes see the SETI programs referred to as clients, but that dates from the old 'Classic' days when SETI did its own communications with the server closet, without the BOINC layer in the middle.

TBar has chosen to manage, maintain and document his work himself: I did download his package (all 222 MB) a couple of weeks ago to see if he'd incorporated my improved workround for the Manager scrolling bug, but the docs suggested he'd still taken out the whole patch. Anyway, that was Manager and this is Client, so it would make no difference.

The boinc client program is listed there, as one of the five files you need to install: I suggest you offer to test a build with #3019 (remember only in master, not a numbered branch) if he's prepared to build you one.
ID: 1988936 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1988938 - Posted: 5 Apr 2019, 18:57:07 UTC - in response to Message 1988936.  

Richard,
Thank you for clarifying what exactly "the client" refers to. If the change is in the "boinc.exe" I have less trouble with it.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1988938 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1988943 - Posted: 5 Apr 2019, 19:42:14 UTC

It also occurs to me if I go ahead and set the cpus to use 1 per GPU, then the CPU cores that are driving the gpus would no longer be "struggling" to spend time on the cpu app tasks.

The downside is it takes away 6 cpu threads from cpu processing. Maybe there is a compromise like 0.33 cpus per gpu.

Until the upgrade comes out :)

Tom
A proud member of the OFA (Old Farts Association).
ID: 1988943 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1988948 - Posted: 5 Apr 2019, 19:55:36 UTC

What you loose by having one core per GPU is more than made up for by the improved performance of the GPUs.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1988948 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1988993 - Posted: 6 Apr 2019, 0:10:46 UTC - in response to Message 1988948.  
Last modified: 6 Apr 2019, 0:11:11 UTC

What you loose by having one core per GPU is more than made up for by the improved performance of the GPUs.


The gpus weren't having a problem. It was the CPU threads which is why I am mourning the (temporary, I hope) loss of them. At least that is what I think was going.

I haven't had any computation errors since yesterday when I upped the cpu voltage. But given it also had something to do with writing a file to the HD I figured I wanted at least some days of "no errors" before I try bringing back more cpu threads to a "shared" status.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1988993 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1989016 - Posted: 6 Apr 2019, 9:06:45 UTC - in response to Message 1988993.  

It wasn't really a 'computation error' - the actual processing ran to completion with no sign of errors.

It was more of a 'housekeeping error', involving memory, disk, and the operating system. The sequence should be:

1. The SETI app writes a data file to disk for upload
2. The SETI app writes a 'finished' file to disk
3. The SETI app shuts down and removes itself from memory
[at which point, another SETI app will probably start up and load itself from disk to memory]
4. The BOINC client checks that both 2 and 3 have happened within 10 seconds

If it takes more than 10 seconds for that to happen, then the OS is thrashing. It might be memory paging as David suggests in his comment (in which case, more RAM might help). Or it might be the OS being too busy starting the next task (in which case, fewer concurrent operations requiring disk access might help). Or perhaps an SSD data disk.
ID: 1989016 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 715
Credit: 8,032,827
RAC: 62
France
Message 1989017 - Posted: 6 Apr 2019, 9:24:04 UTC
Last modified: 6 Apr 2019, 9:24:36 UTC

Beware of HIP wu , i've got some suspicious pulses with this sort of WU ... blc22_2bit_guppi_58406_01923_HIP116971_0034.25185.0.21.44.103.vlar

take a look at his one http://setiathome.berkeley.edu/workunit.php?wuid=3421361418

WU restarted several times with my CPU, as for my wingman ...
ID: 1989017 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 715
Credit: 8,032,827
RAC: 62
France
Message 1989078 - Posted: 7 Apr 2019, 8:41:13 UTC

Messages i got

[SETI@home] task postponed 300.000000 sec: Impossible Autocorr power, retrying from checkpoint.

or

Task postponed: Suspicious pulse results, host needs reboot or maintenance


others ones after stopping computer and restart from cold

Task postponed: Suspicious pulse results, host needs reboot or maintenance


ID: 1989078 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1990423 - Posted: 17 Apr 2019, 20:06:08 UTC

Looks like I have now having gpu computation errors. Going to switch to 1 to 1 cpu to gpu ratios.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1990423 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1990433 - Posted: 17 Apr 2019, 23:38:38 UTC

I'm glad I visited this thread. I wasn't aware Richard, of #3019 patch for the "finish file present too long" bug. I am plagued by that on a lot of my hosts because of being too busy to service the task in the short time frame. Very happy to see the timeout increased. Typically my only errors are for this type of error. I am not running the kind of gpu count that Tom M. is running. Most of my hosts only run 3 gpus with a couple running 4 gpus. But I get the finish file error on even the 3 card hosts. Maybe . . . . one a week.

@Richard, I wasn't aware you were waiting on me for the work_fetch workaround. I haven't tried out the latest dpa_work_fetch_mc branch lately. I never got any feedback on why the simulator again refuses to work for me. It takes my uploaded files without complaint or error but a scenario never shows up.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1990433 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1990498 - Posted: 18 Apr 2019, 13:20:06 UTC - in response to Message 1990423.  

Looks like I have now having gpu computation errors. Going to switch to 1 to 1 cpu to gpu ratios.

Tom


I haven't had in gpu computation errors (or invalid results) since I switched to 1 cpu to 1 gpu in the app_config.xml file.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1990498 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1990610 - Posted: 19 Apr 2019, 4:14:31 UTC - in response to Message 1990498.  

Looks like I have now having gpu computation errors. Going to switch to 1 to 1 cpu to gpu ratios.

Tom


I haven't had in gpu computation errors (or invalid results) since I switched to 1 cpu to 1 gpu in the app_config.xml file.

Tom


Looks like I am getting cpu "invalid results" though.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1990610 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1990622 - Posted: 19 Apr 2019, 8:41:40 UTC - in response to Message 1990433.  

@Richard, I wasn't aware you were waiting on me for the work_fetch workaround. I haven't tried out the latest dpa_work_fetch_mc branch lately. I never got any feedback on why the simulator again refuses to work for me. It takes my uploaded files without complaint or error but a scenario never shows up.
No point just at the moment. I found a second bug after the last outage here: David ignored it for a week and then asked for a scenario at midnight Tuesday. He got it Wednesday morning, and for once the simulator actually showed the bug in action. Ball's back in his court.
ID: 1990622 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1990625 - Posted: 19 Apr 2019, 9:28:07 UTC - in response to Message 1990433.  

Typically my only errors are for this type of error. I am not running the kind of gpu count that Tom M. is running. Most of my hosts only run 3 gpus with a couple running 4 gpus. But I get the finish file error on even the 3 card hosts. Maybe . . . . one a week.

I would also get the occasional "Finish file present too long" when restarting my systems after updates are installed & waiting on a restart. I've just gotten in to the habit of exiting BOINC at least 30sec before restarting the system.
Grant
Darwin NT
ID: 1990625 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1992472 - Posted: 3 May 2019, 19:31:44 UTC - in response to Message 1989016.  

It wasn't really a 'computation error' - the actual processing ran to completion with no sign of errors.

It was more of a 'housekeeping error', involving memory, disk, and the operating system. The sequence should be:

1. The SETI app writes a data file to disk for upload
2. The SETI app writes a 'finished' file to disk
3. The SETI app shuts down and removes itself from memory
[at which point, another SETI app will probably start up and load itself from disk to memory]
4. The BOINC client checks that both 2 and 3 have happened within 10 seconds

If it takes more than 10 seconds for that to happen, then the OS is thrashing. It might be memory paging as David suggests in his comment (in which case, more RAM might help). Or it might be the OS being too busy starting the next task (in which case, fewer concurrent operations requiring disk access might help). Or perhaps an SSD data disk.

Interesting... I wonder if that will solve also some of the issues some people including myself have with Rosetta. Seen myself how few tasks went to error exactly when they were about to finish the work and exit.
ID: 1992472 · Report as offensive

Message boards : Number crunching : CPU Computation errors


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.