Strange behaviour, tasks switch, causing a restart

Message boards : Number crunching : Strange behaviour, tasks switch, causing a restart
Message board moderation

To post messages, you must log in.

AuthorMessage
Wurgl (speak^Wcrunching for Special: Off-Topic)

Send message
Joined: 19 Jun 06
Posts: 5
Credit: 681,649
RAC: 0
Austria
Message 1988136 - Posted: 31 Mar 2019, 13:56:36 UTC
Last modified: 31 Mar 2019, 13:58:49 UTC

I am using BOIC Manager 7.2.42 from OpenSuse Leap 42.3 (yes, I have to and I will upgrade this summer).

Two tasks:
http://setiathome.berkeley.edu/result.php?resultid=7405624489
http://setiathome.berkeley.edu/result.php?resultid=7405624487

One is computing for about 1 minute. Then BOINC decides to switch over to the other task. Switching that task means: Starting from beginning. So it happens. The second task is computing for about a minute. Then BOINC decides … see above.

Nice endless loop.

Sorry. I have to kill those jobs.

BTW: Looking at the statistics graph in the BOINC Client, this behavior seems to happen since March 19th.
ID: 1988136 · Report as offensive
Profile tullio
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 7674
Credit: 2,724,049
RAC: 1,979
Italy
Message 1988140 - Posted: 31 Mar 2019, 14:37:04 UTC
Last modified: 31 Mar 2019, 14:38:15 UTC

I am usin BOINC 7.8.3 on SuSE Leap 15.0, which I recommend. On another CPU, a Virtual Machine on a Windows 10 host, SuSE insists in sending me Tumbleweed as an update of Leap 15.0. BOINC manager 7.14.2 does not work in Tumbleweed and I have connected Einstein@home by a manual installation.
Tullio
ID: 1988140 · Report as offensive
Wurgl (speak^Wcrunching for Special: Off-Topic)

Send message
Joined: 19 Jun 06
Posts: 5
Credit: 681,649
RAC: 0
Austria
Message 1988187 - Posted: 31 Mar 2019, 19:47:26 UTC - in response to Message 1988140.  

Maybe it is important, that these workunits are done on the graphics card, they are not CPU-only. And these seem not to save some checkpoint from where they can continue the computing.

Pure CPU-jobs run fine.
ID: 1988187 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 17728
Credit: 403,234,821
RAC: 152,377
United Kingdom
Message 1988201 - Posted: 31 Mar 2019, 20:37:11 UTC

A couple of thoughts - which source did you use to get the drivers - if it was from MS then there is a fair possibility that they do not have all the computation support needed.
Second, check the GPU to make sure it isn't full of dust and is seated properly.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1988201 · Report as offensive
Wurgl (speak^Wcrunching for Special: Off-Topic)

Send message
Joined: 19 Jun 06
Posts: 5
Credit: 681,649
RAC: 0
Austria
Message 1988204 - Posted: 31 Mar 2019, 21:04:52 UTC - in response to Message 1988201.  

rob, this is LINUX. MS is not allowed on my machine.

The problem is task switching without the ability to continue, not computation errors.

BTW: Now It runs its main job: Einstein@Home and since the start post it has done 9 workunits without errors, all running on the GPU. So no dust.
ID: 1988204 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 13133
Credit: 149,097,459
RAC: 174,967
United Kingdom
Message 1988206 - Posted: 31 Mar 2019, 21:14:40 UTC

All tasks have been aborted, up to 7 weeks after issue. All tasks for computer 4990496.

No evidence retained in std_err, event log could still be queried via stdoutdae.txt.
ID: 1988206 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9661
Credit: 890,157,877
RAC: 1,704,394
United States
Message 1988209 - Posted: 31 Mar 2019, 21:20:22 UTC - in response to Message 1988204.  

You are trying to run the stock SoG OpenCL gpu application. Do you have the Nvidia OpenCL drivers loaded? Does the Event Log at startup show two lines of detection for your GTX 950? You should see one line showing the Nvidia driver level with the CUDA detection. Followed by another line showing the Nvidia driver level with OpenCL detection. If you don't have that second line that is why all your gpu tasks fail.

You might want to reset the project in case the application entries in the client_state.xml are corrupted or something.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1988209 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 13133
Credit: 149,097,459
RAC: 174,967
United Kingdom
Message 1988211 - Posted: 31 Mar 2019, 21:48:18 UTC - in response to Message 1988209.  

I think that Raistmer's SoG application goes into 'temporary exit' and retries if certain types of driver error are encountered - lack of OpenCL might be one of them, but I don't have the documentation to hand.

That would explain the symptom in the thread title, but should be checked in both the Event Log and stderr.
ID: 1988211 · Report as offensive
Profile betreger
Avatar

Send message
Joined: 29 Jun 99
Posts: 9436
Credit: 25,369,628
RAC: 21,435
United States
Message 1988212 - Posted: 31 Mar 2019, 22:04:35 UTC - in response to Message 1988211.  

IIRC the Einstein Nvida app is open CL
ID: 1988212 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4817
Credit: 554,650,934
RAC: 1,259,281
United States
Message 1988213 - Posted: 31 Mar 2019, 22:10:12 UTC - in response to Message 1988204.  

rob, this is LINUX. MS is not allowed on my machine.

The problem is task switching without the ability to continue, not computation errors.

BTW: Now It runs its main job: Einstein@Home and since the start post it has done 9 workunits without errors, all running on the GPU. So no dust.
It would be much easier to just run the current LTS version of Ubuntu with the BOINC-All-In-One package. Just install Ubuntu, the Repository 390 driver, and the BOINC package to your Home folder. Then you could be like all these people, https://setiathome.berkeley.edu/top_hosts.php, except for that one Mac that seems to be hanging in there...
ID: 1988213 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9661
Credit: 890,157,877
RAC: 1,704,394
United States
Message 1988221 - Posted: 31 Mar 2019, 23:12:17 UTC - in response to Message 1988212.  

IIRC the Einstein Nvida app is open CL

Yes, I see that now in his OP. But I think that the Einstein OpenCL app can run on the Mesa OpenCL package drivers. Not sure, I would have to read through the forum threads. There can be instances where two platforms are loaded for graphics drivers. The Mesa and Nvidia, the Mesa and Intel and Mesa and AMD. Each would try an load their OpenCL component. I remember a thread where the Mesa drivers unloaded the AMD OpenCL component and vice versa. Supposedly impossible for both to co-exist on the same system. Could be the case. A check with clinfo would clear things up since it lists all detected platforms and their respective components.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1988221 · Report as offensive
Profile tullio
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 7674
Credit: 2,724,049
RAC: 1,979
Italy
Message 1988265 - Posted: 1 Apr 2019, 5:25:37 UTC

I've learned that a Virtual Machine cannot use the GPU of its host. In a Linux Virtual Machine on a Windows 10 PC with a Nvidia board the hwinfo --gfx command reports I have a VMWare board, which does not exist. This is probably why LHC@home does not use GPUs, because all its programs, except the"native" ones use a Virtual Machine.
Tullio
ID: 1988265 · Report as offensive
Wurgl (speak^Wcrunching for Special: Off-Topic)

Send message
Joined: 19 Jun 06
Posts: 5
Credit: 681,649
RAC: 0
Austria
Message 1988278 - Posted: 1 Apr 2019, 7:59:55 UTC - in response to Message 1988221.  

I have nvidia drivers installed. Cannot install clinfo, since this would cause a conflict.

However, when there is an error, the workunit shall stop with that error. It shall not try to restart (causing the same problem again, and ending up in an endless loop).
ID: 1988278 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 17728
Credit: 403,234,821
RAC: 152,377
United Kingdom
Message 1988281 - Posted: 1 Apr 2019, 9:25:46 UTC

BOINC has a built-in feature that traps tasks that re-start too often, I can't remember what the trigger number is, but seeing the real error report of a task that has undergone a number of re-starts might actually give a hint as to what the problem is.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1988281 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 13133
Credit: 149,097,459
RAC: 174,967
United Kingdom
Message 1988284 - Posted: 1 Apr 2019, 9:57:39 UTC - in response to Message 1988281.  

Limit is 100. But stderr is limited to last 64KB, and with the amount of SoG output, the initial iterations and initial start data will be lost.
ID: 1988284 · Report as offensive

Message boards : Number crunching : Strange behaviour, tasks switch, causing a restart


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.