CPU jobs needing to be aborted

Message boards : Number crunching : CPU jobs needing to be aborted
Message board moderation

To post messages, you must log in.

AuthorMessage
Fortran

Send message
Joined: 20 Apr 06
Posts: 16
Credit: 13,398,872
RAC: 31
Canada
Message 1687517 - Posted: 4 Jun 2015, 0:42:47 UTC

I've got 3 PCs with dual core, AMD64 CPUs. One of these machines gets jobs which never end, and need to be aborted.

The oldest machine is running a Gentoo Linux 64 bit OS, and it can use both cores and lots of machine resources and has lots of disk space and 4 GB of RAM. The newest machine is running Debian oldstable 64 bit OS, and it can use both cores and lots of machine resources, has lots of disk space and 8 GB of RAM. The machine in the middle only has 3 GB of RAM, has lots of disk.

It is only SETI jobs on this last machine causing problems.

This machine is different in three respects that stand out to me:
1. it is running a 32 bit OS (686)
2. it has a HD5450 graphics card (GPU) and is running GPU jobs from PrimeGrid
3. it only runs a single core for BOINC CPU projects, leaving the other core to respond to other things (like feeding the GPU).

Of late, jobs are supposed to take 3 hours or so, and after a day they still have an hour or so left. I used to let these jobs run down to 0 remaining time, but they never did end.

All I have been doing is aborting the jobs. Should a person try to do a core dump?

Longer term plans are to take the GPU jobs off this machine, because the newest machine has a HD6450 card in it, which is a little faster on GPU stuff. Nothing like what some people have.

But, if there is something useful to be learned to help SETI out, I can try and help. But, most of my debugging of number crunching jobs is adding print statements and looking at FORTRAN source.

The reason for having the HD5450 and HD6450 cards, is to learn how to use them for code I write myself. I was just using PrimeGrid for testing.
ID: 1687517 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1687528 - Posted: 4 Jun 2015, 1:07:36 UTC - in response to Message 1687517.  

My AMD 4200+ takes ~10 hours for a MB task, 50+ hours for an AP task.
ID: 1687528 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1687545 - Posted: 4 Jun 2015, 1:47:30 UTC - in response to Message 1687517.  

Both Task 4163788896 and Task 4184268020 show evidence of the timing measurements in the "Optimal function choices" having difficulty. That's been a problem on some hosts for a long time, and attempts to pin it down have failed.

Some users have found that restarting the task so that code will be retried from the beginning may get the task running normally. That is, perhaps suspending and resuming the task or exiting from BOINC and restarting it might help. Those are worth trying before aborting the task.

The other approach is to use a build which doesn't do that "Optimal function choices" code, the CPU builds from http://lunatics.kwsn.info for instance. That means using an Anonymous Platform configuration.
                                                                   Joe
ID: 1687545 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65740
Credit: 55,293,173
RAC: 49
United States
Message 1687552 - Posted: 4 Jun 2015, 2:28:49 UTC
Last modified: 4 Jun 2015, 2:30:32 UTC

I had a cpu job running after Boinc was scheduled to stop running at 8am today, of course I wasn't running S@H at the time, this was running on its own..

I don't have any idea of when this started, though I did nothing to start this wu, the only way to stop this, was to abort it, I'm running Boinc 7.4.42 x64, I thought someone would like to know of this. Oh and I am running anonymous.

GALACTICA

274	SETI@home	6/3/2015 1:22:55 PM	Computation for task 19se12ac.541.25422.438086664199.12.252_1 finished	
275	SETI@home	6/3/2015 1:22:57 PM	Started upload of 19se12ac.541.25422.438086664199.12.252_1_0	
276	SETI@home	6/3/2015 1:23:00 PM	Finished upload of 19se12ac.541.25422.438086664199.12.252_1_0	

The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1687552 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1687558 - Posted: 4 Jun 2015, 3:22:55 UTC

A month ago we discussed this in:
"Crunching appears to stop":
http://setiathome.berkeley.edu/forum_thread.php?id=77279
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1687558 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1687561 - Posted: 4 Jun 2015, 3:29:30 UTC - in response to Message 1687552.  

I had a cpu job running after Boinc was scheduled to stop running at 8am today, of course I wasn't running S@H at the time, this was running on its own..

I don't have any idea of when this started, though I did nothing to start this wu, the only way to stop this, was to abort it, I'm running Boinc 7.4.42 x64, I thought someone would like to know of this. Oh and I am running anonymous.

GALACTICA

274	SETI@home	6/3/2015 1:22:55 PM	Computation for task 19se12ac.541.25422.438086664199.12.252_1 finished	
275	SETI@home	6/3/2015 1:22:57 PM	Started upload of 19se12ac.541.25422.438086664199.12.252_1_0	
276	SETI@home	6/3/2015 1:23:00 PM	Finished upload of 19se12ac.541.25422.438086664199.12.252_1_0	

Hmm, http://setiathome.berkeley.edu/result.php?result_name=19se12ac.541.25422.438086664199.12.252_1 shows BOINC only thought it had 14 seconds of Run time. It is interesting, but I don't think there's any way to guess what actually happened.
                                                                  Joe
ID: 1687561 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1687646 - Posted: 4 Jun 2015, 8:32:44 UTC - in response to Message 1687545.  
Last modified: 4 Jun 2015, 8:40:41 UTC

Both Task 4163788896 and Task 4184268020 show evidence of the timing measurements in the "Optimal function choices" having difficulty. That's been a problem on some hosts for a long time, and attempts to pin it down have failed.

Some users have found that restarting the task so that code will be retried from the beginning may get the task running normally. That is, perhaps suspending and resuming the task or exiting from BOINC and restarting it might help. Those are worth trying before aborting the task.



The other approach is to use a build which doesn't do that "Optimal function choices" code, the CPU builds from http://lunatics.kwsn.info for instance. That means using an Anonymous Platform configuration.
                                                                   Joe


Joe, was affinity pinning to single core tried before?
If not I would recommend for such hosts to use smth like ProcessLasso to pin each app instance to separate CPU (and disallow CPU migration).
If CPU cores clocks de-synchronized in CPU it can lead to weird results on any time measuring attempt...

EDIT: and if that would help the code change (at last for windows builds is trivial): pin to core0 (or implement round-robin core selection as in GPU builds) immediately before inned benchmark, do benchmark, allow CPU migration again after benchmark finish.

EDIT2: also, would be good to start each bench on quantum boundary to avoid or decrease chance to get context switch through benchmark. For that Sleep(0) could be added immediately before each new bench function. To start bench with beginning of new quantum.
ID: 1687646 · Report as offensive
Fortran

Send message
Joined: 20 Apr 06
Posts: 16
Credit: 13,398,872
RAC: 31
Canada
Message 1687689 - Posted: 4 Jun 2015, 12:32:51 UTC - in response to Message 1687545.  

Some users have found that restarting the task so that code will be retried from the beginning may get the task running normally. That is, perhaps suspending and resuming the task or exiting from BOINC and restarting it might help. Those are worth trying before aborting the task.

The other approach is to use a build which doesn't do that "Optimal function choices" code, the CPU builds from http://lunatics.kwsn.info for instance. That means using an Anonymous Platform configuration.
                                                                   Joe


The current job that is demonstrating this, had an initial time estimate of about 3 hours 20 minutes. It has now been running for almost 20 hours, and the time remaining was down to 2 minutes 38 seconds.

I suspended and resumed this job, and watching for a short while, I still don't see the time remaining behaving properly (it did count down to 2 minutes 37 seconds, just now at 2:32).

I am not compiling code for BOINC projects. I had to use the backports version of some boinc packages, in order to get the OpenCL support that worked for Debian oldstable. I have no idea if it is trying to do optimal function choices.
ID: 1687689 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1687729 - Posted: 4 Jun 2015, 14:37:09 UTC - in response to Message 1687646.  

Joe, was affinity pinning to single core tried before?

I did my tests on single-core CPU:
http://setiathome.berkeley.edu/forum_thread.php?id=75617&postid=1577361#1577361

http://setiathome.berkeley.edu/forum_thread.php?id=75617&postid=1575843#1575843

"It happens in 30-40% at start/restart of tasks on my K6-2+ but only when it is booted in Windows 2000
When the same system runs in Windows 98 it never happens"

This K6-2+ have fixed clock (by jumpers) at FSB 95 * 5.5 = 524 MHz
No PowerNow! on this Desktop computer (Shuttle VIA MVP3)

http://en.wikipedia.org/wiki/AMD_K6-2#K6-2.2B
http://www.anandtech.com/show/134
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1687729 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1687787 - Posted: 4 Jun 2015, 16:56:00 UTC - in response to Message 1687729.  
Last modified: 4 Jun 2015, 17:01:08 UTC

Joe, was affinity pinning to single core tried before?

I did my tests on single-core CPU:
http://setiathome.berkeley.edu/forum_thread.php?id=75617&postid=1577361#1577361

http://setiathome.berkeley.edu/forum_thread.php?id=75617&postid=1575843#1575843

"It happens in 30-40% at start/restart of tasks on my K6-2+ but only when it is booted in Windows 2000
When the same system runs in Windows 98 it never happens"

This K6-2+ have fixed clock (by jumpers) at FSB 95 * 5.5 = 524 MHz
No PowerNow! on this Desktop computer (Shuttle VIA MVP3)

http://en.wikipedia.org/wiki/AMD_K6-2#K6-2.2B
http://www.anandtech.com/show/134


So, at least in your case my proposal would not work at all :/
Well, maybe there are few reasons to freeze though...

EDIT: and of course ALL modern Windows are direct descendants of Win2k, not Win98 so quite possibly this issue was inherited w/o changes...
EDIT2: easiest solution then just to switch to Akv8 Lunatics opt apps and forget about this issue ;)
ID: 1687787 · Report as offensive
Fortran

Send message
Joined: 20 Apr 06
Posts: 16
Credit: 13,398,872
RAC: 31
Canada
Message 1688387 - Posted: 6 Jun 2015, 4:00:27 UTC - in response to Message 1687787.  


EDIT: and of course ALL modern Windows are direct descendants of Win2k, not Win98 so quite possibly this issue was inherited w/o changes...


I'm overjoyed that some Windows person got his problem solved hijacking a thread.

I just won't run SETI on the LINUX computer that was seeing the problems that I started this thread for.

Which solves the problem, and reduces how many computers are trying to help SETI!

Isn't that what we are all donating computer time for?
ID: 1688387 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1688395 - Posted: 6 Jun 2015, 4:19:24 UTC - in response to Message 1688387.  

...I just won't run SETI on the LINUX computer that was seeing the problems that I started this thread for....

The fix was in the 3rd post above;
The other approach is to use a build which doesn't do that "Optimal function choices" code, the CPU builds from http://lunatics.kwsn.info for instance. That means using an Anonymous Platform configuration.
                                                                   Joe

Just chose the app that works with your machine and use that instead of the stock app that is giving problems. The choices are here; http://lunatics.kwsn.info/index.php?module=Downloads;catd=48.
I'm using this version on my 2 Linux machines, works great; Linux 64bit Multibeam v7 for SSSE3 CPUs (r2549), July 2014
ID: 1688395 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1688398 - Posted: 6 Jun 2015, 4:23:51 UTC - in response to Message 1688387.  
Last modified: 6 Jun 2015, 4:24:32 UTC

Edit...

Didn't see TBar response.. Follow his advice
ID: 1688398 · Report as offensive
Fortran

Send message
Joined: 20 Apr 06
Posts: 16
Credit: 13,398,872
RAC: 31
Canada
Message 1688433 - Posted: 6 Jun 2015, 6:39:13 UTC - in response to Message 1688395.  


Just chose the app that works with your machine and use that instead of the stock app that is giving problems. The choices are here; http://lunatics.kwsn.info/index.php?module=Downloads;catd=48.
I'm using this version on my 2 Linux machines, works great; Linux 64bit Multibeam v7 for SSSE3 CPUs (r2549), July 2014


Okay, I downloaded the 32 bit SSSE2 package. 7z is not a common compression, but I have seen it. I am guessing I need to find an existing binary on that system, and copy the uncompressed application on top of it. And there will be no problems with shared libraries.

I'm not feeling a lot of warm fuzzies with this.

I will try, and I will let you know. If the package has been trojaned and this becomes apparent at some point, I won't run SETI binaries on any of my machines ever again. You want us to run stuff outside of the control of packaging systems with no checksums; we have our own requirements. And it matters not a whit how much I want to see the Artemis Project get mankind to the Moon.
ID: 1688433 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1688440 - Posted: 6 Jun 2015, 7:04:11 UTC - in response to Message 1688387.  


EDIT: and of course ALL modern Windows are direct descendants of Win2k, not Win98 so quite possibly this issue was inherited w/o changes...


I'm overjoyed that some Windows person got his problem solved hijacking a thread.

I just won't run SETI on the LINUX computer that was seeing the problems that I started this thread for.

Which solves the problem, and reduces how many computers are trying to help SETI!

Isn't that what we are all donating computer time for?

Specially for your needs read EDIT2 and enjoy even more.
Who can't read will never find the answer. FYI AKv8 Lunatics apps exist for Linux too.
If you prefer only strict answers on your questions there is Q&A forum for that. Here discussion of relevant to issue topics. And cause this issue was detected on Windows too see no reasons not to answer on that.
ID: 1688440 · Report as offensive
Werecow
Avatar

Send message
Joined: 13 Mar 05
Posts: 56
Credit: 4,917,657
RAC: 3
United States
Message 1688486 - Posted: 6 Jun 2015, 12:12:12 UTC - in response to Message 1688433.  

7z is not a common compression

7-Zip is a widely used compression format and has been around since 1999 -- although, granted, not nearly as long as FORTRAN. :-)

If the package has been trojaned [...]

"File description: Optimized Seti@Home v7 (Multibeam) application (aka AKv8c) for 32bit Linux platform. Intended for all Intel and AMD CPUs with ssse3 or higher. MD5 : 062dcd86bb9158172e26215996de4c9a MBv7_7.05r2549_ssse3_linux32_CPU.7z Note : Probably not compatible with forthcoming "large" tasks!"

[bpm@oldbox bpm]$ md5sum MBv7_7.05r2549_ssse3_linux32_CPU.7z
[bpm@oldbox bpm]$ 062dcd86bb9158172e26215996de4c9a MBv7_7.05r2549_ssse3_linux32_CPU.7z

So as of a few minutes ago, you're still safe on that count.
ID: 1688486 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1688491 - Posted: 6 Jun 2015, 12:22:56 UTC - in response to Message 1688486.  

If you are worried about 7-zip or even the program to decompress it, you can also download something called

WinRAR

There is a trial version that you can use for 30 days free and it will decompress 7z files.
I only mention this as an alternate to 7zip decompression programs if you are uncomfortable with them.

Zalster
ID: 1688491 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 692
Credit: 135,197,781
RAC: 211
Germany
Message 1688919 - Posted: 7 Jun 2015, 14:26:40 UTC - in response to Message 1688387.  
Last modified: 7 Jun 2015, 14:31:45 UTC

I'm overjoyed that some Windows person got his problem solved hijacking a thread.

I just won't run SETI on the LINUX computer that was seeing the problems that I started this thread for.

Which solves the problem, and reduces how many computers are trying to help SETI!

Isn't that what we are all donating computer time for?

Is there really SSSE3 on an older AMD CPU ?
Before you download an alternative application for crunching SETI you should have checked which level of extensions (see flags) your CPU has.
As you are on a newer Linux there should be "/proc"-filesystem available. Type to see the CPU flags on a prompt :
less /proc/cpuinfo | grep -m 1 flags

Another place would be to look inside BOINC in the file "client_state.xml" for the CPU features that BOINC has detected. Change into BOINC directory and type on a prompt :
less client_state.xml | grep -m 1 p_features

Now go through the flags and look which is the highest of SSE, SSE2, PNI(=SSE3), SSSE3, SSE4.1, SSE4.2, AVX that is available.

With that knowledge now you can choose your download at lunatics.
_\|/_
U r s
ID: 1688919 · Report as offensive

Message boards : Number crunching : CPU jobs needing to be aborted


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.