Odd Result with 03ap18 WU in SoG - had to abort

Message boards : Number crunching : Odd Result with 03ap18 WU in SoG - had to abort
Message board moderation

To post messages, you must log in.

AuthorMessage
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1929272 - Posted: 10 Apr 2018, 21:19:39 UTC

I had an odd result from an Arecibo WU on the OpenCL processing earlier today - it seemed to be "stuck". These WUs normally take up to as long as 30 minutes on my machine(s), but this one had run for nearly 2 hours, without being anywhere near complete. It was from the 03ap18 file that was processed by the servers today.

Come to think of it, a few days ago I noticed 4 recent (March or April 2018) Arecibo WUs that had run for an hour or 2 and had estimated completion times > 1 DAY (!!). Which I aborted, of course.

Has anyone else seen any such WUs? Any idea what the problem is? Seems like these might be a special case in some way that kills (nearly) the app. But I sure don't know!
ID: 1929272 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1929294 - Posted: 10 Apr 2018, 22:38:25 UTC - in response to Message 1929272.  

I had an odd result from an Arecibo WU on the OpenCL processing earlier today - it seemed to be "stuck". These WUs normally take up to as long as 30 minutes on my machine(s), but this one had run for nearly 2 hours, without being anywhere near complete. It was from the 03ap18 file that was processed by the servers today.

Come to think of it, a few days ago I noticed 4 recent (March or April 2018) Arecibo WUs that had run for an hour or 2 and had estimated completion times > 1 DAY (!!). Which I aborted, of course.

Has anyone else seen any such WUs? Any idea what the problem is? Seems like these might be a special case in some way that kills (nearly) the app. But I sure don't know!

I just had to abort two 02dc17ad tasks that were not progressing and the countdown timer was in the tens of thousands of seconds to finish. In High Priority mode.

Couldn't get them to reset, even after suspending them, exiting BOINC and deleting all the files in their slots except for the science app. They just came back each time like a bad penny. Just aborted them finally.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1929294 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1929299 - Posted: 10 Apr 2018, 22:53:34 UTC

Thanks, Keith...


The exact WU is (was!):

03ap18ad.5572.22189.16.43.93_0
ID: 1929299 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1929335 - Posted: 11 Apr 2018, 2:00:13 UTC

I just noticed this one:

12dc17aa.30534.811722.3.30.237_0

It had run for only a couple of minutes, but the time left was over an hour, and increasing. I aborted it.

Also, the card it was running on showed very little work being done (in SIV) - like near 0% GPU utilization, where it normally runs 90-99%.

It appears that there may be a defect in the SoG app, or the WUs created from the GBT data may be faulty in some way that causes the SoG app to loop or race in some manner..

BAD NEWS!
[/b]
ID: 1929335 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1929346 - Posted: 11 Apr 2018, 2:51:30 UTC - in response to Message 1929335.  

That was another Arecibo VLAR of 10 AR. All of them are not running correctly on the SoG or cpu app.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1929346 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1929429 - Posted: 11 Apr 2018, 16:43:10 UTC - in response to Message 1929346.  

Tut posted over in NC about my curious stalled high angle tasks. Seems they ran across these in testing at Beta.
Message 1929367

It looks like we should have just let them run out and ignore the High Priority and ludicrous completion times.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1929429 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1929781 - Posted: 13 Apr 2018, 14:07:56 UTC

Well, this turned out to be my fault. I run 1080s, so I cribbed the 1080ti command line from the "Best tuning for 1080ti and Process Lasso use" thread, thinking that, well, a 1080 is a LOT like a 1080ti (right?). It is:

-v 0 -tt 1500 -period_iterations_num 1 -high_perf -high_prec_timer -sbs 2048 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 256 -oclfft_tune_cw 256

(by cut and paste, so I didn't add any errors while typing).

Not being smart enough to understand all the params, I used it as is, except for changing -period_iterations_num from 1 to 10 .

When I changed back to my simple command line ("-sbs 512 -period_iterations_num 10") the problems with occasional vastly lengthened times for Arecibo MBs on GPUs completely went away.

I still would like to try a similarly enhanced command line, if someone(s) could point out to me where the 1080ti cl needs to be changed for my poor old 1080s, since wall clock time for most WUs went down about 10% using the above...
ID: 1929781 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1929785 - Posted: 13 Apr 2018, 14:37:27 UTC - in response to Message 1929294.  

Couldn't get them to reset, even after suspending them, exiting BOINC and deleting all the files in their slots except for the science app. They just came back each time like a bad penny. Just aborted them finally.
Deleting files is pointless - BOINC is designed to be self-healing when run over poor communication lines which might result in damaged files.

Instead, you have to remove the meta-information describing the task in client_state.xml - very carefully.
ID: 1929785 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1929792 - Posted: 13 Apr 2018, 16:52:30 UTC - in response to Message 1929781.  

Well, this turned out to be my fault. I run 1080s, so I cribbed the 1080ti command line from the "Best tuning for 1080ti and Process Lasso use" thread, thinking that, well, a 1080 is a LOT like a 1080ti (right?). It is:

-v 0 -tt 1500 -period_iterations_num 1 -high_perf -high_prec_timer -sbs 2048 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 256 -oclfft_tune_cw 256

(by cut and paste, so I didn't add any errors while typing).

Not being smart enough to understand all the params, I used it as is, except for changing -period_iterations_num from 1 to 10 .

When I changed back to my simple command line ("-sbs 512 -period_iterations_num 10") the problems with occasional vastly lengthened times for Arecibo MBs on GPUs completely went away.

I still would like to try a similarly enhanced command line, if someone(s) could point out to me where the 1080ti cl needs to be changed for my poor old 1080s, since wall clock time for most WUs went down about 10% using the above...

I run my 1080 with that command line mostly except for the verbosity and have no issues. I would just reduce the -sbs 2048 to -sbs 1024 and maybe increase -period_iterations_num 1 to -period_iterations_num 2 or 4 if necessary.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1929792 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1929793 - Posted: 13 Apr 2018, 16:55:23 UTC - in response to Message 1929785.  

Couldn't get them to reset, even after suspending them, exiting BOINC and deleting all the files in their slots except for the science app. They just came back each time like a bad penny. Just aborted them finally.
Deleting files is pointless - BOINC is designed to be self-healing when run over poor communication lines which might result in damaged files.

Instead, you have to remove the meta-information describing the task in client_state.xml - very carefully.

Thank you very much for your comment Richard. Explains exactly what I was seeing. I was just following many suggestions from others with a similar problem of a stuck task solved by either stopping and restarting BOINC or deleting the state file to reset the task. It has never worked for me and I couldn't understand why when they always stated it worked for them.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1929793 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1929796 - Posted: 13 Apr 2018, 17:15:00 UTC - in response to Message 1929793.  

I was just following many suggestions from others with a similar problem of a stuck task solved by either stopping and restarting BOINC or deleting the state file to reset the task. It has never worked for me and I couldn't understand why when they always stated it worked for them.
Just like googling to solve a software problem - unless you design the search very, very, carefully, all you find is 100 people asking the same question, and three people giving the wrong answer.

And I am very careful to avoid the 'sledgehammer and two short planks' school of computer maintenance - which is why 2901600 is still running the same OEM software, eleven and a half years and two operating system upgrades later.
ID: 1929796 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1929835 - Posted: 13 Apr 2018, 23:39:31 UTC - in response to Message 1929781.  

When I changed back to my simple command line ("-sbs 512 -period_iterations_num 10") the problems with occasional vastly lengthened times for Arecibo MBs on GPUs completely went away.

There are some Arecibo WUs that aren't marked as VLAR, but have much longer runtimes than the usual Arecibo WU (closer to Arecibo VLAR runtimes). And we are now crunching Arecibo VLARs on NVidia GPUs- just under double the crunch time for an average Arecibo WU.
Also values that work best for running 1 WU at a time won't necessarily be best for running 2 WUs at a time.
And keep in mind if you run 2 WUs at a time, and if one is Arecibo & the other is GBT then the Arecibo WU will take up to twice as long to crunch than if it were running with another Arecibo WU.

There are plenty of people here with a much higher RAC than me that run 2WUs at a time using SoG. But the fact is on my system I've found I get the most work per hour just running 1 WU at a time. Your Mileage May Vary.
Grant
Darwin NT
ID: 1929835 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1929839 - Posted: 14 Apr 2018, 0:11:07 UTC - in response to Message 1929835.  

About a week ago or thereabouts, I switched the Windows 10 SoG cruncher from 2 tasks per gpu back to 1. Primarily just for simplification across the 3 projects I run on it simultaneously. I did see a drop of about 4-5K RAC since the configuration change. But in the end, probably a fortuitous decision or case of clairvoyance of the recent change to allow VLAR on Nvidia. So I didn't run into the scenario you describe of mixed antenna tasks on the same card or sharing a card with another project at the same time.

The RAC actually has improved a bit by 1-2K I think because of the higher credit award of the new VLARs. Whether that lasts with CreditNew is still to be decided.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1929839 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1929854 - Posted: 14 Apr 2018, 1:28:35 UTC - in response to Message 1929839.  

About a week ago or thereabouts, I switched the Windows 10 SoG cruncher from 2 tasks per gpu back to 1. Primarily just for simplification across the 3 projects I run on it simultaneously. I did see a drop of about 4-5K RAC since the configuration change.

With the weirdness of Credit New, and the time it takes for RAC to reflect actual processing rates, I just based my decision to go with 1 WU at a time on the number of WUs it was able to crunch per hour of a given type. For some WUs 2 at a time gave more per hour, others (such as Arecibo & GBT together) really killed the number crunched per hour. So overall for me, 1 at a time gives the greatest number crunched per hour (for SoG).
Whether or not the Credit I get reflects that is an entirely different story.
Grant
Darwin NT
ID: 1929854 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1929855 - Posted: 14 Apr 2018, 1:55:21 UTC - in response to Message 1929854.  

I never really analyzed what happened when there was a GBT and Arecibo task on the card at any time. I mostly just look at the macro numbers that I get from BoincTask daily and weekly reports and the trend line in Boinc Manager Statistics plots. And allow for what was coming through the splitters and RTS buffers with a fudge factor.

Probably to scientifically test would mean to run the benchmark app with two tasks on board and the 4 cases of GBT and Arecibo task mixes. I never got that anal about it.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1929855 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1929864 - Posted: 14 Apr 2018, 4:04:20 UTC - in response to Message 1929855.  

Was tested on Beta. SoG were faster and any lengthing of the Arecibo Vlars was only 1-2 minutes over their prolonged time already.

If you really want to see slow down, do a AP OpenCl with a Arecibo Vlar OpenCl....
ID: 1929864 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1929892 - Posted: 14 Apr 2018, 10:02:07 UTC - in response to Message 1929864.  

Ideal app case would run only single app instance per device. The more slowdown you get running 2 tasks simultaneously (for same hardware) the better optimization of single app instance is.
That means AP OpenCL much better uses GPU than OpenCL MB.
Running 2 instances per device always has inherent overhead (flushing all kind of caches; context switching). So it preferred only if there are free computational resources exist. Either by partially loaded CUs or by idle time intervals big enough to offset mentioned inherent overhead.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1929892 · Report as offensive

Message boards : Number crunching : Odd Result with 03ap18 WU in SoG - had to abort


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.