Random Restarts of Astropulses

Message boards : Number crunching : Random Restarts of Astropulses
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1589089 - Posted: 20 Oct 2014, 2:53:00 UTC

I just noticed an unusual thing. I recently changed my setting to run 4 APs per GPU on the GTX 780 SC. I have 0.6 CPU per 0.25 GPU I've been watching it as every so often, the APs will start over for no apparent reason. It's never happen before and I'm not sure if it's because of the amount running on the card or the command line. This is what I use

-use sleep -unroll 16 -oclfft_plan 256 16 256 -ffa_block 16384 -ffa_block_fetch 8192 -tune 1 64 4 1 -tune 2 64 4 1 -hp

So far none of the APs have completed and all have restarted. So I closed and reopened BOINC, all the APs went right back to 0. Changed my setting back on 3 APs per card, 0.8 CPU per 0.33 GPU and reboot the computer. Now SIV show no activity on GPU0. All Aps restarted again, Bonic says they are progressing, I am going to see if they finish. Wonder if the GPU has gone bad? hmm


Suggestions?

Zalster
ID: 1589089 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1589091 - Posted: 20 Oct 2014, 3:10:46 UTC - in response to Message 1589089.  

I suspended the APs and it switched over the MBs. Once they started, The GPU activity went back up to normal and it appears to be crunching those fine. I'm waiting to see if any of the other GPU will pick up those APs and crunch them.
ID: 1589091 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1589106 - Posted: 20 Oct 2014, 3:34:11 UTC - in response to Message 1589091.  

OK, I had to abort those 3 APs, they never got beyond 20% before restarting. I move 1 each to a different GPU but they still wouldn't progress. Not sure why. If any one has any ideas I would like to hear.


Thanks

Zalster
ID: 1589106 · Report as offensive
woohoo
Volunteer tester

Send message
Joined: 30 Oct 13
Posts: 972
Credit: 165,671,404
RAC: 5
United States
Message 1589109 - Posted: 20 Oct 2014, 3:46:56 UTC

Did you ever try resetting the project? A couple times my tasks were seemingly going in loops and generating more errors than usual but resetting fixed it. I guess the good thing is that you don't lose the queue if you reset. Of course since then I've reformatted a couple times and now I'm just using Windows' built-in video drivers along with optimized apps and the command lines from the readme files. I hope to not have to mess with it for a while.
ID: 1589109 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1589115 - Posted: 20 Oct 2014, 3:50:34 UTC

Just go back running 3 instances.
If thats the host with 3 780`s.
12 instances is a bit much for 8 cores.


With each crime and every kindness we birth our future.
ID: 1589115 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1589119 - Posted: 20 Oct 2014, 3:55:21 UTC - in response to Message 1589115.  
Last modified: 20 Oct 2014, 4:21:38 UTC

Thanks Mike,


That's what I just figured out, lol. Was starting to see a lot of inconclusive popping up on the result of that machine. I don't know why I tinker with these things, I should know better, lol


Zalster
ID: 1589119 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1589125 - Posted: 20 Oct 2014, 4:01:09 UTC - in response to Message 1589119.  

When the App Restarts it's usually because of an Error. Due to BOINC's See Nothing, Report Nothing Event Log you have to look at the stderr.txt in the Slot Folder to find the Error. Select the task and Hit Properties to find which Slot it's in. Look near the bottom of the stderr.txt to see if there are any Errors. My Guess would be out of memory...
ID: 1589125 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1589143 - Posted: 20 Oct 2014, 4:31:08 UTC - in response to Message 1589125.  

That makes sense TBar. I had aborted those APs before you posted so hard to tell. I've returned all the values back to what I had before.

Since then, I haven't seen any more looping or inconclusives pop up but I'm sure there will. I noted the time when I changed everything and when the last of the MBs using the modified were completed so I will know if any new inconclusive pop up if they fall within that time frame.

Thanks Mike and Tbar. WooHoo, I figured it was something I did when I started to move those APs to different GPUs and they would start. Watching SIV I could see the GPU were being utilized.

But as I increase the number of work units on those GPUs, that is when I saw the looping. So once I knew it wasn't the GPU, I changed the #AP/GPU back but didn't do the same for the MBs.

TBar statement makes sense as does Mikes. Once I limited my total to 9 work units, everything seems to settle down. Ok, time to bed

Thanks again everyone.
ID: 1589143 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1589203 - Posted: 20 Oct 2014, 8:24:17 UTC - in response to Message 1589089.  
Last modified: 20 Oct 2014, 8:29:28 UTC

I just noticed an unusual thing. I recently changed my setting to run 4 APs per GPU on the GTX 780 SC. I have 0.6 CPU per 0.25 GPU

Be careful with those values

3 GPUs * 4 tasks = 12 GPU tasks

12 * 0.6 = 7.2 = 7 cores

This means BOINC will need to see 7 cores as free/usable to start GPU tasks.
If you run also CPU tasks (from any project) or have 'On multiprocessors, use at most XX% of the processors' this may limit BOINC to less than 7 cores available to 'allocate' for GPU tasks.


But in this case the problem was "some exception" ;) :
ERROR: some exception inside long FFA, probably video-driver restart, restarting app...
http://setiathome.berkeley.edu/result.php?resultid=3789058611
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1589203 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1589216 - Posted: 20 Oct 2014, 9:40:12 UTC

Thats just theory Bilbg.
The rest of the System needs resources too.


With each crime and every kindness we birth our future.
ID: 1589216 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1590656 - Posted: 23 Oct 2014, 11:32:51 UTC
Last modified: 23 Oct 2014, 11:35:47 UTC

FFA thread block override value:16384
FFA thread fetchblock override value:8192


I would not recommend to use such big values for -ffa_block until it's proved for particular host that 16k threads perform markedly faster than let say 8k threads.

Though even 16k threads can underload modern high-end GPU there could be another issue discussed here: http://setiathome.berkeley.edu/forum_thread.php?id=75863&postid=1589263 (and few posts below that).

Excessive memory usage can lead to crash with out of memory condition, especially if many GPU tasks in fly and few of them have overlow result.
ID: 1590656 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1590683 - Posted: 23 Oct 2014, 13:05:29 UTC - in response to Message 1590656.  

Since I return to 3 task per GPU with the listed values I haven't had any more crashes or looping of any APs. I've keep a close eye on that machine and everything has been stable.

I saw Juan's thread but didn't realize it was the same thing since I never checked to see how much memory each AP was being used.

Does SIV allow you to see how much memory is being used? Thanks Raistmer.
ID: 1590683 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1590689 - Posted: 23 Oct 2014, 13:25:59 UTC - in response to Message 1590683.  

Does SIV allow you to see how much memory is being used?

No

Just use Windows Task Manager or Process Explorer to see the current use

BoincTasks also show this - on Tasks and History tabs (you need to enable the columns)

Especially the History tab is useful as it shows the peak RAM usage during the task run (and you can check what was it for past tasks)
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1590689 · Report as offensive

Message boards : Number crunching : Random Restarts of Astropulses


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.