Message boards :
Number crunching :
Random Restarts of Astropulses
Message board moderation
Author | Message |
---|---|
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
I just noticed an unusual thing. I recently changed my setting to run 4 APs per GPU on the GTX 780 SC. I have 0.6 CPU per 0.25 GPU I've been watching it as every so often, the APs will start over for no apparent reason. It's never happen before and I'm not sure if it's because of the amount running on the card or the command line. This is what I use -use sleep -unroll 16 -oclfft_plan 256 16 256 -ffa_block 16384 -ffa_block_fetch 8192 -tune 1 64 4 1 -tune 2 64 4 1 -hp So far none of the APs have completed and all have restarted. So I closed and reopened BOINC, all the APs went right back to 0. Changed my setting back on 3 APs per card, 0.8 CPU per 0.33 GPU and reboot the computer. Now SIV show no activity on GPU0. All Aps restarted again, Bonic says they are progressing, I am going to see if they finish. Wonder if the GPU has gone bad? hmm Suggestions? Zalster |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
I suspended the APs and it switched over the MBs. Once they started, The GPU activity went back up to normal and it appears to be crunching those fine. I'm waiting to see if any of the other GPU will pick up those APs and crunch them. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
OK, I had to abort those 3 APs, they never got beyond 20% before restarting. I move 1 each to a different GPU but they still wouldn't progress. Not sure why. If any one has any ideas I would like to hear. Thanks Zalster |
woohoo Send message Joined: 30 Oct 13 Posts: 972 Credit: 165,671,404 RAC: 5 |
Did you ever try resetting the project? A couple times my tasks were seemingly going in loops and generating more errors than usual but resetting fixed it. I guess the good thing is that you don't lose the queue if you reset. Of course since then I've reformatted a couple times and now I'm just using Windows' built-in video drivers along with optimized apps and the command lines from the readme files. I hope to not have to mess with it for a while. |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
Just go back running 3 instances. If thats the host with 3 780`s. 12 instances is a bit much for 8 cores. With each crime and every kindness we birth our future. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Thanks Mike, That's what I just figured out, lol. Was starting to see a lot of inconclusive popping up on the result of that machine. I don't know why I tinker with these things, I should know better, lol Zalster |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
When the App Restarts it's usually because of an Error. Due to BOINC's See Nothing, Report Nothing Event Log you have to look at the stderr.txt in the Slot Folder to find the Error. Select the task and Hit Properties to find which Slot it's in. Look near the bottom of the stderr.txt to see if there are any Errors. My Guess would be out of memory... |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
That makes sense TBar. I had aborted those APs before you posted so hard to tell. I've returned all the values back to what I had before. Since then, I haven't seen any more looping or inconclusives pop up but I'm sure there will. I noted the time when I changed everything and when the last of the MBs using the modified were completed so I will know if any new inconclusive pop up if they fall within that time frame. Thanks Mike and Tbar. WooHoo, I figured it was something I did when I started to move those APs to different GPUs and they would start. Watching SIV I could see the GPU were being utilized. But as I increase the number of work units on those GPUs, that is when I saw the looping. So once I knew it wasn't the GPU, I changed the #AP/GPU back but didn't do the same for the MBs. TBar statement makes sense as does Mikes. Once I limited my total to 9 work units, everything seems to settle down. Ok, time to bed Thanks again everyone. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
I just noticed an unusual thing. I recently changed my setting to run 4 APs per GPU on the GTX 780 SC. I have 0.6 CPU per 0.25 GPU Be careful with those values 3 GPUs * 4 tasks = 12 GPU tasks 12 * 0.6 = 7.2 = 7 cores This means BOINC will need to see 7 cores as free/usable to start GPU tasks. If you run also CPU tasks (from any project) or have 'On multiprocessors, use at most XX% of the processors' this may limit BOINC to less than 7 cores available to 'allocate' for GPU tasks. But in this case the problem was "some exception" ;) : ERROR: some exception inside long FFA, probably video-driver restart, restarting app... http://setiathome.berkeley.edu/result.php?resultid=3789058611 Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
Thats just theory Bilbg. The rest of the System needs resources too. With each crime and every kindness we birth our future. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
FFA thread block override value:16384 I would not recommend to use such big values for -ffa_block until it's proved for particular host that 16k threads perform markedly faster than let say 8k threads. Though even 16k threads can underload modern high-end GPU there could be another issue discussed here: http://setiathome.berkeley.edu/forum_thread.php?id=75863&postid=1589263 (and few posts below that). Excessive memory usage can lead to crash with out of memory condition, especially if many GPU tasks in fly and few of them have overlow result. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Since I return to 3 task per GPU with the listed values I haven't had any more crashes or looping of any APs. I've keep a close eye on that machine and everything has been stable. I saw Juan's thread but didn't realize it was the same thing since I never checked to see how much memory each AP was being used. Does SIV allow you to see how much memory is being used? Thanks Raistmer. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
Does SIV allow you to see how much memory is being used? No Just use Windows Task Manager or Process Explorer to see the current use BoincTasks also show this - on Tasks and History tabs (you need to enable the columns) Especially the History tab is useful as it shows the peak RAM usage during the task run (and you can check what was it for past tasks) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.