Panic Mode On (113) Server Problems?

Message boards : Number crunching : Panic Mode On (113) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 37 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1958894 - Posted: 6 Oct 2018, 20:17:28 UTC - in response to Message 1958891.  

No Rescheduler will move a task that is in the process of being crunched. So this warning is not valid.
The warning would also apply to any computer reboot (for whatever reason), which will demand careful management of the shutdown/restart process.
ID: 1958894 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1958907 - Posted: 6 Oct 2018, 20:57:43 UTC - in response to Message 1958891.  
Last modified: 6 Oct 2018, 21:01:13 UTC

No Rescheduler will move a task that is in the process of being crunched. So this warning is not valid.


I was not talking about that. What i was warning is when the rescheduler ends the active tasks (the ones who are crunching at that moment)
Sometimes, be aware not allways, when the rescheduler ends the process and the WU is crunching (normaly at the end of the process) and when the crunching process is restarted, after rescheduling the other tasks, the task enter in some kind of limbo (not know a better word to explain) and it stop to crunch on the GPU and is sended to crunch on the CPU. A msg is generated on the task stderr telling about that. Not remember exactly the msg but ell something like "this pot will be process on the CPU" . In this case the task starts to be crunched on the CPU and will end on a error due the long processing time difference (<2 min on the GPU vs About an hr on the CPU).
For me it's hard to explain sorry but i find an example:

https://setiathome.berkeley.edu/result.php?resultid=7036454876

<core_client_version>7.4.44</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
<stderr_txt>
setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 1080 Ti, 11178 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 28 
     pciBusID = 2, pciSlotID = 0
  Device 2: GeForce GTX 1080 Ti, 11178 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 28 
     pciBusID = 3, pciSlotID = 0
  Device 3: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 15 
     pciBusID = 1, pciSlotID = 0
  Device 4: GeForce GTX 1070, 8118 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 15 
     pciBusID = 4, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 2
setiathome_CUDA: CUDA Device 2 specified, checking...
   Device 2: GeForce GTX 1080 Ti is okay
SETI@home using CUDA accelerated device GeForce GTX 1080 Ti
Using pfb = 32 from command line args
Unroll autotune 28. Overriding Pulse find periods per launch. Parameter -pfp set to 28

setiathome v8 enhanced x41p_V0.97b2, Cuda 9.20 special
Compiled with NVCC, using static libraries. Modifications done by petri33 and released to the public by TBar.



Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.009971
Sigma 66
Sigma > GaussTOffsetStop: 66 > -2
Thread call stack limit is: 1k
Pulse: peak=1.618832, time=45.86, period=2.334, d_freq=8507364409.96, score=1.005, chirp=12.084, fft_len=1024 
Pulse: peak=2.241779, time=45.86, period=4.099, d_freq=8507366942.69, score=1.001, chirp=12.236, fft_len=1024 
Triplet: peak=12.25685, time=47.08, period=24.96, d_freq=8507362703.79, chirp=-24.472, fft_len=64 
Pulse: peak=4.555198, time=45.84, period=10.72, d_freq=8507360686.14, score=1.001, chirp=-24.778, fft_len=512 
Triplet: peak=11.89313, time=42.77, period=29.39, d_freq=8507368499.05, chirp=-27.837, fft_len=1024 
Pulse: peak=1.695994, time=45.82, period=2.573, d_freq=8507368463.68, score=1.008, chirp=-32.119, fft_len=256 
Pulse: peak=5.521799, time=45.9, period=11.81, d_freq=8507368132.32, score=1.019, chirp=32.541, fft_len=2k
Triplet: peak=11.37938, time=25.03, period=6.129, d_freq=8507361389.06, chirp=-34.26, fft_len=32 
Pulse: peak=4.671387, time=45.84, period=11.18, d_freq=8507360782.25, score=1.025, chirp=-34.872, fft_len=512 
Pulse: peak=1.670215, time=45.84, period=2.807, d_freq=8507366596.34, score=1.01, chirp=-35.79, fft_len=512 
Triplet: peak=11.14148, time=32.66, period=27.32, d_freq=8507369540.28, chirp=38.543, fft_len=256 
Pulse: peak=9.656481, time=46.17, period=26.49, d_freq=8507370526.14, score=1.051, chirp=-53.924, fft_len=8k
Pulse: peak=3.450325, time=45.84, period=6.297, d_freq=8507368261.54, score=1.039, chirp=56.133, fft_len=512 
Pulse: peak=4.771037, time=45.82, period=9.306, d_freq=8507367791.5, score=1.008, chirp=-62.404, fft_len=128 
Pulse: peak=7.937146, time=45.86, period=18.4, d_freq=8507361460.06, score=1.023, chirp=62.786, fft_len=1024 
Pulse: peak=2.265188, time=45.86, period=3.909, d_freq=8507367626.89, score=1.013, chirp=-63.016, fft_len=1024 
Pulse: peak=4.566975, time=45.84, period=11.22, d_freq=8507371236.36, score=1.002, chirp=68.368, fft_len=512 
Pulse: peak=3.375987, time=45.84, period=7.147, d_freq=8507371715.3, score=1.013, chirp=75.404, fft_len=512 
Pulse: peak=4.062914, time=45.9, period=8.814, d_freq=8507370700.05, score=1.11, chirp=-83.167, fft_len=2k
Pulse: peak=6.64492, time=45.86, period=15.99, d_freq=8507365116.11, score=1.028, chirp=85.728, fft_len=1024 
setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 1080 Ti, 11178 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 28 
     pciBusID = 2, pciSlotID = 0
  Device 2: GeForce GTX 1080 Ti, 11178 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 28 
     pciBusID = 3, pciSlotID = 0
  Device 3: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 15 
     pciBusID = 1, pciSlotID = 0
  Device 4: GeForce GTX 1070, 8118 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 15 
     pciBusID = 4, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 2
setiathome_CUDA: CUDA Device 2 specified, checking...
   Device 2: GeForce GTX 1080 Ti is okay
SETI@home using CUDA accelerated device GeForce GTX 1080 Ti
Using pfb = 32 from command line args
Unroll autotune 28. Overriding Pulse find periods per launch. Parameter -pfp set to 28
Restarted at 61.11 percent, with setiathome enhanced x41p_V0.97b2, Cuda 9.20 special
Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Sigma 66
Sigma > GaussTOffsetStop: 66 > -2
Thread call stack limit is: 1k
Find triplets Cuda kernel encountered too many triplets, or bins above threshold, reprocessing this PoT on CPU... err = 1

</stderr_txt>
]]>


As you could see the process of the WU was interrupted by the rescheduler and when it return it was redirected to the CPU.

In this case, since i know how to quickly identify the error i manualy check for any "limbo" WU after i run the rescheduler and abort the WU process if that heppening.

Who to indentify? The crunching timer increaces and the crunched % remains with no change for some long time.

If you neeed more examples, go to my error WU and look all manualy aborted. Very few since i allready learn a
to avoid the trouble, just not run the rescheduler when any WU is more than 2/3 crunched. Wait to start the rescheduler until it finish. Need some practice in the 4xGPU enviroment like ours.
ID: 1958907 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1958909 - Posted: 6 Oct 2018, 21:06:04 UTC
Last modified: 6 Oct 2018, 21:07:41 UTC

What rescheduler are you using that can stop BOINC? I am not aware of any. All of the reschedulers that I have used, and I have used them all at one time always state you have to stop BOINC first before rescheduling. So it is up to you to stop processing first before running a rescheduler. The normal caveats of the special app apply of course. Don't suspend them or stop them midway. Try to always stop them before they have written any checkpoints into the slots. And always make sure any finished tasks have fully reported and uploaded before stopping BOINC.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1958909 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1958915 - Posted: 6 Oct 2018, 21:49:15 UTC
Last modified: 6 Oct 2018, 22:12:13 UTC

It just one of those caveats with rescheduling; You have to pay attention to what's going on. I you have your checkpoints set to longer than any GPU tasks ever run it shouldn't be an issue. Or if you have it set less than that you have watch that you don't shutdown BIONC when a GPU task has had a chance to drop a checkpoint. If you don't watch what is going on sooner of latter you will get burnt.

I also suspend my pending GPU tasks and let it complete what is being worked on before shutting it down. Which is petty easy if you also use BoincTasks. It is extra work, but worth it in the end.

To be honest the checkpoint function should really be completely disabled now in the 'sauce.' Even for my 750Ti's it would be a great loss to lose what has been process. And I do see the it would be almost impossible to have a reliable checkpoint with synchronous processing where there are upwards of dozens of 'in progress' process each with their own set of info ... and what if the task restarts on a GPU with a different number of CU ... possibility are endless.

My Ryzen frequently has checkpoint problems with all the crashes it goes through, so I have to pay attention to it on each reboot. It a GPU task starts and doesn't do anything in the first 3 minutes ... a Suspend/ClearSlot/Resume is need or it will sit there until it times out in 20 or so minutes.
EDIT: I am not going to fix all my typos, just read around them :D
ID: 1958915 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1958916 - Posted: 6 Oct 2018, 21:51:32 UTC - in response to Message 1958909.  

It would be perfectly possible for a rescheduler to stop and restart BOINC - I would be extremely surprised if nobody has nicked the code from knabench by now. It includes the code for retrieving the installation paths from the registry. (I'm talking Windows, of course - Linux users will have to roll their own, as always).
ID: 1958916 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1958917 - Posted: 6 Oct 2018, 21:58:43 UTC - in response to Message 1958916.  

Yes, I was mistaken. I see that it is possible with Jeff's rescheduler. It is in the readme as possible but untested in the early versions. I have always played safe and stopped BOINC on my own terms before rescheduling. Haven't run into any issues that way either. I have used its restart BOINC dialog after it has finished rescheduling at times with no troubles too.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1958917 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1958919 - Posted: 6 Oct 2018, 22:02:03 UTC - in response to Message 1958917.  
Last modified: 6 Oct 2018, 22:02:19 UTC

I have always played safe and stopped BOINC on my own terms before rescheduling.

Just curiosity, what happening if you manualy stopped the WU in the middle of the crunching process?

Could that leave to the same issue? Or is different?
ID: 1958919 · Report as offensive
Sleepy
Volunteer tester
Avatar

Send message
Joined: 21 May 99
Posts: 219
Credit: 98,947,784
RAC: 28,360
Italy
Message 1958923 - Posted: 6 Oct 2018, 22:27:26 UTC - in response to Message 1958854.  

Install a 7zip application, it is similar to WinZip.
As I said, of course I tried all this. Both in Linux and Windows (with... 7-Zip, which should suit your advice and that I have been using regularly for years and which usually opens any kind of archive). No joy.
So I would be grateful if anyone could advise me about not just "a" 7Zip application, but "the" 7zip application which works with these files, since I have already tested several in both worlds with no avail before asking.

Thank you in advance.

Sleepy
ID: 1958923 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1958927 - Posted: 6 Oct 2018, 22:41:28 UTC - in response to Message 1958923.  

<Scratching head> So you have the downloads but they won't open? That is strange, maybe they were saved HTML pages by mistake? It is just an thought ...

I just looked at a couple of RAW 7zip files in notepad, and they seem to start with the characters "7z" followed by the compressed 'garbage' text. It might be worth look at the RAW file to see if that is what you have. Again, just a thought ...
ID: 1958927 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1958928 - Posted: 6 Oct 2018, 22:41:56 UTC - in response to Message 1958923.  

The normal tool would be 7-zip, but like all software, it has undergone revisions over the years, and sometimes those revisions add newer, faster and/or more efficient modes of compression. If you haven't done so recently, I would download a fresh copy from https://www.7-zip.org/download.html, and see if that can handle the files.
ID: 1958928 · Report as offensive
Sleepy
Volunteer tester
Avatar

Send message
Joined: 21 May 99
Posts: 219
Credit: 98,947,784
RAC: 28,360
Italy
Message 1958930 - Posted: 6 Oct 2018, 22:53:50 UTC - in response to Message 1958927.  
Last modified: 6 Oct 2018, 22:57:26 UTC

<Scratching head> So you have the downloads but they won't open? That is strange, maybe they were saved HTML pages by mistake? It is just an thought ...
Dear Brent,
you won the prize.
To shield myself from the messages about the failed Drive attempts to preview the files, after the first downloads I began to make direct saves from the links. And for some reasons, I actually downloaded the HTML code instead of the files. And that seemed to happen only for the 7zip files, hence my doubts.

So, the simple remedy would downloading the files again, but as everybody know, they are presently unavailable.

In any case, I am not the one downloading the files 100,000 times yesterday! ;-)

Sleepy
ID: 1958930 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1958932 - Posted: 6 Oct 2018, 22:58:53 UTC - in response to Message 1958930.  

Dear Brent,
you won the prize.
OMG, I won, finally a toaster all to myself :D
ID: 1958932 · Report as offensive
Sleepy
Volunteer tester
Avatar

Send message
Joined: 21 May 99
Posts: 219
Credit: 98,947,784
RAC: 28,360
Italy
Message 1958933 - Posted: 6 Oct 2018, 23:02:10 UTC - in response to Message 1958932.  
Last modified: 6 Oct 2018, 23:19:49 UTC

OMG, I won, finally a toaster all to myself :D
It is flying to you as I type! ;-)
https://www.youtube.com/watch?v=0Cm7tv5cM8g

Sleepy
ID: 1958933 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1958936 - Posted: 6 Oct 2018, 23:14:40 UTC

OK, we're back again.
For now.
Grant
Darwin NT
ID: 1958936 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1958941 - Posted: 6 Oct 2018, 23:28:45 UTC - in response to Message 1958919.  
Last modified: 6 Oct 2018, 23:33:11 UTC

I have always played safe and stopped BOINC on my own terms before rescheduling.

Just curiosity, what happening if you manualy stopped the WU in the middle of the crunching process?

Could that leave to the same issue? Or is different?

If you have your checkpoints sufficiently long enough to be greater than the crunching time of the task, then the task just starts over from zero when you restart BOINC. No harm, no foul.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1958941 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1958942 - Posted: 6 Oct 2018, 23:34:30 UTC - in response to Message 1958936.  

OK, we're back again.
For now.

Yes, the daily glitch was causing the site to hang for minutes and then timeout. Looks like that is over now.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1958942 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1958947 - Posted: 6 Oct 2018, 23:59:17 UTC - in response to Message 1958916.  

It would be perfectly possible for a rescheduler to stop and restart BOINC - I would be extremely surprised if nobody has nicked the code from knabench by now. It includes the code for retrieving the installation paths from the registry. (I'm talking Windows, of course - Linux users will have to roll their own, as always).

. . Stubbles script does that. Stops and restarts BOINC. But for some reason the restart does not work with the later version of BOINC and I need to do that manually.

Stephen

<shrug>
ID: 1958947 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1958948 - Posted: 7 Oct 2018, 0:00:08 UTC - in response to Message 1958942.  
Last modified: 7 Oct 2018, 0:16:07 UTC

The SSP doesn't appear to be over it. I'm not seeing a thing.
https://setiathome.berkeley.edu/show_server_status.php
Is it generating tasks...or not.

+++++++++++++++++++++++++
That seemed to work.
Now it's back.

-----------------------------

Now gone again....
ID: 1958948 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1958949 - Posted: 7 Oct 2018, 0:01:03 UTC - in response to Message 1958917.  

Yes, I was mistaken. I see that it is possible with Jeff's rescheduler. It is in the readme as possible but untested in the early versions. I have always played safe and stopped BOINC on my own terms before rescheduling. Haven't run into any issues that way either. I have used its restart BOINC dialog after it has finished rescheduling at times with no troubles too.


. . With Linux and the checkpoint issue it is best to do it manually.

Stephen

:)
ID: 1958949 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1958950 - Posted: 7 Oct 2018, 0:03:35 UTC - in response to Message 1958919.  

I have always played safe and stopped BOINC on my own terms before rescheduling.

Just curiosity, what happening if you manualy stopped the WU in the middle of the crunching process?

Could that leave to the same issue? Or is different?


. . If it has made a checkpoint it will resume and then it is a lottery whether or not it will fail. If it has not made a checkpoint it will restart from scratch and will not fail.

Stephen

:)
ID: 1958950 · Report as offensive
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 37 · Next

Message boards : Number crunching : Panic Mode On (113) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.