The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 94 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024527 - Posted: 23 Dec 2019, 2:15:59 UTC - in response to Message 2024488.  

It might actually be a good plan for those of us running Anonymous to set No New Tasks until this gets figured out, in order to quit hammering the servers for those running stock who could actually get some work.
Personally, Einstein is running fine here until it's squared away and I just can't see breaking what works here to work around a far-end issue ... Just a thought.

Just setting the No New Tasks doesn't seem to stop my hitting the Server. I probably have to increase the "get additional tasks" to something like 0.25 or 0.3
Tom


. . Or if you have the motivation (and time) suspend the network until you have a page or 3 of completed tasks then let them upload before suspending the network again. But that is very fiddly.

Stephen

. .
ID: 2024527 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1856
Credit: 268,616,081
RAC: 1,349
United States
Message 2024528 - Posted: 23 Dec 2019, 2:21:06 UTC - in response to Message 2024524.  
Last modified: 23 Dec 2019, 2:21:38 UTC

. . Or you could do what I have done with the machines with no work, turn them off and save on power bills.
:)
Ah, but without all the lovely heat of those 980s rising up the basement stairwell, I might actually have to turn the house's heat on :| Though it would be cheaper ...
ID: 2024528 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36613
Credit: 261,360,520
RAC: 489
Australia
Message 2024529 - Posted: 23 Dec 2019, 2:21:13 UTC - in response to Message 2024524.  

Any oldtimers here remember that one Christmas season that seti was very down... was it a month?? two??
Was that the 1 where it all came crashing down in early December and didn't get fixed until late January, or something like that?

It might actually be a good plan for those of us running Anonymous to set No New Tasks until this gets figured out, in order to quit hammering the servers for those running stock who could actually get some work.
Personally, Einstein is running fine here until it's squared away and I just can't see breaking what works here to work around a far-end issue ... Just a thought.
. . Or you could do what I have done with the machines with no work, turn them off and save on power bills.
Stephen :)
The lack of heat being produced here is very helpful ATM with the very hot and smokey conditions we're experiencing, but I'd imagine that those in the other half of the world would prefer to have the heat on. ;-)

Cheers.
ID: 2024529 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024530 - Posted: 23 Dec 2019, 2:27:11 UTC - in response to Message 2024502.  

I well remember the 'very extended maintenance' of a few years ago and had no problem with that (I was one of the many who advocated it), but this latest 'episode' is something else. It may well be, make or break, as you suggest. For me, that moment appears to have arrived. Best of luck to you all.


. . Considering the time of year it may be circumspect to wait until the new year and see if it is all resolved. Considering the role the BOINC project management appears to have had in this debacle it may not be resolved until then. So relax and have a Merry Christmas for the interim. And a pleasant and Happy New Year!

Stephen

:)
ID: 2024530 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 2024531 - Posted: 23 Dec 2019, 2:30:06 UTC - in response to Message 2024529.  

Any oldtimers here remember that one Christmas season that seti was very down... was it a month?? two??
Was that the 1 where it all came crashing down in early December and didn't get fixed until late January, or something like that?

It might actually be a good plan for those of us running Anonymous to set No New Tasks until this gets figured out, in order to quit hammering the servers for those running stock who could actually get some work.
Personally, Einstein is running fine here until it's squared away and I just can't see breaking what works here to work around a far-end issue ... Just a thought.
. . Or you could do what I have done with the machines with no work, turn them off and save on power bills.
Stephen :)
The lack of heat being produced here is very helpful ATM with the very hot and smokey conditions we're experiencing, but I'd imagine that those in the other half of the world would prefer to have the heat on. ;-)

Cheers.


Instead of burning my GPUs with electricity I'll burn some wood in the fireplace and some rubber with spikes on the icy roads with my V60 D6.
Now I have time to sleep and replace my Seasonic 1250-X with a 1600W bla bla and throw out those two 1080's and one 1080Ti to make room for my presents.

Those old ones I'd hope go to East. Not south.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 2024531 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2024532 - Posted: 23 Dec 2019, 2:33:15 UTC - in response to Message 2024518.  
Last modified: 23 Dec 2019, 2:41:38 UTC

Brilliant workaround... applying to every machine stat and thank you! :^)
Edit: I tried it first on a 2x2080ti machine and it was doing two at once fine... watched it complete half a dozen like that. Hrm.

I'm confused is stated on the workaround:

(this workaround will work only on single GPU hosts)


And you talk about a 2 GPU host, so it's safe to use on a multi GPU build or no?
ID: 2024532 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024536 - Posted: 23 Dec 2019, 2:42:53 UTC - in response to Message 2024509.  

but this latest 'episode' is something else. It may well be, make or break, as you suggest. For me, that moment appears to have arrived. Best of luck to you all.
Actually Iona your failed tasks are all from "device 1" which should be your 2nd GPU as "device 0" is completing its tasks. If a restart doesn't clear the problem (it may have just had a driver crash and hasn't recovered) then just remove that offending GPU.
Cheers.


. . Hi Wiggo,

. . The machine I see only has the one GTX970, and looking at the stderr info even on the successful tasks it is having massive problems initialising the card with constant restarts before it actually processes the task. The errored tasks are when it gets one too many restarts.

. . I agree that maybe shutting down the machine, lifting the lid and blowing or sucking out the dust bunnies might be a good idea, and checking the PCIe power cables as well. I prefer the vacuum cleaner to the canned air approach as it does not simply move the dust, but actually removes the dust. If that and the restart does not restore normal behaviour then serious hardware issues sound likely.

Stephen

<fingers crossed>

:)
ID: 2024536 · Report as offensive
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2024537 - Posted: 23 Dec 2019, 2:50:53 UTC - in response to Message 2024532.  

Brilliant workaround... applying to every machine stat and thank you! :^)
Edit: I tried it first on a 2x2080ti machine and it was doing two at once fine... watched it complete half a dozen like that. Hrm.

I'm confused is stated on the workaround:

(this workaround will work only on single GPU hosts)

And you talk about a 2 GPU host, so it's safe to use on a multi GPU build or no?

This is a warning, as I can't test it on multi GPU systems.
Judging by the stderr output, the BOINC manager assigns the CUDA device number in a different way than the OpenCL device number, so the CUDA client will not know which device it should use. Perhaps it will use only device 0, regardless that another instance is already using it.
ID: 2024537 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36613
Credit: 261,360,520
RAC: 489
Australia
Message 2024539 - Posted: 23 Dec 2019, 2:53:48 UTC - in response to Message 2024536.  
Last modified: 23 Dec 2019, 3:16:03 UTC

but this latest 'episode' is something else. It may well be, make or break, as you suggest. For me, that moment appears to have arrived. Best of luck to you all.
Actually Iona your failed tasks are all from "device 1" which should be your 2nd GPU as "device 0" is completing its tasks. If a restart doesn't clear the problem (it may have just had a driver crash and hasn't recovered) then just remove that offending GPU.
Cheers.
. . Hi Wiggo,
. . The machine I see only has the one GTX970, and looking at the stderr info even on the successful tasks it is having massive problems initialising the card with constant restarts before it actually processes the task. The errored tasks are when it gets one too many restarts.
. . I agree that maybe shutting down the machine, lifting the lid and blowing or sucking out the dust bunnies might be a good idea, and checking the PCIe power cables as well. I prefer the vacuum cleaner to the canned air approach as it does not simply move the dust, but actually removes the dust. If that and the restart does not restore normal behaviour then serious hardware issues sound likely.
Stephen
<fingers crossed>
:)
Yes it does only show a single card now so it either didn't reinitialise after a reboot and/or Iona removed it, but with the replica still being a good 11hrs behind only Iona can tell us sooner what her solution was and if it's working (and I didn't waste my time going through a large random selection of her errored and validated tasks Stderr outputs for nothing). ;-)

Cheers.
ID: 2024539 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3804
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2024540 - Posted: 23 Dec 2019, 2:56:21 UTC - in response to Message 2024537.  

Thanks... I think that explains the poor performance I am having on the 4-GPU system... they are all using the same card! The 2 GPU systems are still an improvement over stock. The only issue I had was that on first launch a resumed task may stall; abort it and all the rest are fine.
ID: 2024540 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 2024541 - Posted: 23 Dec 2019, 2:59:25 UTC - in response to Message 2024540.  

Mr. Kevvy are you using BoincTasks to monitor the work units when they start? Should tell you which GPUs the work units are running on.
ID: 2024541 · Report as offensive
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2024545 - Posted: 23 Dec 2019, 3:27:21 UTC - in response to Message 2024540.  
Last modified: 23 Dec 2019, 3:27:53 UTC

Thanks... I think that explains the poor performance I am having on the 4-GPU system... they are all using the same card! The 2 GPU systems are still an improvement over stock. The only issue I had was that on first launch a resumed task may stall; abort it and all the rest are fine.
Perhaps you could try to uninstall opencl, forcing your system to ask only for CUDA tasks.
I think the special app can fully replace the original CUDA60 app, so the BOINC manager can correctly set the device id to use.
ID: 2024545 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024546 - Posted: 23 Dec 2019, 3:40:43 UTC - in response to Message 2024540.  

the issue is likely the task trying to restart a task that was previously running a different app. it was running on an openCL app, and then when you renamed it to get the CUDA app in there, it tries to pick up where it left off, but everything is different and it just hangs.

I just aborted any tasks running on GPUs that showed 0% GPU utilization and it successfully picks up new tasks.

running fine on my 10-GPU host, when it can get tasks.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024546 · Report as offensive
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2024554 - Posted: 23 Dec 2019, 3:51:38 UTC

I can't figure out how to pass the -nobs parameter with my workaround.
ID: 2024554 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024556 - Posted: 23 Dec 2019, 3:54:03 UTC - in response to Message 2024554.  

try putting it in the cmdline text file?

mb_cmdline-8.22-opencl_nvidia_SoG.txt
mb_cmdline-8.22-opencl_nvidia_sah.txt

etc
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024556 · Report as offensive
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2024557 - Posted: 23 Dec 2019, 3:54:52 UTC - in response to Message 2024556.  

I did it. It has no effect.
ID: 2024557 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 2024561 - Posted: 23 Dec 2019, 4:05:04 UTC

Interesting. Well it looks like I've been out of work for two days now. Just noticed it. It's been a lot of HTTP errors over the past 48 hours when trying to contact scheduler, and the times that it is successful, the response is well over 60 seconds from request to reply... and it comes back with "no tasks available."

But looking here on the website.. I apparently have 19 in progress. Yay ghosts.

What was the procedure for that again? Send a work request, then suspend network before a reply, and then give it a few minutes and allow network again and do another request?
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 2024561 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36613
Credit: 261,360,520
RAC: 489
Australia
Message 2024563 - Posted: 23 Dec 2019, 4:07:22 UTC - in response to Message 2024561.  

Interesting. Well it looks like I've been out of work for two days now. Just noticed it. It's been a lot of HTTP errors over the past 48 hours when trying to contact scheduler, and the times that it is successful, the response is well over 60 seconds from request to reply... and it comes back with "no tasks available."

But looking here on the website.. I apparently have 19 in progress. Yay ghosts.

What was the procedure for that again? Send a work request, then suspend network before a reply, and then give it a few minutes and allow network again and do another request?
The replica is still 11 hours behind so it's likely that you have no ghosts at all. ;-)

Cheers.
ID: 2024563 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024565 - Posted: 23 Dec 2019, 4:08:58 UTC - in response to Message 2024557.  

it should work. I'm running it in my text file. and I'm seeing the right cpu use % for my setup.

7 GPUs running the special app on a 12 thread system sees ~60% CPU use
10 GPUs running the special app on a 40 thread system sees about 25% CPU use.

you can also add it to your app_config file. add this line:

<cmdline>-nobs</cmdline>
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024565 · Report as offensive
wujj123456

Send message
Joined: 5 Sep 04
Posts: 40
Credit: 20,877,975
RAC: 219
China
Message 2024566 - Posted: 23 Dec 2019, 4:09:12 UTC - in response to Message 2024546.  

the issue is likely the task trying to restart a task that was previously running a different app. it was running on an openCL app, and then when you renamed it to get the CUDA app in there, it tries to pick up where it left off, but everything is different and it just hangs.

I just aborted any tasks running on GPUs that showed 0% GPU utilization and it successfully picks up new tasks.

running fine on my 10-GPU host, when it can get tasks.

How often do you get tasks? Even with stock and reset, I got only like 50 WUs yesterday when I got lucky. I just tried again and I got either no new work or HTTP internal error. I feel it's probably wiser for me to just stop sending requests to server until this is resolved.
ID: 2024566 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.