The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2024527 - Posted: 23 Dec 2019, 2:15:59 UTC - in response to Message 2024488. It might actually be a good plan for those of us running Anonymous to set No New Tasks until this gets figured out, in order to quit hammering the servers for those running stock who could actually get some work. Personally, Einstein is running fine here until it's squared away and I just can't see breaking what works here to work around a far-end issue ... Just a thought. Just setting the No New Tasks doesn't seem to stop my hitting the Server. I probably have to increase the "get additional tasks" to something like 0.25 or 0.3 Tom . . Or if you have the motivation (and time) suspend the network until you have a page or 3 of completed tasks then let them upload before suspending the network again. But that is very fiddly. Stephen . . ID: 2024527 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2024528 - Posted: 23 Dec 2019, 2:21:06 UTC - in response to Message 2024524. Last modified: 23 Dec 2019, 2:21:38 UTC . . Or you could do what I have done with the machines with no work, turn them off and save on power bills. :) Ah, but without all the lovely heat of those 980s rising up the basement stairwell, I might actually have to turn the house's heat on :\| Though it would be cheaper ... ID: 2024528 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 2024529 - Posted: 23 Dec 2019, 2:21:13 UTC - in response to Message 2024524. Any oldtimers here remember that one Christmas season that seti was very down... was it a month?? two?? Was that the 1 where it all came crashing down in early December and didn't get fixed until late January, or something like that? It might actually be a good plan for those of us running Anonymous to set No New Tasks until this gets figured out, in order to quit hammering the servers for those running stock who could actually get some work. Personally, Einstein is running fine here until it's squared away and I just can't see breaking what works here to work around a far-end issue ... Just a thought. . . Or you could do what I have done with the machines with no work, turn them off and save on power bills. Stephen :) The lack of heat being produced here is very helpful ATM with the very hot and smokey conditions we're experiencing, but I'd imagine that those in the other half of the world would prefer to have the heat on. ;-) Cheers. ID: 2024529 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2024530 - Posted: 23 Dec 2019, 2:27:11 UTC - in response to Message 2024502. I well remember the 'very extended maintenance' of a few years ago and had no problem with that (I was one of the many who advocated it), but this latest 'episode' is something else. It may well be, make or break, as you suggest. For me, that moment appears to have arrived. Best of luck to you all. . . Considering the time of year it may be circumspect to wait until the new year and see if it is all resolved. Considering the role the BOINC project management appears to have had in this debacle it may not be resolved until then. So relax and have a Merry Christmas for the interim. And a pleasant and Happy New Year! Stephen :) ID: 2024530 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 2024531 - Posted: 23 Dec 2019, 2:30:06 UTC - in response to Message 2024529. Any oldtimers here remember that one Christmas season that seti was very down... was it a month?? two?? Was that the 1 where it all came crashing down in early December and didn't get fixed until late January, or something like that? It might actually be a good plan for those of us running Anonymous to set No New Tasks until this gets figured out, in order to quit hammering the servers for those running stock who could actually get some work. Personally, Einstein is running fine here until it's squared away and I just can't see breaking what works here to work around a far-end issue ... Just a thought. . . Or you could do what I have done with the machines with no work, turn them off and save on power bills. Stephen :) The lack of heat being produced here is very helpful ATM with the very hot and smokey conditions we're experiencing, but I'd imagine that those in the other half of the world would prefer to have the heat on. ;-) Cheers. Instead of burning my GPUs with electricity I'll burn some wood in the fireplace and some rubber with spikes on the icy roads with my V60 D6. Now I have time to sleep and replace my Seasonic 1250-X with a 1600W bla bla and throw out those two 1080's and one 1080Ti to make room for my presents. Those old ones I'd hope go to East. Not south. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 2024531 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2024532 - Posted: 23 Dec 2019, 2:33:15 UTC - in response to Message 2024518. Last modified: 23 Dec 2019, 2:41:38 UTC Brilliant workaround... applying to every machine stat and thank you! :^) Edit: I tried it first on a 2x2080ti machine and it was doing two at once fine... watched it complete half a dozen like that. Hrm. I'm confused is stated on the workaround: (this workaround will work only on single GPU hosts) And you talk about a 2 GPU host, so it's safe to use on a multi GPU build or no? ID: 2024532 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2024536 - Posted: 23 Dec 2019, 2:42:53 UTC - in response to Message 2024509. but this latest 'episode' is something else. It may well be, make or break, as you suggest. For me, that moment appears to have arrived. Best of luck to you all. Actually Iona your failed tasks are all from "device 1" which should be your 2nd GPU as "device 0" is completing its tasks. If a restart doesn't clear the problem (it may have just had a driver crash and hasn't recovered) then just remove that offending GPU. Cheers. . . Hi Wiggo, . . The machine I see only has the one GTX970, and looking at the stderr info even on the successful tasks it is having massive problems initialising the card with constant restarts before it actually processes the task. The errored tasks are when it gets one too many restarts. . . I agree that maybe shutting down the machine, lifting the lid and blowing or sucking out the dust bunnies might be a good idea, and checking the PCIe power cables as well. I prefer the vacuum cleaner to the canned air approach as it does not simply move the dust, but actually removes the dust. If that and the restart does not restore normal behaviour then serious hardware issues sound likely. Stephen <fingers crossed> :) ID: 2024536 ·

Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230	Message 2024537 - Posted: 23 Dec 2019, 2:50:53 UTC - in response to Message 2024532. Brilliant workaround... applying to every machine stat and thank you! :^) Edit: I tried it first on a 2x2080ti machine and it was doing two at once fine... watched it complete half a dozen like that. Hrm. I'm confused is stated on the workaround: (this workaround will work only on single GPU hosts) And you talk about a 2 GPU host, so it's safe to use on a multi GPU build or no? This is a warning, as I can't test it on multi GPU systems. Judging by the stderr output, the BOINC manager assigns the CUDA device number in a different way than the OpenCL device number, so the CUDA client will not know which device it should use. Perhaps it will use only device 0, regardless that another instance is already using it. ID: 2024537 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 2024539 - Posted: 23 Dec 2019, 2:53:48 UTC - in response to Message 2024536. Last modified: 23 Dec 2019, 3:16:03 UTC but this latest 'episode' is something else. It may well be, make or break, as you suggest. For me, that moment appears to have arrived. Best of luck to you all. Actually Iona your failed tasks are all from "device 1" which should be your 2nd GPU as "device 0" is completing its tasks. If a restart doesn't clear the problem (it may have just had a driver crash and hasn't recovered) then just remove that offending GPU. Cheers. . . Hi Wiggo, . . The machine I see only has the one GTX970, and looking at the stderr info even on the successful tasks it is having massive problems initialising the card with constant restarts before it actually processes the task. The errored tasks are when it gets one too many restarts. . . I agree that maybe shutting down the machine, lifting the lid and blowing or sucking out the dust bunnies might be a good idea, and checking the PCIe power cables as well. I prefer the vacuum cleaner to the canned air approach as it does not simply move the dust, but actually removes the dust. If that and the restart does not restore normal behaviour then serious hardware issues sound likely. Stephen <fingers crossed> :) Yes it does only show a single card now so it either didn't reinitialise after a reboot and/or Iona removed it, but with the replica still being a good 11hrs behind only Iona can tell us sooner what her solution was and if it's working (and I didn't waste my time going through a large random selection of her errored and validated tasks Stderr outputs for nothing). ;-) Cheers. ID: 2024539 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 2024540 - Posted: 23 Dec 2019, 2:56:21 UTC - in response to Message 2024537. Thanks... I think that explains the poor performance I am having on the 4-GPU system... they are all using the same card! The 2 GPU systems are still an improvement over stock. The only issue I had was that on first launch a resumed task may stall; abort it and all the rest are fine. ID: 2024540 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 2024541 - Posted: 23 Dec 2019, 2:59:25 UTC - in response to Message 2024540. Mr. Kevvy are you using BoincTasks to monitor the work units when they start? Should tell you which GPUs the work units are running on. ID: 2024541 ·

Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230	Message 2024545 - Posted: 23 Dec 2019, 3:27:21 UTC - in response to Message 2024540. Last modified: 23 Dec 2019, 3:27:53 UTC Thanks... I think that explains the poor performance I am having on the 4-GPU system... they are all using the same card! The 2 GPU systems are still an improvement over stock. The only issue I had was that on first launch a resumed task may stall; abort it and all the rest are fine. Perhaps you could try to uninstall opencl, forcing your system to ask only for CUDA tasks. I think the special app can fully replace the original CUDA60 app, so the BOINC manager can correctly set the device id to use. ID: 2024545 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024546 - Posted: 23 Dec 2019, 3:40:43 UTC - in response to Message 2024540. the issue is likely the task trying to restart a task that was previously running a different app. it was running on an openCL app, and then when you renamed it to get the CUDA app in there, it tries to pick up where it left off, but everything is different and it just hangs. I just aborted any tasks running on GPUs that showed 0% GPU utilization and it successfully picks up new tasks. running fine on my 10-GPU host, when it can get tasks. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024546 ·

Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230	Message 2024554 - Posted: 23 Dec 2019, 3:51:38 UTC I can't figure out how to pass the -nobs parameter with my workaround. ID: 2024554 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024556 - Posted: 23 Dec 2019, 3:54:03 UTC - in response to Message 2024554. try putting it in the cmdline text file? mb_cmdline-8.22-opencl_nvidia_SoG.txt mb_cmdline-8.22-opencl_nvidia_sah.txt etc Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024556 ·

Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230	Message 2024557 - Posted: 23 Dec 2019, 3:54:52 UTC - in response to Message 2024556. I did it. It has no effect. ID: 2024557 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 2024561 - Posted: 23 Dec 2019, 4:05:04 UTC Interesting. Well it looks like I've been out of work for two days now. Just noticed it. It's been a lot of HTTP errors over the past 48 hours when trying to contact scheduler, and the times that it is successful, the response is well over 60 seconds from request to reply... and it comes back with "no tasks available." But looking here on the website.. I apparently have 19 in progress. Yay ghosts. What was the procedure for that again? Send a work request, then suspend network before a reply, and then give it a few minutes and allow network again and do another request? Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 2024561 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 2024563 - Posted: 23 Dec 2019, 4:07:22 UTC - in response to Message 2024561. Interesting. Well it looks like I've been out of work for two days now. Just noticed it. It's been a lot of HTTP errors over the past 48 hours when trying to contact scheduler, and the times that it is successful, the response is well over 60 seconds from request to reply... and it comes back with "no tasks available." But looking here on the website.. I apparently have 19 in progress. Yay ghosts. What was the procedure for that again? Send a work request, then suspend network before a reply, and then give it a few minutes and allow network again and do another request? The replica is still 11 hours behind so it's likely that you have no ghosts at all. ;-) Cheers. ID: 2024563 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024565 - Posted: 23 Dec 2019, 4:08:58 UTC - in response to Message 2024557. it should work. I'm running it in my text file. and I'm seeing the right cpu use % for my setup. 7 GPUs running the special app on a 12 thread system sees ~60% CPU use 10 GPUs running the special app on a 40 thread system sees about 25% CPU use. you can also add it to your app_config file. add this line: <cmdline>-nobs</cmdline> Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024565 ·

wujj123456 Send message Joined: 5 Sep 04 Posts: 40 Credit: 20,877,975 RAC: 219	Message 2024566 - Posted: 23 Dec 2019, 4:09:12 UTC - in response to Message 2024546. the issue is likely the task trying to restart a task that was previously running a different app. it was running on an openCL app, and then when you renamed it to get the CUDA app in there, it tries to pick up where it left off, but everything is different and it just hangs. I just aborted any tasks running on GPUs that showed 0% GPU utilization and it successfully picks up new tasks. running fine on my 10-GPU host, when it can get tasks. How often do you get tasks? Even with stock and reset, I got only like 50 WUs yesterday when I got lucky. I just tried again and I got either no new work or HTTP internal error. I feel it's probably wiser for me to just stop sending requests to server until this is resolved. ID: 2024566 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.