Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 94 · Next
Author | Message |
---|---|
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
It might actually be a good plan for those of us running Anonymous to set No New Tasks until this gets figured out, in order to quit hammering the servers for those running stock who could actually get some work. . . Or if you have the motivation (and time) suspend the network until you have a page or 3 of completed tasks then let them upload before suspending the network again. But that is very fiddly. Stephen . . |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1856 Credit: 268,616,081 RAC: 1,349 |
. . Or you could do what I have done with the machines with no work, turn them off and save on power bills.Ah, but without all the lovely heat of those 980s rising up the basement stairwell, I might actually have to turn the house's heat on :| Though it would be cheaper ... |
Wiggo Send message Joined: 24 Jan 00 Posts: 36390 Credit: 261,360,520 RAC: 489 |
Any oldtimers here remember that one Christmas season that seti was very down... was it a month?? two??Was that the 1 where it all came crashing down in early December and didn't get fixed until late January, or something like that? The lack of heat being produced here is very helpful ATM with the very hot and smokey conditions we're experiencing, but I'd imagine that those in the other half of the world would prefer to have the heat on. ;-)It might actually be a good plan for those of us running Anonymous to set No New Tasks until this gets figured out, in order to quit hammering the servers for those running stock who could actually get some work.. . Or you could do what I have done with the machines with no work, turn them off and save on power bills. Cheers. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I well remember the 'very extended maintenance' of a few years ago and had no problem with that (I was one of the many who advocated it), but this latest 'episode' is something else. It may well be, make or break, as you suggest. For me, that moment appears to have arrived. Best of luck to you all. . . Considering the time of year it may be circumspect to wait until the new year and see if it is all resolved. Considering the role the BOINC project management appears to have had in this debacle it may not be resolved until then. So relax and have a Merry Christmas for the interim. And a pleasant and Happy New Year! Stephen :) |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Any oldtimers here remember that one Christmas season that seti was very down... was it a month?? two??Was that the 1 where it all came crashing down in early December and didn't get fixed until late January, or something like that? Instead of burning my GPUs with electricity I'll burn some wood in the fireplace and some rubber with spikes on the icy roads with my V60 D6. Now I have time to sleep and replace my Seasonic 1250-X with a 1600W bla bla and throw out those two 1080's and one 1080Ti to make room for my presents. Those old ones I'd hope go to East. Not south. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Brilliant workaround... applying to every machine stat and thank you! :^) I'm confused is stated on the workaround: (this workaround will work only on single GPU hosts) And you talk about a 2 GPU host, so it's safe to use on a multi GPU build or no? |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
but this latest 'episode' is something else. It may well be, make or break, as you suggest. For me, that moment appears to have arrived. Best of luck to you all.Actually Iona your failed tasks are all from "device 1" which should be your 2nd GPU as "device 0" is completing its tasks. If a restart doesn't clear the problem (it may have just had a driver crash and hasn't recovered) then just remove that offending GPU. . . Hi Wiggo, . . The machine I see only has the one GTX970, and looking at the stderr info even on the successful tasks it is having massive problems initialising the card with constant restarts before it actually processes the task. The errored tasks are when it gets one too many restarts. . . I agree that maybe shutting down the machine, lifting the lid and blowing or sucking out the dust bunnies might be a good idea, and checking the PCIe power cables as well. I prefer the vacuum cleaner to the canned air approach as it does not simply move the dust, but actually removes the dust. If that and the restart does not restore normal behaviour then serious hardware issues sound likely. Stephen <fingers crossed> :) |
Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 |
Brilliant workaround... applying to every machine stat and thank you! :^) This is a warning, as I can't test it on multi GPU systems. Judging by the stderr output, the BOINC manager assigns the CUDA device number in a different way than the OpenCL device number, so the CUDA client will not know which device it should use. Perhaps it will use only device 0, regardless that another instance is already using it. |
Wiggo Send message Joined: 24 Jan 00 Posts: 36390 Credit: 261,360,520 RAC: 489 |
Yes it does only show a single card now so it either didn't reinitialise after a reboot and/or Iona removed it, but with the replica still being a good 11hrs behind only Iona can tell us sooner what her solution was and if it's working (and I didn't waste my time going through a large random selection of her errored and validated tasks Stderr outputs for nothing). ;-). . Hi Wiggo,but this latest 'episode' is something else. It may well be, make or break, as you suggest. For me, that moment appears to have arrived. Best of luck to you all.Actually Iona your failed tasks are all from "device 1" which should be your 2nd GPU as "device 0" is completing its tasks. If a restart doesn't clear the problem (it may have just had a driver crash and hasn't recovered) then just remove that offending GPU. Cheers. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3797 Credit: 1,114,826,392 RAC: 3,319 |
|
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Mr. Kevvy are you using BoincTasks to monitor the work units when they start? Should tell you which GPUs the work units are running on. |
Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 |
Thanks... I think that explains the poor performance I am having on the 4-GPU system... they are all using the same card! The 2 GPU systems are still an improvement over stock. The only issue I had was that on first launch a resumed task may stall; abort it and all the rest are fine.Perhaps you could try to uninstall opencl, forcing your system to ask only for CUDA tasks. I think the special app can fully replace the original CUDA60 app, so the BOINC manager can correctly set the device id to use. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
the issue is likely the task trying to restart a task that was previously running a different app. it was running on an openCL app, and then when you renamed it to get the CUDA app in there, it tries to pick up where it left off, but everything is different and it just hangs. I just aborted any tasks running on GPUs that showed 0% GPU utilization and it successfully picks up new tasks. running fine on my 10-GPU host, when it can get tasks. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 |
I can't figure out how to pass the -nobs parameter with my workaround. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
try putting it in the cmdline text file? mb_cmdline-8.22-opencl_nvidia_SoG.txt mb_cmdline-8.22-opencl_nvidia_sah.txt etc Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 |
I did it. It has no effect. |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
Interesting. Well it looks like I've been out of work for two days now. Just noticed it. It's been a lot of HTTP errors over the past 48 hours when trying to contact scheduler, and the times that it is successful, the response is well over 60 seconds from request to reply... and it comes back with "no tasks available." But looking here on the website.. I apparently have 19 in progress. Yay ghosts. What was the procedure for that again? Send a work request, then suspend network before a reply, and then give it a few minutes and allow network again and do another request? Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Wiggo Send message Joined: 24 Jan 00 Posts: 36390 Credit: 261,360,520 RAC: 489 |
Interesting. Well it looks like I've been out of work for two days now. Just noticed it. It's been a lot of HTTP errors over the past 48 hours when trying to contact scheduler, and the times that it is successful, the response is well over 60 seconds from request to reply... and it comes back with "no tasks available."The replica is still 11 hours behind so it's likely that you have no ghosts at all. ;-) Cheers. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
it should work. I'm running it in my text file. and I'm seeing the right cpu use % for my setup. 7 GPUs running the special app on a 12 thread system sees ~60% CPU use 10 GPUs running the special app on a 40 thread system sees about 25% CPU use. you can also add it to your app_config file. add this line: <cmdline>-nobs</cmdline> Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
wujj123456 Send message Joined: 5 Sep 04 Posts: 40 Credit: 20,877,975 RAC: 219 |
the issue is likely the task trying to restart a task that was previously running a different app. it was running on an openCL app, and then when you renamed it to get the CUDA app in there, it tries to pick up where it left off, but everything is different and it just hangs. How often do you get tasks? Even with stock and reset, I got only like 50 WUs yesterday when I got lucky. I just tried again and I got either no new work or HTTP internal error. I feel it's probably wiser for me to just stop sending requests to server until this is resolved. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.