The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2024664 - Posted: 23 Dec 2019, 17:30:19 UTC Last modified: 23 Dec 2019, 17:31:09 UTC Thanks for the info, Richard. It helps to know even a little bit of what is going on. edit: Replica is catching up. less than 4 hours behind. ID: 2024664 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024665 - Posted: 23 Dec 2019, 17:32:32 UTC - in response to Message 2024662. that format seems to have done the trick. GPU utilization now ~98% on all GPUs and runtimes are improved. thanks! Would you mind posting an updated app_config? I'm still having problems getting 98% utilization on my GPUs. I had renamed the cuda101 to the stock SoG and sah exes. Was that the correct thing to do? Can't wait to restore my optimized setiathome folders once this nightmare is over. here's mine: <app_config> <app_version> <app_name>astropulse_v7</app_name> <plan_class>opencl_nvidia_100</plan_class> <avg_ncpus>1.0</avg_ncpus> <ngpus>0.5</ngpus> <cmdline></cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>opencl_nvidia_sah</plan_class> <avg_ncpus>0.5</avg_ncpus> <ngpus>0.5</ngpus> <cmdline>-nobs</cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>opencl_nvidia_SoG</plan_class> <avg_ncpus>0.5</avg_ncpus> <ngpus>0.5</ngpus> <cmdline>-nobs</cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>cuda90</plan_class> <avg_ncpus>0.5</avg_ncpus> <ngpus>0.5</ngpus> <cmdline>-nobs</cmdline> </app_version> </app_config> if you're not running the mutex build then change the ngpus line to 1.0 to run only 1 WU per GPU Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024665 ·

Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693	Message 2024668 - Posted: 23 Dec 2019, 17:55:49 UTC - in response to Message 2024665. Thanks for the app_config. I'm running that, but on a 3 GPU machine, I'm only getting 1 GPU being loaded. the others show running in the task list but very slow. Nvidia-smi shows 0-4% utilization. Any ideas what variable I'm missing and in what file?? ID: 2024668 ·

Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230	Message 2024669 - Posted: 23 Dec 2019, 17:58:58 UTC - in response to Message 2024665. The plan class of the last app should be cuda60, as there's no official cuda90 app. ID: 2024669 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874	Message 2024670 - Posted: 23 Dec 2019, 18:06:25 UTC - in response to Message 2024668. Thanks for the app_config. I'm running that, but on a 3 GPU machine, I'm only getting 1 GPU being loaded. the others show running in the task list but very slow. Nvidia-smi shows 0-4% utilization. Any ideas what variable I'm missing and in what file?? Which exact special app are you running? It turned out that there were important differences between them. In my case, setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101 failed - loaded all work onto one GPU setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 worked - loaded all GPUs ID: 2024670 ·

Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693	Message 2024671 - Posted: 23 Dec 2019, 18:08:46 UTC - in response to Message 2024670. Thanks for the app_config. I'm running that, but on a 3 GPU machine, I'm only getting 1 GPU being loaded. the others show running in the task list but very slow. Nvidia-smi shows 0-4% utilization. Any ideas what variable I'm missing and in what file?? Which exact special app are you running? It turned out that there were important differences between them. In my case, setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101 failed - loaded all work onto one GPU setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 worked - loaded all GPUs My bad luck. I have 101 and 90. I am using 101. Will 90 do or should I get 100? If 100, where can I find it? Thanks, Roger. ID: 2024671 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024672 - Posted: 23 Dec 2019, 18:13:32 UTC - in response to Message 2024671. Have you tried aborting the tasks showing no GPU utilization? I noticed this. The tasks are stuck because they started running on an OpenCL app and then you changed it and it doesnâ€™t know what to do. Try aborting them and let them pick up new tasks. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024672 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874	Message 2024673 - Posted: 23 Dec 2019, 18:21:20 UTC Ian, it might help if you (or one of the other GPUUG 'special source' contributors) could help identify which of the special apps were built on your v7.14.2 build system (and run on all GPUs), and which were built on an earlier system and are stuck on device 0. ID: 2024673 ·

Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693	Message 2024674 - Posted: 23 Dec 2019, 18:22:02 UTC - in response to Message 2024672. I did that when I first made the change so all jobs restarted. I think my problem is using the 101 as you said. it is only loading 1 GPU on each of my 3 machines. So, the problem is quite consistent. I just need to get a 100 exe over to these machines. ID: 2024674 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2024675 - Posted: 23 Dec 2019, 18:30:03 UTC - in response to Message 2024660. Well, that should be just great after everyone has hacked up their configurations 8 ways to Sunday just to try to keep contributing. It would be nice if they could post something official just before or after the maintenance to tell us where we stand. I'm pissed at myself for wasting my time, currently on a train with spotty cell hotspot access. . . That's funny, because I was feeling exactly the same way. After a weekend with no work and finally getting a workaround from very helpful and industrious secret squirrels which was allowing me to relax and get back to producing work, it is going to be reversed and I have to undo it all. But this is why I am a devout believer in Murphy's Law. ... <shrug> . . This is the best solution though because it helps all the less involved "enhanced contributors" on anonymous platforms who are probably scratching their heads wondering why they are not getting any work or even being able to report their results. And that is important both to maintain productivity but especially to maintain enthusiasm from the volunteers. Stephen :) ID: 2024675 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024677 - Posted: 23 Dec 2019, 18:39:21 UTC - in response to Message 2024673. Ian, it might help if you (or one of the other GPUUG 'special source' contributors) could help identify which of the special apps were built on your v7.14.2 build system (and run on all GPUs), and which were built on an earlier system and are stuck on device 0. The only builds I built that are out in the wild are the mutex builds. So the v0.99 apps are from me built against 7.14.2. I built some v0.98 non-mutex builds for myself, but never released them. TBar compiled the other ones that are in his AIO, but I donâ€™t know his compile environment so he will have to answer. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024677 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2024678 - Posted: 23 Dec 2019, 18:39:28 UTC - in response to Message 2024674. Last modified: 23 Dec 2019, 18:40:22 UTC I did that when I first made the change so all jobs restarted. I think my problem is using the 101 as you said. it is only loading 1 GPU on each of my 3 machines. So, the problem is quite consistent. I just need to get a 100 exe over to these machines. The most recent All-In-One has a CUDA 10.2 App, which is working just fine on my 14 GPU machine running the Spoofed CUDA60. It is running as Stock, and doing quite well, http://www.arkayn.us/lunatics/BOINC.7z You will need a Pascal or higher GPU and at least driver 440.33, the post is here, https://setiathome.berkeley.edu/forum_thread.php?id=84927 Everyone that can should update to this build as it should produce fewer Inconclusive results than 101. ID: 2024678 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874	Message 2024679 - Posted: 23 Dec 2019, 18:40:24 UTC - in response to Message 2024674. Last modified: 23 Dec 2019, 18:50:50 UTC I did that when I first made the change so all jobs restarted. I think my problem is using the 101 as you said. it is only loading 1 GPU on each of my 3 machines. So, the problem is quite consistent. I just need to get a 100 exe over to these machines. Don't look just at that one figure. I think the V0.98b1 (failed) and V0.99b1p3 (worked) are more important parts of the file name. The API version number should be embedded in the code, and it should be possible to find it with a hex editor. ( !!! - yes, that's really the way it's done) setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 has "API_VERSION_7.15.0" embedded inside it, so that's good. setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101 has no API_VERSION string at all. Edit - the full name of the 102 app that TBar has just mentioned is setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda102. It's built with API_VERSION_7.5.0 - which is old, but should be good enough. ID: 2024679 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 2024682 - Posted: 23 Dec 2019, 18:46:59 UTC - in response to Message 2024655. I've had a provisional whisper back from the project. As of about an hour ago, the provisional plan is to go through normal maintenance and backup tomorrow (Tuesday), and then revert the database changes so we can go back to running the old server code. My gloss on that is that it probably means a longer outage, and an even worse than usual recovery, but everyone should be able to enjoy the rest of the holiday in peace. Most especially, the staff can relax. Sounds like a good idea to me. Cool. So I shouldn't need to do anything at all on my end then. I was content with just keeping things as they were on my end until the problem got found and fixed on the server side. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 2024682 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024683 - Posted: 23 Dec 2019, 18:48:01 UTC - in response to Message 2024679. Last modified: 23 Dec 2019, 18:52:38 UTC I did that when I first made the change so all jobs restarted. I think my problem is using the 101 as you said. it is only loading 1 GPU on each of my 3 machines. So, the problem is quite consistent. I just need to get a 100 exe over to these machines. Don't look just at that one figure. I think the V0.98b1 (failed) and V0.99b1p3 (worked) are more important parts of the file name. The API version number should be embedded in the code, and it should be possible to find it with a hex editor. ( !!! - yes, that's really the way it's done) setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 has "API_VERSION_7.15.0" embedded inside it, so that's good. setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101 has no API_VERSION string at all. oh, 7.15 that sounds familiar. I know I was messing around a lot compiling different versions of stuff. the last boinc client I compiled was 7.14.2 but maybe some other parts of the boinc code in my compile environment (an Ubuntu 18.04 VM) are from 7.15.0. Like I said, I don't know Tbar's setup, IIRC he uses some old stuff so he can still output an executable, but I found a different way to get an executable created on the newer environment. Otherwise you get a "shared object" file instead of an "executable". Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024683 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 66387 Credit: 55,293,173 RAC: 49	Message 2024685 - Posted: 23 Dec 2019, 18:55:35 UTC - in response to Message 2024682. Last modified: 23 Dec 2019, 18:57:19 UTC I've had a provisional whisper back from the project. As of about an hour ago, the provisional plan is to go through normal maintenance and backup tomorrow (Tuesday), and then revert the database changes so we can go back to running the old server code. My gloss on that is that it probably means a longer outage, and an even worse than usual recovery, but everyone should be able to enjoy the rest of the holiday in peace. Most especially, the staff can relax. Sounds like a good idea to me. Cool. So I shouldn't need to do anything at all on my end then. I was content with just keeping things as they were on my end until the problem got found and fixed on the server side. Sounds good to Me too. I'm going back to watching some clouds. Later. Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST ID: 2024685 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874	Message 2024686 - Posted: 23 Dec 2019, 18:59:49 UTC - in response to Message 2024683. oh, 7.15 that sounds familiar. I know I was messing around a lot compiling different versions of stuff. the last boinc client I compiled was 7.14.2 but maybe some other parts of the boinc code in my compile environment (an Ubuntu 18.04 VM) are from 7.15.0. Like I said, I don't know Tbar's setup, IIRC he uses some old stuff so he can still output an executable, but I found a different way to get an executable created on the newer environment. Otherwise you get a "shared object" file instead of an "executable". Anything downloaded from the GitHub master (development) branch picks up the current generic development version number - which is why we're running on server version 715 now - that's the same number. Development versions are always odd - the client releases are given an even number when they're frozen for release. We're testing client v7.16.3 now, so we should be developing v7.17 - but things have got a bit stuck. See edit to my last post - TBar's system is now reporting API_VERSION_7.5.0, which is good enough. ID: 2024686 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024687 - Posted: 23 Dec 2019, 19:00:47 UTC - in response to Message 2024678. You will need a Pascal or higher GPU and at least driver 440.33 Don't you mean Maxwell or higher? I see you are or have been running this app on cards as old as 900 series cards and 750tis which are all Maxwell cards, the generation before Pascal. Unless you have a different unreleased version of this 10.2 app that you use yourself for your Maxwell cards? Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024687 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874	Message 2024689 - Posted: 23 Dec 2019, 19:08:11 UTC - in response to Message 2024657. ...these problems were known on Beta already, and really should have been hashed out before deployment. 1) On a Friday 2) Before a holiday week 3) With no reversion plan in case of failure. Just to be complete. :^) I've just had a whisper in from CERN, where the server code is packaged (and some of the Linux repo packaging work is done, I think). CERN wrote: CERN closes for two weeks over the holidays. We are encouraged not to make any changes at least a week before. That time should be used for documentation etc. That was a private whisper so far, but it'll get passed on to Berkeley when we next speak. ID: 2024689 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874	Message 2024694 - Posted: 23 Dec 2019, 19:22:39 UTC - in response to Message 2024691. But, is it really CERN that installs upgrades on SETI's servers? I really doubt that. No, but they do have a role to play (as LHC@Home) in testing server code, packaging it, and making it available to new and upgrading projects. The actual server code is mostly written at UCB Berkeley, so it travels out, and travels back in. Since UCB wrote it, they presumably trust it - but it was "somebody at Berkeley" who decided to update the live server last Friday - I still don't know who - and missed the warning signs from Beta. ID: 2024694 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.