Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 94 · Next
Author | Message |
---|---|
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
Thanks for the info, Richard. It helps to know even a little bit of what is going on. edit: Replica is catching up. less than 4 hours behind. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
that format seems to have done the trick. GPU utilization now ~98% on all GPUs and runtimes are improved. thanks! here's mine: <app_config> <app_version> <app_name>astropulse_v7</app_name> <plan_class>opencl_nvidia_100</plan_class> <avg_ncpus>1.0</avg_ncpus> <ngpus>0.5</ngpus> <cmdline></cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>opencl_nvidia_sah</plan_class> <avg_ncpus>0.5</avg_ncpus> <ngpus>0.5</ngpus> <cmdline>-nobs</cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>opencl_nvidia_SoG</plan_class> <avg_ncpus>0.5</avg_ncpus> <ngpus>0.5</ngpus> <cmdline>-nobs</cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>cuda90</plan_class> <avg_ncpus>0.5</avg_ncpus> <ngpus>0.5</ngpus> <cmdline>-nobs</cmdline> </app_version> </app_config> if you're not running the mutex build then change the ngpus line to 1.0 to run only 1 WU per GPU Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 |
|
Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 |
The plan class of the last app should be cuda60, as there's no official cuda90 app. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
Thanks for the app_config.Which exact special app are you running? It turned out that there were important differences between them. In my case, setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101 failed - loaded all work onto one GPU setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 worked - loaded all GPUs |
Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 |
Thanks for the app_config.Which exact special app are you running? It turned out that there were important differences between them. My bad luck. I have 101 and 90. I am using 101. Will 90 do or should I get 100? If 100, where can I find it? Thanks, Roger. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Have you tried aborting the tasks showing no GPU utilization? I noticed this. The tasks are stuck because they started running on an OpenCL app and then you changed it and it doesn’t know what to do. Try aborting them and let them pick up new tasks. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
Ian, it might help if you (or one of the other GPUUG 'special source' contributors) could help identify which of the special apps were built on your v7.14.2 build system (and run on all GPUs), and which were built on an earlier system and are stuck on device 0. |
Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 |
|
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Well, that should be just great after everyone has hacked up their configurations 8 ways to Sunday just to try to keep contributing. It would be nice if they could post something official just before or after the maintenance to tell us where we stand. I'm pissed at myself for wasting my time, currently on a train with spotty cell hotspot access. . . That's funny, because I was feeling exactly the same way. After a weekend with no work and finally getting a workaround from very helpful and industrious secret squirrels which was allowing me to relax and get back to producing work, it is going to be reversed and I have to undo it all. But this is why I am a devout believer in Murphy's Law. ... <shrug> . . This is the best solution though because it helps all the less involved "enhanced contributors" on anonymous platforms who are probably scratching their heads wondering why they are not getting any work or even being able to report their results. And that is important both to maintain productivity but especially to maintain enthusiasm from the volunteers. Stephen :) |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Ian, it might help if you (or one of the other GPUUG 'special source' contributors) could help identify which of the special apps were built on your v7.14.2 build system (and run on all GPUs), and which were built on an earlier system and are stuck on device 0. The only builds I built that are out in the wild are the mutex builds. So the v0.99 apps are from me built against 7.14.2. I built some v0.98 non-mutex builds for myself, but never released them. TBar compiled the other ones that are in his AIO, but I don’t know his compile environment so he will have to answer. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I did that when I first made the change so all jobs restarted. I think my problem is using the 101 as you said. it is only loading 1 GPU on each of my 3 machines. So, the problem is quite consistent. I just need to get a 100 exe over to these machines.The most recent All-In-One has a CUDA 10.2 App, which is working just fine on my 14 GPU machine running the Spoofed CUDA60. It is running as Stock, and doing quite well, http://www.arkayn.us/lunatics/BOINC.7z You will need a Pascal or higher GPU and at least driver 440.33, the post is here, https://setiathome.berkeley.edu/forum_thread.php?id=84927 Everyone that can should update to this build as it should produce fewer Inconclusive results than 101. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
I did that when I first made the change so all jobs restarted. I think my problem is using the 101 as you said. it is only loading 1 GPU on each of my 3 machines. So, the problem is quite consistent. I just need to get a 100 exe over to these machines.Don't look just at that one figure. I think the V0.98b1 (failed) and V0.99b1p3 (worked) are more important parts of the file name. The API version number should be embedded in the code, and it should be possible to find it with a hex editor. ( !!! - yes, that's really the way it's done) setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 has "API_VERSION_7.15.0" embedded inside it, so that's good. setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101 has no API_VERSION string at all. Edit - the full name of the 102 app that TBar has just mentioned is setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda102. It's built with API_VERSION_7.5.0 - which is old, but should be good enough. |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
I've had a provisional whisper back from the project. As of about an hour ago, the provisional plan is to go through normal maintenance and backup tomorrow (Tuesday), and then revert the database changes so we can go back to running the old server code. Cool. So I shouldn't need to do anything at all on my end then. I was content with just keeping things as they were on my end until the problem got found and fixed on the server side. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
I did that when I first made the change so all jobs restarted. I think my problem is using the 101 as you said. it is only loading 1 GPU on each of my 3 machines. So, the problem is quite consistent. I just need to get a 100 exe over to these machines.Don't look just at that one figure. I think the V0.98b1 (failed) and V0.99b1p3 (worked) are more important parts of the file name. oh, 7.15 that sounds familiar. I know I was messing around a lot compiling different versions of stuff. the last boinc client I compiled was 7.14.2 but maybe some other parts of the boinc code in my compile environment (an Ubuntu 18.04 VM) are from 7.15.0. Like I said, I don't know Tbar's setup, IIRC he uses some old stuff so he can still output an executable, but I found a different way to get an executable created on the newer environment. Otherwise you get a "shared object" file instead of an "executable". Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66387 Credit: 55,293,173 RAC: 49 |
I've had a provisional whisper back from the project. As of about an hour ago, the provisional plan is to go through normal maintenance and backup tomorrow (Tuesday), and then revert the database changes so we can go back to running the old server code. Sounds good to Me too. I'm going back to watching some clouds. Later. Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
oh, 7.15 that sounds familiar. I know I was messing around a lot compiling different versions of stuff. the last boinc client I compiled was 7.14.2 but maybe some other parts of the boinc code in my compile environment (an Ubuntu 18.04 VM) are from 7.15.0.Anything downloaded from the GitHub master (development) branch picks up the current generic development version number - which is why we're running on server version 715 now - that's the same number. Development versions are always odd - the client releases are given an even number when they're frozen for release. We're testing client v7.16.3 now, so we should be developing v7.17 - but things have got a bit stuck. See edit to my last post - TBar's system is now reporting API_VERSION_7.5.0, which is good enough. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
You will need a Pascal or higher GPU and at least driver 440.33 Don't you mean Maxwell or higher? I see you are or have been running this app on cards as old as 900 series cards and 750tis which are all Maxwell cards, the generation before Pascal. Unless you have a different unreleased version of this 10.2 app that you use yourself for your Maxwell cards? Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
...these problems were known on Beta already, and really should have been hashed out before deployment. I've just had a whisper in from CERN, where the server code is packaged (and some of the Linux repo packaging work is done, I think). CERN wrote: CERN closes for two weeks over the holidays. We are encouraged not to make any changes at least a week before. That time should be used for documentation etc.That was a private whisper so far, but it'll get passed on to Berkeley when we next speak. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
But, is it really CERN that installs upgrades on SETI's servers? I really doubt that.No, but they do have a role to play (as LHC@Home) in testing server code, packaging it, and making it available to new and upgrading projects. The actual server code is mostly written at UCB Berkeley, so it travels out, and travels back in. Since UCB wrote it, they presumably trust it - but it was "somebody at Berkeley" who decided to update the live server last Friday - I still don't know who - and missed the warning signs from Beta. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.