The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 94 · Next

AuthorMessage
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2024664 - Posted: 23 Dec 2019, 17:30:19 UTC
Last modified: 23 Dec 2019, 17:31:09 UTC

Thanks for the info, Richard. It helps to know even a little bit of what is going on.

edit: Replica is catching up. less than 4 hours behind.
ID: 2024664 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024665 - Posted: 23 Dec 2019, 17:32:32 UTC - in response to Message 2024662.  

that format seems to have done the trick. GPU utilization now ~98% on all GPUs and runtimes are improved. thanks!


Would you mind posting an updated app_config? I'm still having problems getting 98% utilization on my GPUs. I had renamed the cuda101 to the stock SoG and sah exes. Was that the correct thing to do?

Can't wait to restore my optimized setiathome folders once this nightmare is over.


here's mine:

<app_config>
   <app_version>
       <app_name>astropulse_v7</app_name>
       <plan_class>opencl_nvidia_100</plan_class>
       <avg_ncpus>1.0</avg_ncpus>
       <ngpus>0.5</ngpus>
       <cmdline></cmdline>
   </app_version>
   <app_version>
       <app_name>setiathome_v8</app_name>
       <plan_class>opencl_nvidia_sah</plan_class>
       <avg_ncpus>0.5</avg_ncpus>
       <ngpus>0.5</ngpus>
       <cmdline>-nobs</cmdline>
   </app_version>
   <app_version>
       <app_name>setiathome_v8</app_name>
       <plan_class>opencl_nvidia_SoG</plan_class>
       <avg_ncpus>0.5</avg_ncpus>
       <ngpus>0.5</ngpus>
       <cmdline>-nobs</cmdline>
   </app_version>
   <app_version>
       <app_name>setiathome_v8</app_name>
       <plan_class>cuda90</plan_class>
       <avg_ncpus>0.5</avg_ncpus>
       <ngpus>0.5</ngpus>
       <cmdline>-nobs</cmdline>
   </app_version>
</app_config>


if you're not running the mutex build then change the ngpus line to 1.0 to run only 1 WU per GPU
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024665 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2024668 - Posted: 23 Dec 2019, 17:55:49 UTC - in response to Message 2024665.  

Thanks for the app_config.

I'm running that, but on a 3 GPU machine, I'm only getting 1 GPU being loaded. the others show running in the task list but very slow. Nvidia-smi shows 0-4% utilization. Any ideas what variable I'm missing and in what file??
ID: 2024668 · Report as offensive
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2024669 - Posted: 23 Dec 2019, 17:58:58 UTC - in response to Message 2024665.  

The plan class of the last app should be cuda60, as there's no official cuda90 app.
ID: 2024669 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024670 - Posted: 23 Dec 2019, 18:06:25 UTC - in response to Message 2024668.  

Thanks for the app_config.

I'm running that, but on a 3 GPU machine, I'm only getting 1 GPU being loaded. the others show running in the task list but very slow. Nvidia-smi shows 0-4% utilization. Any ideas what variable I'm missing and in what file??
Which exact special app are you running? It turned out that there were important differences between them.

In my case, setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101 failed - loaded all work onto one GPU
setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 worked - loaded all GPUs
ID: 2024670 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2024671 - Posted: 23 Dec 2019, 18:08:46 UTC - in response to Message 2024670.  

Thanks for the app_config.

I'm running that, but on a 3 GPU machine, I'm only getting 1 GPU being loaded. the others show running in the task list but very slow. Nvidia-smi shows 0-4% utilization. Any ideas what variable I'm missing and in what file??
Which exact special app are you running? It turned out that there were important differences between them.

In my case, setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101 failed - loaded all work onto one GPU
setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 worked - loaded all GPUs


My bad luck. I have 101 and 90. I am using 101. Will 90 do or should I get 100? If 100, where can I find it?
Thanks, Roger.
ID: 2024671 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024672 - Posted: 23 Dec 2019, 18:13:32 UTC - in response to Message 2024671.  

Have you tried aborting the tasks showing no GPU utilization?

I noticed this. The tasks are stuck because they started running on an OpenCL app and then you changed it and it doesn’t know what to do. Try aborting them and let them pick up new tasks.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024672 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024673 - Posted: 23 Dec 2019, 18:21:20 UTC

Ian, it might help if you (or one of the other GPUUG 'special source' contributors) could help identify which of the special apps were built on your v7.14.2 build system (and run on all GPUs), and which were built on an earlier system and are stuck on device 0.
ID: 2024673 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2024674 - Posted: 23 Dec 2019, 18:22:02 UTC - in response to Message 2024672.  

I did that when I first made the change so all jobs restarted. I think my problem is using the 101 as you said. it is only loading 1 GPU on each of my 3 machines. So, the problem is quite consistent. I just need to get a 100 exe over to these machines.
ID: 2024674 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024675 - Posted: 23 Dec 2019, 18:30:03 UTC - in response to Message 2024660.  

Well, that should be just great after everyone has hacked up their configurations 8 ways to Sunday just to try to keep contributing. It would be nice if they could post something official just before or after the maintenance to tell us where we stand. I'm pissed at myself for wasting my time, currently on a train with spotty cell hotspot access.


. . That's funny, because I was feeling exactly the same way. After a weekend with no work and finally getting a workaround from very helpful and industrious secret squirrels which was allowing me to relax and get back to producing work, it is going to be reversed and I have to undo it all. But this is why I am a devout believer in Murphy's Law. ... <shrug>

. . This is the best solution though because it helps all the less involved "enhanced contributors" on anonymous platforms who are probably scratching their heads wondering why they are not getting any work or even being able to report their results. And that is important both to maintain productivity but especially to maintain enthusiasm from the volunteers.

Stephen

:)
ID: 2024675 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024677 - Posted: 23 Dec 2019, 18:39:21 UTC - in response to Message 2024673.  

Ian, it might help if you (or one of the other GPUUG 'special source' contributors) could help identify which of the special apps were built on your v7.14.2 build system (and run on all GPUs), and which were built on an earlier system and are stuck on device 0.


The only builds I built that are out in the wild are the mutex builds. So the v0.99 apps are from me built against 7.14.2. I built some v0.98 non-mutex builds for myself, but never released them.

TBar compiled the other ones that are in his AIO, but I don’t know his compile environment so he will have to answer.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024677 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2024678 - Posted: 23 Dec 2019, 18:39:28 UTC - in response to Message 2024674.  
Last modified: 23 Dec 2019, 18:40:22 UTC

I did that when I first made the change so all jobs restarted. I think my problem is using the 101 as you said. it is only loading 1 GPU on each of my 3 machines. So, the problem is quite consistent. I just need to get a 100 exe over to these machines.
The most recent All-In-One has a CUDA 10.2 App, which is working just fine on my 14 GPU machine running the Spoofed CUDA60. It is running as Stock, and doing quite well, http://www.arkayn.us/lunatics/BOINC.7z
You will need a Pascal or higher GPU and at least driver 440.33, the post is here, https://setiathome.berkeley.edu/forum_thread.php?id=84927
Everyone that can should update to this build as it should produce fewer Inconclusive results than 101.
ID: 2024678 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024679 - Posted: 23 Dec 2019, 18:40:24 UTC - in response to Message 2024674.  
Last modified: 23 Dec 2019, 18:50:50 UTC

I did that when I first made the change so all jobs restarted. I think my problem is using the 101 as you said. it is only loading 1 GPU on each of my 3 machines. So, the problem is quite consistent. I just need to get a 100 exe over to these machines.
Don't look just at that one figure. I think the V0.98b1 (failed) and V0.99b1p3 (worked) are more important parts of the file name.

The API version number should be embedded in the code, and it should be possible to find it with a hex editor. ( !!! - yes, that's really the way it's done)

setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 has "API_VERSION_7.15.0" embedded inside it, so that's good.
setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101 has no API_VERSION string at all.

Edit - the full name of the 102 app that TBar has just mentioned is setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda102. It's built with API_VERSION_7.5.0 - which is old, but should be good enough.
ID: 2024679 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 2024682 - Posted: 23 Dec 2019, 18:46:59 UTC - in response to Message 2024655.  

I've had a provisional whisper back from the project. As of about an hour ago, the provisional plan is to go through normal maintenance and backup tomorrow (Tuesday), and then revert the database changes so we can go back to running the old server code.

My gloss on that is that it probably means a longer outage, and an even worse than usual recovery, but everyone should be able to enjoy the rest of the holiday in peace. Most especially, the staff can relax. Sounds like a good idea to me.

Cool. So I shouldn't need to do anything at all on my end then.

I was content with just keeping things as they were on my end until the problem got found and fixed on the server side.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 2024682 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024683 - Posted: 23 Dec 2019, 18:48:01 UTC - in response to Message 2024679.  
Last modified: 23 Dec 2019, 18:52:38 UTC

I did that when I first made the change so all jobs restarted. I think my problem is using the 101 as you said. it is only loading 1 GPU on each of my 3 machines. So, the problem is quite consistent. I just need to get a 100 exe over to these machines.
Don't look just at that one figure. I think the V0.98b1 (failed) and V0.99b1p3 (worked) are more important parts of the file name.

The API version number should be embedded in the code, and it should be possible to find it with a hex editor. ( !!! - yes, that's really the way it's done)

setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 has "API_VERSION_7.15.0" embedded inside it, so that's good.
setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101 has no API_VERSION string at all.


oh, 7.15 that sounds familiar. I know I was messing around a lot compiling different versions of stuff. the last boinc client I compiled was 7.14.2 but maybe some other parts of the boinc code in my compile environment (an Ubuntu 18.04 VM) are from 7.15.0.

Like I said, I don't know Tbar's setup, IIRC he uses some old stuff so he can still output an executable, but I found a different way to get an executable created on the newer environment. Otherwise you get a "shared object" file instead of an "executable".
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024683 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 66359
Credit: 55,293,173
RAC: 49
United States
Message 2024685 - Posted: 23 Dec 2019, 18:55:35 UTC - in response to Message 2024682.  
Last modified: 23 Dec 2019, 18:57:19 UTC

I've had a provisional whisper back from the project. As of about an hour ago, the provisional plan is to go through normal maintenance and backup tomorrow (Tuesday), and then revert the database changes so we can go back to running the old server code.

My gloss on that is that it probably means a longer outage, and an even worse than usual recovery, but everyone should be able to enjoy the rest of the holiday in peace. Most especially, the staff can relax. Sounds like a good idea to me.

Cool. So I shouldn't need to do anything at all on my end then.

I was content with just keeping things as they were on my end until the problem got found and fixed on the server side.

Sounds good to Me too.

I'm going back to watching some clouds. Later.
Savoir-Faire is everywhere!
The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST

ID: 2024685 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024686 - Posted: 23 Dec 2019, 18:59:49 UTC - in response to Message 2024683.  

oh, 7.15 that sounds familiar. I know I was messing around a lot compiling different versions of stuff. the last boinc client I compiled was 7.14.2 but maybe some other parts of the boinc code in my compile environment (an Ubuntu 18.04 VM) are from 7.15.0.

Like I said, I don't know Tbar's setup, IIRC he uses some old stuff so he can still output an executable, but I found a different way to get an executable created on the newer environment. Otherwise you get a "shared object" file instead of an "executable".
Anything downloaded from the GitHub master (development) branch picks up the current generic development version number - which is why we're running on server version 715 now - that's the same number. Development versions are always odd - the client releases are given an even number when they're frozen for release.

We're testing client v7.16.3 now, so we should be developing v7.17 - but things have got a bit stuck.

See edit to my last post - TBar's system is now reporting API_VERSION_7.5.0, which is good enough.
ID: 2024686 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024687 - Posted: 23 Dec 2019, 19:00:47 UTC - in response to Message 2024678.  

You will need a Pascal or higher GPU and at least driver 440.33


Don't you mean Maxwell or higher? I see you are or have been running this app on cards as old as 900 series cards and 750tis which are all Maxwell cards, the generation before Pascal.

Unless you have a different unreleased version of this 10.2 app that you use yourself for your Maxwell cards?
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024687 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024689 - Posted: 23 Dec 2019, 19:08:11 UTC - in response to Message 2024657.  

...these problems were known on Beta already, and really should have been hashed out before deployment.

1) On a Friday 2) Before a holiday week 3) With no reversion plan in case of failure.
Just to be complete. :^)

I've just had a whisper in from CERN, where the server code is packaged (and some of the Linux repo packaging work is done, I think).

CERN wrote:
CERN closes for two weeks over the holidays. We are encouraged not to make any changes at least a week before. That time should be used for documentation etc.
That was a private whisper so far, but it'll get passed on to Berkeley when we next speak.
ID: 2024689 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024694 - Posted: 23 Dec 2019, 19:22:39 UTC - in response to Message 2024691.  

But, is it really CERN that installs upgrades on SETI's servers? I really doubt that.
No, but they do have a role to play (as LHC@Home) in testing server code, packaging it, and making it available to new and upgrading projects.

The actual server code is mostly written at UCB Berkeley, so it travels out, and travels back in. Since UCB wrote it, they presumably trust it - but it was "somebody at Berkeley" who decided to update the live server last Friday - I still don't know who - and missed the warning signs from Beta.
ID: 2024694 · Report as offensive
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.