The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 94 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024626 - Posted: 23 Dec 2019, 14:11:15 UTC - in response to Message 2024611.  

Bad news: It's not received the instruction to run on device 0 / device 1
Please try to force your system to ask for cuda tasks only. I think you can achieve that by uninstalling opencl. Judging by the stderr output, if the special app runs instead the original CUDA60 app, it will run on the designated GPU.
I've had a PM querying my 'bad news'. In case anybody else is wondering, here's the evidence I sent in reply.

I have two identical GPUs. BOINC is specifying that two tasks should run on different devices (this is the 'new' API in action - snip from init_data.xml):

Device 0:
<gpu_type>NVIDIA</gpu_type>
<gpu_device_num>0</gpu_device_num>
<gpu_opencl_dev_index>0</gpu_opencl_dev_index>
Device 1:
<gpu_type>NVIDIA</gpu_type>
<gpu_device_num>1</gpu_device_num>
<gpu_opencl_dev_index>1</gpu_opencl_dev_index>
But Petri's special app isn't listening (snip from stderr.txt):

Device 0:
setiathome_CUDA: Found 2 CUDA device(s):
  Device 1: GeForce GTX 1660 Ti, 5911 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 24 
     pciBusID = 1, pciSlotID = 0
  Device 2: GeForce GTX 1660 Ti, 5914 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 24 
     pciBusID = 3, pciSlotID = 0
setiathome_CUDA: No device specified, determined to use CUDA device 1: GeForce GTX 1660 Ti
Device 1:
setiathome_CUDA: Found 2 CUDA device(s):
  Device 1: GeForce GTX 1660 Ti, 5911 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 24 
     pciBusID = 1, pciSlotID = 0
  Device 2: GeForce GTX 1660 Ti, 5914 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 24 
     pciBusID = 3, pciSlotID = 0
setiathome_CUDA: No device specified, determined to use CUDA device 1: GeForce GTX 1660 Ti
They end up on the same device, despite instructions to run separately. Running the CUDA driver on its own may indeed workround the workround for multi-GPU hosts, but I won't be able to test that until the server sends me a CUDA task. None yet.
ID: 2024626 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024632 - Posted: 23 Dec 2019, 15:00:34 UTC - in response to Message 2024626.  
Last modified: 23 Dec 2019, 15:01:13 UTC

Which special app are you trying to use? Try running my mutex build. It’s working fine on my systems. I compiled it against the BOINC 7.14.2 code IIRC

All 3 systems. A 10-gpu and 2x 7-gpu systems all running the special app on each card.

It’s setup to run 2 WUs per card, but only one actually processes at a time (as designed) while the second one stays preloaded in a wait state until the first is finished.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024632 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024635 - Posted: 23 Dec 2019, 15:06:15 UTC - in response to Message 2024626.  

Oh. Your driver version isn’t sufficient for my CUDA 10.2 build. Hang on, let me get you my CUDA 10.0 build.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024635 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2024638 - Posted: 23 Dec 2019, 15:11:35 UTC - in response to Message 2024626.  
Last modified: 23 Dec 2019, 15:30:31 UTC

It seems to be working on my Linux machine running the 10.2 from the All-In-One;
https://setiathome.berkeley.edu/result.php?resultid=8364929644
<core_client_version>7.16.2</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 14 CUDA device(s):
  Device 1: GeForce RTX 2070, 7981 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 36 
     pciBusID = 1, pciSlotID = 0
  Device 2: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 15 
     pciBusID = 14, pciSlotID = 0
  Device 3: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 15 
     pciBusID = 15, pciSlotID = 0
  Device 4: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 15 
     pciBusID = 16, pciSlotID = 0
  Device 5: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 15 
     pciBusID = 17, pciSlotID = 0
  Device 6: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 15 
     pciBusID = 20, pciSlotID = 0
  Device 7: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 15 
     pciBusID = 21, pciSlotID = 0
  Device 8: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 9 
     pciBusID = 2, pciSlotID = 0
  Device 9: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 9 
     pciBusID = 3, pciSlotID = 0
  Device 10: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 9 
     pciBusID = 6, pciSlotID = 0
  Device 11: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 9 
     pciBusID = 11, pciSlotID = 0
  Device 12: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 9 
     pciBusID = 13, pciSlotID = 0
  Device 13: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 9 
     pciBusID = 18, pciSlotID = 0
  Device 14: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 9 
     pciBusID = 19, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 5
setiathome_CUDA: CUDA Device 5 specified, checking...
   Device 5: GeForce GTX 1070 is okay
SETI@home using CUDA accelerated device GeForce GTX 1070
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

setiathome v8 enhanced x41p_V0.98b1, Cuda 10.2 Special
Modifications done by petri33, compiled by TBar
Are you sure it's not the version of BOINC you're using?
ID: 2024638 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024639 - Posted: 23 Dec 2019, 15:12:48 UTC - in response to Message 2024635.  
Last modified: 23 Dec 2019, 15:29:35 UTC

Oh. Your driver version isn’t sufficient for my CUDA 10.2 build. Hang on, let me get you my CUDA 10.0 build.
It's OK, I think I've got that. I've certainly tested a mutex build, live. I'll give it a try (didn't stick with it, because I've only got a fast M.2 SSD in there - not much latency left to lose!)

Edit - I was previously running setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101
I have setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 - I'll try that instead.

And that looks better:

setiathome_CUDA: Found 2 CUDA device(s):
  Device 1: GeForce GTX 1660 Ti, 5911 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 24 
     pciBusID = 1, pciSlotID = 0
  Device 2: GeForce GTX 1660 Ti, 5914 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 24 
     pciBusID = 3, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 2
setiathome_CUDA: CUDA Device 2 specified, checking...
   Device 2: GeForce GTX 1660 Ti is okay
SETI@home using CUDA accelerated device GeForce GTX 1660 Ti
Using unroll = 5 from command line args

---------------------------------------------------------
SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.0 special
-----------------------------------------------------------------------------------
Modifications done by petri33, work sync mutex by Oddbjornik. Compiled by Ian (^_^)>c[_]
-----------------------------------------------------------------------------------
Another lesson learned!
ID: 2024639 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024643 - Posted: 23 Dec 2019, 15:40:15 UTC - in response to Message 2024639.  
Last modified: 23 Dec 2019, 15:50:06 UTC

Cool.

I see you’re using a custom unroll via command line. 2 questions:

1. Why?
2. Are you using this method of making BOINC think it’s an SoG/sah/CUDA60 app by renaming the executable? I haven’t been able to get the app to accept the -nobs parameter in either the text file or app_config.

Edit-
also, if you don't want to use the mutex funtionality, you can just set the system to use 1 WU per GPU and it will act like a normal app.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024643 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024644 - Posted: 23 Dec 2019, 15:49:32 UTC - in response to Message 2024643.  
Last modified: 23 Dec 2019, 15:58:58 UTC

It was left over in my old app_config.xml, which I'm using to test command-line passing in the workround. The unroll was originally set for the 0.98b1 app, where it made a difference. I haven't changed it yet for V0.99b1p3, where I'm told autotune is better - waiting until this test worked (which it now has), and it provides a handy marker in stderr so I can see easily what's happening.

Yes, I'm doing the 'rename executable' workround. I'll go fetch my app_config file from downstairs.
<app_config>
   <app_version>
       <app_name>setiathome_v8</app_name>
       <plan_class>opencl_nvidia_sah</plan_class>
       <avg_ncpus>0.01</avg_ncpus>
       <ngpus>1</ngpus>
       <cmdline>-nobs -unroll 5</cmdline>
   </app_version>
   <app_version>
       <app_name>setiathome_v8</app_name>
       <plan_class>opencl_nvidia_SoG</plan_class>
       <avg_ncpus>0.01</avg_ncpus>
       <ngpus>1</ngpus>
       <cmdline>-nobs -unroll 5</cmdline>
   </app_version>
   <app_version>
       <app_name>setiathome_v8</app_name>
       <plan_class>cuda101</plan_class>
       <avg_ncpus>0.01</avg_ncpus>
       <ngpus>1</ngpus>
       <cmdline>-nobs -unroll 5</cmdline>
   </app_version>
   <app_version>
       <app_name>setiathome_v8</app_name>
       <plan_class>cuda100</plan_class>
       <avg_ncpus>0.01</avg_ncpus>
       <ngpus>0.5</ngpus>
       <cmdline>-nobs</cmdline>
   </app_version>
   <app_version>
       <app_name>setiathome_v8</app_name>
       <plan_class>cuda90</plan_class>
       <avg_ncpus>0.01</avg_ncpus>
       <ngpus>1</ngpus>
       <cmdline>-nobs -unroll 5</cmdline>
   </app_version>
   <project_max_concurrent>2</project_max_concurrent>
</app_config
From the top:

Today's stock/rename tests
opencl_nvidia_sah
opencl_nvidia_SoG

Copied from previous V98 running:
cuda101

Previous Mutex test:
cuda100
ID: 2024644 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024645 - Posted: 23 Dec 2019, 16:04:13 UTC - in response to Message 2024644.  
Last modified: 23 Dec 2019, 16:08:23 UTC

ah, I don't have the plan class entries in my app_config. so maybe that's why. let me give that a shot.

also those cuda100 and 101 plan classes don't exist. don't use them. the latest cuda apps are still filed under cuda90
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024645 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024649 - Posted: 23 Dec 2019, 16:28:35 UTC - in response to Message 2024644.  

that format seems to have done the trick. GPU utilization now ~98% on all GPUs and runtimes are improved. thanks!
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024649 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 2024650 - Posted: 23 Dec 2019, 16:38:54 UTC - in response to Message 2024618.  

Bad news: It's not received the instruction to run on device 0 / device 1
Please try to force your system to ask for cuda tasks only. I think you can achieve that by uninstalling opencl. Judging by the stderr output, if the special app runs instead the original CUDA60 app, it will run on the designated GPU.
No. It took me bloody ages to get that driver installed (about the first thing I ever tried to do in Linux), and I'm not changing it for a temporary glitch. Running at half-throttle is fine, and kinder on the servers. I have one other host running stock so I can keep an eye on things, and all the others are waiting for the recovery. Which will be fun in itself...

Edit - that was a bit harsh. I'm feeling better now I've had a bite of lunch. Don't let me stop anybody else testing this aspect of Retvari's suggestion, if they're prepared to sacrifice their OpenCL driver.

However, it depends whether the Linux Cuda app was compiled against the modern API or not. If it's even modestly modern, it'll tell BOINC to use the modern calls, and we're back at square 1. I'll check it out if the server ever chooses to send me Cuda work, but so far it's alternating between sah and SoG.


Richard,

Been watching all this and how about a cc_config.xml with excludes? You could tell seti to exclude the Opencl apps and for that matter all the other cuda except 60. That way it make it easier to only accept the cuda 60. I knew about the app_config but since I'm at work and don't have access to a computer was unable to test. Thanks and good luck. Won't be able to test anything myself till the 25th.

Z
ID: 2024650 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024651 - Posted: 23 Dec 2019, 16:44:23 UTC - in response to Message 2024645.  

When you're working anonymous platform, you can make up any plan_class name you want. I made them different so I could transition between applications, each with the own settings, without losing work or having to run dry. When running stock, you have to match the project's names.
ID: 2024651 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024653 - Posted: 23 Dec 2019, 16:49:20 UTC - in response to Message 2024650.  

Richard,

Been watching all this and how about a cc_config.xml with excludes? You could tell seti to exclude the Opencl apps and for that matter all the other cuda except 60. That way it make it easier to only accept the cuda 60. I knew about the app_config but since I'm at work and don't have access to a computer was unable to test. Thanks and good luck. Won't be able to test anything myself till the 25th.

Z
Unfortunately, the excludes only go down as far as

<exclude_gpu>
   <url>project_URL</url>
   [<device_num>N</device_num>]
   [<type>NVIDIA|ATI|intel_gpu</type>]
   [<app>appname</app>]
</exclude_gpu>
- so no support for excluding by app_version. Good idea anyway, and we seem to have got liftoff if you choose the right app.
ID: 2024653 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024655 - Posted: 23 Dec 2019, 16:55:49 UTC

For general information

I've had a provisional whisper back from the project. As of about an hour ago, the provisional plan is to go through normal maintenance and backup tomorrow (Tuesday), and then revert the database changes so we can go back to running the old server code.

My gloss on that is that it probably means a longer outage, and an even worse than usual recovery, but everyone should be able to enjoy the rest of the holiday in peace. Most especially, the staff can relax. Sounds like a good idea to me.
ID: 2024655 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024656 - Posted: 23 Dec 2019, 17:02:52 UTC - in response to Message 2024655.  
Last modified: 23 Dec 2019, 17:03:35 UTC

while that sounds fine. I have to wonder why they decided to make such a change in the first place. and i thought they seemed close to finding a solution to the problem without having to revert with Keith/Juan/Martin/etc and everyone else's inputs about possible causes.

but as has been said several times, these problems were known on Beta already, and really should have been hashed out before deployment.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024656 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3797
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2024657 - Posted: 23 Dec 2019, 17:07:30 UTC - in response to Message 2024656.  

...these problems were known on Beta already, and really should have been hashed out before deployment.


1) On a Friday 2) Before a holiday week 3) With no reversion plan in case of failure.
Just to be complete. :^)
ID: 2024657 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024658 - Posted: 23 Dec 2019, 17:15:52 UTC - in response to Message 2024656.  

I imagine this is just a temporary step 1 - get the project running again over Christmas and the New Year. There has got to be a post-mortem, because I found the same problem in the first-ever Server Stable release running at LHC. That's a big oopsie. Also, we still don't know who decided it was safe to go ahead with the upgrade last Friday, and why.

And in due course, we'll want to move forward again. I'll be pressing for prior notice, on a more sensible day and date.
ID: 2024658 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2024660 - Posted: 23 Dec 2019, 17:21:21 UTC - in response to Message 2024655.  

For general information

I've had a provisional whisper back from the project. As of about an hour ago, the provisional plan is to go through normal maintenance and backup tomorrow (Tuesday), and then revert the database changes so we can go back to running the old server code.

My gloss on that is that it probably means a longer outage, and an even worse than usual recovery, but everyone should be able to enjoy the rest of the holiday in peace. Most especially, the staff can relax. Sounds like a good idea to me.

Well, that should be just great after everyone has hacked up their configurations 8 ways to Sunday just to try to keep contributing. It would be nice if they could post something official just before or after the maintenance to tell us where we stand. I'm pissed at myself for wasting my time, currently on a train with spotty cell hotspot access.
ID: 2024660 · Report as offensive
halfempty
Avatar

Send message
Joined: 2 Jun 99
Posts: 97
Credit: 35,236,901
RAC: 114
United States
Message 2024661 - Posted: 23 Dec 2019, 17:24:10 UTC - in response to Message 2024655.  

... and then revert the database changes so we can go back to running the old server code ...

Sorry about the simplistic question, but for those of us running Windows I would like a little post reversion information. Would it be easier/wiser to re-run the Lunatics installer on the current stock installation/cache than to run out the cache and restore from backup?
ID: 2024661 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2024662 - Posted: 23 Dec 2019, 17:25:31 UTC - in response to Message 2024649.  

that format seems to have done the trick. GPU utilization now ~98% on all GPUs and runtimes are improved. thanks!


Would you mind posting an updated app_config? I'm still having problems getting 98% utilization on my GPUs. I had renamed the cuda101 to the stock SoG and sah exes. Was that the correct thing to do?

Can't wait to restore my optimized setiathome folders once this nightmare is over.
ID: 2024662 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024663 - Posted: 23 Dec 2019, 17:29:35 UTC - in response to Message 2024660.  

Honestly the configuration isn't that hard to go back and forth.

when i want to go back to anonymous platform, I just need to restore my backup versions of app_info and app_config and it will be back to normal. But I'm not going back to Anonymous Platform until my it's confirmed back up properly.

I have my test bench sitting idle on Anonymous Platform that I never changed. when it starts getting work again, I'll flip the other systems back.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024663 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.