The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024626 - Posted: 23 Dec 2019, 14:11:15 UTC - in response to Message 2024611. Bad news: It's not received the instruction to run on device 0 / device 1 Please try to force your system to ask for cuda tasks only. I think you can achieve that by uninstalling opencl. Judging by the stderr output, if the special app runs instead the original CUDA60 app, it will run on the designated GPU. I've had a PM querying my 'bad news'. In case anybody else is wondering, here's the evidence I sent in reply. I have two identical GPUs. BOINC is specifying that two tasks should run on different devices (this is the 'new' API in action - snip from init_data.xml): Device 0: <gpu_type>NVIDIA</gpu_type> <gpu_device_num>0</gpu_device_num> <gpu_opencl_dev_index>0</gpu_opencl_dev_index> Device 1: <gpu_type>NVIDIA</gpu_type> <gpu_device_num>1</gpu_device_num> <gpu_opencl_dev_index>1</gpu_opencl_dev_index> But Petri's special app isn't listening (snip from stderr.txt): Device 0: setiathome_CUDA: Found 2 CUDA device(s): Device 1: GeForce GTX 1660 Ti, 5911 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 1660 Ti, 5914 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 3, pciSlotID = 0 setiathome_CUDA: No device specified, determined to use CUDA device 1: GeForce GTX 1660 Ti Device 1: setiathome_CUDA: Found 2 CUDA device(s): Device 1: GeForce GTX 1660 Ti, 5911 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 1660 Ti, 5914 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 3, pciSlotID = 0 setiathome_CUDA: No device specified, determined to use CUDA device 1: GeForce GTX 1660 Ti They end up on the same device, despite instructions to run separately. Running the CUDA driver on its own may indeed workround the workround for multi-GPU hosts, but I won't be able to test that until the server sends me a CUDA task. None yet. ID: 2024626 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024632 - Posted: 23 Dec 2019, 15:00:34 UTC - in response to Message 2024626. Last modified: 23 Dec 2019, 15:01:13 UTC Which special app are you trying to use? Try running my mutex build. Itâ€™s working fine on my systems. I compiled it against the BOINC 7.14.2 code IIRC All 3 systems. A 10-gpu and 2x 7-gpu systems all running the special app on each card. Itâ€™s setup to run 2 WUs per card, but only one actually processes at a time (as designed) while the second one stays preloaded in a wait state until the first is finished. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024632 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024635 - Posted: 23 Dec 2019, 15:06:15 UTC - in response to Message 2024626. Oh. Your driver version isnâ€™t sufficient for my CUDA 10.2 build. Hang on, let me get you my CUDA 10.0 build. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024635 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2024638 - Posted: 23 Dec 2019, 15:11:35 UTC - in response to Message 2024626. Last modified: 23 Dec 2019, 15:30:31 UTC It seems to be working on my Linux machine running the 10.2 from the All-In-One; https://setiathome.berkeley.edu/result.php?resultid=8364929644 <core_client_version>7.16.2</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 14 CUDA device(s): Device 1: GeForce RTX 2070, 7981 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 36 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 14, pciSlotID = 0 Device 3: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 15, pciSlotID = 0 Device 4: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 16, pciSlotID = 0 Device 5: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 17, pciSlotID = 0 Device 6: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 20, pciSlotID = 0 Device 7: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 21, pciSlotID = 0 Device 8: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 2, pciSlotID = 0 Device 9: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 3, pciSlotID = 0 Device 10: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 6, pciSlotID = 0 Device 11: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 11, pciSlotID = 0 Device 12: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 13, pciSlotID = 0 Device 13: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 18, pciSlotID = 0 Device 14: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 19, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 5 setiathome_CUDA: CUDA Device 5 specified, checking... Device 5: GeForce GTX 1070 is okay SETI@home using CUDA accelerated device GeForce GTX 1070 Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 setiathome v8 enhanced x41p_V0.98b1, Cuda 10.2 Special Modifications done by petri33, compiled by TBar Are you sure it's not the version of BOINC you're using? ID: 2024638 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024639 - Posted: 23 Dec 2019, 15:12:48 UTC - in response to Message 2024635. Last modified: 23 Dec 2019, 15:29:35 UTC Oh. Your driver version isnâ€™t sufficient for my CUDA 10.2 build. Hang on, let me get you my CUDA 10.0 build. It's OK, I think I've got that. I've certainly tested a mutex build, live. I'll give it a try (didn't stick with it, because I've only got a fast M.2 SSD in there - not much latency left to lose!) Edit - I was previously running setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101 I have setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 - I'll try that instead. And that looks better: setiathome_CUDA: Found 2 CUDA device(s): Device 1: GeForce GTX 1660 Ti, 5911 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 1660 Ti, 5914 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 3, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 2 setiathome_CUDA: CUDA Device 2 specified, checking... Device 2: GeForce GTX 1660 Ti is okay SETI@home using CUDA accelerated device GeForce GTX 1660 Ti Using unroll = 5 from command line args --------------------------------------------------------- SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.0 special ----------------------------------------------------------------------------------- Modifications done by petri33, work sync mutex by Oddbjornik. Compiled by Ian (^_^)>c[_] ----------------------------------------------------------------------------------- Another lesson learned! ID: 2024639 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024643 - Posted: 23 Dec 2019, 15:40:15 UTC - in response to Message 2024639. Last modified: 23 Dec 2019, 15:50:06 UTC Cool. I see youâ€™re using a custom unroll via command line. 2 questions: 1. Why? 2. Are you using this method of making BOINC think itâ€™s an SoG/sah/CUDA60 app by renaming the executable? I havenâ€™t been able to get the app to accept the -nobs parameter in either the text file or app_config. Edit- also, if you don't want to use the mutex funtionality, you can just set the system to use 1 WU per GPU and it will act like a normal app. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024643 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024644 - Posted: 23 Dec 2019, 15:49:32 UTC - in response to Message 2024643. Last modified: 23 Dec 2019, 15:58:58 UTC It was left over in my old app_config.xml, which I'm using to test command-line passing in the workround. The unroll was originally set for the 0.98b1 app, where it made a difference. I haven't changed it yet for V0.99b1p3, where I'm told autotune is better - waiting until this test worked (which it now has), and it provides a handy marker in stderr so I can see easily what's happening. Yes, I'm doing the 'rename executable' workround. I'll go fetch my app_config file from downstairs. <app_config> <app_version> <app_name>setiathome_v8</app_name> <plan_class>opencl_nvidia_sah</plan_class> <avg_ncpus>0.01</avg_ncpus> <ngpus>1</ngpus> <cmdline>-nobs -unroll 5</cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>opencl_nvidia_SoG</plan_class> <avg_ncpus>0.01</avg_ncpus> <ngpus>1</ngpus> <cmdline>-nobs -unroll 5</cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>cuda101</plan_class> <avg_ncpus>0.01</avg_ncpus> <ngpus>1</ngpus> <cmdline>-nobs -unroll 5</cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>cuda100</plan_class> <avg_ncpus>0.01</avg_ncpus> <ngpus>0.5</ngpus> <cmdline>-nobs</cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>cuda90</plan_class> <avg_ncpus>0.01</avg_ncpus> <ngpus>1</ngpus> <cmdline>-nobs -unroll 5</cmdline> </app_version> <project_max_concurrent>2</project_max_concurrent> </app_config From the top: Today's stock/rename tests opencl_nvidia_sah opencl_nvidia_SoG Copied from previous V98 running: cuda101 Previous Mutex test: cuda100 ID: 2024644 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024645 - Posted: 23 Dec 2019, 16:04:13 UTC - in response to Message 2024644. Last modified: 23 Dec 2019, 16:08:23 UTC ah, I don't have the plan class entries in my app_config. so maybe that's why. let me give that a shot. also those cuda100 and 101 plan classes don't exist. don't use them. the latest cuda apps are still filed under cuda90 Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024645 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024649 - Posted: 23 Dec 2019, 16:28:35 UTC - in response to Message 2024644. that format seems to have done the trick. GPU utilization now ~98% on all GPUs and runtimes are improved. thanks! Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024649 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 2024650 - Posted: 23 Dec 2019, 16:38:54 UTC - in response to Message 2024618. Bad news: It's not received the instruction to run on device 0 / device 1 Please try to force your system to ask for cuda tasks only. I think you can achieve that by uninstalling opencl. Judging by the stderr output, if the special app runs instead the original CUDA60 app, it will run on the designated GPU. No. It took me bloody ages to get that driver installed (about the first thing I ever tried to do in Linux), and I'm not changing it for a temporary glitch. Running at half-throttle is fine, and kinder on the servers. I have one other host running stock so I can keep an eye on things, and all the others are waiting for the recovery. Which will be fun in itself... Edit - that was a bit harsh. I'm feeling better now I've had a bite of lunch. Don't let me stop anybody else testing this aspect of Retvari's suggestion, if they're prepared to sacrifice their OpenCL driver. However, it depends whether the Linux Cuda app was compiled against the modern API or not. If it's even modestly modern, it'll tell BOINC to use the modern calls, and we're back at square 1. I'll check it out if the server ever chooses to send me Cuda work, but so far it's alternating between sah and SoG. Richard, Been watching all this and how about a cc_config.xml with excludes? You could tell seti to exclude the Opencl apps and for that matter all the other cuda except 60. That way it make it easier to only accept the cuda 60. I knew about the app_config but since I'm at work and don't have access to a computer was unable to test. Thanks and good luck. Won't be able to test anything myself till the 25th. Z ID: 2024650 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024651 - Posted: 23 Dec 2019, 16:44:23 UTC - in response to Message 2024645. When you're working anonymous platform, you can make up any plan_class name you want. I made them different so I could transition between applications, each with the own settings, without losing work or having to run dry. When running stock, you have to match the project's names. ID: 2024651 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024653 - Posted: 23 Dec 2019, 16:49:20 UTC - in response to Message 2024650. Richard, Been watching all this and how about a cc_config.xml with excludes? You could tell seti to exclude the Opencl apps and for that matter all the other cuda except 60. That way it make it easier to only accept the cuda 60. I knew about the app_config but since I'm at work and don't have access to a computer was unable to test. Thanks and good luck. Won't be able to test anything myself till the 25th. Z Unfortunately, the excludes only go down as far as <exclude_gpu> <url>project_URL</url> [<device_num>N</device_num>] [<type>NVIDIA\|ATI\|intel_gpu</type>] [<app>appname</app>] </exclude_gpu> - so no support for excluding by app_version. Good idea anyway, and we seem to have got liftoff if you choose the right app. ID: 2024653 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024655 - Posted: 23 Dec 2019, 16:55:49 UTC For general information I've had a provisional whisper back from the project. As of about an hour ago, the provisional plan is to go through normal maintenance and backup tomorrow (Tuesday), and then revert the database changes so we can go back to running the old server code. My gloss on that is that it probably means a longer outage, and an even worse than usual recovery, but everyone should be able to enjoy the rest of the holiday in peace. Most especially, the staff can relax. Sounds like a good idea to me. ID: 2024655 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024656 - Posted: 23 Dec 2019, 17:02:52 UTC - in response to Message 2024655. Last modified: 23 Dec 2019, 17:03:35 UTC while that sounds fine. I have to wonder why they decided to make such a change in the first place. and i thought they seemed close to finding a solution to the problem without having to revert with Keith/Juan/Martin/etc and everyone else's inputs about possible causes. but as has been said several times, these problems were known on Beta already, and really should have been hashed out before deployment. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024656 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 2024657 - Posted: 23 Dec 2019, 17:07:30 UTC - in response to Message 2024656. ...these problems were known on Beta already, and really should have been hashed out before deployment. 1) On a Friday 2) Before a holiday week 3) With no reversion plan in case of failure. Just to be complete. :^) ID: 2024657 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024658 - Posted: 23 Dec 2019, 17:15:52 UTC - in response to Message 2024656. I imagine this is just a temporary step 1 - get the project running again over Christmas and the New Year. There has got to be a post-mortem, because I found the same problem in the first-ever Server Stable release running at LHC. That's a big oopsie. Also, we still don't know who decided it was safe to go ahead with the upgrade last Friday, and why. And in due course, we'll want to move forward again. I'll be pressing for prior notice, on a more sensible day and date. ID: 2024658 ·

Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693	Message 2024660 - Posted: 23 Dec 2019, 17:21:21 UTC - in response to Message 2024655. For general information I've had a provisional whisper back from the project. As of about an hour ago, the provisional plan is to go through normal maintenance and backup tomorrow (Tuesday), and then revert the database changes so we can go back to running the old server code. My gloss on that is that it probably means a longer outage, and an even worse than usual recovery, but everyone should be able to enjoy the rest of the holiday in peace. Most especially, the staff can relax. Sounds like a good idea to me. Well, that should be just great after everyone has hacked up their configurations 8 ways to Sunday just to try to keep contributing. It would be nice if they could post something official just before or after the maintenance to tell us where we stand. I'm pissed at myself for wasting my time, currently on a train with spotty cell hotspot access. ID: 2024660 ·

halfempty Send message Joined: 2 Jun 99 Posts: 97 Credit: 35,236,901 RAC: 114	Message 2024661 - Posted: 23 Dec 2019, 17:24:10 UTC - in response to Message 2024655. ... and then revert the database changes so we can go back to running the old server code ... Sorry about the simplistic question, but for those of us running Windows I would like a little post reversion information. Would it be easier/wiser to re-run the Lunatics installer on the current stock installation/cache than to run out the cache and restore from backup? ID: 2024661 ·

Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693	Message 2024662 - Posted: 23 Dec 2019, 17:25:31 UTC - in response to Message 2024649. that format seems to have done the trick. GPU utilization now ~98% on all GPUs and runtimes are improved. thanks! Would you mind posting an updated app_config? I'm still having problems getting 98% utilization on my GPUs. I had renamed the cuda101 to the stock SoG and sah exes. Was that the correct thing to do? Can't wait to restore my optimized setiathome folders once this nightmare is over. ID: 2024662 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024663 - Posted: 23 Dec 2019, 17:29:35 UTC - in response to Message 2024660. Honestly the configuration isn't that hard to go back and forth. when i want to go back to anonymous platform, I just need to restore my backup versions of app_info and app_config and it will be back to normal. But I'm not going back to Anonymous Platform until my it's confirmed back up properly. I have my test bench sitting idle on Anonymous Platform that I never changed. when it starts getting work again, I'll flip the other systems back. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024663 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.