Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 94 · Next
Author | Message |
---|---|
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14668 Credit: 200,643,578 RAC: 874 |
I've had a PM querying my 'bad news'. In case anybody else is wondering, here's the evidence I sent in reply.Bad news: It's not received the instruction to run on device 0 / device 1Please try to force your system to ask for cuda tasks only. I think you can achieve that by uninstalling opencl. Judging by the stderr output, if the special app runs instead the original CUDA60 app, it will run on the designated GPU. I have two identical GPUs. BOINC is specifying that two tasks should run on different devices (this is the 'new' API in action - snip from init_data.xml): Device 0: <gpu_type>NVIDIA</gpu_type> <gpu_device_num>0</gpu_device_num> <gpu_opencl_dev_index>0</gpu_opencl_dev_index>Device 1: <gpu_type>NVIDIA</gpu_type> <gpu_device_num>1</gpu_device_num> <gpu_opencl_dev_index>1</gpu_opencl_dev_index>But Petri's special app isn't listening (snip from stderr.txt): Device 0: setiathome_CUDA: Found 2 CUDA device(s): Device 1: GeForce GTX 1660 Ti, 5911 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 1660 Ti, 5914 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 3, pciSlotID = 0 setiathome_CUDA: No device specified, determined to use CUDA device 1: GeForce GTX 1660 TiDevice 1: setiathome_CUDA: Found 2 CUDA device(s): Device 1: GeForce GTX 1660 Ti, 5911 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 1660 Ti, 5914 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 3, pciSlotID = 0 setiathome_CUDA: No device specified, determined to use CUDA device 1: GeForce GTX 1660 TiThey end up on the same device, despite instructions to run separately. Running the CUDA driver on its own may indeed workround the workround for multi-GPU hosts, but I won't be able to test that until the server sends me a CUDA task. None yet. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Which special app are you trying to use? Try running my mutex build. It’s working fine on my systems. I compiled it against the BOINC 7.14.2 code IIRC All 3 systems. A 10-gpu and 2x 7-gpu systems all running the special app on each card. It’s setup to run 2 WUs per card, but only one actually processes at a time (as designed) while the second one stays preloaded in a wait state until the first is finished. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Oh. Your driver version isn’t sufficient for my CUDA 10.2 build. Hang on, let me get you my CUDA 10.0 build. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
It seems to be working on my Linux machine running the 10.2 from the All-In-One; https://setiathome.berkeley.edu/result.php?resultid=8364929644 <core_client_version>7.16.2</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 14 CUDA device(s): Device 1: GeForce RTX 2070, 7981 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 36 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 14, pciSlotID = 0 Device 3: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 15, pciSlotID = 0 Device 4: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 16, pciSlotID = 0 Device 5: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 17, pciSlotID = 0 Device 6: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 20, pciSlotID = 0 Device 7: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 21, pciSlotID = 0 Device 8: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 2, pciSlotID = 0 Device 9: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 3, pciSlotID = 0 Device 10: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 6, pciSlotID = 0 Device 11: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 11, pciSlotID = 0 Device 12: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 13, pciSlotID = 0 Device 13: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 18, pciSlotID = 0 Device 14: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 19, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 5 setiathome_CUDA: CUDA Device 5 specified, checking... Device 5: GeForce GTX 1070 is okay SETI@home using CUDA accelerated device GeForce GTX 1070 Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 setiathome v8 enhanced x41p_V0.98b1, Cuda 10.2 Special Modifications done by petri33, compiled by TBarAre you sure it's not the version of BOINC you're using? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14668 Credit: 200,643,578 RAC: 874 |
Oh. Your driver version isn’t sufficient for my CUDA 10.2 build. Hang on, let me get you my CUDA 10.0 build.It's OK, I think I've got that. I've certainly tested a mutex build, live. I'll give it a try (didn't stick with it, because I've only got a fast M.2 SSD in there - not much latency left to lose!) Edit - I was previously running setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101 I have setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda100 - I'll try that instead. And that looks better: setiathome_CUDA: Found 2 CUDA device(s): Device 1: GeForce GTX 1660 Ti, 5911 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 1660 Ti, 5914 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 3, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 2 setiathome_CUDA: CUDA Device 2 specified, checking... Device 2: GeForce GTX 1660 Ti is okay SETI@home using CUDA accelerated device GeForce GTX 1660 Ti Using unroll = 5 from command line args --------------------------------------------------------- SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.0 special ----------------------------------------------------------------------------------- Modifications done by petri33, work sync mutex by Oddbjornik. Compiled by Ian (^_^)>c[_] -----------------------------------------------------------------------------------Another lesson learned! |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Cool. I see you’re using a custom unroll via command line. 2 questions: 1. Why? 2. Are you using this method of making BOINC think it’s an SoG/sah/CUDA60 app by renaming the executable? I haven’t been able to get the app to accept the -nobs parameter in either the text file or app_config. Edit- also, if you don't want to use the mutex funtionality, you can just set the system to use 1 WU per GPU and it will act like a normal app. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14668 Credit: 200,643,578 RAC: 874 |
It was left over in my old app_config.xml, which I'm using to test command-line passing in the workround. The unroll was originally set for the 0.98b1 app, where it made a difference. I haven't changed it yet for V0.99b1p3, where I'm told autotune is better - waiting until this test worked (which it now has), and it provides a handy marker in stderr so I can see easily what's happening. Yes, I'm doing the 'rename executable' workround. I'll go fetch my app_config file from downstairs. <app_config> <app_version> <app_name>setiathome_v8</app_name> <plan_class>opencl_nvidia_sah</plan_class> <avg_ncpus>0.01</avg_ncpus> <ngpus>1</ngpus> <cmdline>-nobs -unroll 5</cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>opencl_nvidia_SoG</plan_class> <avg_ncpus>0.01</avg_ncpus> <ngpus>1</ngpus> <cmdline>-nobs -unroll 5</cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>cuda101</plan_class> <avg_ncpus>0.01</avg_ncpus> <ngpus>1</ngpus> <cmdline>-nobs -unroll 5</cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>cuda100</plan_class> <avg_ncpus>0.01</avg_ncpus> <ngpus>0.5</ngpus> <cmdline>-nobs</cmdline> </app_version> <app_version> <app_name>setiathome_v8</app_name> <plan_class>cuda90</plan_class> <avg_ncpus>0.01</avg_ncpus> <ngpus>1</ngpus> <cmdline>-nobs -unroll 5</cmdline> </app_version> <project_max_concurrent>2</project_max_concurrent> </app_configFrom the top: Today's stock/rename tests opencl_nvidia_sah opencl_nvidia_SoG Copied from previous V98 running: cuda101 Previous Mutex test: cuda100 |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
ah, I don't have the plan class entries in my app_config. so maybe that's why. let me give that a shot. also those cuda100 and 101 plan classes don't exist. don't use them. the latest cuda apps are still filed under cuda90 Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
that format seems to have done the trick. GPU utilization now ~98% on all GPUs and runtimes are improved. thanks! Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
No. It took me bloody ages to get that driver installed (about the first thing I ever tried to do in Linux), and I'm not changing it for a temporary glitch. Running at half-throttle is fine, and kinder on the servers. I have one other host running stock so I can keep an eye on things, and all the others are waiting for the recovery. Which will be fun in itself...Bad news: It's not received the instruction to run on device 0 / device 1Please try to force your system to ask for cuda tasks only. I think you can achieve that by uninstalling opencl. Judging by the stderr output, if the special app runs instead the original CUDA60 app, it will run on the designated GPU. Richard, Been watching all this and how about a cc_config.xml with excludes? You could tell seti to exclude the Opencl apps and for that matter all the other cuda except 60. That way it make it easier to only accept the cuda 60. I knew about the app_config but since I'm at work and don't have access to a computer was unable to test. Thanks and good luck. Won't be able to test anything myself till the 25th. Z |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14668 Credit: 200,643,578 RAC: 874 |
When you're working anonymous platform, you can make up any plan_class name you want. I made them different so I could transition between applications, each with the own settings, without losing work or having to run dry. When running stock, you have to match the project's names. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14668 Credit: 200,643,578 RAC: 874 |
Richard,Unfortunately, the excludes only go down as far as <exclude_gpu> <url>project_URL</url> [<device_num>N</device_num>] [<type>NVIDIA|ATI|intel_gpu</type>] [<app>appname</app>] </exclude_gpu>- so no support for excluding by app_version. Good idea anyway, and we seem to have got liftoff if you choose the right app. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14668 Credit: 200,643,578 RAC: 874 |
For general information I've had a provisional whisper back from the project. As of about an hour ago, the provisional plan is to go through normal maintenance and backup tomorrow (Tuesday), and then revert the database changes so we can go back to running the old server code. My gloss on that is that it probably means a longer outage, and an even worse than usual recovery, but everyone should be able to enjoy the rest of the holiday in peace. Most especially, the staff can relax. Sounds like a good idea to me. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
while that sounds fine. I have to wonder why they decided to make such a change in the first place. and i thought they seemed close to finding a solution to the problem without having to revert with Keith/Juan/Martin/etc and everyone else's inputs about possible causes. but as has been said several times, these problems were known on Beta already, and really should have been hashed out before deployment. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3793 Credit: 1,114,826,392 RAC: 3,319 |
|
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14668 Credit: 200,643,578 RAC: 874 |
I imagine this is just a temporary step 1 - get the project running again over Christmas and the New Year. There has got to be a post-mortem, because I found the same problem in the first-ever Server Stable release running at LHC. That's a big oopsie. Also, we still don't know who decided it was safe to go ahead with the upgrade last Friday, and why. And in due course, we'll want to move forward again. I'll be pressing for prior notice, on a more sensible day and date. |
Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 |
For general information Well, that should be just great after everyone has hacked up their configurations 8 ways to Sunday just to try to keep contributing. It would be nice if they could post something official just before or after the maintenance to tell us where we stand. I'm pissed at myself for wasting my time, currently on a train with spotty cell hotspot access. |
halfempty Send message Joined: 2 Jun 99 Posts: 97 Credit: 35,236,901 RAC: 114 |
... and then revert the database changes so we can go back to running the old server code ... Sorry about the simplistic question, but for those of us running Windows I would like a little post reversion information. Would it be easier/wiser to re-run the Lunatics installer on the current stock installation/cache than to run out the cache and restore from backup? |
Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 |
that format seems to have done the trick. GPU utilization now ~98% on all GPUs and runtimes are improved. thanks! Would you mind posting an updated app_config? I'm still having problems getting 98% utilization on my GPUs. I had renamed the cuda101 to the stock SoG and sah exes. Was that the correct thing to do? Can't wait to restore my optimized setiathome folders once this nightmare is over. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Honestly the configuration isn't that hard to go back and forth. when i want to go back to anonymous platform, I just need to restore my backup versions of app_info and app_config and it will be back to normal. But I'm not going back to Anonymous Platform until my it's confirmed back up properly. I have my test bench sitting idle on Anonymous Platform that I never changed. when it starts getting work again, I'll flip the other systems back. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.