Message boards :
Number crunching :
Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation
Previous · 1 . . . 42 · 43 · 44 · 45 · 46 · 47 · 48 . . . 83 · Next
Author | Message |
---|---|
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . I am most distressed ... I have an invalid task that is not a noise bomb ... . . Thanks Grant, . . That explains the problem, it must have been when I shut down to upgrade something. Though detecting it as a V7 task is odd. But since it is a known issue I won't fret over it any more :) . . I wonder if suspending all other tasks and letting the two running tasks complete before shutting down BOINC will solve the issue. From the sound of things it should. Stephen |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Replied to wrong post. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Restarted at 72.01 percent **** Detected setiathome_enhanced_v7 task **** What about detecting the WU as v7 and not v8? Is that what causes the problem after the restart, or is it a different issue entirely? Grant Darwin NT |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
That's an interesting observation. I hadn't noticed that discrepancy before on the Invalids I've been getting on my host 8289033. It appears that if a task restarts from scratch, it sticks with "Detected setiathome_enhanced_v8 task", but if it restarts from a checkpoint, it prints the "v7" line, as you spotted. Only the developer could say if that's a significant clue or not.Restarted at 72.01 percent Host 8289033 is the only one of my 3 Linux boxes that gets those Invalids, which have been happening about 3-4 times a week since I starting running the x41p_zi3v app. (The other 2 hosts are still running x41p_zi3t2b.) Inasmuch as I have 4 GPUs on that host, and it goes through a restart cycle every weekday evening, that's about 15-20% of the restarted tasks that go berserk. There seems to be a threshold at around 70% where, if a task restarts from a checkpoint below that point, it "detects" a mass of phantom Spikes, whereas above that point, it finds phantom Triplets. All 3 Linux hosts have the same 120 second checkpoint setting. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The CUDA code is a kluge, some parts are from 2007. This is the part of the code that prints that line; https://setisvn.ssl.berkeley.edu/trac/browser/branches/sah_v7_opt/Xbranch/client/alpha/PetriR_raw3/analyzeFuncs.cpp#L701 if (ac_fft_len) fprintf(stderr,"Detected setiathome_enhanced_v7 task. It would be a very simple matter to change that 7 to an 8, I doubt it would change a thing. So, I wouldn't get very concerned about it. It's going to take much more than changing the number to fix the problem, and the problem existed back when v7 was current. The overflows on restarted tasks have always been there...they were worse a while back. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
The CUDA code is a kluge, some parts are from 2007. This is the part of the code that prints that line;I think the question, though, would be whether the program's taking an incorrect path following a restart, actually bringing it to the v7 instead of the v8 path. As I said, if a task restarts at the beginning and doesn't show a restart percentage, it still prints the v8 line. The overflows on restarted tasks have always been there...they were worse a while back.Not on Windows, as far as I know. Never had a single one in the 4+ years I've been active. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The overflows on restarted tasks have always been there WITH THE SPECIAL APP...they were worse a while back. Fixed it for you. If you check the code, I think you'll find the Special code doesn't have another path. You might want to check it though, I could be wrong ;-) BTW, a quick check would be to find the line that says, fprintf(stderr,"Restarted at %.2f percent, with setiathome enhanced x41p_zi3v, Cuda %c.%c%c %s\n", progress*100,custr[0],custr[2],custr[3],(CUDART_VERSION >= 6050) ? "special":""); if (ac_fft_len) fprintf(stderr,"Detected setiathome_enhanced_v8 task. I don't see one. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Okay, I just restarted 8289033 a little early so I could easily spot today's four restarted tasks when they reported. I got the full range of results, I think. Task 5901687112: Validated; originally started on a 750Ti, restarted on the 960 at 47.66%; shows "Detected setiathome_enhanced_v7 task" after restart; no apparent phantom signals Task 5901695820: Pending; originally started on the 960, restarted on a 750Ti at 66.41%; shows "Detected setiathome_enhanced_v7 task" after restart; no apparent phantom signals Task 5901695818: Pending; originally started on a 750Ti, restarted on a different 750Ti, apparently from beginning (no restart % shown); shows "Detected setiathome_enhanced_v8 task" after restart; no apparent phantom signals Task 5901687309: Inconclusive; originally started on a 750Ti, restarted on a different 75Ti at 57.90%; shows "Detected setiathome_enhanced_v7 task" after restart; appears to have detected 17 phantom Triplets following restart and will obviously be marked Invalid once all is said and done And, before someone asks, it isn't always the same GPU that exhibits the problem following a restart. It can happen on any one of the 4. ------------------ EDIT: To provide another data point, I just took a look at one of the tasks that restarted this evening on one of my other Linux hosts, 8253697, which is running x41p_zi3t2b. It also prints the v7 line following a checkpoint restart. The thing is, this host has never had the "phantom signals" problem. Task 5902138952: Pending; originally started on one 980, restarted on the other 980 at 56.13%; shows "Detected setiathome_enhanced_v7 task" after restart; no apparent phantom signals |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
I'm going to look at the code when home. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thought I remembered reading about the "Detected setiathome_enhanced_v7 task" errors recently. Just checked my new special app linux box and noticed a couple of errors. These two ones: Task 5902461864 Task 5905308810 have the same errors talked about in the last few posts. Never had any of those kind of errors on my Windows machines. I take it that the app is being looked at and there will be a possible update? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Thought I remembered reading about the "Detected setiathome_enhanced_v7 task" errors recently. Just checked my new special app linux box and noticed a couple of errors. These two ones:Although the "Detected setiathome_enhanced_v7 task" is apparently coming from the Special app when it's restarted from a checkpoint, the actual error you're getting on those two, "finish file present too long", is a long-standing BOINC problem. Your tasks actually completed successfully and wrote out their "finish file", but BOINC was shut down before it got around to noticing that the file was there. By the time BOINC was restarted, that 10 second window had long been exceeded and BOINC trashed the tasks. Nothing the application can do in this situation. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Hi, detected ... v7 ... That is just an old message and I have changed that to say now v8. If you have trouble with restarted tasks you could try setting in BOINC manager 'write to disk at most every xxx seconds' to such a high number that a longest running GPU task does not do checkpoints. My computer finishes GPU tasks in under 300 seconds so I can set 300 to the BOINC. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Petri, do you have your latest special app hosted somewhere? What is a sensible -pfb value for reference GTX970s? I have a command line currently at -autotune and -nobs. I searched in the top 100 computers for other 970 users and I only found Stephen who is using the same defaults and Mr. Kevvy who is using autotune and -pfb=32 setting. I think the -pfp value correlates directly to the number of compute units. I am unclear on what the -pfb parameter sets up in the card. A short explanation would be appreciated. I looked in the docs and it doesn't really explain what the parameters accomplish, just their syntax. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Petri, do you have your latest special app hosted somewhere? Hi, I do not have the latest (cuda9) version anywhere yet. It'll come in a week or so. I'll post here the links. The -pfb can have 8, 16 or 32 on modern HW (7xx, 9xx & 10xx). It is short for Pulse Find Blocks per sm. The default is 4 if I remember correctly. Setting it to 8 or higher speeds up the computation. It is not autotuned yet. I use -pfb 32 on my computer. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks Petri, I mainly asked because of your post of the fix for the outdated error messages to V7. I'll monitor the thread looking for the latest when you think it is ready for release. Thanks for the explanation about the -pfb setting. I went back to the docs for the x41Z app and refreshed my memory. I used to run -pfb=16 for the CUDA50 app on the 970's. I'll think I'll give that a try for the 970s on the special app. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Thanks Petri, I mainly asked because of your post of the fix for the outdated error messages to V7. I'll monitor the thread looking for the latest when you think it is ready for release. I hope you all can live with the message having a 'typo'. An to the -pfb N, I think you'll notice an improvement in speed. And for those who have blocking sync enabled I recommend disabling it if you are after the ultimate speed. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . I have added -pfb 16 to the app_info.xml for both machines and cannot say there is a huge improvemnent. So I upped the number for the 970s to 32 but still only a few secs off at best. Too small a change to be sure about. It could just be a small variation in the batch of tasks coming through at the moment. Still it is doing no harm so I will leave it there. . . It was worth a try. . . Normal Arecibo (NARA) tasks were 2 mins 48 - 53 secs. Now about 2 min 44-46 secs. . . VHAR (Halflings) were 1 min 18-20 secs, now about 1 min 21-22 secs. . . A similarly small change in GB run times. Stephen .. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
OK, you prompted me to bump my -pfb to 32 also since I know you have 970s also. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
It appears the new build of BOINC 7.2.47 is working very nicely. There's just a couple minor changes over the current build. The new build uses the gpu_nvidia.cpp file from 7.4.53 which displays the nvidia_driver_version, the change from 7.6.33 that adds the repository driver library link allowing boinc to see OpenCL with the repository driver, and the embedded wxWidgets libraries now contains the tiny libwx_gtk2_gl-2.8.a library. As far as I know there aren't any outstanding dependencies in the targeted Ubuntu systems, 12.04 and higher. In the newer systems such as 16.04 you need to choose Shut down connected client... in the Advanced menu to Stop boinc and the active tasks, or just Suspend the project before Exiting the Manager which will leave boinc running harmlessly in the background until next use. This version of BOINC does Not contain the Tasks page display Bug which is present in newer versions of BOINC for Linux. As soon as I test embedding the CUDA Special App in the BOINC package I'll be posting the All In One BOINC package. This will make running the Special App as easy as Installing the OS & Driver, Expanding the Download to your Home folder, and Double Clicking boincmgr. Quite an improvement in ease of use. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks for the continued development and debug of the linux BOINC platform, TBar. I'm sure a lot of people will be interested in your all-in-one package. And I'm sure the fact BOINC gets installed in Home directory will make the permissions headache a thing of the past and will be greatly appreciated by everyone. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.