Message boards :
Number crunching :
Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation
Previous · 1 . . . 24 · 25 · 26 · 27 · 28 · 29 · 30 . . . 83 · Next
| Author | Message |
|---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0
|
Any news regarding checking of Jason's proposal regarding race condition? Last I've heard is Petri's earlier post mentioning that he'd give the strategy a try. That would be good if it can work out, as I have had other setbacks with illness and catching up with other work in the meantime. It'll have to be in Petri & TBar's hands for the short term until I'm back on my feet. Unfortunate, but the way the cookie crumbles for now. [Side note:] At the same time, as down with cold and digging into that pulsefind race, I found material on using Convolutional Neural Networks (CNNs) specifically for RF chirped signal feature recognition. That's significant because it has potential to recognise the features to a higher certainty than we're used to (~98%) across all chirp rates in the same pass. Even if only used as a prescan to sparsify/target traditional Chirp+fourier analyses, that form of AI is what the current architectures were built to do, and The next generation (Volta) supposedly has 'Tensor processors' in addition to normal Cuda cores. The rapid development in that direction may be too important to ignore for too much longer. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873
|
I've always had that thought at the back of my mind. That mechanism has been in place forever but I have never jumped out of my complacency to try it out. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873
|
I thought that might be the limit but my thread searches never pinned the exact number down. The CPU2GPU script seems to work for me IF I only use it once. Any more and I create ghosts. There was a fortuitous occurrence yesterday for Numbskull because it got exclusively Arecibo MHARs on the CPU side and Mr. Kevvy's/Stubbles/Jimbocous script was run every 20 minute or so to build my CPU count back up to 100 and I then could transfer them over to the GPU's. I got all the way past 3:30 PM PDT before I ran out of GPU tasks on that machine instead of running out by noon as usual. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Jeff Buck ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0
|
Or just run another instance of BOINC. It is quite a bit less work.Yeah, I've seen that approach mentioned from time to time. Setting up a separate BOINC Data folder seems pretty straightforward, but do you know how Windows handles the Registry entries? I'm thinking that there would still be only one set, even with multiple instances of BOINC. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873
|
Or just run another instance of BOINC. It is quite a bit less work.Yeah, I've seen that approach mentioned from time to time. Setting up a separate BOINC Data folder seems pretty straightforward, but do you know how Windows handles the Registry entries? I'm thinking that there would still be only one set, even with multiple instances of BOINC. If I remember right, the registry doesn't even come into play. Its up to each BOINC instance to point at its /Data directory, wherever it may reside on whatever mounted storage medium. Somebody correct me if I don't have the facts straight. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Jeff Buck ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0
|
If I remember right, the registry doesn't even come into play. Its up to each BOINC instance to point at its /Data directory, wherever it may reside on whatever mounted storage medium. Somebody correct me if I don't have the facts straight.Perhaps, but where would it store that info if not in the Registry? I know that, for my own VLAR Rescheduler, I pick up the data folder location from the DATADIR value in "HKEY_LOCAL_MACHINE\SOFTWARE\Space Sciences Laboratory, U.C. Berkeley\BOINC Setup". And, in addition to about 20 values stored in that key, there are a bunch more in "HKEY_CURRENT_USER\Software\Space Sciences Laboratory, U.C. Berkeley\BOINC Manager". I assume that last structure could accommodate separate values for BOINC instances that run under different logon IDs, but I'm really not sure. |
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
If I remember right, the registry doesn't even come into play. Its up to each BOINC instance to point at its /Data directory, wherever it may reside on whatever mounted storage medium. Somebody correct me if I don't have the facts straight.Perhaps, but where would it store that info if not in the Registry? I know that, for my own VLAR Rescheduler, I pick up the data folder location from the DATADIR value in "HKEY_LOCAL_MACHINE\SOFTWARE\Space Sciences Laboratory, U.C. Berkeley\BOINC Setup". And, in addition to about 20 values stored in that key, there are a bunch more in "HKEY_CURRENT_USER\Software\Space Sciences Laboratory, U.C. Berkeley\BOINC Manager". I assume that last structure could accommodate separate values for BOINC instances that run under different logon IDs, but I'm really not sure. . . Personally I suspect there are too many ways to get things wrong when trying that so I will pass ... . . Call me chicken :) Stephen :) |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156
|
Any news regarding checking of Jason's proposal regarding race condition? I'll try that on weekend. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121
|
Even if only used as a prescan to sparsify/target traditional Chirp+fourier analyses, that form of AI is what the current architectures were built to do, and The next generation (Volta) supposedly has 'Tensor processors' in addition to normal Cuda cores. The rapid development in that direction may be too important to ignore for too much longer. Yes, that's very true... but can't find someone for another +24 hours per day renting - that's the issue :) SETI apps news We're not gonna fight them. We're gonna transcend them. |
|
EdwardPF Send message Joined: 26 Jul 99 Posts: 389 Credit: 236,772,605 RAC: 374
|
If I remember right, the registry doesn't even come into play. Its up to each BOINC instance to point at its /Data directory, wherever it may reside on whatever mounted storage medium. Somebody correct me if I don't have the facts straight. (off topic ... I know ...) As I recall ... I had to remove all references to BOINC and setiathome.berkeley from the registry. Then I ran BOINC.EXE from each created sub-directory and packaged it in a SIMPLE startup ".bat" that ran at startup ... as follows: rem start Y:\BOINC_test_programs\boinc.exe --gui_rpc_port 31416 --dir Y:\BOINC_test_data_1 --allow_multiple_clients --detach
start /affinity 0A Y:\BOINC_test_programs\boinc.exe --gui_rpc_port 31417 --dir Y:\BOINC_test_data_2 --allow_multiple_clients --detach
start /affinity C0 Y:\BOINC_test_programs\boinc.exe --gui_rpc_port 31418 --dir Y:\BOINC_test_data_3 --allow_multiple_clients --detach
start /affinity 01 Y:\BOINC_test_programs\boinc.exe --gui_rpc_port 31419 --dir Y:\BOINC_test_data_4 --allow_multiple_clients --detach
start /affinity 44 Y:\BOINC_test_programs\boinc.exe --gui_rpc_port 31420 --dir Y:\BOINC_test_data_5 --allow_multiple_clients --detach
start /affinity 10 Y:\BOINC_test_programs\boinc.exe --gui_rpc_port 31421 --dir Y:\BOINC_test_data_6 --allow_multiple_clients --detach
start /affinity 04 Y:\BOINC_test_programs\boinc.exe --gui_rpc_port 31422 --dir Y:\BOINC_test_data_7 --allow_multiple_clients --detach
cd /D Y:\BOINC_test_data_1
start Y:\BOINC_test_programs\boincmgr.exe /s
cd ..
exit
or something near abouts this... Ed F |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0
|
Even if only used as a prescan to sparsify/target traditional Chirp+fourier analyses, that form of AI is what the current architectures were built to do, and The next generation (Volta) supposedly has 'Tensor processors' in addition to normal Cuda cores. The rapid development in that direction may be too important to ignore for too much longer. Yep. Currently attempting to restructure my work+study+home life a bit to get more time for development. Unfortunately it's a slow process. I have hope the learning curve isn't going to be as steep as I thought a year ago, fingers crossed. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
rob smith ![]() Send message Joined: 7 Mar 03 Posts: 22878 Credit: 416,307,556 RAC: 380
|
Take your time Jason - I for one can wait Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Jeff Buck ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0
|
Agreed. The 6.5 App doesn't seem to have the same clustered Pulse problem that the 8.0 App was showing on the 780.My statement held true for about 18 days, until Task 5744228697, which reported a cluster of 21 pulses undetected by the wingmen. Again, the WU was a guppi VLAR. Obviously, 1 occurrence in 18 days is significantly better than 42 in less than 1 day, but it does seem to indicate that whatever the incompatibility is doesn't originate with Cuda 8.0, but is certainly magnified by it. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0
|
Agreed. The 6.5 App doesn't seem to have the same clustered Pulse problem that the 8.0 App was showing on the 780.My statement held true for about 18 days, until Task 5744228697, which reported a cluster of 21 pulses undetected by the wingmen. Again, the WU was a guppi VLAR. Correct. Under older (simpler) Cuda driver models, more of the execution tends to be serialised, while in later variants the granularity can be remarkably fine, and complicated by driver optimisations that fuse kernels in flight. That makes the presence of any race condition at all either have no effect whatsoever, or right through the spectrum to obviously broken. The 'obvious' solution there is to not have race conditions, and that is what the things Petri is going to give a go at will hopefully work to resolving. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0
|
Is that really what you think is happening here? I was under the impression that the race conditions were on display when all tasks for a WU overflowed, just with differing signal counts and, perhaps, best signal values. In all my recent Invalids with some version of the Special App, both the earlier ones with Cuda 8.0 and this new one with 6.5, only my tasks (on the GTX 780) resulted in a -9 overflow, always with a cluster of pulses that none of the wingmen detected. The wingmen did not overflow and had fairly normal looking signal counts. BTW, the other cards in that box (a GTX 960 and a recently added GTX 980) haven't exhibited the odd behavior. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0
|
Is that really what you think is happening here? I was under the impression that the race conditions were on display when all tasks for a WU overflowed, just with differing signal counts and, perhaps, best signal values. In all my recent Invalids with some version of the Special App, both the earlier ones with Cuda 8.0 and this new one with 6.5, only my tasks (on the GTX 780) resulted in a -9 overflow, always with a cluster of pulses that none of the wingmen detected. The wingmen did not overflow and had fairly normal looking signal counts. [Not ruling out other 'bugs' though...] Yes, quite certain the race is a thing. We know the CPU side recording of pulses is the same as always, reading from a matrix of results in serial fashion after the complete batch of pulse folds at many periods is performed. In a little more depth, We know the anomaly occurs across unroll boundaries simply because setting unroll to 1 eliminates the issue. This restores the GPU side processing order to the same as Baseline, with no changes to the CPU side, and produces çorrect results [seen so far here anyway, all the samples you sent me]. In Petri's particularly effective unrolling approach, it spreads the fold periods across SMs, simply by apportioning what used to be a serial run of long pulsefinds (in VLAR the longest) over more threads, splitting up what used to be serial. Key point is that what used to be serial then works in parallel, however now there are multiple threads and only one result dataset for recording the pulses. When the pulse detections occur, in the older GPU and reference CPU applications, subsequent detections in the same pulsepot overwrite with the higher score. That detection and overwrite if done in parallel with no form of reduction/serialisation then becomes the race condition manifest. 'Purely parallelised' 50% chance of getting the right one, 50% chance of not, in the case of two detections of the same pulse and slightly differing periods smeared across the unroll boundary. the particular race internal to the pulsefind is quite visible by using the special app simply by setting unroll to 1, and therefore getting the correct result in most/all the test cases where the wrong pulses were detected under higher unroll. Unroll 1 of course effectively serialises those updates , but then you lose the parallelism. A key point of order then is 'why the different behaviour by Platform and Cuda version ?'. [Apart from the LLVM compiler in the driver generating different code for the different GPUs at runtime also..., and 780 having Hyper-Q...] That's readily explainable In that Cuda, and nvCuda underneath it, use platform DirectX &/or OpenGL+CL functionality underneath. Windows/WDDM in particular has constantly evolved in its use of parallelism, in that being key for gaming+PhysX, driver optimisation are quite aggressive. Launches can be transparently fused, reordered/rescheduled in any sequence, even to the point of breakage. That fusion/reordering is less pronounced on Linux, most likely due to the traditionally simpler driver model, and lower latency overhead to induce such switching in the first place. Linux IIRC only received basic Fermi-Esque virtualisation of the memory subsystem relatively recently (excluding special TCC unified memory). Newer drivers have it, so will manifest the issues. [Importantly the 780 Vs 960 driver compiler will be JIT generating different kernel paths, so in a sense the different behaviour can be down to the same race sensitive code being scheduled differently. The LLVM driver compiler does this for each compute capability. 780 has an additional parallelising feature called Hyper-Q, allowing some pretty sophisticated reordering things likely not active in the 960.] The OSX arrangement is less known to me, being quite closed/proprietary right through, however the seemingly irrational behaviour matches with the definition of similar parallelisation induced arbitrary ordering, of what is fundamentally a serial algorithm. (until proper reductions or shunting to unroll 1 are added) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0
|
... was under the impression that the race conditions were on display when all tasks for a WU overflowed... [Adding] The overflow situation can be connected, but excluding excessive OCs or other failures, is a symptom rather than a cause. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Brent Norman ![]() Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835
|
Jason, just to ease my curious mind :) Are you saying that if I ran 4 tasks with - unroll 5 on a 1080 w20 SMs that it should 'in theory' reduce the inconclusives to 20% of current values since the data chunks are larger? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0
|
Jason, just to ease my curious mind :) Are you saying that if I ran 4 tasks with - unroll 5 on a 1080 w20 SMs that it should 'in theory' reduce the inconclusives to 20% of current values since the data chunks are larger? Lower unroll should reduce the number of susceptible boundaries, yes, though better to also increase pfPeriodsPerLaunch at the same time as far as you can without excessive lag. Eventually those will be automated [with an fps target or similar]. Naturally Murphy's law says just because something can go wrong it will, so just because the issue is there, there is every chance you'll just realign the fold period boundaries onto a new set of smeared pulses. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13985 Credit: 208,696,464 RAC: 304
|
Jason, just to ease my curious mind :) Are you saying that if I ran 4 tasks with - unroll 5 on a 1080 w20 SMs that it should 'in theory' reduce the inconclusives to 20% of current values since the data chunks are larger? My guess is no. Inconclusives would most likely be reduced; but due to the way the drivers & hardware schedule & process the work, and the type of work being done, you wouldn't see such a significant reduction in inconclusives. I'd expect a reduction, but it doubt it would reduce in such a linear manner. Grant Darwin NT |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.