21)
Message boards :
Number crunching :
What should I look for in a GPU?
(Message 1992618)
Posted 4 May 2019 by petri33 Post: Hi gt40, Thanks Keith! I kind of knew who was the one to fill my sentence out. -- Petri |
22)
Message boards :
Number crunching :
What should I look for in a GPU?
(Message 1992616)
Posted 4 May 2019 by petri33 Post: Hi gt40, Since you are running on Linux you might be interested in TBar's OneForAll or smthg solution. You could easily threefold your daily output when compared to 'old' CUDA or SoG for NVIDIA on Linux. The link to the software is ... -- Petri |
23)
Message boards :
Number crunching :
have more GPUs than actually exist
(Message 1991698)
Posted 27 Apr 2019 by petri33 Post: V0.99 is coming. You can run 2 at a time to fill the initialisation and post processing gap. GPU part is run one at a time. |
24)
Message boards :
Number crunching :
High performance Linux clients at SETI
(Message 1990709)
Posted 19 Apr 2019 by petri33 Post: I decided to test offline the -pfb argument for values of 16 and 32. Observed absolutely no difference running with or without the argument. The difference in run_times were hundredths of a second. Likely measurement error. Thank You Keith, You can read "my English". -pfb and -unroll are needed only when a pulse (or a suspect) is found. That is a rare event. Only Then the old code is run. On noisy data a larger unroll may help a second or two, but the likelihood of an error is bigger. A noisebomb is a noiseblimb/blumb/blomb and it is a pure chance which data is reported from the very first beginning of a data packet. Unroll 1 gives the best equivalent to the CPU version. Petri |
25)
Message boards :
Number crunching :
High performance Linux clients at SETI
(Message 1990459)
Posted 18 Apr 2019 by petri33 Post: Please don't add the 750 Ti's to that rig Tom as I'll be taking a benchmark from it. Hi, One is to test. Backup everything, go off-line during the test. I can say it runs with 4. TBar is running 7 or 11 GPUs on his rig so he might be able to say a definite yes or no. |
26)
Message boards :
Number crunching :
High performance Linux clients at SETI
(Message 1990295)
Posted 16 Apr 2019 by petri33 Post: thanks for the info Petri. Just to be current and make sure the code compiles in the future too. |
27)
Message boards :
Number crunching :
RTX 2080
(Message 1990288)
Posted 16 Apr 2019 by petri33 Post: Petri you continue to amaze me with what you have done with the Cuda code optimizations. Hi, Thanks! I sometimes amaze myself too. The new v0.98 search algorithm was something that came to my mind a long time ago. Now was the time to implement it. I did initial tests with pencil and square paper, then in Python to make the address calculations right and then in CUDA C. It turned out to be a success. I've already got some new ideas how to make the program even faster. The debugger says there is some call overhead and waiting when transferring data from GPU to CPU for reporting. |
28)
Message boards :
Number crunching :
High performance Linux clients at SETI
(Message 1990282)
Posted 16 Apr 2019 by petri33 Post: Hi, unroll is default 1. You can set it with -unroll 2 or whatever your RAM and number of SM is if you want to squeeze out a second or so. Using value 1 makes it possibly a bit slower but more compatible with the official CPU version (1 means sequential find stage.) The pulse find algorithm search stage was completely rewritten . It does not need any buffer for temporary values. The scan is fully unrolled to all SM units but does not require any memory to store data. Earlier the process read the values sequentially and wrote sums back and read the values again and did addition and wrote again and again,,,. That needed a lot memory for each SM participating. Indexes: 01(2)3456789 56(7)89 34 89 1 6 4 9 now it reads them in 0538164927 and makes pairwise sums and sums of sums ... in one go. If something is found when scanning the original routine kicks in and reports the pulse. [/u] |
29)
Message boards :
Number crunching :
RTX 2080
(Message 1990177)
Posted 15 Apr 2019 by petri33 Post: Thank you Keith! excellect test report and nice findings concerning the performance (+500GFlops) and validity (equal or better than old one) of different tasks. The finding(s) with the idle time going down is an interesting one too. It seems to be working nicely on the 10x0 series. Thanks to Oddbjornik, TBar and some other volunteer testers too. I have compiled a new executable that should run on new Linux versions and 'old' GTX780 (sm35) and upwards. I've sent that to TBar (and select partners) for reference. I and TBar are still working hard to get a stable multiplatform capable version that toleretes the differences introduced by different driver versions combined to different GPU architectures and Linux versions. TBar will get a software update in a minute or two. Now going to do that. Bye. Petri |
30)
Message boards :
Number crunching :
Setting up Linux to crunch CUDA90 and above for Windows users
(Message 1989989)
Posted 14 Apr 2019 by petri33 Post: . . @ Petri Hi Stephen, Looking at the results and good validation rate of the latest tests I think that the 0.98 can be released for Linux as soon as TBar is ready to and for MAC when a stable configuration of driver/compiler version is found. Oddbjornik is working on runnning "one and a sixteenth of WUs at a time" so that some initialization and cleanup can be done on CPU for one task while another task is doing the heavy processing on GPU. Lets see if that brings an additional performance gain to total output and if we can add that to the new executable too. Petri |
31)
Message boards :
Number crunching :
Question about the special app for linux
(Message 1989934)
Posted 13 Apr 2019 by petri33 Post: Hi oddbjornik, there is a global int variable gCUDADevPref that holds the -device num parameter value. Each GPU should be allowed to run one task at a time. See PM for additional details. Petri |
32)
Message boards :
Number crunching :
RTX 2080
(Message 1989815)
Posted 12 Apr 2019 by petri33 Post: I don't think nvidia-smi shows the average. I think it is just a 1 second snapshot at time of invocation. You still can use a power limit if concerned about power usage with the new beta. Since it has far less memory transfers going on, the beta spends more time crunching than pushing data in an out of the memory array, so higher overall average power usage. You may have a word of wisdom there... The new version allows me to run memory a tad bit higher clocks since it does not stress the chips as much as any of the previous versions. To make that happen I had to add a substantial amount of address calculation to the code. Luckily enough the new cards do support simultaneous integer and floating point calculation, so I can calculate addresses at the same time I'm calculating scientific data (sum, min, max, ...) and do memory fetch operations. Pulse finding is kind of a folding a measurement tape in half, leaving any odd folds to that and continuing with an index of one more per any odd fold. A recursive algorithm is easy to write but compiles slow and uses a pile of stack variables and spills to memory from the register file. I overcame that by doing a deep loop (for in a for in a for ... 14 levels deep). The cuda compiler unrolls six of the innermost loops :) !! |
33)
Message boards :
Number crunching :
RTX 2080
(Message 1989793)
Posted 12 Apr 2019 by petri33 Post: Looking forward for general release of the 0.98 special app. The words/ coming from you/ make a wold/ of difference and true/ That is the shine/ of all of us Setizens/ giving us power and endurance/ to improve the GPU code/ and CPU too! p,s. I now know where to fork the initialize and wait mutex in the code. My guess is that a simple FILE *f = fopen("2GPU_lock", "w+") will do with a fclose(f) in the icfft-loop (analyzeFuncs.cpp). If I remember correctly all open files are closed at the program end and no two fopens can succeed with write flag, the OS should guarantee that. I'll test that tonight, tomorrow or whenever I have time from heating the water in my "palju" (Google that). My V0.98bXX uses a lot less memory compared to the one I have sent to 2 users to test with. (TBar is supplied with the latest advancements in the code the best I can.) Keep on reading this channel. I hope there will be news from the 2nd party tests. -- Petri |
34)
Message boards :
Number crunching :
RTX 2080
(Message 1989702)
Posted 11 Apr 2019 by petri33 Post:
Take a close look of Ian&Steve's tasks stderr.txt during the weekend and if you see V0.98 then take a look at the run-times. On Monday everything should look mundane again. It is a test with V0.98 on 20x0 and I have asked Ian&Steve not to give the test executable to anyone else. Petri |
35)
Message boards :
Number crunching :
Question about the special app for linux
(Message 1989682)
Posted 11 Apr 2019 by petri33 Post: Looking to improve the performance for the special app, I have a question. Probably mainly for Petri: Hi Oddbjörnik, In short: Yes! Running one at a time and initializing one in the background makes sense. I'm glad you noticed that too. I like to have my GPU to cool-off those seconds but the super crunchers with their water cooled units would like to have that feature (I guess) right now or preferably yesterday. Those seconds could really make a difference. Especially when running a long batch of shorties. Implementing such a scheme is not so hard. I'd be happy to include that in to the code if someone has time to experiment, develop and test. The source code is available and I'd be happy if someone had time to do so. The upcoming version has a much reduced memory footprint. You will all be able to experiment. (You can try with the current code to set -unroll 1 and run 2 at a time. My machine was slow with that). Petri |
36)
Message boards :
Number crunching :
Titan V core clocks in compute mode
(Message 1989681)
Posted 11 Apr 2019 by petri33 Post: Hi everyone, I have not had much time to twiddle with the TITAN V. My first impressions and after thaughts after a quick test were that the P0 enables the graphics processing units too and the power usage jumps to twice the normal and no speed gain is really gained in Seti. I'm runnig still at P2 with keepP2 with +220 for the GPU and +220 for the memory. I'm waiting for the Tuesday outages to align with my free time. Being a teacher the rest of the spring is going to be quite hectic at work. Well -- the summer is coming at a fast pace. There is a noticeably less snow here now. The roads are ice free again! I'll keep reading about your experiments and results with the new options. Thank you for your good work! Petri p.s. A new version (0.98) is still being tested by me and TBar and a couple of not so random Setizens. Memory usage will drop. Some speed gain can be expected. |
37)
Message boards :
Number crunching :
Panic Mode On (116) Server Problems?
(Message 1989076)
Posted 7 Apr 2019 by petri33 Post: Valid SETI@home v8 tasks for computer 7475713 Database Error Database Error Warning: Invalid argument supplied for foreach() in /disks/carolyn/b/home/boincadm/projects/sah/html/inc/result.inc on line 757 Database Error Warning: Invalid argument supplied for foreach() in /disks/carolyn/b/home/boincadm/projects/sah/html/inc/result.inc on line 766 |
38)
Message boards :
Number crunching :
User achievements thread.........
(Message 1988994)
Posted 6 Apr 2019 by petri33 Post: Hi, I thought I did not understand French, but reading a technical article changed my view and opinion. It was readable, understandable and a nice representation of different CUDA architechtures covering the most recent ones too. Petri some explanations about tensor core in French indeed :D ( search for 4.3.7 chapter ) http://www.info.univ-angers.fr/pub/richer/cuda_crs4.php |
39)
Message boards :
Number crunching :
Panic Mode On (116) Server Problems?
(Message 1988468)
Posted 2 Apr 2019 by petri33 Post: There is also a BIG file in the list... This one will take a while to split Hi, Why does it have DIAG(nose) in the name? a) Is it a diagnostic 'tape' to test the validity of the crunching software with artificially created 'hard' tasks? b) Is it a diagnosis tape containing an actual suspect (real) pulse? c) Is it a diagnosis test run of a new or an existing telescope? ö) Something else (an Aprils fools tape?) Petri |
40)
Message boards :
Number crunching :
User achievements thread.........
(Message 1988053)
Posted 30 Mar 2019 by petri33 Post: Thanks! I'm sure it (new code) could help ATI/AMD and NV on Windoze too. Preliminary reports from TBar say that tasks validate even better than V0.97 when compared to the 'official' CPU version. My inconclusive count was dropping from near 300 to nearer 250 when I looked at it last time. The CPU does minimal work and during the various stages CPU runs parallel to GPU. The CPU does mainly wait for the GPU to finish a kernel and fetch flags to see if any results should be fetched. The default is to run with CPU friendly options. To make more use of a spare CPU core one can specify -nobs option. (nobs = no blocking sync. I.e. active waiting in a busy loop.) ... but what if the CPU version of Seti MB could benefit from the new algorithm? No "read"-"write"-"read again" to the RAM. Petri Ffs!! You rock! |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.