Message boards :
Number crunching :
Average Credit Decreasing?
Message board moderation
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 32 · Next
Author | Message |
---|---|
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Every time I look and see my rac dropping I keep thinking there's a problem then I remember Credit Screw That's the main cause of my current decline.... 20 NV GPUs having a nasty time with the Guppies. "Time is simply the mechanism that keeps everything from happening all at once." |
Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 |
Is it more of a driver issue (where we would have to wait for Nvidia to get their act together before there is any relief), the way that the Guppies are configured when they are split (something that can be addressed internally, though with the manpower crunch, is that likely anytime soon?), or something else that causes such a penalty for them on Nvidia hardware? Are things much better in the AMD world? I haven't ran a AMD card for probably 15+ years, so I basically have no experience with them. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Not this time Al :), we're digging into computer science territory now :) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3806 Credit: 1,114,826,392 RAC: 3,319 |
Is it more of a driver issue (where we would have to wait for Nvidia to get their act together before there is any relief), the way that the Guppies are configured when they are split (something that can be addressed internally, though with the manpower crunch, is that likely anytime soon?), or something else that causes such a penalty for them on Nvidia hardware? Are things much better in the AMD world? I haven't ran a AMD card for probably 15+ years, so I basically have no experience with them. I haven't checked the source code enough yet to know for sure, but I have a suspicion: There's something in the CUDA framework that doesn't like that VLAR work units have negligibly small angular size. This should exclude them from even checking for Gaussians because the telescope is not crossing the signal which is what causes one, so any code which checks for Gaussians shouldn't even run in a VLAR. The fact that it is running really slowly means that something in there is still using that angular size (what else but a Gaussian would need to use it?), and very likely shouldn't be. So, to find what the code is that does it, and don't run it if the angular width is below the VLAR threshold. I'm going to try to bring myself up to speed to fix this thing, but I'm hoping someone will beat me to it... it's a lot of work getting there. :^) Edit: Also when I get there I am not sure I will even recognize it when I see it...lol. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
it's a lot of work getting there. :^) Correct, but no shortcuts :) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
tullio Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 |
Is it more of a driver issue (where we would have to wait for Nvidia to get their act together before there is any relief), the way that the Guppies are configured when they are split (something that can be addressed internally, though with the manpower crunch, is that likely anytime soon?), or something else that causes such a penalty for them on Nvidia hardware? Are things much better in the AMD world? I haven't ran a AMD card for probably 15+ years, so I basically have no experience with them. I am crunching both SETI@home GPUs and SETI Beta GPUs on an AMD HD 7770 in my Linux box and they take about one hour or less even if they are VLAR. Tullio |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
AR used not only for Gaussians. It also defines how long telescope stare at the nearly same point so defines length of time through which data coud be accumulated (so all PoT analysis use it). For explanations why VLAR relatively harder for GPU vs CPU vs other ARs look few my recent posts for example (actually it was repeated few times through years). And effect strongly depends on memory organization. Even on the same frequency memory access to NV CC1.x (for example) device and AMD device very different. So called coalesced access cause enormous performance drop for early NV architectures in case of random (or close to random from hardware point of view) access to memory. Later architectures improved this. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3806 Credit: 1,114,826,392 RAC: 3,319 |
|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
very short version: low AR => longer time to stare at same point => bigger data array for single PoT search => failure to fit cache, failure to get enough parallel data to fill all CUs (longer single array = less number of such arrays cause 1M matrix of data point remains constant), decreased computation/memory access ratio (cause most of PulseFind is folding (simple additions) and this search has increased share) => performance drop for devices with massive parallelizm and big memory access latencies (that GPU are). EDIT: to find person's posts: http://setiathome.berkeley.edu/forum_user_posts.php?userid=7779286 |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Is it more of a driver issue (where we would have to wait for Nvidia to get their act together before there is any relief), the way that the Guppies are configured when they are split (something that can be addressed internally, though with the manpower crunch, is that likely anytime soon?), or something else that causes such a penalty for them on Nvidia hardware? Are things much better in the AMD world? I haven't ran a AMD card for probably 15+ years, so I basically have no experience with them. Hmmmm, are you running 2 at a time...or something? My ATI 7750 runs them in under 30 minutes, http://setiathome.berkeley.edu/result.php?resultid=4932770085, which is better than the 6850s and about the same as the 150 watt 6870. You might try adding some settings, try just the basic ones my 7750 is using; -sbs 256 -oclfft_tune_gr 256 -oclfft_tune_wg 128 |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3806 Credit: 1,114,826,392 RAC: 3,319 |
very short version: OK, this helps. Since slewing Arecibo work units don't do this, the AR must be greater than the scope's "aperture". Let's say that the AR is 3x the size of the aperture of the scope. Then why not break the WU into that ratio of pieces (in this case 3) and run the pulsefind on each piece, then add the results? That way each piece being of the same timebase as an Arecibo Gaussian won't overload the cache. Edit: This will take some tinkering due to pulses at the edge of each piece that and up in both of them... |
tullio Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 |
No, I am a total novice on GPUs and run them one at a time, both on the Linux box with AMD and the Windows 10 PC with a GTX 750 Ti OC, which runs mostly Einstein@home tasks, which take much longer but reward me with 4400 credits. Tullio |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
No, I am a total novice on GPUs and run them one at a time, both on the Linux box with AMD and the Windows 10 PC with a GTX 750 Ti OC, which runs mostly Einstein@home tasks, which take much longer but reward me with 4400 credits. Multiply 'Seti Time' and credits by ~3.3 +/- and you will get Einstein time :D "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
tullio Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 |
I made a rule of thumb calculation. Einstein@home is giving me 900 credits/hour per elapsed time on a GPU task, while SETI@home gives me 100 credits/hour also on a GPU task. I am not crunching for credits, being a (retired ) physicist I am strongly interested in Einstein@home. Most of the Einstein projects use only CPUs, only the search for binary radio pulsars on Arecibo and Parkes data uses GPUs. Tullio |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Is that before, during, or after v8 transition, and does it factor that Guppi tasks are lower efficiency running ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
rob smith Send message Joined: 7 Mar 03 Posts: 22535 Credit: 416,307,556 RAC: 380 |
From memory of my walk through of the credit code a few years back, and running some simulations, it would appear that the code is all but incapable of correctly resolving the sot of changes that have happened in the last few months. It struggles with the slow increase in performance of both computer and application, but when you have a step change in application and the type of work unit coming out it is all but incapable of working out what is going on. It will default to granting to the lowest possible credit for each task, which will result in a drop in credit granted for a "standard" of between 15 and 30%. Further this drop will continue for about another 5 to 10 % (of the initial credit). As has been said by some for a long time CreditNew is far from "fit for purpose" - even if one assumed that the purpose is to allow comparison in performance between systems and applications within a project - it is far too dependent upon the performance of the individual computer, and NOT on the content of the task. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Works for me, so question is what do we do about it ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
tullio Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 |
All my tasks are V8 now. guppi tasks take a little longer, but not much. Tullio |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
All my tasks are V8 now. guppi tasks take a little longer, but not much. and credit is equal for equal work ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Well I skip any discussion about number 3 origin. But in essense it's the way to deal with too long arrays - to split them on subarrays where possible. This have own pluses and minuses: + less number of data point fit in smaller cache better, + more separate chains of data to load parallel device better, - need to assemble back, that is, additional synching between data parts processing. Also, not always possible to select even loosely independend parts of data array. Consider folding algorithm (in real MultiBeam PulseFind actual folding done also by 3 and 5 ) in its simpliess form (as it implemented in AstroPulse): one need to take 2 numbers separated by "arbitrary" (in real life - computed from let say data recording params and wanted period to analyze) stride, add them and put in next array. Then repeat the same (and probably with different stride) on new array and so on [And, of course, check each iteration if we have smth over threshold and select best of them - another synching point in this reduction process]. To launch separate kernel for each cycle will be absolute performance kill. So one pass. Then one should know that global memory considered asynchronous between workitems in the same kernel. That is, if CU0 writes smth in cellN and CU1 reads from cellN - order of these operations undefined. So, when one try to split array onto parts one can have synching only inside workgroup (256 workitems for AMD, up to 2048 for modern NV). Obviously part of array that should be handled by single workgroup should be enough to include all data that constitute last point after folding. That limits ability of "divide and conquer" approach in this case. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.