Average Credit Decreasing?

Author	Message
kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1787883 - Posted: 16 May 2016, 16:12:04 UTC - in response to Message 1787881. Every time I look and see my rac dropping I keep thinking there's a problem then I remember Credit Screw Not the cause anymore... it's now due to your exclusively NVidia farm receiving GUPPI VLAR work units on the GPUs. As noted much elsewhere, they process much more slowly than Arecibo MBs but pay the same credit. That's the main cause of my current decline.... 20 NV GPUs having a nasty time with the Guppies. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1787883 ·

Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482	Message 1787890 - Posted: 16 May 2016, 16:21:17 UTC Is it more of a driver issue (where we would have to wait for Nvidia to get their act together before there is any relief), the way that the Guppies are configured when they are split (something that can be addressed internally, though with the manpower crunch, is that likely anytime soon?), or something else that causes such a penalty for them on Nvidia hardware? Are things much better in the AMD world? I haven't ran a AMD card for probably 15+ years, so I basically have no experience with them. ID: 1787890 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1787891 - Posted: 16 May 2016, 16:22:48 UTC - in response to Message 1787890. Not this time Al :), we're digging into computer science territory now :) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1787891 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 1787894 - Posted: 16 May 2016, 16:28:41 UTC - in response to Message 1787890. Last modified: 16 May 2016, 16:37:49 UTC Is it more of a driver issue (where we would have to wait for Nvidia to get their act together before there is any relief), the way that the Guppies are configured when they are split (something that can be addressed internally, though with the manpower crunch, is that likely anytime soon?), or something else that causes such a penalty for them on Nvidia hardware? Are things much better in the AMD world? I haven't ran a AMD card for probably 15+ years, so I basically have no experience with them. I haven't checked the source code enough yet to know for sure, but I have a suspicion: There's something in the CUDA framework that doesn't like that VLAR work units have negligibly small angular size. This should exclude them from even checking for Gaussians because the telescope is not crossing the signal which is what causes one, so any code which checks for Gaussians shouldn't even run in a VLAR. The fact that it is running really slowly means that something in there is still using that angular size (what else but a Gaussian would need to use it?), and very likely shouldn't be. So, to find what the code is that does it, and don't run it if the angular width is below the VLAR threshold. I'm going to try to bring myself up to speed to fix this thing, but I'm hoping someone will beat me to it... it's a lot of work getting there. :^) Edit: Also when I get there I am not sure I will even recognize it when I see it...lol. ID: 1787894 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1787897 - Posted: 16 May 2016, 16:37:49 UTC - in response to Message 1787894. it's a lot of work getting there. :^) Correct, but no shortcuts :) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1787897 ·

tullio Volunteer tester Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1	Message 1787898 - Posted: 16 May 2016, 16:39:02 UTC - in response to Message 1787890. Is it more of a driver issue (where we would have to wait for Nvidia to get their act together before there is any relief), the way that the Guppies are configured when they are split (something that can be addressed internally, though with the manpower crunch, is that likely anytime soon?), or something else that causes such a penalty for them on Nvidia hardware? Are things much better in the AMD world? I haven't ran a AMD card for probably 15+ years, so I basically have no experience with them. I am crunching both SETI@home GPUs and SETI Beta GPUs on an AMD HD 7770 in my Linux box and they take about one hour or less even if they are VLAR. Tullio ID: 1787898 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1787902 - Posted: 16 May 2016, 16:51:45 UTC - in response to Message 1787894. Last modified: 16 May 2016, 17:00:18 UTC The fact that it is running really slowly means that something in there is still using that angular size (what else but a Gaussian would need to use it?), and very likely shouldn't be. So, to find what the code is that does it, and don't run it if the angular width is below the VLAR threshold. AR used not only for Gaussians. It also defines how long telescope stare at the nearly same point so defines length of time through which data coud be accumulated (so all PoT analysis use it). For explanations why VLAR relatively harder for GPU vs CPU vs other ARs look few my recent posts for example (actually it was repeated few times through years). And effect strongly depends on memory organization. Even on the same frequency memory access to NV CC1.x (for example) device and AMD device very different. So called coalesced access cause enormous performance drop for early NV architectures in case of random (or close to random from hardware point of view) access to memory. Later architectures improved this. ID: 1787902 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 1787907 - Posted: 16 May 2016, 16:58:05 UTC - in response to Message 1787902. For explanations why VLAR relatively harder for GPU vs CPU vs other ARs look few my recent posts for example (actually it was repeated few times through years). Thanks for the details... do you have a link to any of these posts? ID: 1787907 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1787909 - Posted: 16 May 2016, 17:00:46 UTC - in response to Message 1787907. Last modified: 16 May 2016, 17:05:41 UTC very short version: low AR => longer time to stare at same point => bigger data array for single PoT search => failure to fit cache, failure to get enough parallel data to fill all CUs (longer single array = less number of such arrays cause 1M matrix of data point remains constant), decreased computation/memory access ratio (cause most of PulseFind is folding (simple additions) and this search has increased share) => performance drop for devices with massive parallelizm and big memory access latencies (that GPU are). EDIT: to find person's posts: http://setiathome.berkeley.edu/forum_user_posts.php?userid=7779286 ID: 1787909 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1787919 - Posted: 16 May 2016, 17:20:14 UTC - in response to Message 1787898. Is it more of a driver issue (where we would have to wait for Nvidia to get their act together before there is any relief), the way that the Guppies are configured when they are split (something that can be addressed internally, though with the manpower crunch, is that likely anytime soon?), or something else that causes such a penalty for them on Nvidia hardware? Are things much better in the AMD world? I haven't ran a AMD card for probably 15+ years, so I basically have no experience with them. I am crunching both SETI@home GPUs and SETI Beta GPUs on an AMD HD 7770 in my Linux box and they take about one hour or less even if they are VLAR. Tullio Hmmmm, are you running 2 at a time...or something? My ATI 7750 runs them in under 30 minutes, http://setiathome.berkeley.edu/result.php?resultid=4932770085, which is better than the 6850s and about the same as the 150 watt 6870. You might try adding some settings, try just the basic ones my 7750 is using; -sbs 256 -oclfft_tune_gr 256 -oclfft_tune_wg 128 ID: 1787919 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 1787922 - Posted: 16 May 2016, 17:30:01 UTC - in response to Message 1787909. Last modified: 16 May 2016, 17:39:19 UTC very short version: low AR => longer time to stare at same point => bigger data array for single PoT search => failure to fit cache, failure to get enough parallel data to fill all CUs (longer single array = less number of such arrays cause 1M matrix of data point remains constant) OK, this helps. Since slewing Arecibo work units don't do this, the AR must be greater than the scope's "aperture". Let's say that the AR is 3x the size of the aperture of the scope. Then why not break the WU into that ratio of pieces (in this case 3) and run the pulsefind on each piece, then add the results? That way each piece being of the same timebase as an Arecibo Gaussian won't overload the cache. Edit: This will take some tinkering due to pulses at the edge of each piece that and up in both of them... ID: 1787922 ·

tullio Volunteer tester Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1	Message 1787923 - Posted: 16 May 2016, 17:34:22 UTC - in response to Message 1787919. No, I am a total novice on GPUs and run them one at a time, both on the Linux box with AMD and the Windows 10 PC with a GTX 750 Ti OC, which runs mostly Einstein@home tasks, which take much longer but reward me with 4400 credits. Tullio ID: 1787923 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1787927 - Posted: 16 May 2016, 17:39:56 UTC - in response to Message 1787923. No, I am a total novice on GPUs and run them one at a time, both on the Linux box with AMD and the Windows 10 PC with a GTX 750 Ti OC, which runs mostly Einstein@home tasks, which take much longer but reward me with 4400 credits. Tullio Multiply 'Seti Time' and credits by ~3.3 +/- and you will get Einstein time :D "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1787927 ·

tullio Volunteer tester Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1	Message 1787931 - Posted: 16 May 2016, 17:57:01 UTC - in response to Message 1787927. I made a rule of thumb calculation. Einstein@home is giving me 900 credits/hour per elapsed time on a GPU task, while SETI@home gives me 100 credits/hour also on a GPU task. I am not crunching for credits, being a (retired ) physicist I am strongly interested in Einstein@home. Most of the Einstein projects use only CPUs, only the search for binary radio pulsars on Arecibo and Parkes data uses GPUs. Tullio ID: 1787931 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1787932 - Posted: 16 May 2016, 18:00:30 UTC - in response to Message 1787931. Is that before, during, or after v8 transition, and does it factor that Guppi tasks are lower efficiency running ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1787932 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22200 Credit: 416,307,556 RAC: 380	Message 1787940 - Posted: 16 May 2016, 18:14:13 UTC From memory of my walk through of the credit code a few years back, and running some simulations, it would appear that the code is all but incapable of correctly resolving the sot of changes that have happened in the last few months. It struggles with the slow increase in performance of both computer and application, but when you have a step change in application and the type of work unit coming out it is all but incapable of working out what is going on. It will default to granting to the lowest possible credit for each task, which will result in a drop in credit granted for a "standard" of between 15 and 30%. Further this drop will continue for about another 5 to 10 % (of the initial credit). As has been said by some for a long time CreditNew is far from "fit for purpose" - even if one assumed that the purpose is to allow comparison in performance between systems and applications within a project - it is far too dependent upon the performance of the individual computer, and NOT on the content of the task. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1787940 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1787943 - Posted: 16 May 2016, 18:22:46 UTC - in response to Message 1787940. Works for me, so question is what do we do about it ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1787943 ·

tullio Volunteer tester Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1	Message 1787944 - Posted: 16 May 2016, 18:25:04 UTC - in response to Message 1787932. All my tasks are V8 now. guppi tasks take a little longer, but not much. Tullio ID: 1787944 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1787945 - Posted: 16 May 2016, 18:29:30 UTC - in response to Message 1787944. All my tasks are V8 now. guppi tasks take a little longer, but not much. Tullio and credit is equal for equal work ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1787945 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1787946 - Posted: 16 May 2016, 18:35:34 UTC - in response to Message 1787922. Last modified: 16 May 2016, 18:42:38 UTC Then why not break the WU into that ratio of pieces (in this case 3) and run the pulsefind on each piece, then add the results? That way each piece being of the same timebase as an Arecibo Gaussian won't overload the cache. Well I skip any discussion about number 3 origin. But in essense it's the way to deal with too long arrays - to split them on subarrays where possible. This have own pluses and minuses: + less number of data point fit in smaller cache better, + more separate chains of data to load parallel device better, - need to assemble back, that is, additional synching between data parts processing. Also, not always possible to select even loosely independend parts of data array. Consider folding algorithm (in real MultiBeam PulseFind actual folding done also by 3 and 5 ) in its simpliess form (as it implemented in AstroPulse): one need to take 2 numbers separated by "arbitrary" (in real life - computed from let say data recording params and wanted period to analyze) stride, add them and put in next array. Then repeat the same (and probably with different stride) on new array and so on [And, of course, check each iteration if we have smth over threshold and select best of them - another synching point in this reduction process]. To launch separate kernel for each cycle will be absolute performance kill. So one pass. Then one should know that global memory considered asynchronous between workitems in the same kernel. That is, if CU0 writes smth in cellN and CU1 reads from cellN - order of these operations undefined. So, when one try to split array onto parts one can have synching only inside workgroup (256 workitems for AMD, up to 2048 for modern NV). Obviously part of array that should be handled by single workgroup should be enough to include all data that constitute last point after folding. That limits ability of "divide and conquer" approach in this case. ID: 1787946 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.