Linux CUDA 'Special' App finally available, featuring Low CPU use

Author	Message
TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1875464 - Posted: 28 Jun 2017, 8:24:04 UTC - in response to Message 1875460. All those people testing these Apps at Beta and no one picked this up? Nevermind. And you was just one of them, if I recall correctly :D Acually I've been avoiding Windows like the Plague since WinNSA Edition starting trying to take over my Win8.1 machine. I just got it registered at SETI again and it showed the Last contact as Dec 2015. I missed all the Windows SoG stuff. Although I had to deal with it a little on my Mac. I still prefer the Non-SoG version on the Mac, it produces much lower Idle Wake Ups, and is just about as fast. Ever notice there aren't any SoG versions on My pages at Crunchers Anonymous? ID: 1875464 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1875466 - Posted: 28 Jun 2017, 8:59:14 UTC - in response to Message 1875441. Still waiting on my Seti Toaster :( We were going to fax it to your 3D printer but couldn't find a phone number.... A proud member of the OFA (Old Farts Association). ID: 1875466 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1875467 - Posted: 28 Jun 2017, 9:05:52 UTC - in response to Message 1875464. All those people testing these Apps at Beta and no one picked this up? I have been running SetiBeta off and on (including the 8.06(alt) and now the 8.07(alt) apps). I have been skimming this thread because I was thinking of trying out the "special sauce" app. It is clear to me, that I currently don't have a clue on how to diagnosis, test and read the results. It wouldn't surprise me if there are a lot of "us" no nothings processing SetiBeta but not really able to debug anything. Tom A proud member of the OFA (Old Farts Association). ID: 1875467 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1875474 - Posted: 28 Jun 2017, 11:44:20 UTC - in response to Message 1875351. Hi, The bug is bugging me. Peak, Time, Period and Score + fft_len always the same. Freq and chirp wary. The first pulsefind in the code (8k len). Please explain more verbose here. First pulsefind is done on zero chirp and its length 8, not 8k. What did you mean by fist and 8k here? The first pulse find in the souce code file cudaAcc_pulsefind.cu. There are 2 versions of pulse find, the first one is used for fft lengths 1k-16k and the second l2m is used for fft len < 1k. So I was referring not to the order of pulse find runs but to the place of the kernel in the source code file. Not the l2m version. ??sorry? See above. About the rare errors: I'm running my cards at high clock speed for memory and GPU. They have fan at 100% and temperatures near 70C. That may be one cause for the errors. In summer the cards sem to have more lockups too. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1875474 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1875481 - Posted: 28 Jun 2017, 12:19:13 UTC - in response to Message 1875467. All those people testing these Apps at Beta and no one picked this up? I have been running SetiBeta off and on (including the 8.06(alt) and now the 8.07(alt) apps). I have been skimming this thread because I was thinking of trying out the "special sauce" app. It is clear to me, that I currently don't have a clue on how to diagnosis, test and read the results. It wouldn't surprise me if there are a lot of "us" no nothings processing SetiBeta but not really able to debug anything. Tom . . Putting hand up :) Stephen :) ID: 1875481 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380	Message 1875483 - Posted: 28 Jun 2017, 12:58:09 UTC Last modified: 28 Jun 2017, 13:50:50 UTC There are two sides to Beta testing: "Bulk crunching" - different people crunching on a wide range of hardware to comb out the unforeseen bugs that can only be detected when a lot of test results are generated. Most on Beta will do this. These people should keep an eye on their results and report any "strangeness". "Debugging" - only a few have the time, equipment and skills to do this. They are the ones who will look at the pattern of the results being generated, and when they see something unexpected acting to investigate the problem. These days I fall very much into the first group as I don't have the time or the equipment (compilers etc) required to serve in the second role. I will swing a computer over to Beta when needed - thus I have just swung my Windows machine over to make sure that 8.07 (CPU) really does work correctly, tonight when I get home I will turn on the graphics for a time as part of the testing it to make sure that doing the graphics doesn't interfere with the calculation part of the process. A "good" set of data being fed into a Beta test program should include data that is known to cause problems, as well as "generally representative" data. It is important to realise that Beta is about testing a new application and making sure that application "behaves correctly", which may involve having tasks that end in errors because they are intended to do just that. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1875483 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875485 - Posted: 28 Jun 2017, 13:25:54 UTC - in response to Message 1875474. Can't do that @petri, let it stew. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875485 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1875490 - Posted: 28 Jun 2017, 13:47:14 UTC - in response to Message 1875467. It is clear to me, that I currently don't have a clue on how to diagnosis, test and read the results. The first step is to know there is a problem. That in itself is a problem because SETI allows for close to a 50% difference in results before the tasks fails validation. The task could be much different and unless you looked at the results you wouldn't know there was a problem. The problem is actually Worse at Beta since they started issuing tasks to three Hosts a once. Sometimes results that are Inconclusive spend very little time listed as Inconclusive before being Validated by the third Host leaving No indication there was even a problem. The only way to tell is to check All the Validated tasks, which is unacceptable to most people. Someone suggested giving the Results that Validated as Weakly Similar half credit. That would probably also be unacceptable. My suggestion would be to simply Color the Results that validated as Weakly Similar a different color than the rest, say Yellow, instead of the normal color. That would make the problems easy to identify for just about anyone. Once you know there was a problem with the task it's just a matter of comparing results to find the difference and posting about the problem on the board. The first step is knowing there is actually something different, a different colored result would certainly help. ID: 1875490 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380	Message 1875492 - Posted: 28 Jun 2017, 13:56:33 UTC That looks to be a good idea - visual and easy to spot the "problem" tasks. Another thought, picking up from your comment about inconclusives "vanishing on validation" - is that instead of purging from view "inconclusive" tasks that validate on #3 make them hang around for another 24 hours from validation to give folks a chance to see and capture them. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1875492 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1875494 - Posted: 28 Jun 2017, 14:03:48 UTC - in response to Message 1875492. Well, they don't actually disappear. They just become Valid instead of Inconclusive, and spend the day in a different location. The color coding should take care of that, it will still be yellow even in the Valid column. Anything to help label a problem as a problem would help. ID: 1875494 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380	Message 1875496 - Posted: 28 Jun 2017, 14:14:32 UTC ...accepted, but once in the valid pile they are "buried" - perhaps I should have said "buried" instead? But leaving them in the "inconclusive" pile for a bit longer would make them more noticeable. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1875496 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1875523 - Posted: 28 Jun 2017, 17:25:31 UTC - in response to Message 1875324. So far, I've only noted one task with a problem using zi3v, and I'm pretty sure that's an isolated incident. Task 5828724704 originally was started on a GTX 750 Ti but, following a reboot, restarted on the GTX 960. Before the restart, it looks like it was running fine, but afterwards it went haywire, identifying 25 bogus Triplets with non-numeric peaks (i.e, "peak=-nan"). I imagine that it's just some sort of restart timing issue, though perhaps on a restart like that the memory usage spikes in some way. Only if it happens again will I really be concerned. Anyway, that task is currently in an Inconclusive state but I expect it to go Invalid once the tie-breaker reports in. Unfortunately, this was not an isolated incident. I've now had two more tasks, 5832726585 and 5832726581, which went haywire with zi3v following a restart. The first one originally ran on a GTX 750 Ti and restarted on the GTX 960, at which point it reported 15 bogus Triplets with "peak=-nan", similar to the one from Friday. The second one also started on a GTX 750 Ti but restarted on a different 750 Ti, this time quickly reporting 17 bogus Spikes. Both tasks seemed to be running fine before the shutdown and restart. Oh, oh. It appears that this problem following restarts isn't limited to zi3v. Task 5833903191, on a different host running x41p_zi3t2b (the Petri version w/o Blocking Sync), restarted at 9.60 percent. It then reported 27 bogus spikes after the restart. As with the tasks on the other machine that I previously reported, this task ran on a different GPU after the restart than the one it was running on before shutdown. On this machine, both are GTX 960s. Spike: peak=38.84184, time=0.02623, d_freq=1420147781.16, chirp=-8.1252, fft_len=512 Spike: peak=303.4074, time=0.07866, d_freq=1420142478.3, chirp=-8.1252, fft_len=512 Spike: peak=265.8479, time=0.1311, d_freq=1420145357.97, chirp=-8.1252, fft_len=512 Spike: peak=512, time=0.1835, d_freq=1420146006.05, chirp=-8.1252, fft_len=512 Spike: peak=57.68067, time=0.2359, d_freq=1420143506.99, chirp=-8.1252, fft_len=512 Spike: peak=64, time=0.2884, d_freq=1420142190.5, chirp=-8.1252, fft_len=512 Spike: peak=247.2751, time=0.3408, d_freq=1420145299.05, chirp=-8.1252, fft_len=512 Spike: peak=511.7787, time=0.3932, d_freq=1420149037.03, chirp=-8.1252, fft_len=512 Spike: peak=36.57143, time=0.4457, d_freq=1420147987.56, chirp=-8.1252, fft_len=512 Spike: peak=160.278, time=0.4981, d_freq=1420145202.4, chirp=-8.1252, fft_len=512 Spike: peak=512, time=0.5505, d_freq=1420142264.66, chirp=-8.1252, fft_len=512 Spike: peak=511.8982, time=0.6029, d_freq=1420147166.12, chirp=-8.1252, fft_len=512 Spike: peak=43.89521, time=0.6554, d_freq=1420150827.8, chirp=-8.1252, fft_len=512 Spike: peak=512, time=0.7078, d_freq=1420150655.72, chirp=-8.1252, fft_len=512 Spike: peak=63.99994, time=0.7602, d_freq=1420143521.81, chirp=-8.1252, fft_len=512 Spike: peak=233.6334, time=0.8126, d_freq=1420144608.57, chirp=-8.1252, fft_len=512 Spike: peak=70.89178, time=0.8651, d_freq=1420142433.77, chirp=-8.1252, fft_len=512 Spike: peak=42.67112, time=0.9175, d_freq=1420143310.72, chirp=-8.1252, fft_len=512 Spike: peak=32.00002, time=0.9699, d_freq=1420147258.51, chirp=-8.1252, fft_len=512 Spike: peak=256, time=1.022, d_freq=1420149947.44, chirp=-8.1252, fft_len=512 Spike: peak=30.11851, time=1.075, d_freq=1420150671.81, chirp=-8.1252, fft_len=512 Spike: peak=120.4702, time=1.127, d_freq=1420143251.8, chirp=-8.1252, fft_len=512 Spike: peak=46.55062, time=1.18, d_freq=1420151014.28, chirp=-8.1252, fft_len=512 Spike: peak=220.6367, time=1.232, d_freq=1420148934.84, chirp=-8.1252, fft_len=512 Spike: peak=102.3289, time=1.285, d_freq=1420146855.41, chirp=-8.1252, fft_len=512 Spike: peak=47.20825, time=1.337, d_freq=1420145271.88, chirp=-8.1252, fft_len=512 Spike: peak=220.932, time=1.389, d_freq=1420150612.03, chirp=-8.1252, fft_len=512 To make it even more bizarre, there were 5 Spikes reported before the shutdown and those appear to match the wingmen's reports. But 5 + 27 should equal 32, whereas the reported Spike count for the -9 Overflow is still just 30. ID: 1875523 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875572 - Posted: 29 Jun 2017, 1:01:21 UTC Last modified: 29 Jun 2017, 1:14:46 UTC In the meantime, multiple response tweets from Eric (lmao). Analysing & compiling the info. my original query: @SETIEric If a 'best Gaussian' looks more 'Guassianey' than the reportables, why may it not necessarily be reportable ? A Gaussian has to pass 3 thresholds. 1. A power threshold for the fit to even occur 2. A chisqr "gaussianness" thresholld and 3. A null chisqr "integrated power" threshold. The "Best Gaussian" is chosen by a score computed from chisqr and nullchisqr. But for best Gaussian, the thresolds are lower than for reported Gaussians, especially early in the run (we always want a best Gaussian) So it's possible for a high scoring Gaussian not to meet the chisqr threshold and not be reported. Cool, seems to gel somewhat with what I thought. Will have to figure out which builds differ in what ways. my response: Thanks for the detailed responses :D. Context is we currently have different apps doing different things. Can now compare against intent. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875572 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1875574 - Posted: 29 Jun 2017, 1:06:25 UTC - in response to Message 1875572. In the meantime, multiple response tweets from Eric (lmao). Analysing & compiling the info. Nothing like tackling a complex issue 140 characters at a time. Oh, well, hope it all fits together. ID: 1875574 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875579 - Posted: 29 Jun 2017, 1:25:54 UTC - in response to Message 1875523. Last modified: 29 Jun 2017, 1:27:09 UTC ... To make it even more bizarre, there were 5 Spikes reported before the shutdown and those appear to match the wingmen's reports. But 5 + 27 should equal 32, whereas the reported Spike count for the -9 Overflow is still just 30. As the spike processing is on GPU, it's parallelised, therefore theoretically it can pick up many more spikes, but will raise a -9 overflow exception during recording to the result file at 30, and bail. So they can show more in the log, but only store 30. What might be going wrong with restart though, is probably something with startup, such as the initial baseline smooth possibly being omitted by accident. Fortunately since the problem shows with spikes, it means there isn't much of complexity to check going on before that. Just the task load, smooth, FFT planning, Chirps, and FFT. So it should be relatively simple to remedy. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875579 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1875581 - Posted: 29 Jun 2017, 1:37:35 UTC - in response to Message 1875579. Problems with restarts have existed forever with Petri's App. Since his tasks finish in under a few minutes I don't think he uses checkpoints. The problems can be reduced by setting the checkpoints to around 3 minutes as it seems most of the problems happen on tasks that have run less than 3 minutes before restarting. I still see a few problems but they are rare. ID: 1875581 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1875582 - Posted: 29 Jun 2017, 1:46:56 UTC - in response to Message 1875572. But for best Gaussian, the thresolds are lower than for reported Gaussians, especially early in the run (we always want a best Gaussian) So it's possible for a high scoring Gaussian not to meet the chisqr threshold and not be reported. That still only makes sense to me up until a Gaussian scores high enough to actually be reported, at which point I would think the "best" Gaussian would now have to be one of the reported ones. In other words, having a "best" Gaussian but no reported Gaussians seems fine (i.e., we always want a best Gaussian), but having one that isn't included in reported ones still just doesn't seem right. I don't get a clear sense from his reponse(s) that he understood that that's the situation we were wondering about. What might be going wrong with restart though, is probably something with startup, such as the initial baseline smooth possibly being omitted by accident. Fortunately since the problem shows with spikes, it means there isn't much of complexity to check going on before that. Just the task load, smooth, FFT planning, Chirps, and FFT. So it should be relatively simple to remedy. It's not just Spikes. Two of the ones I've gotten have been Spikes and two have been Triplets. ID: 1875582 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875585 - Posted: 29 Jun 2017, 2:04:48 UTC - in response to Message 1875582. Last modified: 29 Jun 2017, 2:08:45 UTC Okey doke. Plenty to work with on both issues when I can. Fingers crossed I get some home & work things out of the way this week. [Edit:] Will walk through the stock CPU code against Eric's responses, query if I can't see why best shouldn't be a duplicate of a reportable if present. For the restart issue, I'll have a peek at the checkpointing and restart process before updating the alpha in svn, either commit as is and work on it incrementally, or fix it first if simple. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875585 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1875586 - Posted: 29 Jun 2017, 2:05:45 UTC - in response to Message 1875581. Problems with restarts have existed forever with Petri's App. Since his tasks finish in under a few minutes I don't think he uses checkpoints. But he's not just coding for himself. Or is he? The problems can be reduced by setting the checkpoints to around 3 minutes as it seems most of the problems happen on tasks that have run less than 3 minutes before restarting. I still see a few problems but they are rare. I think I generally have my checkpoints set to 2 minutes, though I'll have to check when I start those boxes back up in a few hours. (They shut down for 5 hours on weekdays to avoid peak electric rates.) Having an app that doesn't respect checkpointing doesn't seem like one that should ever get a stamp of approval. Three of the four that I've experienced in the last several days have been guppi VLARs that ended up with total run times of 5:17 (restarted at 32.57%), 8:29 (restarted at 72.08%), and 9:22 (restarted at 71.23%). Regardless of the restart percentage, though, I would guess that those tasks overflowed very quickly after the restart, in which case even a 3 minute checkpoint wouldn't help. Only the new one I reported today wasn't a VLAR, and that one finished in 1:07 after restarting at 9.60%. ID: 1875586 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875588 - Posted: 29 Jun 2017, 2:18:21 UTC - in response to Message 1875586. Problems with restarts have existed forever with Petri's App. Since his tasks finish in under a few minutes I don't think he uses checkpoints. But he's not just coding for himself. Or is he? Petri has repeatedly made clear to me that he's tweaking it more or less for his special situation, and that generalising it for wider use will need to come with a separate effort. Fortunately that area is my kindof speciality, and that becomes more feasible now the pulse thing is addressed. Time to work on this has been a problem for me, though am expecting things to get better soon, especially since work is drying up, and I have better internet. I'll probably end up dedicating a portion of streaming airtime to open source development, as interest was high when I watched other developers stream, and it looked like fun to gasbag online. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875588 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.