Message boards :
Number crunching :
Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation
Previous · 1 . . . 74 · 75 · 76 · 77 · 78 · 79 · 80 . . . 83 · Next
Author | Message |
---|---|
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I just wish the mechanism that is supposedly in place for the longest time, actually worked. But I agree, there should be new mechanisms put into play to prevent these bad hosts from getting any more work until they clean up their act. As I said, if the existing mechanism worked consistently, report a bad task and get penalized and reduce penalty in real time for each valid task reported. . . Ironically that process works big time if you abort any tasks, your downloads are crippled until you have uploaded x number of valid tasks (it counts them as it goes but does not indicate the target you have to reach) then your downloads are re-enabled. If it did that with these delinquent hosts would be shut down as they return very few valid results but dozens or hundreds of invalids. Stephen ?? |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Petri is apparently testing a new version of the Special App. I don't know what improvements may be incorporated in it, but so far it still appears to have the Spike/Autocorr reporting sequence issue that seemed to start with zi3x. Workunit 2793219570 (08mr07ac.5673.4571.13.40.249) Task 6264299405 (S=28, A=2, P=0, T=0, G=0, BS=62.65578, BG=1.97871) v8.08 (alt) windows_x86_64 Task 6264299406 (S=25, A=5, P=0, T=0, G=0, BS=52.88408, BG=1.978711) x41p_zi3xs4, Cuda 9.10 special The first 25 Spikes all match up fine, and the 2 Autocorrs reported by v8.08(alt) are also found in the zi3xs4 result. However, the last 3 Spikes have been replaced by Autocorrs. As I've noted before, it's simply a matter of where the 30-signal overflow cutoff falls in the different reporting sequences. The tiebreaker is assigned to my Linux box that currently runs the Cuda 8.00 version of zi3v, so I'm pretty much expecting it to agree with the v8.08 (alt) result. It will probably run sometime overnight. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Well, that's interesting. The result from my host running zi3v Cuda 8.00 matched the zi3xs4 Cuda 9.10. I thought the change to the Spike/Autocorr reporting sequence first appeared after zi3v, but perhaps not. I guess I'll have to revisit the Inconclusives from zi3v Cuda 8.00 and see if I've been overlooking those. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
I'm more concern with how the system rewarded you for your work, 1 credit.... I didn't expect your result to be different from petri's , especially if all he is doing is modifying the base algorithm. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I'm more concern with how the system rewarded you for your work, 1 credit....Just like the old days.....exactly 1 credit for every WU completed! Not really. Looking at my recent overflow results, I see credit ranging anywhere from 0.58 to 1.72 for similar run times. That's pretty "normal". :^) Actually, I thought that particular change in reporting sequence first started showing up with the zi3x version of the Special App, while the zi3v and earlier versions conformed to the standard sequence. Perhaps I was wrong about that, so I'll have to take another look at my Inconclusives when I get a chance. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Actually, I thought that particular change in reporting sequence first started showing up with the zi3x version of the Special App, while the zi3v and earlier versions conformed to the standard sequence. Perhaps I was wrong about that, so I'll have to take another look at my Inconclusives when I get a chance.Yep, looks like I was totally wrong about when that Spike/Autocorr reporting sequence problem first started showing up. In fact, it seems to have been there at least since zi3t2b, as seen in these two Inconclusives found on one of my hosts: Workunit 2745816919 (31ja07aa.16485.7025.11.38.12) Task 6165476063 (S=22, A=8, P=0, T=0, G=0, BS=59.23683, BG=2.030948) v8.22 (opencl_nvidia_SoG) windows_intelx86 Task 6165476064 (S=19, A=11, P=0, T=0, G=0, BS=56.80987, BG=2.030948) x41p_zi3t2b, Cuda 8.00 special Workunit 2749963764 (11mr07ad.3304.25021.15.42.246) Task 6174104968 (S=23, A=7, P=0, T=0, G=0, BS=54.66829, BG=3.430506) v8.08 (alt) windows_x86_64 Task 6174104969 (S=22, A=8, P=0, T=0, G=0, BS=54.66837, BG=3.430513) x41p_zi3t2b, Cuda 8.00 special And here's a bit of a bizarre one with zi3v. The tiebreaking host (running v8.03 x86_64-apple-darwin) has actually already reported, agreeing with the SoG result, of course, but the validator missed it the first time around, thanks to one of our extended outages. That means the WU has two tasks showing as Inconclusive and one "waiting for validation". It will have to wait for the original deadline for the third task before the validator can wrap this one up. It looks like that should happen on New Year's day. Workunit 2774983914 (09ja07ac.28081.2935.10.37.11) Task 6226304497 (S=22, A=8, P=0, T=0, G=0, BS=52.74569, BG=0) x41p_zi3v, Cuda 8.00 special Task 6226304498 (S=25, A=5, P=0, T=0, G=0, BS=66.7202, BG=0) v8.22 (opencl_nvidia_SoG) windows_intelx86 Apparently, I chose to ignore these types of Inconclusives early on and they only started getting my attention again when they showed up with wingmen who were running zi3x and later. In any event, I just went through my complete list of current Inconclusives and discovered that, out of 169 of them on my Linux boxes, 43 look like they're due to this Spike/Autocorr reporting sequence issue, based on the summaries in my list. I didn't look at the details for each, but even assuming that a handful might actually have other issues in play, it means that about 20-25% of my Inconclusives could very well go away if that issue was addressed. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
It appears that Petri's latest version of the Special App may have a change to "Best pulse" reporting, so I've added that to my Inconclusives list (as "BP="). In this non-overflow WU, the "Best pulse" reported by zi3xs4 differs from zi3v, but it still doesn't match SoG, either, so another tiebreaker will be required. None of the apps found an actual reportable Pulse in this WU. Workunit 2792762274 (15fe07ad.18728.19295.15.42.55) Task 6263350694 (S=12, A=0, P=0, T=4, G=0, BS=28.19272, BG=0, BP=0.3216815) x41p_zi3xs4, Cuda 9.10 special Task 6273408457 (S=12, A=0, P=0, T=4, G=0, BS=28.19279, BG=0, BP=1.343807) v8.22 (opencl_nvidia_SoG) windows_intelx86 Task 6274464485 (S=12, A=0, P=0, T=4, G=0, BS=28.19273, BG=0, BP=0.2401133) x41p_zi3v, Cuda 9.00 special |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Sorry if this is a wrong thread to post but i cant find a better one since i use the CUDA90 builds. Just see something strange on one of the blc5 wWU https://setiathome.berkeley.edu/result.php?resultid=6284598198 Apparently it put the thread in loop, runs for minutes and not ends. The others blc5 WU are crunched in less than 3 min and this ones takes more than 10 and the remaning times stucks at > 80% done. So I abort it. What is the right thing to do if that happening again? Abort it or leave them running for more time? <edit> After i abort the WU and looking the Stderr output i see this: Restarted at 81.55 percent, with setiathome enhanced x41p_zi3v, Cuda 9.00 special Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Sigma 66 Sigma > GaussTOffsetStop: 66 > -2 Thread call stack limit is: 1k Find triplets Cuda kernel encountered too many triplets, or bins above threshold, reprocessing this PoT on CPU... err = 1 |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Should have just let it run. Your clue is this: Find triplets Cuda kernel encountered too many triplets, or bins above threshold, reprocessing this PoT on CPU... err = 1 The app is describing exactly what it is doing. Found too may triplets, so shifting the computation over to the CPU for completion. The task would have finished on the CPU and reported. No harm, no foul. The only thing that can bite you is if the completion time exceeds the max time allotted for a gpu task and that is based on the APR and the f_pops_est number. If you get into that condition, the task errors out and you get no credit for it. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
The app is describing exactly what it is doing. Found too may triplets, so shifting the computation over to the CPU for completion. The task would have finished on the CPU and reported. No harm, no foul OK But i only receive the msg after i abort the task and see the logs, before i have no idea what is happening. Anyway is a waste of resource since it holds the GPU doing nothing until it will timed out. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
The app is describing exactly what it is doing. Found too may triplets, so shifting the computation over to the CPU for completion. The task would have finished on the CPU and reported. No harm, no foul If you have a suspicion about a running task, you can always look at its Properties to see what slot it is running in. You can then go that slot and examine the stderr.txt output file and read the problems before the task finishes. Would explain why the task seemed stalled or taking longer than usual to finish. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Thank you. I learned something new today. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I don't normally post Inconclusives involving a significantly older version of the Special App, but I think it will be useful to verify that the Pulse difference that caused this Inconclusive is something that's been fixed in a later version, or whether it will cross-validate with the tiebreaker on my host, which will run with x41p_zi3v, Cuda 9.00. Workunit 2812663093 (blc05_2bit_guppi_57976_75003_HIP46005_0031.23430.818.22.45.80.vlar) Task 6304795497 (S=1, A=0, P=5, T=1, G=0, BS=24.12749, BG=0, BP=3.887372) v8.22 (opencl_nvidia_SoG) windows_intelx86 Task 6305023330 (S=1, A=0, P=5, T=1, G=0, BS=24.12747, BG=0, BP=3.887368) x41p_zi3t1f, Cuda 8.00 special The difference is in the second Pulse, which SoG reports as.... Pulse: peak=6.384412, time=45.82, period=14.39, d_freq=7495055652.29, score=1.044, chirp=24.211, fft_len=64 ....while the x41p_zi3t1f, Cuda 8.00 special reports.... Pulse: peak=6.210592, time=45.82, period=14.39, d_freq=7495055652.29, score=1.016, chirp=24.211, fft_len=64 If my host agrees with SoG, then it should confirm that upgrading to a newer version of the Special App is advisable. However, if it cross-validates, I imagine some offline runs with other apps would be appropriate to see if SoG's value is truly the correct one (in which case the ball would probably be back in Petri's court). |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Okay, the x41p_zi3v, Cuda 9.00 agreed with SoG, so that should confirm that this particularly Pulse issue has been addressed. The x41p_zi3t1f should be retired and replaced with a newer version. In a way, this highlights one of the problems facing the Special App when, hopefully, the remaining issues one day get resolved and a stable version gets released for general use. Over the many months that some version of the Special App has appeared in my Inconclusives list, I've identified no fewer that 31 different versions, all essentially being Beta-tested in the production environment. I have no way of knowing how many are currently active but, as the previous example shows, there are certainly some that should be upgraded. Unfortunately, there's really no way to force the retirement of those earlier versions. I don't think the project has any way to do it, so that should put it in the hands of the developer. But that isn't really practical, either, so........the bottom line seems to be that even if a completely clean version of the Special App got released tomorrow, some of those earlier test versions are likely to be hanging around for a good long while. That's perfectly normal if it happens on Beta, but it shouldn't happen that way on Main. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I don't know that you can do anything about it either. It's been up to the person running a host to make sure it is running correctly since the project began. As we know, the project is incapable of punishing any host that isn't working correctly and returning rubbish. The kind of situation you describe has been going on since the release of the Lunatics Optimized apps and the Installer. You are just as capable of installing the wrong app with that platform as you are in not updating the early revisions of the special app for better performing releases. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Yeah, it's ultimately a problem with the whole Anonymous Platform concept. As far as I know, it has no way to "certify" non-stock applications as safe to run/test in the production environment. If it did, then apps could also be de-certified once they were deemed obsolete or significant problems were identified. I realize that SoG had many of the same growing pains that the Special App has had, also. However, the two major differences I see there is that at least the various SoG versions tended to be pushed through Beta, first. And then the developer tended to be more responsive to addressing problems that surfaced than is the case with the Special App (though, at times there could be significant resistance ;^)). Now, the current versions of SoG seem to be pretty well accepted as mainstream apps, but with a similar legacy of numerous outdated versions still floating around in the SETI-sphere. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
(though, at times there could be significant resistance ;^)). :D :D :D SETI apps news We're not gonna fight them. We're gonna transcend them. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Well, I don't know what Petri's doing with that zi3xs4 version, but it sure doesn't look stable. Workunit 2813410758 (blc05_2bit_guppi_57976_77315_HIP46417_0038.12216.818.21.44.188.vlar) Task 6306343836 (S=0, A=0, P=30, T=0, G=0, BS=13.40984, BG=0, BP=23.54865) x41p_zi3xs4, Cuda 9.10 special Task 6306343837 (S=2, A=2, P=5, T=2, G=0, BS=24.3649, BG=0, BP=12.14392) x41p_zi3v, Cuda 9.00 special Workunit 2813410770 (blc05_2bit_guppi_57976_75329_HIP46343_0032.11400.818.21.44.192.vlar) Task 6306343840 (S=0, A=0, P=30, T=0, G=0, BS=12.58375, BG=0, BP=2.555692) x41p_zi3xs4, Cuda 9.10 special Task 6306343841 (S=21, A=0, P=5, T=0, G=0, BS=24.6841, BG=0, BP=9.582356) x41p_zi3v, Cuda 9.00 special Workunit 2813438980 (blc05_2bit_guppi_57976_76984_HIP46432_0037.16675.409.21.44.90.vlar) Task 6306402535 (S=0, A=0, P=30, T=0, G=0, BS=12.35322, BG=0, BP=3.181179) x41p_zi3xs4, Cuda 9.10 special Task 6306402536 (S=0, A=0, P=8, T=1, G=0, BS=23.56962, BG=0, BP=0.9703487) x41p_zi3v, Cuda 8.00 special He's coughing up 30-Pulse hairballs where my zi3v hosts are reporting normal-looking results. And, of the Pulses that are reported by my hosts, I don't see any correlation with his reported Pulses. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Well, I don't know what Petri's doing with that zi3xs4 version, but it sure doesn't look stable. . . The moral there is for the majority of normal crunchers to NOT change over to the newer/edgier revisions UNTIL the developers announce they believe them to be trusty and stable. The good news is that Zi3v in both Cuda80 and Cuda90 versions seems to be that. At least the Cuda80 version is judging by my results. Stephen . . |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
. . The moral there is for the majority of normal crunchers to NOT change over to the newer/edgier revisions UNTIL the developers announce they believe them to be trusty and stable. The good news is that Zi3v in both Cuda80 and Cuda90 versions seems to be that. At least the Cuda80 version is judging by my results.The zi3v version may be stable, in the sense that it isn't changing underfoot, but it still does have several nagging issues that have yet to be addressed. I hope those are the sorts of things that Petri's working on with zi3xs4, not just trying to squeeze a few more seconds out of the run times. However, it certainly appears that zi3xs4 is a work in progress, which absolutely should not be happening in a production environment. At least it appears that Petri is the only one running that version but still, the sort of primary testing that yields results like the 3 WUs in my previous post (where all 3 of the zi3xs4 tasks got marked Invalid) is what should be taking place offline, using a collection of specific WUs which can produce repeatable results until the app gets it right. That collection should already be pretty extensive, but those 3 new WUs could certainly be added. And I'm sure Petri's seen many more such WUs crop up in his own task list. In my opinion, zi3xs4 should be in use offline, or in Beta, only. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.