Linux CUDA 'Special' App finally available, featuring Low CPU use

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 74 · 75 · 76 · 77 · 78 · 79 · 80 . . . 83 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1909006 - Posted: 26 Dec 2017, 7:15:53 UTC - in response to Message 1908975.  

I just wish the mechanism that is supposedly in place for the longest time, actually worked. But I agree, there should be new mechanisms put into play to prevent these bad hosts from getting any more work until they clean up their act. As I said, if the existing mechanism worked consistently, report a bad task and get penalized and reduce penalty in real time for each valid task reported.


. . Ironically that process works big time if you abort any tasks, your downloads are crippled until you have uploaded x number of valid tasks (it counts them as it goes but does not indicate the target you have to reach) then your downloads are re-enabled. If it did that with these delinquent hosts would be shut down as they return very few valid results but dozens or hundreds of invalids.

Stephen

??
ID: 1909006 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1909216 - Posted: 28 Dec 2017, 3:59:48 UTC

Petri is apparently testing a new version of the Special App. I don't know what improvements may be incorporated in it, but so far it still appears to have the Spike/Autocorr reporting sequence issue that seemed to start with zi3x.

Workunit 2793219570 (08mr07ac.5673.4571.13.40.249)
Task 6264299405 (S=28, A=2, P=0, T=0, G=0, BS=62.65578, BG=1.97871) v8.08 (alt) windows_x86_64
Task 6264299406 (S=25, A=5, P=0, T=0, G=0, BS=52.88408, BG=1.978711) x41p_zi3xs4, Cuda 9.10 special

The first 25 Spikes all match up fine, and the 2 Autocorrs reported by v8.08(alt) are also found in the zi3xs4 result. However, the last 3 Spikes have been replaced by Autocorrs. As I've noted before, it's simply a matter of where the 30-signal overflow cutoff falls in the different reporting sequences.

The tiebreaker is assigned to my Linux box that currently runs the Cuda 8.00 version of zi3v, so I'm pretty much expecting it to agree with the v8.08 (alt) result. It will probably run sometime overnight.
ID: 1909216 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1909288 - Posted: 28 Dec 2017, 16:37:48 UTC - in response to Message 1909216.  

Well, that's interesting. The result from my host running zi3v Cuda 8.00 matched the zi3xs4 Cuda 9.10. I thought the change to the Spike/Autocorr reporting sequence first appeared after zi3v, but perhaps not. I guess I'll have to revisit the Inconclusives from zi3v Cuda 8.00 and see if I've been overlooking those.
ID: 1909288 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1909290 - Posted: 28 Dec 2017, 16:52:45 UTC - in response to Message 1909288.  

I'm more concern with how the system rewarded you for your work, 1 credit....

I didn't expect your result to be different from petri's , especially if all he is doing is modifying the base algorithm.
ID: 1909290 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1909291 - Posted: 28 Dec 2017, 17:15:44 UTC - in response to Message 1909290.  

I'm more concern with how the system rewarded you for your work, 1 credit....

I didn't expect your result to be different from petri's , especially if all he is doing is modifying the base algorithm.
Just like the old days.....exactly 1 credit for every WU completed! Not really. Looking at my recent overflow results, I see credit ranging anywhere from 0.58 to 1.72 for similar run times. That's pretty "normal". :^)

Actually, I thought that particular change in reporting sequence first started showing up with the zi3x version of the Special App, while the zi3v and earlier versions conformed to the standard sequence. Perhaps I was wrong about that, so I'll have to take another look at my Inconclusives when I get a chance.
ID: 1909291 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1909392 - Posted: 29 Dec 2017, 4:14:41 UTC - in response to Message 1909291.  

Actually, I thought that particular change in reporting sequence first started showing up with the zi3x version of the Special App, while the zi3v and earlier versions conformed to the standard sequence. Perhaps I was wrong about that, so I'll have to take another look at my Inconclusives when I get a chance.
Yep, looks like I was totally wrong about when that Spike/Autocorr reporting sequence problem first started showing up. In fact, it seems to have been there at least since zi3t2b, as seen in these two Inconclusives found on one of my hosts:

Workunit 2745816919 (31ja07aa.16485.7025.11.38.12)
Task 6165476063 (S=22, A=8, P=0, T=0, G=0, BS=59.23683, BG=2.030948) v8.22 (opencl_nvidia_SoG) windows_intelx86
Task 6165476064 (S=19, A=11, P=0, T=0, G=0, BS=56.80987, BG=2.030948) x41p_zi3t2b, Cuda 8.00 special

Workunit 2749963764 (11mr07ad.3304.25021.15.42.246)
Task 6174104968 (S=23, A=7, P=0, T=0, G=0, BS=54.66829, BG=3.430506) v8.08 (alt) windows_x86_64
Task 6174104969 (S=22, A=8, P=0, T=0, G=0, BS=54.66837, BG=3.430513) x41p_zi3t2b, Cuda 8.00 special

And here's a bit of a bizarre one with zi3v. The tiebreaking host (running v8.03 x86_64-apple-darwin) has actually already reported, agreeing with the SoG result, of course, but the validator missed it the first time around, thanks to one of our extended outages. That means the WU has two tasks showing as Inconclusive and one "waiting for validation". It will have to wait for the original deadline for the third task before the validator can wrap this one up. It looks like that should happen on New Year's day.

Workunit 2774983914 (09ja07ac.28081.2935.10.37.11)
Task 6226304497 (S=22, A=8, P=0, T=0, G=0, BS=52.74569, BG=0) x41p_zi3v, Cuda 8.00 special
Task 6226304498 (S=25, A=5, P=0, T=0, G=0, BS=66.7202, BG=0) v8.22 (opencl_nvidia_SoG) windows_intelx86

Apparently, I chose to ignore these types of Inconclusives early on and they only started getting my attention again when they showed up with wingmen who were running zi3x and later. In any event, I just went through my complete list of current Inconclusives and discovered that, out of 169 of them on my Linux boxes, 43 look like they're due to this Spike/Autocorr reporting sequence issue, based on the summaries in my list. I didn't look at the details for each, but even assuming that a handful might actually have other issues in play, it means that about 20-25% of my Inconclusives could very well go away if that issue was addressed.
ID: 1909392 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1909751 - Posted: 31 Dec 2017, 4:18:45 UTC

It appears that Petri's latest version of the Special App may have a change to "Best pulse" reporting, so I've added that to my Inconclusives list (as "BP="). In this non-overflow WU, the "Best pulse" reported by zi3xs4 differs from zi3v, but it still doesn't match SoG, either, so another tiebreaker will be required. None of the apps found an actual reportable Pulse in this WU.

Workunit 2792762274 (15fe07ad.18728.19295.15.42.55)
Task 6263350694 (S=12, A=0, P=0, T=4, G=0, BS=28.19272, BG=0, BP=0.3216815) x41p_zi3xs4, Cuda 9.10 special
Task 6273408457 (S=12, A=0, P=0, T=4, G=0, BS=28.19279, BG=0, BP=1.343807) v8.22 (opencl_nvidia_SoG) windows_intelx86
Task 6274464485 (S=12, A=0, P=0, T=4, G=0, BS=28.19273, BG=0, BP=0.2401133) x41p_zi3v, Cuda 9.00 special
ID: 1909751 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1910528 - Posted: 4 Jan 2018, 16:01:53 UTC
Last modified: 4 Jan 2018, 16:11:16 UTC

Sorry if this is a wrong thread to post but i cant find a better one since i use the CUDA90 builds.

Just see something strange on one of the blc5 wWU

https://setiathome.berkeley.edu/result.php?resultid=6284598198

Apparently it put the thread in loop, runs for minutes and not ends.
The others blc5 WU are crunched in less than 3 min and this ones takes more than 10 and the remaning times stucks at > 80% done.
So I abort it.

What is the right thing to do if that happening again? Abort it or leave them running for more time?

<edit>
After i abort the WU and looking the Stderr output i see this:

Restarted at 81.55 percent, with setiathome enhanced x41p_zi3v, Cuda 9.00 special
Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Sigma 66
Sigma > GaussTOffsetStop: 66 > -2
Thread call stack limit is: 1k
Find triplets Cuda kernel encountered too many triplets, or bins above threshold, reprocessing this PoT on CPU... err = 1
ID: 1910528 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1910700 - Posted: 5 Jan 2018, 0:03:45 UTC

Should have just let it run. Your clue is this:

Find triplets Cuda kernel encountered too many triplets, or bins above threshold, reprocessing this PoT on CPU... err = 1

The app is describing exactly what it is doing. Found too may triplets, so shifting the computation over to the CPU for completion. The task would have finished on the CPU and reported. No harm, no foul.

The only thing that can bite you is if the completion time exceeds the max time allotted for a gpu task and that is based on the APR and the f_pops_est number. If you get into that condition, the task errors out and you get no credit for it.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1910700 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1910702 - Posted: 5 Jan 2018, 0:13:15 UTC - in response to Message 1910700.  
Last modified: 5 Jan 2018, 0:14:02 UTC

The app is describing exactly what it is doing. Found too may triplets, so shifting the computation over to the CPU for completion. The task would have finished on the CPU and reported. No harm, no foul

OK But i only receive the msg after i abort the task and see the logs, before i have no idea what is happening.
Anyway is a waste of resource since it holds the GPU doing nothing until it will timed out.
ID: 1910702 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1910747 - Posted: 5 Jan 2018, 3:44:29 UTC - in response to Message 1910702.  

The app is describing exactly what it is doing. Found too may triplets, so shifting the computation over to the CPU for completion. The task would have finished on the CPU and reported. No harm, no foul

OK But i only receive the msg after i abort the task and see the logs, before i have no idea what is happening.
Anyway is a waste of resource since it holds the GPU doing nothing until it will timed out.

If you have a suspicion about a running task, you can always look at its Properties to see what slot it is running in. You can then go that slot and examine the stderr.txt output file and read the problems before the task finishes. Would explain why the task seemed stalled or taking longer than usual to finish.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1910747 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1910748 - Posted: 5 Jan 2018, 3:47:13 UTC - in response to Message 1910747.  
Last modified: 5 Jan 2018, 3:50:36 UTC

Thank you. I learned something new today.
ID: 1910748 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1912455 - Posted: 12 Jan 2018, 3:52:57 UTC

I don't normally post Inconclusives involving a significantly older version of the Special App, but I think it will be useful to verify that the Pulse difference that caused this Inconclusive is something that's been fixed in a later version, or whether it will cross-validate with the tiebreaker on my host, which will run with x41p_zi3v, Cuda 9.00.

Workunit 2812663093 (blc05_2bit_guppi_57976_75003_HIP46005_0031.23430.818.22.45.80.vlar)
Task 6304795497 (S=1, A=0, P=5, T=1, G=0, BS=24.12749, BG=0, BP=3.887372) v8.22 (opencl_nvidia_SoG) windows_intelx86
Task 6305023330 (S=1, A=0, P=5, T=1, G=0, BS=24.12747, BG=0, BP=3.887368) x41p_zi3t1f, Cuda 8.00 special

The difference is in the second Pulse, which SoG reports as....
Pulse: peak=6.384412, time=45.82, period=14.39, d_freq=7495055652.29, score=1.044, chirp=24.211, fft_len=64

....while the x41p_zi3t1f, Cuda 8.00 special reports....
Pulse: peak=6.210592, time=45.82, period=14.39, d_freq=7495055652.29, score=1.016, chirp=24.211, fft_len=64

If my host agrees with SoG, then it should confirm that upgrading to a newer version of the Special App is advisable. However, if it cross-validates, I imagine some offline runs with other apps would be appropriate to see if SoG's value is truly the correct one (in which case the ball would probably be back in Petri's court).
ID: 1912455 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1912586 - Posted: 12 Jan 2018, 17:40:13 UTC - in response to Message 1912455.  

Okay, the x41p_zi3v, Cuda 9.00 agreed with SoG, so that should confirm that this particularly Pulse issue has been addressed. The x41p_zi3t1f should be retired and replaced with a newer version.

In a way, this highlights one of the problems facing the Special App when, hopefully, the remaining issues one day get resolved and a stable version gets released for general use. Over the many months that some version of the Special App has appeared in my Inconclusives list, I've identified no fewer that 31 different versions, all essentially being Beta-tested in the production environment. I have no way of knowing how many are currently active but, as the previous example shows, there are certainly some that should be upgraded.

Unfortunately, there's really no way to force the retirement of those earlier versions. I don't think the project has any way to do it, so that should put it in the hands of the developer. But that isn't really practical, either, so........the bottom line seems to be that even if a completely clean version of the Special App got released tomorrow, some of those earlier test versions are likely to be hanging around for a good long while. That's perfectly normal if it happens on Beta, but it shouldn't happen that way on Main.
ID: 1912586 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912596 - Posted: 12 Jan 2018, 18:20:23 UTC - in response to Message 1912586.  

I don't know that you can do anything about it either. It's been up to the person running a host to make sure it is running correctly since the project began. As we know, the project is incapable of punishing any host that isn't working correctly and returning rubbish.

The kind of situation you describe has been going on since the release of the Lunatics Optimized apps and the Installer. You are just as capable of installing the wrong app with that platform as you are in not updating the early revisions of the special app for better performing releases.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912596 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1912601 - Posted: 12 Jan 2018, 18:43:02 UTC - in response to Message 1912596.  

Yeah, it's ultimately a problem with the whole Anonymous Platform concept. As far as I know, it has no way to "certify" non-stock applications as safe to run/test in the production environment. If it did, then apps could also be de-certified once they were deemed obsolete or significant problems were identified.

I realize that SoG had many of the same growing pains that the Special App has had, also. However, the two major differences I see there is that at least the various SoG versions tended to be pushed through Beta, first. And then the developer tended to be more responsive to addressing problems that surfaced than is the case with the Special App (though, at times there could be significant resistance ;^)). Now, the current versions of SoG seem to be pretty well accepted as mainstream apps, but with a similar legacy of numerous outdated versions still floating around in the SETI-sphere.
ID: 1912601 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1912654 - Posted: 12 Jan 2018, 23:12:12 UTC - in response to Message 1912601.  

(though, at times there could be significant resistance ;^)).


:D :D :D
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1912654 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1912717 - Posted: 13 Jan 2018, 3:30:28 UTC

Well, I don't know what Petri's doing with that zi3xs4 version, but it sure doesn't look stable.

Workunit 2813410758 (blc05_2bit_guppi_57976_77315_HIP46417_0038.12216.818.21.44.188.vlar)
Task 6306343836 (S=0, A=0, P=30, T=0, G=0, BS=13.40984, BG=0, BP=23.54865) x41p_zi3xs4, Cuda 9.10 special
Task 6306343837 (S=2, A=2, P=5, T=2, G=0, BS=24.3649, BG=0, BP=12.14392) x41p_zi3v, Cuda 9.00 special

Workunit 2813410770 (blc05_2bit_guppi_57976_75329_HIP46343_0032.11400.818.21.44.192.vlar)
Task 6306343840 (S=0, A=0, P=30, T=0, G=0, BS=12.58375, BG=0, BP=2.555692) x41p_zi3xs4, Cuda 9.10 special
Task 6306343841 (S=21, A=0, P=5, T=0, G=0, BS=24.6841, BG=0, BP=9.582356) x41p_zi3v, Cuda 9.00 special

Workunit 2813438980 (blc05_2bit_guppi_57976_76984_HIP46432_0037.16675.409.21.44.90.vlar)
Task 6306402535 (S=0, A=0, P=30, T=0, G=0, BS=12.35322, BG=0, BP=3.181179) x41p_zi3xs4, Cuda 9.10 special
Task 6306402536 (S=0, A=0, P=8, T=1, G=0, BS=23.56962, BG=0, BP=0.9703487) x41p_zi3v, Cuda 8.00 special

He's coughing up 30-Pulse hairballs where my zi3v hosts are reporting normal-looking results. And, of the Pulses that are reported by my hosts, I don't see any correlation with his reported Pulses.
ID: 1912717 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1912792 - Posted: 13 Jan 2018, 13:05:08 UTC - in response to Message 1912717.  

Well, I don't know what Petri's doing with that zi3xs4 version, but it sure doesn't look stable.

He's coughing up 30-Pulse hairballs where my zi3v hosts are reporting normal-looking results. And, of the Pulses that are reported by my hosts, I don't see any correlation with his reported Pulses.


. . The moral there is for the majority of normal crunchers to NOT change over to the newer/edgier revisions UNTIL the developers announce they believe them to be trusty and stable. The good news is that Zi3v in both Cuda80 and Cuda90 versions seems to be that. At least the Cuda80 version is judging by my results.

Stephen

. .
ID: 1912792 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1912834 - Posted: 13 Jan 2018, 17:13:01 UTC - in response to Message 1912792.  

. . The moral there is for the majority of normal crunchers to NOT change over to the newer/edgier revisions UNTIL the developers announce they believe them to be trusty and stable. The good news is that Zi3v in both Cuda80 and Cuda90 versions seems to be that. At least the Cuda80 version is judging by my results.

Stephen

. .
The zi3v version may be stable, in the sense that it isn't changing underfoot, but it still does have several nagging issues that have yet to be addressed. I hope those are the sorts of things that Petri's working on with zi3xs4, not just trying to squeeze a few more seconds out of the run times.

However, it certainly appears that zi3xs4 is a work in progress, which absolutely should not be happening in a production environment. At least it appears that Petri is the only one running that version but still, the sort of primary testing that yields results like the 3 WUs in my previous post (where all 3 of the zi3xs4 tasks got marked Invalid) is what should be taking place offline, using a collection of specific WUs which can produce repeatable results until the app gets it right. That collection should already be pretty extensive, but those 3 new WUs could certainly be added. And I'm sure Petri's seen many more such WUs crop up in his own task list.

In my opinion, zi3xs4 should be in use offline, or in Beta, only.
ID: 1912834 · Report as offensive
Previous · 1 . . . 74 · 75 · 76 · 77 · 78 · 79 · 80 . . . 83 · Next

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.