Cuda50 Task Invalid Against Two Other Cuda50s

Message boards : Number crunching : Cuda50 Task Invalid Against Two Other Cuda50s
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1763057 - Posted: 7 Feb 2016, 4:40:20 UTC

I've come to expect the occasional odd Invalid when tasks run on different platforms, but I don't think I've ever run across a case where the same, apparently identical app, returns an Invalid to one host but validates the other two.

In this case, WU 2051291982, all three hosts ran "setiathome enhanced x41zi (baseline v8), Cuda 5.00". For my host, the one that got the Invalid, it was run under Anonymous Platform, while the other two ran stock.

All three hosts returned a -9 Overflow with 30 spikes. It was not an "instant" Invalid, but first was marked Inconclusive until the third host reported. My task ran on a GTX 750Ti, while the initial Inconclusive was against a Quadro K620. The deciding vote was cast by a GTX 650Ti.

I don't see anything in the Stderr outputs that would indicate any differences in configuration parameters, or any other glaring differences, for that matter. So.......is this some odd manifestation of this new "increased precision", or are cosmic rays at work again?

Of course, as a -9 overflow that resulted in a grand total credit of 0.40 for the valid tasks, there's really not much at stake with this particular WU. :^) However, any circumstance where Cuda50 doesn't agree with Cuda50 seems like it might be a concern.
ID: 1763057 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1763069 - Posted: 7 Feb 2016, 5:15:41 UTC - in response to Message 1763057.  
Last modified: 7 Feb 2016, 5:19:32 UTC

I've noticed a couple of inconclusives against CUDA50 on different hardware from mine (GTX 750Tis), as well as CUDA42 & 32 & some Intel iGPUs. Most of them so far are against Apple-Darwin results.

I found Autocorr & triplets, 5 of each. The other system 30 Autocorr,
http://setiathome.berkeley.edu/workunit.php?wuid=2047661455

The ones that have been against CUDA50 have often finished early on the other system, whereas mine crunched through to the end.

I've got one where both systems crunched to the end- I found a spike, the other system didn't.
http://setiathome.berkeley.edu/workunit.php?wuid=2050828313

I've got another where both systems found 19 spikes, but it's still inconclusive.
http://setiathome.berkeley.edu/workunit.php?wuid=2053501202

So far my Inconclusives are running at about 3.6%, and I haven't had any invlalids yet. Although as you've shown, it's only a matter of time.
Grant
Darwin NT
ID: 1763069 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1763073 - Posted: 7 Feb 2016, 5:28:03 UTC - in response to Message 1763057.  

Always hard to gauge on few results and little time, but basically good results can 'conspire', and your app could be good or bad. So many new apps, too early to point bones.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1763073 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11358
Credit: 29,581,041
RAC: 66
United States
Message 1763156 - Posted: 7 Feb 2016, 17:42:56 UTC

On my 3 boxes, 2 for Seti the other for Einstein I found most of my inconclusives are from Apple, Linux machines or IGPUs
ID: 1763156 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1764570 - Posted: 13 Feb 2016, 7:47:21 UTC - in response to Message 1763156.  

On my 3 boxes, 2 for Seti the other for Einstein I found most of my inconclusives are from Apple, Linux machines or IGPUs


I just scored my first invalid, and it was against a pair of x86_64-apple-darwins
2059369452


My result,
Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 0
Gaussian count: 0


Their results,
Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 0
Gaussian count: 0


Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 0
Gaussian count: 0


How is it so?
Grant
Darwin NT
ID: 1764570 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1764571 - Posted: 13 Feb 2016, 7:50:55 UTC - in response to Message 1764570.  
Last modified: 13 Feb 2016, 7:51:32 UTC

How is it so?


Even with no 'reportable' signals, there is still a best spike, autocorrelation, triplet, pulse, and gaussian in the result file (total 5). The current implementations across the board lose precision the further below reportable threshold you get, and it seems the compiler options used to make the apple stock builds may not have yet factored in the precision enhancements Eric made in v8.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1764571 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1764572 - Posted: 13 Feb 2016, 7:55:22 UTC - in response to Message 1764571.  
Last modified: 13 Feb 2016, 7:59:16 UTC

How is it so?


Even with no 'reportable' signals, there is still a best spike, autocorrelation, triplet, pulse, and gaussian in the result file (total 5). The current implementations across the board lose precision the further below reportable threshold you get, and it seems the compiler options used to make the apple stock builds may not have yet factored in the precision enhancements Eric made in v8.


Ah, OK.
I see those in the apple-darwin stderr outputs, but not in the Lunatics stderr output.
Grant
Darwin NT
ID: 1764572 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1764575 - Posted: 13 Feb 2016, 8:33:23 UTC - in response to Message 1764572.  

How is it so?


Even with no 'reportable' signals, there is still a best spike, autocorrelation, triplet, pulse, and gaussian in the result file (total 5). The current implementations across the board lose precision the further below reportable threshold you get, and it seems the compiler options used to make the apple stock builds may not have yet factored in the precision enhancements Eric made in v8.


Ah, OK.
I see those in the apple-darwin stderr outputs, but not in the Lunatics stderr output.


If you're referring to the Cuda builds missing these, I'm sad to report our missing brother in arms Joe Segur was largeley responsible for those. I had indicated before his disappearance that I wanted to reduce stderr pollution (by removing stylised ascii art lunatics logo for starters), and he may have taken that to heart. I've got nothign specifically against best signal printing, though remain to be convinced adding more random numbers would be helpful.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1764575 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1764576 - Posted: 13 Feb 2016, 8:40:05 UTC - in response to Message 1764575.  

I've got nothign specifically against best signal printing, though remain to be convinced adding more random numbers would be helpful.

I'm in favour of keeping things simple, but for things like this one (where everything looks the same, but isn't) those details would be useful.
Grant
Darwin NT
ID: 1764576 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1764577 - Posted: 13 Feb 2016, 8:48:48 UTC - in response to Message 1764576.  

I've got nothign specifically against best signal printing, though remain to be convinced adding more random numbers would be helpful.

I'm in favour of keeping things simple, but for things like this one (where everything looks the same, but isn't) those details would be useful.


Indeed a tough call. On one hand I would say no reportabla signals means "whocares?". On the Other hand I would say precision matters, and sadly you lost to two bad apple applications this time around.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1764577 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1766977 - Posted: 22 Feb 2016, 23:13:22 UTC - in response to Message 1763073.  

Always hard to gauge on few results and little time, but basically good results can 'conspire', and your app could be good or bad. So many new apps, too early to point bones.

Well, here's another one, WU 2051291980, that's almost identical to the one in my original post. Interestingly, although the WU number is just 2 off from the first one, the WU names are very different, the first one being 28oc15ab.17157.11110.12.39.249, while this one is 01ap11ab.8577.13973.6.33.254, created at virtually the same time but split from different "tapes".

The initial Inconclusive pitted the same GTX 750Ti on my host against the Quadro K620 on the same initial wingman's host, both running "x41zi (baseline v8), Cuda 5.00". This time the deciding vote was (finally) cast by a GT 730 running Cuda42, which apparently agreed with the K620, with my results somehow differing too greatly to validate. All three hosts returned a -9 Overflow with 30 spikes, the same as for the WU in my original post.

Again, only an insignificant amount of processing time lost here, but it just seems very strange when what is essentially the same app is returning results that apparently differ significantly enough to fail validation.
ID: 1766977 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1766982 - Posted: 22 Feb 2016, 23:32:55 UTC - in response to Message 1766977.  
Last modified: 23 Feb 2016, 0:27:32 UTC

The reported clockrate on your 750ti seems on the high side:
GPU current clockRate = 1320 MHz

Factory superclock perhaps ? (which can always be marginal sold as a gaming device)

Reference card clock is: 1020 Base, 1085 Boost

If temps are OK, I'd be inclined to give a small core voltage bump and back off the memory clock a little too (if not already forced to p2 power state when Cuda fires up).
[Alternative is back off the core clock a few notches. Does yours even have the Auxiliary PCIe Power connector ?]

Background is that of nv cards only Tesla/Titan devices are really rated for 24x7 gpgpu 100% duty cycle, and factories often set their clocks by acceptable number of graphical artefacts per time period. As gpGPU permeates more and more applications, that is why at one point they started throttling GPUs to the p2 power state, since no number of artefacts is acceptable for compute.

Six or one half dozen the other, if invalids only seem to manifest with overflow and VHAR shorties though.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1766982 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1767016 - Posted: 23 Feb 2016, 1:50:39 UTC - in response to Message 1766982.  

The reported clockrate on your 750ti seems on the high side:
GPU current clockRate = 1320 MHz

Factory superclock perhaps ? (which can always be marginal sold as a gaming device)

Reference card clock is: 1020 Base, 1085 Boost

Well, it is factory superclocked, but GPU-Z reports it as 1254 MHz, which is consistent with EVGA's specs for the 02G-P4-3753-KR (1176 MHz Base Clock, 1255 MHz Boost Clock). Not something that's surfaced as a problem before, until these two WUs. I've had that card crunching since September, 2014. However, it doesn't run 24/7. Right now that box just runs about 11-12 hours each night.

If temps are OK, I'd be inclined to give a small core voltage bump and back off the memory clock a little too (if not already forced to p2 power state when Cuda fires up).
[Alternative is back off the core clock a few notches. Does yours even have the Auxiliary PCIe Power connector ?]

Background is that of nv cards only Tesla/Titan devices are really rated for 24x7 gpgpu 100% duty cycle, and factories often set their clocks by acceptable number of graphical artefacts per time period. As gpGPU permeates more and more applications, that is why at one point they started throttling GPUs to the p2 power state, since no number of artefacts is acceptable for compute.

Six or one half dozen the other, if invalids only seem to manifest with overflow and VHAR shorties though.

Temps should be fine. Precision X would keep them in line, if necessary, but the two 750Tis actually reside outside the box and, as far as I know, the temps stay below 60C without any help.

Clock and power settings are still at factory, as I haven't tinkered with them. I haven't checked power state, but perhaps I can do that in 2 or 3 hours, when I start the box back up again for tonight's crunching. It does not have an auxiliary PCIe power connector.

While both Invalids have been -9 overflows with 30 Spikes, the ARs were different. The first one was 0.868322 and the second one was 2.724965.
ID: 1767016 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1767024 - Posted: 23 Feb 2016, 2:26:25 UTC - in response to Message 1767016.  

Due to the way the Boost works, the MHz read is more or less arbitrary depending on when it's sampled. In Our case I sample in the middle of a Chirp, which is relatively compute intensive, so clocks are likely high. Yeah I'd take a look at the power state with nvidia Inspector, then try reducing three notches of 5Mhz. The firmware curve isn't AI, and probably has some latency/overshoot involved, so the manufacturer may have been just a little optimistic
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1767024 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1767035 - Posted: 23 Feb 2016, 3:03:31 UTC - in response to Message 1767024.  

Okay, I'll check it this evening but will probably hold off on doing any tinkering for now. If the card is at fault, it's certainly a rare event. I just realized that I had a copy of the stdoutdae.txt from that machine over here on my daily driver, so I checked to see when those two Invalid tasks ran. It turns out they ran back-to-back in the space of 20 seconds. There were also 3 other tasks from the 01ap11ab tape and 2 from 28oc15ab which ran about the same time. They also resulted in -9 overflows, but those 5 each ran on one the GTX660 GPUs. Four of the five had the same host with the Quadro K620 as primary wingman and all validated on the first try, so.........

For now, I think I'll just remain watchful for any more hiccups, since that card is successfully churning though about 70 tasks a night without any other apparent problems.
ID: 1767035 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1767037 - Posted: 23 Feb 2016, 3:05:45 UTC - in response to Message 1767035.  

Yes, don't forget the vague coversation I recall us having about soft-error. There be radioactive carbon in 'dem chips.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1767037 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1767043 - Posted: 23 Feb 2016, 3:29:59 UTC - in response to Message 1767037.  

Yes, don't forget the vague coversation I recall us having about soft-error. There be radioactive carbon in 'dem chips.

Oh, yeah...can't forget those Phantom Triplets! That was on another card, on another machine, but certainly could be applicable. Now, where did I put that Geiger counter?? Hmmm.... ;^)
ID: 1767043 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1767054 - Posted: 23 Feb 2016, 4:58:45 UTC - in response to Message 1766982.  
Last modified: 23 Feb 2016, 5:03:33 UTC

Just a quick follow-up on the power/performance state, now that the box is up and running for the night. NVIDIA Inspector shows it as P0. In fact, it's the same for all 4 GPUs.

I also see that it shows the "Boost Clock" as 1254 MHz but the "Current Clock" at 1320 MHz. Looking again at GPU-Z, I see that it depends on which tab I look at as to whether I see 1255 MHz or 1320 MHz. And Open Hardware Monitor likes the 1320 MHz number. Sheesh!

EDIT: I forgot about Precision X. It also votes for 1320 MHz.
ID: 1767054 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1767059 - Posted: 23 Feb 2016, 5:13:43 UTC - in response to Message 1767054.  

Hmphhhh, sounds like both precision and inspector use the same techniques as CreditNew.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1767059 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1767060 - Posted: 23 Feb 2016, 5:28:36 UTC - in response to Message 1767059.  

Hmphhhh, sounds like both precision and inspector use the same techniques as CreditNew.

Then I guess we're all DOOMED! ;^)
ID: 1767060 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Cuda50 Task Invalid Against Two Other Cuda50s


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.