runaway AMD/openCL

Message boards : Number crunching : runaway AMD/openCL
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1471153 - Posted: 31 Jan 2014, 15:03:27 UTC

http://setiathome.berkeley.edu/workunit.php?wuid=1411661827

I got 2 spikes and a triplet. My wingie got 30 autocorrs. I look at his machine and I see it has a whole lot of inconclusives and invalids. I didn't look at every one of them, but all the ones I saw were openCL tasks. He also has a number of valids that look like crap but validated against other crap (I saw one where what looks like a real result was marked invalid because this box agreed with a 560Ti).
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1471153 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1471213 - Posted: 31 Jan 2014, 17:35:46 UTC

Yep, that's the same host that I was using as the primary example in my Two wrongs make a right thread. And that's not the only machine doing this.

It's clearly an ongoing problem where 2 runaway ATI rigs are getting validations for false overflows with an Autocorr count of 30. And many times, a perfectly reasonable and probably accurate result from another host gets thrown out because of it. Apparently the project admins aren't much concerned with the science database getting polluted like this.
ID: 1471213 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1471261 - Posted: 31 Jan 2014, 19:14:40 UTC
Last modified: 31 Jan 2014, 19:16:36 UTC

such hosts will be blocked from reaching validator in next app release. [autocorr overflow will be invalid for GPU builds]

Also I propose all who looking into reported results to notice any similarities in such false overflows that could help to distinguish them from valid overflows.
Then we could implement some inner checking to block all suspicious results from reaching validator and prevent science database pollution.

Regarding server admins I'm afraid they can't do anything else but to develop similar heuristics and embed them in validator. So lets try to find such heuristics.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1471261 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1471264 - Posted: 31 Jan 2014, 19:21:09 UTC

I've encountered several folks like this. Typically a kind PM to them works wonders. I'd like to think things can be solved a bit quieter than public display of runaway PC/GPU's


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1471264 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1471269 - Posted: 31 Jan 2014, 19:27:27 UTC - in response to Message 1471264.  
Last modified: 31 Jan 2014, 19:28:06 UTC

I've encountered several folks like this. Typically a kind PM to them works wonders. I'd like to think things can be solved a bit quieter than public display of runaway PC/GPU's


Perhaps all such host needs is reboot (this similar to CUDA issue with false spike overflows that hists time to time some CUDA hosts, reboot helps usually).
But the fact that 2 broken hosts can validate vs each other makes this issue much more dangerous than simple invalid results. Hence we need some measure to prevent such result from validation.

Currently I see no means to prevent such GPU state cause no error reported back to app from OpenCL runtime. Hence we need to teach app or validator to stop such false overflows on exit and abort result reporting to server.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1471269 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1471270 - Posted: 31 Jan 2014, 19:28:56 UTC - in response to Message 1471264.  

Yes a PM is in order when possible but many of them are Anonymous and don't wish to be helped or bothered.
ID: 1471270 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1471273 - Posted: 31 Jan 2014, 19:32:40 UTC
Last modified: 31 Jan 2014, 19:36:43 UTC

Link to host from another thread: https://setiathome.berkeley.edu/results.php?hostid=7090798&offset=0&show_names=0&state=5&appid=
Here we have false Pulse overflows from CUDA. Are them can validate between each other?

[In current 63 host valid results I spotted no overflows with 30 pulses, worth to look for this host for awhile and see if its 30 pulses overflows are always invalids or can pass validation time to time]
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1471273 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1471277 - Posted: 31 Jan 2014, 19:38:57 UTC

I agree that false overflows should never validate. In fact, I would have thought that an overflow would create an automatic resend of work to the next PC/GPU. Overflows and errors should never be allowed to validate.

On the people that we see rapid errors from, no all are anonymous. Most are active unhidden users that a PM should suffice in getting them to act on their misbehaving PC.


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1471277 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1471282 - Posted: 31 Jan 2014, 19:50:49 UTC - in response to Message 1471277.  
Last modified: 31 Jan 2014, 19:51:20 UTC

In fact, I would have thought that an overflow would create an automatic resend of work to the next PC/GPU.


The difficulty with that: there is fully legal genuine overflows just because of data patterns, not because of some failure in data processing. Sending such overflows on other hosts will just confirm such overflow and this can be quite infinite process.

What we need is to distinguish probably valid overflows from probably false ones before false ones reach the validator.
Hence some common patterns should be found by human volunteers then that patterns can be coded as heuristics by app developer (me in case of OpenCl for example) and then app will be able to do "CRC" checking on own results to not report suspicious ones to validator.

One current heuristics I'm going to implement is: if tasks overflowed only with autocorrs then it's highly probable that it's false overflow. Hence abort such result.
This will block most of such host results from reaching validator.

This will not work if there is 1 spike for example and 29 autocorrs.
So, next advance in heuristic could be: abort all overflows where autocorr number >= X. Let's find such X. If it will be too low app will abort genuine overflows also, if too high - some flase overflows will pass still.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1471282 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1471291 - Posted: 31 Jan 2014, 20:00:13 UTC - in response to Message 1471282.  

In fact, I would have thought that an overflow would create an automatic resend of work to the next PC/GPU.


The difficulty with that: there is fully legal genuine overflows just because of data patterns, not because of some failure in data processing. Sending such overflows on other hosts will just confirm such overflow and this can be quite infinite process.

What we need is to distinguish probably valid overflows from probably false ones before false ones reach the validator.
Hence some common patterns should be found by human volunteers then that patterns can be coded as heuristics by app developer (me in case of OpenCl for example) and then app will be able to do "CRC" checking on own results to not report suspicious ones to validator.

One current heuristics I'm going to implement is: if tasks overflowed only with autocorrs then it's highly probable that it's false overflow. Hence abort such result.
This will block most of such host results from reaching validator.

This will not work if there is 1 spike for example and 29 autocorrs.
So, next advance in heuristic could be: abort all overflows where autocorr number >= X. Let's find such X. If it will be too low app will abort genuine overflows also, if too high - some flase overflows will pass still.
isn't that the reason to sent these WU's out a max of 10 times with errors on return. the rule is already there make sure the bad WU's aren't part of the good data. Why not use what is already setup instead of assuming a WU is bad because just 2 CPU/GPUs send in bad results.


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1471291 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1471292 - Posted: 31 Jan 2014, 20:01:39 UTC

Joe Segur made two suggestions in the Two wrongs make a right thread. First:
Autocorrs with peak powers over 100 are so unlikely that might be considered as a sanity check level. Similar "too good to be true" levels for other signal types would be possible.

All of the validated overflows for 6062303 that I've looked at have those high peak powers, so perhaps that would be feasible.

Second:
it would be good to insert a check in the Validator to not accept overflowed results if there's a non-overflow one.

This is workable if either the _0 or _1 task is non-overflow. Unfortunately, what seems to also happen frequently is that the _0 and _1 are both false overflows and validate against each other before a third opinion can even be sought.
ID: 1471292 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1471297 - Posted: 31 Jan 2014, 20:08:41 UTC - in response to Message 1471291.  

isn't that the reason to sent these WU's out a max of 10 times with errors on return. the rule is already there make sure the bad WU's aren't part of the good data. Why not use what is already setup instead of assuming a WU is bad because just 2 CPU/GPUs send in bad results.

No, it's not.
There are valid overflows that no need to be processed 10 times or marked as invalids.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1471297 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1471298 - Posted: 31 Jan 2014, 20:10:30 UTC - in response to Message 1471292.  

Joe Segur made two suggestions in the Two wrongs make a right thread. First:
Autocorrs with peak powers over 100 are so unlikely that might be considered as a sanity check level. Similar "too good to be true" levels for other signal types would be possible.

All of the validated overflows for 6062303 that I've looked at have those high peak powers, so perhaps that would be feasible.

Second:
it would be good to insert a check in the Validator to not accept overflowed results if there's a non-overflow one.

This is workable if either the _0 or _1 task is non-overflow. Unfortunately, what seems to also happen frequently is that the _0 and _1 are both false overflows and validate against each other before a third opinion can even be sought.


Thanks!
I either missed that proposal before or forgot about it. Well, first one I( can implement as app heuristic while second one out of my control and falls to Eric&Co department :)
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1471298 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1471306 - Posted: 31 Jan 2014, 20:18:48 UTC - in response to Message 1471282.  
Last modified: 31 Jan 2014, 20:20:10 UTC

In fact, I would have thought that an overflow would create an automatic resend of work to the next PC/GPU.


The difficulty with that: there is fully legal genuine overflows just because of data patterns, not because of some failure in data processing. Sending such overflows on other hosts will just confirm such overflow and this can be quite infinite process.

What we need is to distinguish probably valid overflows from probably false ones before false ones reach the validator.
Hence some common patterns should be found by human volunteers then that patterns can be coded as heuristics by app developer (me in case of OpenCl for example) and then app will be able to do "CRC" checking on own results to not report suspicious ones to validator.

One current heuristics I'm going to implement is: if tasks overflowed only with autocorrs then it's highly probable that it's false overflow. Hence abort such result.
This will block most of such host results from reaching validator.

This will not work if there is 1 spike for example and 29 autocorrs.
So, next advance in heuristic could be: abort all overflows where autocorr number >= X. Let's find such X. If it will be too low app will abort genuine overflows also, if too high - some false overflows will pass still.

Someone would need to code a search program and search for values that are typical of false overflows. Personally, I don't recall any valid results where the spike count was Zero and some other value was 30. However, a search program might find a few. Sounds like a reasonable endeavor.
ID: 1471306 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1471316 - Posted: 31 Jan 2014, 20:43:53 UTC - in response to Message 1471306.  

Well, peak_power <100 limit will cut such host down completely.
Good enough heuristic for now.
Next big task would be to update GPU apps to use it.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1471316 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1471320 - Posted: 31 Jan 2014, 20:49:01 UTC

What about runaway GPU AstroPulse hosts?
Any flavours from ATi to Intel, NV including. any links ?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1471320 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1471343 - Posted: 31 Jan 2014, 21:33:50 UTC - in response to Message 1471320.  

What about runaway GPU AstroPulse hosts?
Any flavours from ATi to Intel, NV including. any links ?

It would be nice if this were corrected;
Known issues: - For overflowed tasks found signal sequence not always match CPU version.

So repeated results such as this could be avoided; http://setiathome.berkeley.edu/workunit.php?wuid=1415148294
ID: 1471343 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1471344 - Posted: 31 Jan 2014, 21:34:58 UTC - in response to Message 1471320.  

What about runaway GPU AstroPulse hosts?
Any flavours from ATi to Intel, NV including. any links ?

I've only seen the MB overflowing.


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1471344 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1471357 - Posted: 31 Jan 2014, 21:48:02 UTC - in response to Message 1471343.  
Last modified: 31 Jan 2014, 21:54:52 UTC

What about runaway GPU AstroPulse hosts?
Any flavours from ATi to Intel, NV including. any links ?

It would be nice if this were corrected;
Known issues: - For overflowed tasks found signal sequence not always match CPU version.

So repeated results such as this could be avoided; http://setiathome.berkeley.edu/workunit.php?wuid=1415148294


GPU result reporting mimics CPU builds closely. Most probably such issues come due to numeric variations of results between different builds. With 60 signals to compare the probability of some mismatch greatly increases.
So far I have no clear test case that proves something wrong with signal reporting order per se so that part of readme unverified still.

EDIT: what volunteers could do to help with this issue:
1) save task with incompletely validated results for offline testing
2) generate reference result with CPU app (opt app can be used instead of stock one)
3) do offline GPU run to prove that issue really exist and reproduceable
4) increase limit of signals to report by editing task's header.
5) rerun CPU/GPU builds and see what results are differ now.
With problems in signal reporting order increase in number of reported signals usually makes results strongly valid while numerical difference will remain untouched.
6) if you found task that fail validation with CPU with 30/30 limit and pass validation with let say 60/60 limit - report that task to Lunatics crew for further investigation.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1471357 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1471377 - Posted: 31 Jan 2014, 22:14:14 UTC - in response to Message 1471282.  

This will not work if there is 1 spike for example and 29 autocorrs.

Here's an example of such a situation, WU 1399455335, where two runaway ATI GPUs reported an Autocorr count of 29 and a Triplet count of 1. They validated against each other while a machine reporting a Triplet count of 4 (and all other counts = 0) was marked Invalid. That is the only Invalid showing for that machine (7147221), so I'm inclined to believe that its results are the reliable ones that actually should have ended up in the science database.

This is the first one of these that I've seen where the Autocorr count was less than 30, but there are probably plenty more out there. You can probably create an entire chain of machines that are doing this. Start with a host like 6062303. Check its recent WUs which have validated with an Autocorr overflow. There are two that I see at the moment:

http://setiathome.berkeley.edu/workunit.php?wuid=1415996526
http://setiathome.berkeley.edu/workunit.php?wuid=1416016634

Since it takes two to tango, check the host for the wingman in each WU that had the same Autocorr overflow. It the first example that would be 5744165. Perform the same exercise for that machine (which is how I identified the example at the beginning of this post), and so on, and so on. I suspect it could produce a fairly substantial list of hosts feeding this problem (all running some flavor of OpenCL ATI app).

The most effective short-term step that the project admins could take to reduce the likelihood of this problem occurring (although not eliminate it), is to do what's been suggested many, many times in quite a few threads on the message boards. That is to choke off the supply of tasks that these runaway rigs are receiving. Each of them is still receiving hundreds of tasks per day. The Application Details for host 6062303 shows "Number of tasks today" at 272 for "opencl_ati_cat132" and 267 for "opencl_ati5_cat132". All or most of them will overflow, and those that overflow due to the Autocorr count are at risk for being paired up with another host doing exactly the same thing! Simply choking off the supply of tasks for machines that produce mostly Invalid results in the same manner that they get choked off for producing Errors, would vastly reduce the opportunity for those machines to create mischief with the scientific accuracy of the project!
ID: 1471377 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : runaway AMD/openCL


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.