Astropulse Errors II-Optimized version 5.03!

Message boards : Number crunching : Astropulse Errors II-Optimized version 5.03!
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profile Robi

Send message
Joined: 24 Oct 00
Posts: 33
Credit: 886,890
RAC: 1
United States
Message 881608 - Posted: 2 Apr 2009, 9:30:15 UTC - in response to Message 880855.  

For awhile the number of Astropulse v5 results needed to get a valid pair was huge, but it has now gotten down to about 3.18 (based on the "Results waiting for db purging"/"Workunits waiting for db purging" ratio 300861/94389). It would be nice if that were closer to 2, and it may get down to the ~2.7 that old Astropulse had. I think it's a matter of glitches on hosts affecting a larger percentage of AP than MB simply because of the longer crunch time. If a host of typical speed glitches on average once a day and is doing AP, almost all its results will be affected, but if it were doing MB only a few of the larger number of results would be affected.

Unfortunately, the quota system is very ineffective for hosts of typical speed doing AP work. If a host takes one day or more to do an AP WU, the quota system never actually affects it since it always will provide one WU/day per CPU. We simply have to hope the user notices the host is producing errors rather than successful results, and isn't earning any credit.
                                                              Joe

Yeah, I even posted some color picture some time ago on beta to illustrate the fact heavely OCing host can do MB pretty fine failing occasionally but will completely unusable for AP failing on each task... Unfortunately, AP task can't be splitted.


I wonder if the system checks on hosts that return invalid results or return computation error on each task, and block the host for those tasks.
i.e. if every AP (APv5 or MB) (let's say out of 10) returns with computation error or invalid result, then stop sending those tasks to that host and mark it on the user account page as "project AP (APv5 or MB) blocked due to excessive invalid returns". The user could unblock it if he/she wants and/or believes that he/she has corrected the source of the errors.
Robi
ID: 881608 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 881692 - Posted: 2 Apr 2009, 15:21:11 UTC - in response to Message 881608.  

For awhile the number of Astropulse v5 results needed to get a valid pair was huge, but it has now gotten down to about 3.18 (based on the "Results waiting for db purging"/"Workunits waiting for db purging" ratio 300861/94389). It would be nice if that were closer to 2, and it may get down to the ~2.7 that old Astropulse had. I think it's a matter of glitches on hosts affecting a larger percentage of AP than MB simply because of the longer crunch time. If a host of typical speed glitches on average once a day and is doing AP, almost all its results will be affected, but if it were doing MB only a few of the larger number of results would be affected.

Unfortunately, the quota system is very ineffective for hosts of typical speed doing AP work. If a host takes one day or more to do an AP WU, the quota system never actually affects it since it always will provide one WU/day per CPU. We simply have to hope the user notices the host is producing errors rather than successful results, and isn't earning any credit.
                                                              Joe

Yeah, I even posted some color picture some time ago on beta to illustrate the fact heavely OCing host can do MB pretty fine failing occasionally but will completely unusable for AP failing on each task... Unfortunately, AP task can't be splitted.


I wonder if the system checks on hosts that return invalid results or return computation error on each task, and block the host for those tasks.
i.e. if every AP (APv5 or MB) (let's say out of 10) returns with computation error or invalid result, then stop sending those tasks to that host and mark it on the user account page as "project AP (APv5 or MB) blocked due to excessive invalid returns". The user could unblock it if he/she wants and/or believes that he/she has corrected the source of the errors.

The only blocking it does is reduces the daily quota by one per CPU. The max is 100/CPU/day, and reporting a compute error (aborts, missed deadlines, and download errors count in this case, too) will reduce the quota to 99/CPU/day. The way the quota system is set up is that every bad task reduces by one, and every good task doubles the quota.

I still think the system needs to be revised to be +2 instead of *2. Reporting 50 errors can be erased by reporting good task (100-50, then *2 is 100 again). Something seems flawed in that logic, to me anyway.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 881692 · Report as offensive
Profile Ananas
Volunteer tester

Send message
Joined: 14 Dec 01
Posts: 195
Credit: 2,503,252
RAC: 0
Germany
Message 881913 - Posted: 3 Apr 2009, 7:53:10 UTC
Last modified: 3 Apr 2009, 8:06:42 UTC

wuid=423714931

The stock application says : Found 30 single pulses and 30 repeating pulses, exiting.

The optimized one might be missing this exit criterium, so the status is Completed, validation inconclusive now.

p.s.: This seems to be a very rare condition, it's the first one of those I have seen. The next result will be done with a stock client, so the result with the 30-pulse-plausi will go into the database.
ID: 881913 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 882002 - Posted: 3 Apr 2009, 16:02:29 UTC
Last modified: 3 Apr 2009, 16:58:35 UTC

One could be a fluke, two could be coincidence, but three is a pattern. . .

http://setiathome.berkeley.edu/workunit.php?wuid=417989367
This is the third time this has happened in as many days. Unfortunately the previous two have been deleted already.
(Note to whomever, the [url] bbcode tag seems to be broken at the moment.)

All three followed exactly the same scenario:

- Task #2 completed within deadline using the stock app.
- Task #1 went past the deadline causing a third one to be sent. #1 was then returned past deadline, also using the stock app and validated with #2.
- Task #3 completed using optimized app but failed to compare with the previous two.

I don’t see any way to visually compare the results to see if the third was truly invalid, but since I’ve had no other failures on this machine, nor the previous two which were run on different machines, I’m guessing it should have validated. So this is probably a validator problem and not an optimized app or machine problem.

As I said in a previous post, Stuff Happens. But I have to admit that three within three days is getting just a bit distressing.

[edited to correct task numbers.]
ID: 882002 · Report as offensive
Profile Ananas
Volunteer tester

Send message
Joined: 14 Dec 01
Posts: 195
Credit: 2,503,252
RAC: 0
Germany
Message 882025 - Posted: 3 Apr 2009, 18:04:35 UTC - in response to Message 882002.  
Last modified: 3 Apr 2009, 18:08:24 UTC

One could be a fluke, two could be coincidence, but three is a pattern. . .


The idea, that it might have tried to validate the third result against already purged ones crossed my mind earlier, as I saw the same facts - one too late, third one delivered but the late one returned before the replacement could be returned.

But that didn't match my latest one - the one with the 30 pulses limit.
ID: 882025 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 882047 - Posted: 3 Apr 2009, 19:38:24 UTC - in response to Message 881913.  

Ananas wrote:
wuid=423714931

The stock application says : Found 30 single pulses and 30 repeating pulses, exiting.

The optimized one might be missing this exit criterium, so the status is Completed, validation inconclusive now.

p.s.: This seems to be a very rare condition, it's the first one of those I have seen. The next result will be done with a stock client, so the result with the 30-pulse-plausi will go into the database.

The linked WU doesn't have any result with 30/30 exit, did you perhaps choose the wrong one? It did get inconclusive validation so is certainly worth watching.

The optimized apps certainly do have that early exit, optimizers don't discard anything which finishes work sooner.


gomeyer wrote:
All three followed exactly the same scenario:

- Task #2 completed within deadline using the stock app.
- Task #1 went past the deadline causing a third one to be sent. #1 was then returned past deadline, also using the stock app and validated with #2.
- Task #3 completed using optimized app but failed to compare with the previous two.

I don’t see any way to visually compare the results to see if the third was truly invalid, but since I’ve had no other failures on this machine, nor the previous two which were run on different machines, I’m guessing it should have validated. So this is probably a validator problem and not an optimized app or machine problem.

That's a very interesting observation.

The check_pair() validation logic used when there is already a canonical result might have a flaw I can't spot, but I suspect that it simply couldn't find the canonical result to do the comparison. There's no specific error code for that case, so the web page would simply give the invalid indication you see. There is a server log message "Couldn't create canonical AP_RESULT object" which project staff could look for.
                                                                Joe
ID: 882047 · Report as offensive
Profile Ananas
Volunteer tester

Send message
Joined: 14 Dec 01
Posts: 195
Credit: 2,503,252
RAC: 0
Germany
Message 882097 - Posted: 3 Apr 2009, 22:02:45 UTC - in response to Message 882047.  
Last modified: 3 Apr 2009, 22:18:38 UTC

Ananas wrote:
wuid=423714931

The stock application says : Found 30 single pulses and 30 repeating pulses, exiting.

The optimized one might be missing this exit criterium, so the status is Completed, validation inconclusive now.

p.s.: This seems to be a very rare condition, it's the first one of those I have seen. The next result will be done with a stock client, so the result with the 30-pulse-plausi will go into the database.

The linked WU doesn't have any result with 30/30 exit, did you perhaps choose the wrong one? It did get inconclusive validation so is certainly worth watching.
....


Oops, I have 2 of those inconclusive ones?

wuid=427650464 is the 30/30 one, sorry - my fault :-/

p.s.: It seems to be a problem of the host running the stock app., have a look at this :

hostid=4821903

One more p.s.:

The other Linux host of the same cruncher has the same problem, lots of results that failed validation
ID: 882097 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 882127 - Posted: 4 Apr 2009, 0:08:06 UTC - in response to Message 882047.  


gomeyer wrote:
All three followed exactly the same scenario:

- Task #2 completed within deadline using the stock app.
- Task #1 went past the deadline causing a third one to be sent. #1 was then returned past deadline, also using the stock app and validated with #2.
- Task #3 completed using optimized app but failed to compare with the previous two.

I don’t see any way to visually compare the results to see if the third was truly invalid, but since I’ve had no other failures on this machine, nor the previous two which were run on different machines, I’m guessing it should have validated. So this is probably a validator problem and not an optimized app or machine problem.

That's a very interesting observation.

The check_pair() validation logic used when there is already a canonical result might have a flaw I can't spot, but I suspect that it simply couldn't find the canonical result to do the comparison. There's no specific error code for that case, so the web page would simply give the invalid indication you see. There is a server log message "Couldn't create canonical AP_RESULT object" which project staff could look for.
                                                                Joe

I guess what worries me as much as anything else is, how many of these are happening but not being spotted. Not just mine but everyone's.

It was just a chance that I found the first of the three I mentioned; I returned a wu and happened to notice no correspondng increase in either Total Credit nor Pending. I'm now checking returned results more carefully.

As was said in another thread, the science is not being lost, only time and credit.
ID: 882127 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 882630 - Posted: 5 Apr 2009, 22:43:12 UTC
Last modified: 5 Apr 2009, 22:44:42 UTC

OK, This is getting friggin ridiculous. That's 4 in 5 days. Again it was my WU going against 2 stock apps one of which had gone overdue, just as the previous ones did.

Can someone with the ear of the administration do so and ask if this can be looked into?
ID: 882630 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 883289 - Posted: 8 Apr 2009, 4:21:49 UTC - in response to Message 882630.  

OK, This is getting friggin ridiculous. That's 4 in 5 days. Again it was my WU going against 2 stock apps one of which had gone overdue, just as the previous ones did.

Can someone with the ear of the administration do so and ask if this can be looked into?

I've just sent an email, sorry for the delay.

If Josh or Eric or Jeff do take a look, it would be very convenient to have a fresh case so the database records haven't been purged. Even better would be a case which matches the pattern but your host hasn't quite finished it yet, in that case the canonical result file should still be available so your result file can be compared.
                                                               Joe
ID: 883289 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 883363 - Posted: 8 Apr 2009, 13:08:18 UTC - in response to Message 883289.  
Last modified: 8 Apr 2009, 14:08:10 UTC

OK, This is getting friggin ridiculous. That's 4 in 5 days. Again it was my WU going against 2 stock apps one of which had gone overdue, just as the previous ones did.

Can someone with the ear of the administration do so and ask if this can be looked into?

I've just sent an email, sorry for the delay.

If Josh or Eric or Jeff do take a look, it would be very convenient to have a fresh case so the database records haven't been purged. Even better would be a case which matches the pattern but your host hasn't quite finished it yet, in that case the canonical result file should still be available so your result file can be compared.
                                                               Joe

Thanks for sending the email Joe.
If/when this happens again I'll certainly see it after the fact and will report it here immediately. Finding one in advance of a zero credit is a bit time consuming, but I will start looking as time allows.

[edit]Ask and ye shall receive.
Workunit 419896647 fits the pattern except that it was the _1 task that went over instead of _0. Don't know if that will matter tho'.

The task should complete within then next 24 hours if it is run in normal order. I'll wait to report work from that machine, just let me know.
ID: 883363 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 883419 - Posted: 8 Apr 2009, 16:06:55 UTC
Last modified: 8 Apr 2009, 16:11:42 UTC

These are turning out to be way too easy to find. Here are two more . . .

419755968

This is an interesting variation on the theme. _0 and _1 aborted for other reasons, then _2 and _3 validated but _2 was late. I'm _4 . . .
419662989

All three of the one's I've found should begin running on their own within about 24 hours or less completing 12-15 hours later. I can of course start them early or suspend them before completion if that will help Berkeley get a look at what is happening. Just let me know how it should be handled.

BTW, on a separate but related subject Eric’s script to credit successful WU’s orphaned by excess errors may not have been re-enabled after last week’s server woes. I lost one that was returned 30 Mar 22:00 UTC.
ID: 883419 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 883431 - Posted: 8 Apr 2009, 16:39:49 UTC

Only Getting AP V5.03 Units
Since I installed the optimised AP apps I find that on the computers with AP enabled I'm only downloading AP units and no MB.

I notice that further up the the thread there were others with the same problem but no solution was posted. I've been getting around the problem by location switching but I'd rather not have to micromanage things.

What do I do to get an even mix of AP and MB units with no manual intervention ?

Brodo
ID: 883431 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 883443 - Posted: 8 Apr 2009, 17:07:04 UTC - in response to Message 883431.  

Only Getting AP V5.03 Units
Since I installed the optimised AP apps I find that on the computers with AP enabled I'm only downloading AP units and no MB.

I notice that further up the the thread there were others with the same problem but no solution was posted. I've been getting around the problem by location switching but I'd rather not have to micromanage things.

What do I do to get an even mix of AP and MB units with no manual intervention ?

Brodo

I have a feeling that's something that needs to be addressed server-side, but it hasn't been looked into, or fixed yet. During the outage issues last week, I got a bunch of MBs and no APs, but once both were available again, it's back to only receiving AP unless I do a venue-change.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 883443 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 883454 - Posted: 8 Apr 2009, 18:00:14 UTC - in response to Message 883443.  

Same thing for me on my 2 machines that I allow AP on, nothing but them and no MB in site.

My pending is going to clear 50,000 real soon.

ID: 883454 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 883456 - Posted: 8 Apr 2009, 18:06:10 UTC

I too with my optimised version only got APs not MBs. SO I have switched back to just doing MBs if it could be sorted out would like to do both again.
ID: 883456 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 883834 - Posted: 10 Apr 2009, 2:18:49 UTC - in response to Message 883419.  

These are turning out to be way too easy to find. Here are two more . . .

419755968

This is an interesting variation on the theme. _0 and _1 aborted for other reasons, then _2 and _3 validated but _2 was late. I'm _4 . . .
419662989
...

Searching through the Top Hosts list, I found four more cases, WU 417759313, WU 417924830, WU 418697898, and WU 418715656. I don't know if that indicates that it always happens when there's a final result after a canonical result has been chosen, I know no way to estimate how often that situation arises. All those I found were near the top of the list (within ~220). I also looked at hosts with RACs near 2500 and 1500 to see if I could find a case where the last host to report was running stock, no luck after looking at about 200 hosts in each of those zones. But of course those hosts don't turn in nearly as much work as those at the top of the list, many might not turn in one AP WU per day if they're running MB too.

I've sent an update email to Josh and Eric with info on those searches.
                                                                     Joe
ID: 883834 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 883853 - Posted: 10 Apr 2009, 3:36:53 UTC - in response to Message 883834.  

These are turning out to be way too easy to find. Here are two more . . .

419755968

This is an interesting variation on the theme. _0 and _1 aborted for other reasons, then _2 and _3 validated but _2 was late. I'm _4 . . .
419662989
...

Searching through the Top Hosts list, I found four more cases, WU 417759313, WU 417924830, WU 418697898, and WU 418715656. I don't know if that indicates that it always happens when there's a final result after a canonical result has been chosen, I know no way to estimate how often that situation arises. All those I found were near the top of the list (within ~220). I also looked at hosts with RACs near 2500 and 1500 to see if I could find a case where the last host to report was running stock, no luck after looking at about 200 hosts in each of those zones. But of course those hosts don't turn in nearly as much work as those at the top of the list, many might not turn in one AP WU per day if they're running MB too.

I've sent an update email to Josh and Eric with info on those searches.
                                                                     Joe

This is probably a bad weekend to be asking them to do this. If I don't hear from anyone by tomorrow evening I'll do as you suggested and save the result files then return the results and note if they get zero. If Josh or Eric need these they can let me know. I've also saved the original WU's in the unlikely event that they are needed.

I don't know if it's important or not, but so far all the canonical results I've seen have been from stock apps although one of the new one's you found had compared OK with an earlier op app. Or that might just mean that there are a lot more people running stock.
ID: 883853 · Report as offensive
David Emigh

Send message
Joined: 13 Mar 06
Posts: 7
Credit: 36,459
RAC: 0
United States
Message 888656 - Posted: 27 Apr 2009, 1:17:18 UTC - in response to Message 879588.  

I fear it would take another 188+ hours to get it done, there's a known issue with the checkpoint file which can make the app start over. The host which was given the resend probably started it March 23 and took about 8.2 days for an earlier AP_v5. If your host finishes while the WU is unresolved you'll get credit, but that seems unlikely. If it were mine I'd swear a little and abort it.
                                                                Joe


I fear the known issue has affected me as well. Workunit #433221213 is presently at 101.5% and rising, which is to say, it started over.

I have another astropulse only a few hours from completion. I will wait until that one finishes (or does not...) to make a decision about continuing to crunch astropulse workunits.

ID: 888656 · Report as offensive
David Emigh

Send message
Joined: 13 Mar 06
Posts: 7
Credit: 36,459
RAC: 0
United States
Message 888699 - Posted: 27 Apr 2009, 5:27:46 UTC

It happened again, but not in so disastrous a fashion.

Workunit #424771170 got to 100%, then promptly reset itself, but only back to 99.6xx% When it worked its way back up to 100%, it declared itself ready to report.

I updated the project and the workunit validated immediately (I was the wingman).

I am torn at this point about continuing the troublesome workunit noted in the post immediately prior to this one. I would appreciate any advice.
ID: 888699 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Astropulse Errors II-Optimized version 5.03!


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.