Message boards :
Number crunching :
Astropulse Errors II-Optimized version 5.03!
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next
Author | Message |
---|---|
Robi Send message Joined: 24 Oct 00 Posts: 33 Credit: 886,890 RAC: 1 |
For awhile the number of Astropulse v5 results needed to get a valid pair was huge, but it has now gotten down to about 3.18 (based on the "Results waiting for db purging"/"Workunits waiting for db purging" ratio 300861/94389). It would be nice if that were closer to 2, and it may get down to the ~2.7 that old Astropulse had. I think it's a matter of glitches on hosts affecting a larger percentage of AP than MB simply because of the longer crunch time. If a host of typical speed glitches on average once a day and is doing AP, almost all its results will be affected, but if it were doing MB only a few of the larger number of results would be affected. I wonder if the system checks on hosts that return invalid results or return computation error on each task, and block the host for those tasks. i.e. if every AP (APv5 or MB) (let's say out of 10) returns with computation error or invalid result, then stop sending those tasks to that host and mark it on the user account page as "project AP (APv5 or MB) blocked due to excessive invalid returns". The user could unblock it if he/she wants and/or believes that he/she has corrected the source of the errors. Robi |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
For awhile the number of Astropulse v5 results needed to get a valid pair was huge, but it has now gotten down to about 3.18 (based on the "Results waiting for db purging"/"Workunits waiting for db purging" ratio 300861/94389). It would be nice if that were closer to 2, and it may get down to the ~2.7 that old Astropulse had. I think it's a matter of glitches on hosts affecting a larger percentage of AP than MB simply because of the longer crunch time. If a host of typical speed glitches on average once a day and is doing AP, almost all its results will be affected, but if it were doing MB only a few of the larger number of results would be affected. The only blocking it does is reduces the daily quota by one per CPU. The max is 100/CPU/day, and reporting a compute error (aborts, missed deadlines, and download errors count in this case, too) will reduce the quota to 99/CPU/day. The way the quota system is set up is that every bad task reduces by one, and every good task doubles the quota. I still think the system needs to be revised to be +2 instead of *2. Reporting 50 errors can be erased by reporting good task (100-50, then *2 is 100 again). Something seems flawed in that logic, to me anyway. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Ananas Send message Joined: 14 Dec 01 Posts: 195 Credit: 2,503,252 RAC: 0 |
wuid=423714931 The stock application says : Found 30 single pulses and 30 repeating pulses, exiting. The optimized one might be missing this exit criterium, so the status is Completed, validation inconclusive now. p.s.: This seems to be a very rare condition, it's the first one of those I have seen. The next result will be done with a stock client, so the result with the 30-pulse-plausi will go into the database. |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
One could be a fluke, two could be coincidence, but three is a pattern. . . http://setiathome.berkeley.edu/workunit.php?wuid=417989367 This is the third time this has happened in as many days. Unfortunately the previous two have been deleted already. (Note to whomever, the [url] bbcode tag seems to be broken at the moment.) All three followed exactly the same scenario: - Task #2 completed within deadline using the stock app. - Task #1 went past the deadline causing a third one to be sent. #1 was then returned past deadline, also using the stock app and validated with #2. - Task #3 completed using optimized app but failed to compare with the previous two. I don’t see any way to visually compare the results to see if the third was truly invalid, but since I’ve had no other failures on this machine, nor the previous two which were run on different machines, I’m guessing it should have validated. So this is probably a validator problem and not an optimized app or machine problem. As I said in a previous post, Stuff Happens. But I have to admit that three within three days is getting just a bit distressing. [edited to correct task numbers.] |
Ananas Send message Joined: 14 Dec 01 Posts: 195 Credit: 2,503,252 RAC: 0 |
One could be a fluke, two could be coincidence, but three is a pattern. . . The idea, that it might have tried to validate the third result against already purged ones crossed my mind earlier, as I saw the same facts - one too late, third one delivered but the late one returned before the replacement could be returned. But that didn't match my latest one - the one with the 30 pulses limit. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Ananas wrote: wuid=423714931 The linked WU doesn't have any result with 30/30 exit, did you perhaps choose the wrong one? It did get inconclusive validation so is certainly worth watching. The optimized apps certainly do have that early exit, optimizers don't discard anything which finishes work sooner. gomeyer wrote: All three followed exactly the same scenario: That's a very interesting observation. The check_pair() validation logic used when there is already a canonical result might have a flaw I can't spot, but I suspect that it simply couldn't find the canonical result to do the comparison. There's no specific error code for that case, so the web page would simply give the invalid indication you see. There is a server log message "Couldn't create canonical AP_RESULT object" which project staff could look for. Joe |
Ananas Send message Joined: 14 Dec 01 Posts: 195 Credit: 2,503,252 RAC: 0 |
Ananas wrote:wuid=423714931 Oops, I have 2 of those inconclusive ones? wuid=427650464 is the 30/30 one, sorry - my fault :-/ p.s.: It seems to be a problem of the host running the stock app., have a look at this : hostid=4821903 One more p.s.: The other Linux host of the same cruncher has the same problem, lots of results that failed validation |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
I guess what worries me as much as anything else is, how many of these are happening but not being spotted. Not just mine but everyone's. It was just a chance that I found the first of the three I mentioned; I returned a wu and happened to notice no correspondng increase in either Total Credit nor Pending. I'm now checking returned results more carefully. As was said in another thread, the science is not being lost, only time and credit. |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
OK, This is getting friggin ridiculous. That's 4 in 5 days. Again it was my WU going against 2 stock apps one of which had gone overdue, just as the previous ones did. Can someone with the ear of the administration do so and ask if this can be looked into? |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
OK, This is getting friggin ridiculous. That's 4 in 5 days. Again it was my WU going against 2 stock apps one of which had gone overdue, just as the previous ones did. I've just sent an email, sorry for the delay. If Josh or Eric or Jeff do take a look, it would be very convenient to have a fresh case so the database records haven't been purged. Even better would be a case which matches the pattern but your host hasn't quite finished it yet, in that case the canonical result file should still be available so your result file can be compared. Joe |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
OK, This is getting friggin ridiculous. That's 4 in 5 days. Again it was my WU going against 2 stock apps one of which had gone overdue, just as the previous ones did. Thanks for sending the email Joe. If/when this happens again I'll certainly see it after the fact and will report it here immediately. Finding one in advance of a zero credit is a bit time consuming, but I will start looking as time allows. [edit]Ask and ye shall receive. Workunit 419896647 fits the pattern except that it was the _1 task that went over instead of _0. Don't know if that will matter tho'. The task should complete within then next 24 hours if it is run in normal order. I'll wait to report work from that machine, just let me know. |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
These are turning out to be way too easy to find. Here are two more . . . 419755968 This is an interesting variation on the theme. _0 and _1 aborted for other reasons, then _2 and _3 validated but _2 was late. I'm _4 . . . 419662989 All three of the one's I've found should begin running on their own within about 24 hours or less completing 12-15 hours later. I can of course start them early or suspend them before completion if that will help Berkeley get a look at what is happening. Just let me know how it should be handled. BTW, on a separate but related subject Eric’s script to credit successful WU’s orphaned by excess errors may not have been re-enabled after last week’s server woes. I lost one that was returned 30 Mar 22:00 UTC. |
Terror Australis Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44 |
Only Getting AP V5.03 Units Since I installed the optimised AP apps I find that on the computers with AP enabled I'm only downloading AP units and no MB. I notice that further up the the thread there were others with the same problem but no solution was posted. I've been getting around the problem by location switching but I'd rather not have to micromanage things. What do I do to get an even mix of AP and MB units with no manual intervention ? Brodo |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
Only Getting AP V5.03 Units I have a feeling that's something that needs to be addressed server-side, but it hasn't been looked into, or fixed yet. During the outage issues last week, I got a bunch of MBs and no APs, but once both were available again, it's back to only receiving AP unless I do a venue-change. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
|
[B^S] madmac Send message Joined: 9 Feb 04 Posts: 1175 Credit: 4,754,897 RAC: 0 |
|
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
These are turning out to be way too easy to find. Here are two more . . . Searching through the Top Hosts list, I found four more cases, WU 417759313, WU 417924830, WU 418697898, and WU 418715656. I don't know if that indicates that it always happens when there's a final result after a canonical result has been chosen, I know no way to estimate how often that situation arises. All those I found were near the top of the list (within ~220). I also looked at hosts with RACs near 2500 and 1500 to see if I could find a case where the last host to report was running stock, no luck after looking at about 200 hosts in each of those zones. But of course those hosts don't turn in nearly as much work as those at the top of the list, many might not turn in one AP WU per day if they're running MB too. I've sent an update email to Josh and Eric with info on those searches. Joe |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
These are turning out to be way too easy to find. Here are two more . . . This is probably a bad weekend to be asking them to do this. If I don't hear from anyone by tomorrow evening I'll do as you suggested and save the result files then return the results and note if they get zero. If Josh or Eric need these they can let me know. I've also saved the original WU's in the unlikely event that they are needed. I don't know if it's important or not, but so far all the canonical results I've seen have been from stock apps although one of the new one's you found had compared OK with an earlier op app. Or that might just mean that there are a lot more people running stock. |
David Emigh Send message Joined: 13 Mar 06 Posts: 7 Credit: 36,459 RAC: 0 |
I fear it would take another 188+ hours to get it done, there's a known issue with the checkpoint file which can make the app start over. The host which was given the resend probably started it March 23 and took about 8.2 days for an earlier AP_v5. If your host finishes while the WU is unresolved you'll get credit, but that seems unlikely. If it were mine I'd swear a little and abort it.Joe I fear the known issue has affected me as well. Workunit #433221213 is presently at 101.5% and rising, which is to say, it started over. I have another astropulse only a few hours from completion. I will wait until that one finishes (or does not...) to make a decision about continuing to crunch astropulse workunits. |
David Emigh Send message Joined: 13 Mar 06 Posts: 7 Credit: 36,459 RAC: 0 |
It happened again, but not in so disastrous a fashion. Workunit #424771170 got to 100%, then promptly reset itself, but only back to 99.6xx% When it worked its way back up to 100%, it declared itself ready to report. I updated the project and the workunit validated immediately (I was the wingman). I am torn at this point about continuing the troublesome workunit noted in the post immediately prior to this one. I would appreciate any advice. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.