Message boards :
Number crunching :
Work Unit problem
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next
Author | Message |
---|---|
bounty.hunter Send message Joined: 22 Mar 04 Posts: 442 Credit: 459,063 RAC: 0 |
Hi All, Suspend that workunit for now....Boinc will caryy on with the next one. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Found another in my cache - 04mr07ab.32128.6207.15.4.179: I wonder what proportion of 04mr07ab.32128 is affected by this? [I suspect it's a deliberate ploy to stifle our plaintive cries of 'More WUs! More Wus!'. LOL] |
Havoc Send message Joined: 18 May 99 Posts: 38 Credit: 1,454,156 RAC: 0 |
Edit: Aborting these WUs would simply impose an additional load on the servers as they reissue the WUs to others. I suggest anyone who has one Suspend it until the project has a chance to react. They could cancel the WUs to keep reissues from happening and that should give project aborts of the WUs to each host as it contacts the Scheduler. 15/08/2007 17:01:18|SETI@home|Restarting task 04mr07aa.8827.23385.12.4.252_1 using setiathome_enhanced version 527 Suspended until the project can deal with it. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Found another in my cache - 04mr07ab.32128.6207.15.4.179: I wonder what proportion of 04mr07ab.32128 is affected by this? I suspect almost all 04mr07ab WUs may be affected, in the sense that they have different thresholds than intended. That's the way the problem was last year, and the fact that some users are seeing eventual overflow on Pulses if they let these slow units run long enough reinforces my impression it's the same problem. There are three thresholds which are adjusted by the splitter based on the angle range of the work; gauss_null_chi_sq_thresh, pulse_thresh, and triplet_thresh. It's supposed to keep the number of false positives about constant no matter what the angle range. But if that adjustment isn't working right the results are essentially bad, unless they can postprocess them in some fashion to compensate. Note: The Gaussian threshold won't have any effect unless the angle range later in the data set works down to below 1.13 angle range, where Gaussian fitting is performed. Joe |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
I suspect almost all 04mr07ab WUs may be affected, in the sense that they have different thresholds than intended. Have you been able to get a message through to anyone in the lab, or should I PM Matt or somebody? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I have lots of these across two machines. To scour them for weird thresholds would take a long time. Should I suspend all of them instead? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Bump ... suspended 35 x 04mr07ab WUs on one machine ( 90% of cache ), and 2x 04mr07ab WUs on the other (one seventh of cache), pending further advice. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
W-K 666 Send message Joined: 18 May 99 Posts: 19367 Credit: 40,757,560 RAC: 67 |
I've just reported three of these 04mr07 units, still got 7 more, They have exited with -9 overflow after a few seconds, all report 31 triplets. Andy edit] processed with chicken 2.4 [/edit |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I suspect almost all 04mr07ab WUs may be affected, in the sense that they have different thresholds than intended. I did PM Matt, don't know how often he checks or whether to expect a reply. Joe |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I've just reported three of these 04mr07 units, still got 7 more, They have exited with -9 overflow after a few seconds, all report 31 triplets. Jolly Good. Which science app / Boinc / OS that was with? If similar to mine I may just let them go.[unsuspend] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
EDIT @ JOE /EDIT Just to be clear, are you suggesting that ALL wu's beginning with 04mr07ab.xxxx should be suspended? I have just over 350 of these on 9 machines so I do NOT want to misunderstand what is happening here. Thanks! |
speedimic Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0 |
I just checked a bunch of WUs on some of my boxes, all 04mr07ab (also some .32128.), and none had a negative triplet_thresh. So from what I understand they will error out -9. I'll just let them go the normal way... mic. |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
I just checked a bunch of WUs on some of my boxes, all 04mr07ab (also some .32128.), and none had a negative triplet_thresh. Watch 'To Completion' time. I had several that were running and the time was counting up, then down, then up again, etc. Finally suspended it. |
speedimic Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0 |
|
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
EDIT @ JOE /EDIT I think if they run in normal time or overflow early you might as well let them do so. I was suggesting the Suspend option for those which exhibit the slow syndrome. There are 14 channels of 04mr07ab data, some of the splitter processes may be doing the right thing or close enough. OTOH, someone with a large queue who usually doesn't watch how things are progressing might try to avoid the possibility of a lot of wasted time by Suspending all 04mr07ab until the situation clarifies. Joe |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
EDIT @ JOE /EDIT Thanks Joe, I'll keep an eye on them. The problem is that once they are suspended, my machines are not requesting new work even tho' the queues are below set limits including the suspended ones. I hope Berkeley comes up with something soon. Regards, Gus |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Another slight difficulty, as I've just confirmed with BOINC v5.10.13, is that BOINC won't request new work while there are suspended WUs for the project, even if the project queue is otherwise completely dry. I've just 'unsuspended' my three, and BOINC immediately requested new work: however, in the current circumstances, 'requested' and 'allocated' (let alone 'received') are not the same thing. BOINC asked for 250K seconds: it got 2 WUs, in a dozen attempts. Oh well, CPDN, Einstein and Astropulse wil get another above-quota night's crunching, and I'll see what the buffer is looking like in the morning. |
Christoph Send message Joined: 21 Apr 03 Posts: 76 Credit: 355,173 RAC: 0 |
I MAY have an explanation for BOINC not downloading new Units if suspended are there. This had been used at LHC@home to fill up the own cache and receive more of the rare units without increasing the cache. MAY BE that due to this a function was added to BOINC not to do this. But I am only guessing here! Happy crunching, Christoph Christoph |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
EDIT @ JOE /EDIT No, these are just not working. The one's I've been checking are running, but the Time To Completion is counting up and down and up again etc. As Richard said, there is always Rosetta, Einstein et al. that need our cycles. But, you know darn well many will simply abort or worse delete these instead of suspending them. |
W-K 666 Send message Joined: 18 May 99 Posts: 19367 Credit: 40,757,560 RAC: 67 |
I've just reported three of these 04mr07 units, still got 7 more, They have exited with -9 overflow after a few seconds, all report 31 triplets. Sorry, got distracted by family etc. Science app Chicken Pent M 2.4, BONIC 5.10.13, WinXP Pro. Update on units: 590749272 -9 after few sec, 31 triplets 590885684 normal time and credits 590801394 723 secs, cr 3.55, 1 pulse 30 triplets, assume ok. 590688581 ok 590728963 Bad unit. CPU time 3416.25, Claimed credit 0.1289. but it did exit without human(?) intervention. Andy |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.