Work Unit problem

Message boards : Number crunching : Work Unit problem
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

AuthorMessage
Profile bounty.hunter
Volunteer tester
Avatar

Send message
Joined: 22 Mar 04
Posts: 442
Credit: 459,063
RAC: 0
India
Message 619646 - Posted: 15 Aug 2007, 16:55:19 UTC - in response to Message 619644.  

Hi All,

I have the same problem overhere, a resyart from Bionc didn't help, in the message tab I can read the following and that is pretty clear but what can I do about it, let it go or something else????

15-8-2007 18:46:35|SETI@home|Restarting task 04mr07ab.32128.6616.15.4.75_1 using setiathome_enhanced version 527
15-8-2007 18:47:12|SETI@home|app reporting negative CPU: -0.281250

Regards, Browny


Suspend that workunit for now....Boinc will caryy on with the next one.
ID: 619646 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 619651 - Posted: 15 Aug 2007, 17:21:39 UTC

Found another in my cache - 04mr07ab.32128.6207.15.4.179: I wonder what proportion of 04mr07ab.32128 is affected by this?

[I suspect it's a deliberate ploy to stifle our plaintive cries of 'More WUs! More Wus!'. LOL]
ID: 619651 · Report as offensive
Havoc
Volunteer tester

Send message
Joined: 18 May 99
Posts: 38
Credit: 1,454,156
RAC: 0
United Kingdom
Message 619652 - Posted: 15 Aug 2007, 17:22:30 UTC - in response to Message 619638.  

Edit: Aborting these WUs would simply impose an additional load on the servers as they reissue the WUs to others. I suggest anyone who has one Suspend it until the project has a chance to react. They could cancel the WUs to keep reissues from happening and that should give project aborts of the WUs to each host as it contacts the Scheduler.

Good thinking. Suspended it is.


15/08/2007 17:01:18|SETI@home|Restarting task 04mr07aa.8827.23385.12.4.252_1 using setiathome_enhanced version 527

Suspended until the project can deal with it.
ID: 619652 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 619663 - Posted: 15 Aug 2007, 17:47:44 UTC - in response to Message 619651.  

Found another in my cache - 04mr07ab.32128.6207.15.4.179: I wonder what proportion of 04mr07ab.32128 is affected by this?

[I suspect it's a deliberate ploy to stifle our plaintive cries of 'More WUs! More Wus!'. LOL]

I suspect almost all 04mr07ab WUs may be affected, in the sense that they have different thresholds than intended. That's the way the problem was last year, and the fact that some users are seeing eventual overflow on Pulses if they let these slow units run long enough reinforces my impression it's the same problem.

There are three thresholds which are adjusted by the splitter based on the angle range of the work; gauss_null_chi_sq_thresh, pulse_thresh, and triplet_thresh. It's supposed to keep the number of false positives about constant no matter what the angle range. But if that adjustment isn't working right the results are essentially bad, unless they can postprocess them in some fashion to compensate.

Note: The Gaussian threshold won't have any effect unless the angle range later in the data set works down to below 1.13 angle range, where Gaussian fitting is performed.
                                                               Joe
ID: 619663 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 619666 - Posted: 15 Aug 2007, 17:52:57 UTC - in response to Message 619663.  

I suspect almost all 04mr07ab WUs may be affected, in the sense that they have different thresholds than intended.

Have you been able to get a message through to anyone in the lab, or should I PM Matt or somebody?
ID: 619666 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 619667 - Posted: 15 Aug 2007, 17:53:08 UTC - in response to Message 619663.  


... I suspect almost all 04mr07ab WUs may be affected...


I have lots of these across two machines. To scour them for weird thresholds would take a long time. Should I suspend all of them instead?

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 619667 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 619695 - Posted: 15 Aug 2007, 19:10:05 UTC - in response to Message 619667.  

Bump ... suspended 35 x 04mr07ab WUs on one machine ( 90% of cache ), and 2x 04mr07ab WUs on the other (one seventh of cache), pending further advice.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 619695 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19402
Credit: 40,757,560
RAC: 67
United Kingdom
Message 619719 - Posted: 15 Aug 2007, 19:22:01 UTC
Last modified: 15 Aug 2007, 19:23:20 UTC

I've just reported three of these 04mr07 units, still got 7 more, They have exited with -9 overflow after a few seconds, all report 31 triplets.

Andy

edit] processed with chicken 2.4 [/edit
ID: 619719 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 619721 - Posted: 15 Aug 2007, 19:24:17 UTC - in response to Message 619666.  

I suspect almost all 04mr07ab WUs may be affected, in the sense that they have different thresholds than intended.

Have you been able to get a message through to anyone in the lab, or should I PM Matt or somebody?

I did PM Matt, don't know how often he checks or whether to expect a reply.
                                                                 Joe
ID: 619721 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 619723 - Posted: 15 Aug 2007, 19:25:20 UTC - in response to Message 619719.  
Last modified: 15 Aug 2007, 19:26:29 UTC

I've just reported three of these 04mr07 units, still got 7 more, They have exited with -9 overflow after a few seconds, all report 31 triplets.

Andy

edit] processed with chicken 2.4 [/edit


Jolly Good. Which science app / Boinc / OS that was with? If similar to mine I may just let them go.[unsuspend]

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 619723 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 619812 - Posted: 15 Aug 2007, 21:15:07 UTC
Last modified: 15 Aug 2007, 21:16:21 UTC

EDIT @ JOE /EDIT
Just to be clear, are you suggesting that ALL wu's beginning with 04mr07ab.xxxx should be suspended? I have just over 350 of these on 9 machines so I do NOT want to misunderstand what is happening here.
Thanks!
ID: 619812 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 619845 - Posted: 15 Aug 2007, 21:43:00 UTC

I just checked a bunch of WUs on some of my boxes, all 04mr07ab (also some .32128.), and none had a negative triplet_thresh.

So from what I understand they will error out -9.

I'll just let them go the normal way...


mic.


ID: 619845 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 619846 - Posted: 15 Aug 2007, 21:45:02 UTC - in response to Message 619845.  

I just checked a bunch of WUs on some of my boxes, all 04mr07ab (also some .32128.), and none had a negative triplet_thresh.

So from what I understand they will error out -9.

I'll just let them go the normal way...


Watch 'To Completion' time. I had several that were running and the time was counting up, then down, then up again, etc. Finally suspended it.
ID: 619846 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 619861 - Posted: 15 Aug 2007, 22:06:24 UTC

Watch 'To Completion' time. I had several that were running and the time was counting up, then down, then up again, etc. Finally suspended it.


This one just worked out fine.
This one looks good, too (11min, 15%).

mic.


ID: 619861 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 619878 - Posted: 15 Aug 2007, 22:29:50 UTC - in response to Message 619812.  

EDIT @ JOE /EDIT
Just to be clear, are you suggesting that ALL wu's beginning with 04mr07ab.xxxx should be suspended? I have just over 350 of these on 9 machines so I do NOT want to misunderstand what is happening here.
Thanks!

I think if they run in normal time or overflow early you might as well let them do so. I was suggesting the Suspend option for those which exhibit the slow syndrome. There are 14 channels of 04mr07ab data, some of the splitter processes may be doing the right thing or close enough.

OTOH, someone with a large queue who usually doesn't watch how things are progressing might try to avoid the possibility of a lot of wasted time by Suspending all 04mr07ab until the situation clarifies.
                                                                  Joe
ID: 619878 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 619881 - Posted: 15 Aug 2007, 22:40:26 UTC - in response to Message 619878.  

EDIT @ JOE /EDIT
Just to be clear, are you suggesting that ALL wu's beginning with 04mr07ab.xxxx should be suspended? I have just over 350 of these on 9 machines so I do NOT want to misunderstand what is happening here.
Thanks!

I think if they run in normal time or overflow early you might as well let them do so. I was suggesting the Suspend option for those which exhibit the slow syndrome. There are 14 channels of 04mr07ab data, some of the splitter processes may be doing the right thing or close enough.

OTOH, someone with a large queue who usually doesn't watch how things are progressing might try to avoid the possibility of a lot of wasted time by Suspending all 04mr07ab until the situation clarifies.
                                                                  Joe

Thanks Joe, I'll keep an eye on them. The problem is that once they are suspended, my machines are not requesting new work even tho' the queues are below set limits including the suspended ones. I hope Berkeley comes up with something soon.
Regards,
Gus
ID: 619881 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 619882 - Posted: 15 Aug 2007, 22:40:31 UTC

Another slight difficulty, as I've just confirmed with BOINC v5.10.13, is that BOINC won't request new work while there are suspended WUs for the project, even if the project queue is otherwise completely dry. I've just 'unsuspended' my three, and BOINC immediately requested new work: however, in the current circumstances, 'requested' and 'allocated' (let alone 'received') are not the same thing. BOINC asked for 250K seconds: it got 2 WUs, in a dozen attempts. Oh well, CPDN, Einstein and Astropulse wil get another above-quota night's crunching, and I'll see what the buffer is looking like in the morning.
ID: 619882 · Report as offensive
Christoph
Volunteer tester

Send message
Joined: 21 Apr 03
Posts: 76
Credit: 355,173
RAC: 0
Germany
Message 619894 - Posted: 15 Aug 2007, 22:52:08 UTC

I MAY have an explanation for BOINC not downloading new Units if suspended are there. This had been used at LHC@home to fill up the own cache and receive more of the rare units without increasing the cache. MAY BE that due to this a function was added to BOINC not to do this.

But I am only guessing here!

Happy crunching, Christoph
Christoph
ID: 619894 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 619899 - Posted: 15 Aug 2007, 22:57:31 UTC - in response to Message 619881.  
Last modified: 15 Aug 2007, 22:59:55 UTC

EDIT @ JOE /EDIT
Just to be clear, are you suggesting that ALL wu's beginning with 04mr07ab.xxxx should be suspended? I have just over 350 of these on 9 machines so I do NOT want to misunderstand what is happening here.
Thanks!

I think if they run in normal time or overflow early you might as well let them do so. I was suggesting the Suspend option for those which exhibit the slow syndrome. There are 14 channels of 04mr07ab data, some of the splitter processes may be doing the right thing or close enough.

OTOH, someone with a large queue who usually doesn't watch how things are progressing might try to avoid the possibility of a lot of wasted time by Suspending all 04mr07ab until the situation clarifies.
                                                                  Joe

Thanks Joe, I'll keep an eye on them. The problem is that once they are suspended, my machines are not requesting new work even tho' the queues are below set limits including the suspended ones. I hope Berkeley comes up with something soon.
Regards,
Gus

No, these are just not working. The one's I've been checking are running, but the Time To Completion is counting up and down and up again etc. As Richard said, there is always Rosetta, Einstein et al. that need our cycles.

But, you know darn well many will simply abort or worse delete these instead of suspending them.
ID: 619899 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19402
Credit: 40,757,560
RAC: 67
United Kingdom
Message 620019 - Posted: 16 Aug 2007, 2:27:38 UTC - in response to Message 619723.  

I've just reported three of these 04mr07 units, still got 7 more, They have exited with -9 overflow after a few seconds, all report 31 triplets.

Andy

edit] processed with chicken 2.4 [/edit


Jolly Good. Which science app / Boinc / OS that was with? If similar to mine I may just let them go.[unsuspend]

Sorry, got distracted by family etc.
Science app Chicken Pent M 2.4, BONIC 5.10.13, WinXP Pro.
Update on units:
590749272 -9 after few sec, 31 triplets
590885684 normal time and credits
590801394 723 secs, cr 3.55, 1 pulse 30 triplets, assume ok.
590688581 ok
590728963 Bad unit. CPU time 3416.25, Claimed credit 0.1289. but it did exit without human(?) intervention.

Andy
ID: 620019 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

Message boards : Number crunching : Work Unit problem


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.