Work Unit problem

Message boards : Number crunching : Work Unit problem
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 620337 - Posted: 16 Aug 2007, 17:20:57 UTC - in response to Message 620279.  

There may be more WUs affected by this problem. I have one crunching on my x64 quad that has been at it for almost 3-1/2 hours and is only at .083% complete. And this is a different series....recently downloaded....05mr07aa.12591.24612.13.4.241...so it appears this is still perhaps a splitter problem.
Normal completions times for MB seem to be running about 1hr 20min on this rig, so it's obvious there is a major problem with this WU.
I will let it run for today, but I am going to abort it tonight if I see it has wasted another 14 hours on it and is still only a few percent complete.

Created 16 Aug 2007 11:23:35 UTC

So the splitters are running, and still creating ... gibberish.

[Mark, could you check/post the

<triplet_thresh>x.xxxxxxxx</triplet_thresh>

for that WU? Thanks.]

Anyone - everyone - please, how can we get through to the project staff to tackle this problem at source, not just keep feeding the monster?


Where would I find that info? I just looked, and the WU finally finished with a -9 overflow after over 3-1/2 hours of crunching and asking for .61 worth of credit for the trouble.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 620337 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 620343 - Posted: 16 Aug 2007, 17:50:05 UTC - in response to Message 620332.  

FYI - we were observing the triplet overflow behavior as soon as these particular files were being split days ago. Usually these are caused by heavy areas of RFI and we work beyond them on our own. Some fires you just let burn, you know? Anyway, we're on it.

- Matt

Any thoughts as to what we're to do with the WU's we've suspended after they stopped responding at ~0.0nnn%
If we simply abort them someone else will get them, and if we let them sit there BOINC will never request more work.
ID: 620343 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14676
Credit: 200,643,578
RAC: 874
United Kingdom
Message 620344 - Posted: 16 Aug 2007, 17:57:16 UTC - in response to Message 620337.  

Where would I find that info? I just looked, and the WU finally finished with a -9 overflow after over 3-1/2 hours of crunching and asking for .61 worth of credit for the trouble.

If the WU has finished and reported, it's too late now: but for future reference (and in case any other drive-by readers are interested) -

All the WU data is contained in the big (367KB) files in the ..\\BOINC\\projects\\setiathome.berkeley.edu folder. It's plain text: you can open it with Notepad, Wordpad or any other plain-text viewer. Wordpad displays it more clearly than Notepad. Be very careful not to make or save any changes to the file, but it's perfectly safe to have a peek.

The files start with an xml section - you'll be familiar with the style from the app_info.xml files. You'll see a <workunit_header>, and then more information about the WU than you ever really wanted to know.

The bit that seems to have gotten itself messed up this time is towards the bottom of the header, half way down a section called <analysis_cfg>. The tag in one of my suspended WUs reads

<triplet_thresh>-2.06835318</triplet_thresh>

- Joe reckons the number is only meaningful if it's positive, which is whay I was asking what yours was. Anyway, Matt's on the case now, so I think we can relax a bit.
ID: 620344 · Report as offensive
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8964
Credit: 12,678,685
RAC: 0
United States
Message 620382 - Posted: 16 Aug 2007, 19:10:16 UTC

In the future people can PM me...I've called the lab several times and tend to be more available than Pappa


ID: 620382 · Report as offensive
Profile SATAN
Avatar

Send message
Joined: 27 Aug 06
Posts: 835
Credit: 2,129,006
RAC: 0
United Kingdom
Message 620387 - Posted: 16 Aug 2007, 19:22:09 UTC

I've just aborted 10 that wouldn't even download, and have just got another 20, doing the exactly the same thing. Something doesn't add up, they say there are plenty of units to download, but they very rarely get here.

I'm sure the guys will sort it out in the end.
ID: 620387 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 620407 - Posted: 16 Aug 2007, 20:20:37 UTC - in response to Message 620343.  

FYI - we were observing the triplet overflow behavior as soon as these particular files were being split days ago. Usually these are caused by heavy areas of RFI and we work beyond them on our own. Some fires you just let burn, you know? Anyway, we're on it.

- Matt

Any thoughts as to what we're to do with the WU's we've suspended after they stopped responding at ~0.0nnn%
If we simply abort them someone else will get them, and if we let them sit there BOINC will never request more work.

If you don't have too many, perhaps a Resume, get work, Suspend sequence would be practical. You'd only have to leave them active until the Scheduler had assigned the work to download. Of course we hope the project will resolve the situation before the 8.68 day deadline, otherwise those WUs will be sent to someone else anyhow.

The only other way to be kind to other users is to let the WUs run and hope they eventually overflow on Pulses. Those participants who don't watch crunching are effectively doing that, it's the only way to get a matching result.
                                                                Joe
ID: 620407 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 620424 - Posted: 16 Aug 2007, 20:36:31 UTC - in response to Message 620407.  

FYI - we were observing the triplet overflow behavior as soon as these particular files were being split days ago. Usually these are caused by heavy areas of RFI and we work beyond them on our own. Some fires you just let burn, you know? Anyway, we're on it.

- Matt

Any thoughts as to what we're to do with the WU's we've suspended after they stopped responding at ~0.0nnn%
If we simply abort them someone else will get them, and if we let them sit there BOINC will never request more work.

If you don't have too many, perhaps a Resume, get work, Suspend sequence would be practical. You'd only have to leave them active until the Scheduler had assigned the work to download. Of course we hope the project will resolve the situation before the 8.68 day deadline, otherwise those WUs will be sent to someone else anyhow.

The only other way to be kind to other users is to let the WUs run and hope they eventually overflow on Pulses. Those participants who don't watch crunching are effectively doing that, it's the only way to get a matching result.
                                                                Joe

Makes sense. Thanks again.
ID: 620424 · Report as offensive
Jesse Viviano

Send message
Joined: 27 Feb 00
Posts: 100
Credit: 3,949,583
RAC: 0
United States
Message 620426 - Posted: 16 Aug 2007, 20:38:00 UTC

Here's another one exhibiting the same behavior. I am just waiting for the WU to overflow so that others don't get stuck on it.
ID: 620426 · Report as offensive
Profile Rongar
Avatar

Send message
Joined: 4 Aug 99
Posts: 13
Credit: 149,653
RAC: 0
Germany
Message 620427 - Posted: 16 Aug 2007, 20:38:55 UTC

Hi,
to prevent to get the WUs send out again. Shouldn't we collect all WU IDs to get them removed from the database manually or can this be fixed by running a script?


Best regards
Michael
ID: 620427 · Report as offensive
Bob Nadler

Send message
Joined: 3 Sep 99
Posts: 7
Credit: 726,368
RAC: 0
United States
Message 620429 - Posted: 16 Aug 2007, 20:42:38 UTC - in response to Message 620427.  

Hi,
to prevent to get the WUs send out again. Shouldn't we collect all WU IDs to get them removed from the database manually or can this be fixed by running a script?


Best regards
Michael


Hi Everyone,

I have just recently returned to teh seti@home project.. I also have one of these troublesome work units - 04mr07ab.14840.4980.6.4.243 .

http://setiathome.berkeley.edu/workunit.php?wuid=147539328

Here is the <triplet_thresh>-0.764051318</triplet_thresh>

I post here in case this info can be used as a data point.

Good luck!

Bob
ID: 620429 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14676
Credit: 200,643,578
RAC: 874
United Kingdom
Message 620582 - Posted: 16 Aug 2007, 23:15:31 UTC

I think the final word in this thread should go to Matt, in the latest Technical News. Thanks to MadMac for starting the thread in the first place, and getting us all thinking.
ID: 620582 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 620586 - Posted: 16 Aug 2007, 23:23:12 UTC - in response to Message 620582.  

I think the final word in this thread should go to Matt, in the latest Technical News. Thanks to MadMac for starting the thread in the first place, and getting us all thinking.

And thanks to Joe Segur for getting involved, quickly locating that negative triplet threshold, and getting a dialog started with Matt.
ID: 620586 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 620590 - Posted: 16 Aug 2007, 23:28:52 UTC - in response to Message 620586.  

Yep - Joe was quite helpful. - Matt

I think the final word in this thread should go to Matt, in the latest Technical News. Thanks to MadMac for starting the thread in the first place, and getting us all thinking.

And thanks to Joe Segur for getting involved, quickly locating that negative triplet threshold, and getting a dialog started with Matt.


-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 620590 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 620775 - Posted: 17 Aug 2007, 4:56:30 UTC - in response to Message 620427.  

Hi,
to prevent to get the WUs send out again. Shouldn't we collect all WU IDs to get them removed from the database manually or can this be fixed by running a script?


Best regards
Michael

It probably could be done with a script, assuming someone had time to write one. Let's at least list the ones we know about and try to get them cancelled. Even that may be more effort than Matt or Jeff have time for, but we can ask. Cancelling might mean never getting the fractional credit for those which have been allowed to complete, speak up if you have objections.

Because there are 256 WUs with identical thresholds in each group, only the first 3 fields of the WU name are needed. Here's mine plus those already mentioned in the thread:

04mr07ab.10282.4980
04mr07ab.14840.4980
04mr07ab.32128.5798
05mr07aa.12591.24612
05mr07aa.15859.24612
05mr07ab.7301.368637
                                                               Joe
ID: 620775 · Report as offensive
Profile Tklop
Avatar

Send message
Joined: 11 May 03
Posts: 175
Credit: 613,952
RAC: 0
United States
Message 620808 - Posted: 17 Aug 2007, 5:37:01 UTC - in response to Message 620775.  

Hi,
to prevent to get the WUs send out again. Shouldn't we collect all WU IDs to get them removed from the database manually or can this be fixed by running a script?


Best regards
Michael

It probably could be done with a script, assuming someone had time to write one. Let's at least list the ones we know about and try to get them cancelled. Even that may be more effort than Matt or Jeff have time for, but we can ask. Cancelling might mean never getting the fractional credit for those which have been allowed to complete, speak up if you have objections.

Because there are 256 WUs with identical thresholds in each group, only the first 3 fields of the WU name are needed. Here's mine plus those already mentioned in the thread:

04mr07ab.10282.4980
04mr07ab.14840.4980
04mr07ab.32128.5798
05mr07aa.12591.24612
05mr07aa.15859.24612
05mr07ab.7301.368637
                                                               Joe


Hello, fellow crunchers...

Here's my solitary example (so far)

04mr07ab.7106.5389

Have we reached a consensus for action needed?

I thought Matt reccomended letting them run, until the the overflow check causes them to stop. Does anyone know when we might expect that to happen? So far, this one has been crunching 16+ hours, with only .012 complete.

I don't mind just letting it run--provided it finishes before its deadline, and I sure don't want to just dump it on some other user...

The computer trying to crunch it (if it even matters) is a Pentium M 1.86GHz, with 1GB RAM, running Windows XP Pro 5.1.2600, SP2, Build 2600...

Anyways,
Keep on crunching, all...
SETI@Home Forever!


___Tklop (Step-Founder, U.S. Air Force team)
ID: 620808 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19356
Credit: 40,757,560
RAC: 67
United Kingdom
Message 620877 - Posted: 17 Aug 2007, 7:17:01 UTC

Does anybody know if these batches could be canceled at Berkeley, I have just had another one on re-issue, copy _4. Just caught it before I put nose to grindstone. It had done 20mins for 0.018%, so aborted.

Andy
ID: 620877 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 620882 - Posted: 17 Aug 2007, 7:24:39 UTC - in response to Message 620808.  

Hi,
to prevent to get the WUs send out again. Shouldn't we collect all WU IDs to get them removed from the database manually or can this be fixed by running a script?


Best regards
Michael

It probably could be done with a script, assuming someone had time to write one. Let's at least list the ones we know about and try to get them cancelled. Even that may be more effort than Matt or Jeff have time for, but we can ask. Cancelling might mean never getting the fractional credit for those which have been allowed to complete, speak up if you have objections.

Because there are 256 WUs with identical thresholds in each group, only the first 3 fields of the WU name are needed. Here's mine plus those already mentioned in the thread:

04mr07ab.10282.4980
04mr07ab.14840.4980
04mr07ab.32128.5798
05mr07aa.12591.24612
05mr07aa.15859.24612
05mr07ab.7301.368637
                                                               Joe


Hello, fellow crunchers...

Here's my solitary example (so far)

04mr07ab.7106.5389

Have we reached a consensus for action needed?

I thought Matt reccomended letting them run, until the the overflow check causes them to stop. Does anyone know when we might expect that to happen? So far, this one has been crunching 16+ hours, with only .012 complete.

I don't mind just letting it run--provided it finishes before its deadline, and I sure don't want to just dump it on some other user...

The computer trying to crunch it (if it even matters) is a Pentium M 1.86GHz, with 1GB RAM, running Windows XP Pro 5.1.2600, SP2, Build 2600...

Anyways,


Until the problem is fixed and a solution implemented, either just let it run to completion (it will probably -9 and exit), or suspend that WU for now. The only problem with suspending it is that another user commented the he though Boinc will not attempt to get any new work while a WU is suspended. If you have additional work in your cache, that is not a problem. If you do not, I guess I would abort it to try to get new work. Except that downloads seem to be problematic at the moment as well.
Hang in there.

"Time is simply the mechanism that keeps everything from happening all at once."

ID: 620882 · Report as offensive
Profile Tklop
Avatar

Send message
Joined: 11 May 03
Posts: 175
Credit: 613,952
RAC: 0
United States
Message 620906 - Posted: 17 Aug 2007, 8:26:11 UTC - in response to Message 620882.  

Hi,
to prevent to get the WUs send out again. Shouldn't we collect all WU IDs to get them removed from the database manually or can this be fixed by running a script?


Best regards
Michael

It probably could be done with a script, assuming someone had time to write one. Let's at least list the ones we know about and try to get them cancelled. Even that may be more effort than Matt or Jeff have time for, but we can ask. Cancelling might mean never getting the fractional credit for those which have been allowed to complete, speak up if you have objections.

Because there are 256 WUs with identical thresholds in each group, only the first 3 fields of the WU name are needed. Here's mine plus those already mentioned in the thread:

04mr07ab.10282.4980
04mr07ab.14840.4980
04mr07ab.32128.5798
05mr07aa.12591.24612
05mr07aa.15859.24612
05mr07ab.7301.368637
                                                               Joe


Hello, fellow crunchers...

Here's my solitary example (so far)

04mr07ab.7106.5389

Have we reached a consensus for action needed?

I thought Matt reccomended letting them run, until the the overflow check causes them to stop. Does anyone know when we might expect that to happen? So far, this one has been crunching 16+ hours, with only .012 complete.

I don't mind just letting it run--provided it finishes before its deadline, and I sure don't want to just dump it on some other user...

The computer trying to crunch it (if it even matters) is a Pentium M 1.86GHz, with 1GB RAM, running Windows XP Pro 5.1.2600, SP2, Build 2600...

Anyways,


Until the problem is fixed and a solution implemented, either just let it run to completion (it will probably -9 and exit), or suspend that WU for now. The only problem with suspending it is that another user commented the he though Boinc will not attempt to get any new work while a WU is suspended. If you have additional work in your cache, that is not a problem. If you do not, I guess I would abort it to try to get new work. Except that downloads seem to be problematic at the moment as well.
Hang in there.


Thanks, msattler!

I believe I shall let it go for another half day or so, before suspending it... If it -9's and exits, then all the better!

Upon reviewing the work unit's history, it looks like that's what happened to another user already -- see here: http://setiathome.berkeley.edu/workunit.php?wuid=147604236

So, for now, patience it is!
Keep on crunching, all...
SETI@Home Forever!


___Tklop (Step-Founder, U.S. Air Force team)
ID: 620906 · Report as offensive
Profile Rongar
Avatar

Send message
Joined: 4 Aug 99
Posts: 13
Credit: 149,653
RAC: 0
Germany
Message 620910 - Posted: 17 Aug 2007, 8:33:38 UTC
Last modified: 17 Aug 2007, 8:35:44 UTC

Hi,
my WU:

05mr07aa.12591.24612.13

I think I have to suspend it. Since my 'puter is not powered 24/7 it seems not to reach the next milestone and falls back to 0.012%. And it seems to have few signals (found so far 1 Pulse within 0.012% and 2h crunching time.)


Best regards
Michael
ID: 620910 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 620912 - Posted: 17 Aug 2007, 8:37:10 UTC - in response to Message 620906.  

Hi,
to prevent to get the WUs send out again. Shouldn't we collect all WU IDs to get them removed from the database manually or can this be fixed by running a script?


Best regards
Michael

It probably could be done with a script, assuming someone had time to write one. Let's at least list the ones we know about and try to get them cancelled. Even that may be more effort than Matt or Jeff have time for, but we can ask. Cancelling might mean never getting the fractional credit for those which have been allowed to complete, speak up if you have objections.

Because there are 256 WUs with identical thresholds in each group, only the first 3 fields of the WU name are needed. Here's mine plus those already mentioned in the thread:

04mr07ab.10282.4980
04mr07ab.14840.4980
04mr07ab.32128.5798
05mr07aa.12591.24612
05mr07aa.15859.24612
05mr07ab.7301.368637
                                                               Joe


Hello, fellow crunchers...

Here's my solitary example (so far)

04mr07ab.7106.5389

Have we reached a consensus for action needed?

I thought Matt reccomended letting them run, until the the overflow check causes them to stop. Does anyone know when we might expect that to happen? So far, this one has been crunching 16+ hours, with only .012 complete.

I don't mind just letting it run--provided it finishes before its deadline, and I sure don't want to just dump it on some other user...

The computer trying to crunch it (if it even matters) is a Pentium M 1.86GHz, with 1GB RAM, running Windows XP Pro 5.1.2600, SP2, Build 2600...

Anyways,


Until the problem is fixed and a solution implemented, either just let it run to completion (it will probably -9 and exit), or suspend that WU for now. The only problem with suspending it is that another user commented the he though Boinc will not attempt to get any new work while a WU is suspended. If you have additional work in your cache, that is not a problem. If you do not, I guess I would abort it to try to get new work. Except that downloads seem to be problematic at the moment as well.
Hang in there.


Thanks, msattler!

I believe I shall let it go for another half day or so, before suspending it... If it -9's and exits, then all the better!

Upon reviewing the work unit's history, it looks like that's what happened to another user already -- see here: http://setiathome.berkeley.edu/workunit.php?wuid=147604236

So, for now, patience it is!


Excellent choice, my friend, excellent choice! The kitties and I have been watching my RAC get slaughtered lately, but we will hang with it.
We are in it for the science, although the science experiment seems to have gone a bit awry lately. A few test tubes shattered and such. Remember science class? Mr Wizard would have understood all of this. It'll get sorted out.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 620912 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Work Unit problem


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.