Jobs restarting, with longer time to completion

Message boards : Number crunching : Jobs restarting, with longer time to completion
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 625463 - Posted: 24 Aug 2007, 12:01:20 UTC
Last modified: 24 Aug 2007, 12:04:55 UTC

I wonder if anyone else is seeing this with the Multi-Beam data. I have had several jobs that restart on their own, with the "Time to completion" increasing each time. Here is one example: when it was first downloaded it estimated about 9:30 to complete.

After 5:38 CPU time used, it showed 23.7% complete, 9:59 to go

After 6:30 CPU time it was 12.57 % complete, 13:09 to go

After 9:30 CPU time it was 1.6 % complete, 18:48 to go.

I exited and restarted BOINC, and let it run overnight. Here is what I found this morning:

8/23/2007 10:25:29 PM|SETI@home|Restarting task 13fe07ab.527.9479.3.5.95_1 using setiathome_enhanced version 527
8/23/2007 10:44:47 PM|SETI@home|Restarting task 13fe07ab.527.9479.3.5.95_1 using setiathome_enhanced version 527
8/23/2007 11:04:59 PM|SETI@home|Restarting task 13fe07ab.527.9479.3.5.95_1 using setiathome_enhanced version 527

(bunch of messages omitted)

8/24/2007 6:31:42 AM|SETI@home|Restarting task 13fe07ab.527.9479.3.5.95_1 using setiathome_enhanced version 527
8/24/2007 6:41:53 AM|SETI@home|Restarting task 13fe07ab.527.9479.3.5.95_1 using setiathome_enhanced version 527
8/24/2007 7:05:05 AM|SETI@home|Restarting task 13fe07ab.527.9479.3.5.95_1 using setiathome_enhanced version 527

It is currently showing 14:37 CPU used, 12.6% complete, 20:11 to completion.

I'm running three projects under BOINC, and I would expect SETI to restart about every three hours. I haven't completed a SETI job in several days, they have all either locked up or gone into this weird behaviour. I was doing one or two SETI results per day before that. The other two projects are working normally (Climate prediction and world grid)

Any suggestions?

ID: 625463 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 625490 - Posted: 24 Aug 2007, 13:31:27 UTC

It sounds like a checkpointing problem, and no, I haven't heard anyone else seeing (or at least, reporting) anything like this. We know that the 'Splitsville' WUs (the 2.5% with impossibly low treshholds) wouldn't checkpoint, but they only showed %age complete in the 0.0x% range.

As a temporary workround, if you're prepared to run the machine like this, you could set 'Leave applications in memory while suspended?' to Yes: provided you didn't have to restart BOINC for any reason, the WUs would then crunch to completion without having to rely on a checkpoint file.

Then, once you've uploaded and reported the WUs which have been struggling to date, we can look in <stderr_txt> and see if there are any messages about checkpointing to help the next person who asks.
ID: 625490 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 625598 - Posted: 24 Aug 2007, 15:26:57 UTC - in response to Message 625490.  
Last modified: 24 Aug 2007, 16:06:21 UTC

As a temporary workround, if you're prepared to run the machine like this, you could set 'Leave applications in memory while suspended?' to Yes: provided you didn't have to restart BOINC for any reason, the WUs would then crunch to completion without having to rely on a checkpoint file.

Then, once you've uploaded and reported the WUs which have been struggling to date, we can look in <stderr_txt> and see if there are any messages about checkpointing to help the next person who asks.


Tried that, but the current project locked up after a few hours, stuck at 8.26 % complete, but no change in CPU time. The "to completion" time went from about 9:30 when it started, to currently showing 13:56. The status says running, but nothing is changing.

Unless you have another suggestion, I'm going to can this one too. BTW, job id is 02mr07ah.29213.481.10.5.111_0. Here are the last few messages:

8/23/2007 9:22:18 PM|SETI@home|Restarting task 02mr07ah.29213.481.10.5.111_0 using setiathome_enhanced version 527

8/24/2007 9:14:28 AM|SETI@home|Restarting task 02mr07ah.29213.481.10.5.111_0 using setiathome_enhanced version 527

8/24/2007 10:29:48 AM|SETI@home|Restarting task 02mr07ah.29213.481.10.5.111_0 using setiathome_enhanced version 527


FORGET ALL THAT. While I was cutting and pasting this info, the job restarted, no messages, but time counters are moving again. I'm so confused....

Added slightly later: shortly after the counters started moving, a message was generated:

8/24/2007 11:33:27 AM|SETI@home|Restarting task 02mr07ah.29213.481.10.5.111_0 using setiathome_enhanced version 527

Also, the time to completion jumped up, to 15:10 (was 13:56 when the job was frozen). It appears to me that my SETI jobs are restarting on their own, from the beginning or near the beginning. I'll let this run through the weekend to see if there is an end to all this.


ID: 625598 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 625698 - Posted: 24 Aug 2007, 16:24:11 UTC
Last modified: 24 Aug 2007, 16:24:48 UTC

I've just had a thought, and looked at your result listings. Result 594641367 has this:
<message>
- exit code 1073807364 (0x40010004)
</message>

See message 609680.

Since you're getting these random restarts, I'd add: check the graphics card fan for proper cooling, and check the card itself is properly seated in its slot, as well as following bounty.hunter's advice about graphics drivers and DirectX.
ID: 625698 · Report as offensive
Profile Sarge
Volunteer tester

Send message
Joined: 25 Aug 99
Posts: 12273
Credit: 8,569,109
RAC: 79
United States
Message 625736 - Posted: 24 Aug 2007, 17:02:23 UTC - in response to Message 625490.  

It sounds like a checkpointing problem, and no, I haven't heard anyone else seeing (or at least, reporting) anything like this. We know that the 'Splitsville' WUs (the 2.5% with impossibly low treshholds) wouldn't checkpoint, but they only showed %age complete in the 0.0x% range.

As a temporary workround, if you're prepared to run the machine like this, you could set 'Leave applications in memory while suspended?' to Yes: provided you didn't have to restart BOINC for any reason, the WUs would then crunch to completion without having to rely on a checkpoint file.

Then, once you've uploaded and reported the WUs which have been struggling to date, we can look in <stderr_txt> and see if there are any messages about checkpointing to help the next person who asks.


Bill and Richard: Yes, I have had one SETI WU act like this. There is another BOINC project where something similar can happen quite often. (I do not recall which project, since I am back to only running SETI@Home for the time being.)
As someone who has only asked or suggested a few things in this forum, I ask the following hesitantly: what about just aborting the WU?
Capitalize on this good fortune, one word can bring you round ... changes.
ID: 625736 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 625737 - Posted: 24 Aug 2007, 17:06:14 UTC - in response to Message 625736.  
Last modified: 24 Aug 2007, 17:08:55 UTC

Bill and Richard: Yes, I have had one SETI WU act like this. There is another BOINC project where something similar can happen quite often. (I do not recall which project, since I am back to only running SETI@Home for the time being.)
As someone who has only asked or suggested a few things in this forum, I ask the following hesitantly: what about just aborting the WU?


Well, the thinking early on was that it might be better to just suspend the bogus WU's with the hope the project would get around to cancelling them and end the problem without propagating them to other participants until they reached the max errors or hit another host which didn't gag on them.

At this point, I don't think it's going to happen. So if they're posing a problem for your host, just go ahead and can them.

Alinator
ID: 625737 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 625751 - Posted: 24 Aug 2007, 17:15:54 UTC
Last modified: 24 Aug 2007, 17:17:28 UTC

Sarge, Alinator:

The reason I didn't suggest that in the first place is that this report didn't sound like the known faulty WUs we're trying to get cancelled.

If they were 'Splitsville' WUs, and there was a plentiful supply of replacements, then I would agree: abort them.

But just at the moment:
1) I'm wondering whether it's a system error, not a WU error.
2) There aren't any replacements to be had.
3) It just, just possibly could be the first report of a new class of WU problem.

- all of which leads me to continue to suggest that it would be more useful to nurse at least one to completion, so we can examine the entrails.
ID: 625751 · Report as offensive
harry hirsch
Avatar

Send message
Joined: 30 Nov 05
Posts: 6
Credit: 1,483,338
RAC: 0
Germany
Message 625752 - Posted: 24 Aug 2007, 17:16:26 UTC - in response to Message 625463.  

I have the same problem since 4 hours. When i close the boinc manager, i havn't done any CPU-time.
ID: 625752 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 625755 - Posted: 24 Aug 2007, 17:20:19 UTC - in response to Message 625752.  

I have the same problem since 4 hours. When i close the boinc manager, i havn't done any CPU-time.

It might be the way in which you close the manager. The upper right hand side "red x" closes just the manager. Doing a "File" then "exit" closes both the manager and the daemon (boinc.exe) when installed as single user. If installed as service it will run regardless of how the manager is closed.
ID: 625755 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 625764 - Posted: 24 Aug 2007, 17:29:35 UTC - in response to Message 625698.  
Last modified: 24 Aug 2007, 17:34:45 UTC

Since you're getting these random restarts, I'd add: check the graphics card fan for proper cooling, and check the card itself is properly seated in its slot, as well as following bounty.hunter's advice about graphics drivers and DirectX.


First, let me say thanks to Richard for taking the time to work with me on this. It has been driving me crazy. And thanks to others for sharing similar problems. Misery loves company.

Checked all my drivers are up to date, fan is working, per the Intel TAT load test, and CPUs are at 60 to 61C with BONIC running flat out on both processors. I think that is acceptable, but would welcome any correction.

Not sure about re-seating video card. I'm running on a Toshiba Satellite lap top, about one year old, with no problems so far (touch wood). Can cards come loose on these? Also, if it was a hardware problem, wouldn't I see effects on my other BOINC projects?

ID: 625764 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 625766 - Posted: 24 Aug 2007, 17:30:23 UTC - in response to Message 625752.  

I have the same problem since 4 hours. When i close the boinc manager, i havn't done any CPU-time.

Your problem is the opposite of Bill Walker's. He's done lots of time, but not seen any progress as a result of it.

If you're saying that you actually had the BOINC manager open for 4 hours, but the CPU time didn't increase at all, I can only suggest:

Have you got any tasks, from any project, which are listed as 'Running', 'Waiting to run', or 'Ready to start'? Nothing else will ever get any CPU time.

If you have WUs, but BOINC is spending no time on them, then maybe you have the preference setting "Suspend work while computer is in use?" set to 'yes', either on this website (under 'Your account', 'General preferences'), or in the BOINC Manager itself - the BOINC Manager settings take priority.

If that doesn't help you cure it, restart BOINC to get a new set of start-up messages at the top of your message log, and post the first 20 lines or so here.
ID: 625766 · Report as offensive
harry hirsch
Avatar

Send message
Joined: 30 Nov 05
Posts: 6
Credit: 1,483,338
RAC: 0
Germany
Message 625773 - Posted: 24 Aug 2007, 17:35:16 UTC - in response to Message 625766.  

I've deleted the work, got new work and now it works normally...
ID: 625773 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 625778 - Posted: 24 Aug 2007, 17:39:01 UTC - in response to Message 625773.  

I've deleted the work, got new work and now it works normally...

Ah. WU 147537487. That was one of the Splitsville Specials. Right thing to do.
ID: 625778 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 626021 - Posted: 24 Aug 2007, 23:13:21 UTC - in response to Message 625778.  

Richard - your suggestion to leave applications in memory when suspended may be working. My most recent offending job, 02mr07ah.29213.481.10.5.111_0, appears to be switching on and off at regular intervals. Time to completion is slowly going down, and it is up to 51% complete, with 7:15:00 to completion. I'll keep my fingers crossed!

ID: 626021 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 626315 - Posted: 25 Aug 2007, 11:50:34 UTC - in response to Message 626021.  

Richard - your suggestion to leave applications in memory when suspended may be working.


Success! Job completed normally last night, after 46,569 seconds. New job appears to be running normally. Results are at http://setiathome.berkeley.edu/workunit.php?wuid=150079901
waiting to have their entrails read.

ID: 626315 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 626323 - Posted: 25 Aug 2007, 12:08:08 UTC - in response to Message 626315.  
Last modified: 25 Aug 2007, 12:08:37 UTC

Richard - your suggestion to leave applications in memory when suspended may be working.

Success! Job completed normally last night, after 46,569 seconds. New job appears to be running normally. Results are at http://setiathome.berkeley.edu/workunit.php?wuid=150079901
waiting to have their entrails read.

Result 596007685. Lots of restarts, as expected, but nothing else obvious....

....except each time it restarts, it seems to choose a different set of (supposedly optimal) processor routines - ChirpData runs anything from basic to SSE3. Any programmers/optimisers like to comment? My first guess would be overheating, but I'm out of my depth here.
ID: 626323 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 626350 - Posted: 25 Aug 2007, 13:30:57 UTC - in response to Message 626323.  

My first guess would be overheating, but I'm out of my depth here.


Can anybody tell me what a "normal" temperature range is for Pentium Ds in a lap top? I've been using the Intel temperature monitoring utility for the last couple of days, and I see temps normally in the 61 to 63C range. When the screen shuts down, the temp will drop to about 57C, coming back up within about 5 minutes of the screen being re-activated. The default Toshiba temp warning limit for the CPUs is set at 68C, so I though I had some cushion.

The only other temps I can find info on (so far) are "ACPI Thermal Zone TZ01_0" and "...TZ02_0", both are always within 1C of the CPU temps.

ID: 626350 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 626356 - Posted: 25 Aug 2007, 13:40:16 UTC - in response to Message 626350.  
Last modified: 25 Aug 2007, 14:01:23 UTC


Can anybody tell me what a "normal" temperature range is for Pentium Ds in a lap top?

Pentium D's are normally hot... using a pentium D in a laptop sounds unusual, wouldn't expect lots of battery life... are you sure it's not one of those new 2007 Pentium Dual_Cores? (different thing)


I've been using the Intel temperature monitoring utility for the last couple of days, and I see temps normally in the 61 to 63C range. When the screen shuts down, the temp will drop to about 57C, coming back up within about 5 minutes of the screen being re-activated. The default Toshiba temp warning limit for the CPUs is set at 68C, so I though I had some cushion.

Sounds good to me if it's a pentium D, If the harddrive is a toshiba check it with bios test if possible, there seem to be a few of these going lately, at least in my area, that causes wierd things to happen.


The only other temps I can find info on (so far) are "ACPI Thermal Zone TZ01_0" and "...TZ02_0", both are always within 1C of the CPU temps.


How does the air that blows out feel (hot/cold/notthere)? can you hear the fan?

Good luck with this :D

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 626356 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 626390 - Posted: 25 Aug 2007, 14:45:23 UTC - in response to Message 626356.  

Excuse the brain f*rt, they are dual core T2050s. The Pentium Ds are in my desk top, not available for BOINC.

The fan comes on full at 58/59C, with a "fairly warm", not uncomfortable for a short time, air flow. This behaviour has been pretty constant since I got the machine. Blocking the air outlet by hand is good for about an extra 2C.

I will run the windows hard drive test, and let you know. Can anybody recommend a better (free) hard drive checker?

ID: 626390 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 626436 - Posted: 25 Aug 2007, 16:17:06 UTC - in response to Message 626390.  
Last modified: 25 Aug 2007, 16:28:29 UTC

Excuse the brain f*rt, they are dual core T2050s. The Pentium Ds are in my desk top, not available for BOINC.

The fan comes on full at 58/59C, with a "fairly warm", not uncomfortable for a short time, air flow. This behaviour has been pretty constant since I got the machine. Blocking the air outlet by hand is good for about an extra 2C.

I will run the windows hard drive test, and let you know. Can anybody recommend a better (free) hard drive checker?


Depending on the brand of drive, most manufacturer's have a free diagnostic...
[e.g maxtor Powermax, seagate seatools, Western Digital Diagnostics ]

[Later: those are generally best for determining if a hard drive is going to fail soon, but if scandisk etc.. with surface scan enabled start to show bad clusters then that's another indicator... These can sometime's come good with a 'Darik's boot 'n nuke' (far better than a format) & reinstall, but I'd never rely on them to store important data again.]


"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 626436 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Jobs restarting, with longer time to completion


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.