Progress counting backwards on some WU's

Message boards : Number crunching : Progress counting backwards on some WU's
Message board moderation

To post messages, you must log in.

AuthorMessage
S@NL - John van Gorsel
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 193
Credit: 139,673,078
RAC: 0
Netherlands
Message 780294 - Posted: 7 Jul 2008, 16:43:52 UTC

I've been struggling with one machine for a while now and I haven't been able to pinpoint the problem. Maybe someone can point me in the right direction.

The pc is a Dell laptop (Core Duo T7100) and since day one I could not get the RAC above 250 because half of all WU's failed. Not just Seti, but Einstein, Rosetta and Leiden as well. The type of error varies: sometimes a "computation error" after a few seconds, sometimes the WU gets stuck (no progress, no graphics). With Seti WU's I noticed that around 5% of the WU's are processed to more than 99% (I had a very frustrating 99.998% today...), and then start processing backwards to around 95%, and go back up again. Some of these units actually make it to 100%, but most of them die close to 100%.

What I did sofar:

  • update BOINC to the latest version (5.10.45)
  • installed the v8 optimized applications
  • killed the virus scanner and the scheduled backups


The optimized apps certainly improved the RAC (660 after one week) but still around 5% of the WU's do this backward routine and half of those don't make it to the end.
I have no reason to suspect the hardware since I don't have any problems with other programmes. It is a laptop from work though and I don't control all that's running in the background.

Any suggestions are welcome.

Regards,
John





Seti@Netherlands website
ID: 780294 · Report as offensive
Profile JDWhale
Volunteer tester
Avatar

Send message
Joined: 6 Apr 99
Posts: 921
Credit: 21,935,817
RAC: 3
United States
Message 780304 - Posted: 7 Jul 2008, 17:18:15 UTC - in response to Message 780294.  

I've been struggling with one machine for a while now and I haven't been able to pinpoint the problem. Maybe someone can point me in the right direction.

The pc is a Dell laptop (Core Duo T7100) and since day one I could not get the RAC above 250 because half of all WU's failed. Not just Seti, but Einstein, Rosetta and Leiden as well. The type of error varies: sometimes a "computation error" after a few seconds, sometimes the WU gets stuck (no progress, no graphics). With Seti WU's I noticed that around 5% of the WU's are processed to more than 99% (I had a very frustrating 99.998% today...), and then start processing backwards to around 95%, and go back up again. Some of these units actually make it to 100%, but most of them die close to 100%.

What I did sofar:

  • update BOINC to the latest version (5.10.45)
  • installed the v8 optimized applications
  • killed the virus scanner and the scheduled backups


The optimized apps certainly improved the RAC (660 after one week) but still around 5% of the WU's do this backward routine and half of those don't make it to the end.
I have no reason to suspect the hardware since I don't have any problems with other programmes. It is a laptop from work though and I don't control all that's running in the background.

Any suggestions are welcome.

Regards,
John




First off.... Welcome to the message boards and congratulations on your recent 9 year anniversary of joining Seti@home.

With a quick look through the available result history on your T7100 I can not find any WUs ending with any errors. I do see some variation of CPU times on "like" WUs (WUs with same Angle Range), but this sometimes happens depending on workload.

Honestly, I've never noticed the Progress % decreasing but I admit that I do not pay attention to that number. I sometimes notice "To Completion" time increasing slightly, though.

Since this your "work" computer, maybe you should enable the virus scanner and scheduled backups.

Good luck,
JDWhale
ID: 780304 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 780326 - Posted: 7 Jul 2008, 18:13:12 UTC - in response to Message 780294.  
Last modified: 7 Jul 2008, 18:17:30 UTC

Before setiathome_enhanced, there were hosts which would get to 100% progress significantly before the work was really done, and fairly frequent discussion here about that. The basic progress calculation depends on weighted estimates of how much time various operations will take; it theoretically shouldn't ever get past 100% but apparently there's an undiscovered flaw. Eric Korpela decided to modify the progress for _enhanced such that it cannot possibly go past 100%. It's done by blending the basic progress with another ratio guaranteed not to exceed 100% since it determines when the app knows it has finished. That blend is exponential so the new ratio becomes dominant around the 90% progress point. If the basic progress happens to have gotten ahead of where it should have been, that can result in decreasing progress for awhile.

In short, I think your host is simply exposing a known but undiagnosed flaw in the progress calculation. Judging by your results it seems safe to ignore.

If you're curious, the state.sah checkpoint file in the slot directory has a <prog> line which preserves the basic progress. It's probably going to be 0.9999999 while BOINC Manager is showing decreasing progress, if not my analysis is wrong.

I haven't addressed WU failures on other projects, I know. Perhaps whatever is causing the progress calculation oddity has a worse effect there.
                                                                  Joe
ID: 780326 · Report as offensive
S@NL - John van Gorsel
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 193
Credit: 139,673,078
RAC: 0
Netherlands
Message 780330 - Posted: 7 Jul 2008, 18:21:49 UTC

Thanks, when I get this laptop running at it's full potential I may even reach 2M credits by the time I reach my 10th anniversary...
As for the virusscanner and the backup: the only suspicious incoming source on that laptop is e-mail and since there is a very intolerant virus scanner on our mail server I'm not too concerned about that. Backups are made on a daily basis to my own network and that feels a lot safer than the "company standard" weekly backup over a VPN connection.

I had to cancel about 5 WU's in the last week because they got stuck, and I can only find 1 of them back in the task list. Since this WU is "aborted by user" there's not much info to find there.

Any suggestions on what to do once a WU gets stuck? If I abort that unit, no data is sent back to the server so I guess I need to suspend that WU and check the result file on the laptop.


Seti@Netherlands website
ID: 780330 · Report as offensive
S@NL - John van Gorsel
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 193
Credit: 139,673,078
RAC: 0
Netherlands
Message 780331 - Posted: 7 Jul 2008, 18:27:35 UTC

Thanks Joe, I'll keep an eye on the state.sah file once I see the progress bar go down.


Seti@Netherlands website
ID: 780331 · Report as offensive
Profile Mumps [MM]
Volunteer tester
Avatar

Send message
Joined: 11 Feb 08
Posts: 4454
Credit: 100,893,853
RAC: 30
United States
Message 780493 - Posted: 8 Jul 2008, 0:50:44 UTC

Total shot in the dark...

When this happens, is the system slow in general? Could it be that the system is getting to a point it's run out of memory and has started to page? I know that at some points with my work laptop (which also only has 1 Gig of memory) I do sometimes run out. It shouldn't cause the CPU seconds reported by BOINC to go up, but I'm only guessing that the activities towards the end of the SETI WU's may include reviewing the results it's processed for building the "package" to be returned. And may be a more Memory intensive activity than the crunching in general. And other projects, like Einstein and Rosetta, may be more memory intensive and show ongoing issues more readily.

Just looking for a box to think outside of...
ID: 780493 · Report as offensive
S@NL - John van Gorsel
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 193
Credit: 139,673,078
RAC: 0
Netherlands
Message 780735 - Posted: 8 Jul 2008, 14:47:54 UTC

I checked the state.sah file when I noticed one of the WU's start counting backwards, and while the progress bar went back from 99.995% to around 95%, the progress counter in the state.sah file remained at 0.99999990.

I do have two more WU's that are frozen at 95.171 and 99.352%. The processor is still at 100% load but the the progress bar didn't change for over an hour. What is the best way to diagnose this?



Seti@Netherlands website
ID: 780735 · Report as offensive

Message boards : Number crunching : Progress counting backwards on some WU's


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.