Message boards :
Number crunching :
Progress counting backwards on some WU's
Message board moderation
Author | Message |
---|---|
S@NL - John van Gorsel Send message Joined: 5 Jul 99 Posts: 193 Credit: 139,673,078 RAC: 0 |
I've been struggling with one machine for a while now and I haven't been able to pinpoint the problem. Maybe someone can point me in the right direction. The pc is a Dell laptop (Core Duo T7100) and since day one I could not get the RAC above 250 because half of all WU's failed. Not just Seti, but Einstein, Rosetta and Leiden as well. The type of error varies: sometimes a "computation error" after a few seconds, sometimes the WU gets stuck (no progress, no graphics). With Seti WU's I noticed that around 5% of the WU's are processed to more than 99% (I had a very frustrating 99.998% today...), and then start processing backwards to around 95%, and go back up again. Some of these units actually make it to 100%, but most of them die close to 100%. What I did sofar:
Seti@Netherlands website |
JDWhale Send message Joined: 6 Apr 99 Posts: 921 Credit: 21,935,817 RAC: 3 |
I've been struggling with one machine for a while now and I haven't been able to pinpoint the problem. Maybe someone can point me in the right direction. First off.... Welcome to the message boards and congratulations on your recent 9 year anniversary of joining Seti@home. With a quick look through the available result history on your T7100 I can not find any WUs ending with any errors. I do see some variation of CPU times on "like" WUs (WUs with same Angle Range), but this sometimes happens depending on workload. Honestly, I've never noticed the Progress % decreasing but I admit that I do not pay attention to that number. I sometimes notice "To Completion" time increasing slightly, though. Since this your "work" computer, maybe you should enable the virus scanner and scheduled backups. Good luck, JDWhale |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Before setiathome_enhanced, there were hosts which would get to 100% progress significantly before the work was really done, and fairly frequent discussion here about that. The basic progress calculation depends on weighted estimates of how much time various operations will take; it theoretically shouldn't ever get past 100% but apparently there's an undiscovered flaw. Eric Korpela decided to modify the progress for _enhanced such that it cannot possibly go past 100%. It's done by blending the basic progress with another ratio guaranteed not to exceed 100% since it determines when the app knows it has finished. That blend is exponential so the new ratio becomes dominant around the 90% progress point. If the basic progress happens to have gotten ahead of where it should have been, that can result in decreasing progress for awhile. In short, I think your host is simply exposing a known but undiagnosed flaw in the progress calculation. Judging by your results it seems safe to ignore. If you're curious, the state.sah checkpoint file in the slot directory has a <prog> line which preserves the basic progress. It's probably going to be 0.9999999 while BOINC Manager is showing decreasing progress, if not my analysis is wrong. I haven't addressed WU failures on other projects, I know. Perhaps whatever is causing the progress calculation oddity has a worse effect there. Joe |
S@NL - John van Gorsel Send message Joined: 5 Jul 99 Posts: 193 Credit: 139,673,078 RAC: 0 |
Thanks, when I get this laptop running at it's full potential I may even reach 2M credits by the time I reach my 10th anniversary... As for the virusscanner and the backup: the only suspicious incoming source on that laptop is e-mail and since there is a very intolerant virus scanner on our mail server I'm not too concerned about that. Backups are made on a daily basis to my own network and that feels a lot safer than the "company standard" weekly backup over a VPN connection. I had to cancel about 5 WU's in the last week because they got stuck, and I can only find 1 of them back in the task list. Since this WU is "aborted by user" there's not much info to find there. Any suggestions on what to do once a WU gets stuck? If I abort that unit, no data is sent back to the server so I guess I need to suspend that WU and check the result file on the laptop. Seti@Netherlands website |
S@NL - John van Gorsel Send message Joined: 5 Jul 99 Posts: 193 Credit: 139,673,078 RAC: 0 |
Thanks Joe, I'll keep an eye on the state.sah file once I see the progress bar go down. Seti@Netherlands website |
Mumps [MM] Send message Joined: 11 Feb 08 Posts: 4454 Credit: 100,893,853 RAC: 30 |
Total shot in the dark... When this happens, is the system slow in general? Could it be that the system is getting to a point it's run out of memory and has started to page? I know that at some points with my work laptop (which also only has 1 Gig of memory) I do sometimes run out. It shouldn't cause the CPU seconds reported by BOINC to go up, but I'm only guessing that the activities towards the end of the SETI WU's may include reviewing the results it's processed for building the "package" to be returned. And may be a more Memory intensive activity than the crunching in general. And other projects, like Einstein and Rosetta, may be more memory intensive and show ongoing issues more readily. Just looking for a box to think outside of... |
S@NL - John van Gorsel Send message Joined: 5 Jul 99 Posts: 193 Credit: 139,673,078 RAC: 0 |
I checked the state.sah file when I noticed one of the WU's start counting backwards, and while the progress bar went back from 99.995% to around 95%, the progress counter in the state.sah file remained at 0.99999990. I do have two more WU's that are frozen at 95.171 and 99.352%. The processor is still at 100% load but the the progress bar didn't change for over an hour. What is the best way to diagnose this? Seti@Netherlands website |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.