Extended Outage July 23 2010 Problems

Message boards : Number crunching : Extended Outage July 23 2010 Problems
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 1018934 - Posted: 24 Jul 2010, 0:30:44 UTC

Here is the post outage report of problems

Please consider a Donation to the Seti Project.

ID: 1018934 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1019008 - Posted: 24 Jul 2010, 3:18:29 UTC - in response to Message 1018934.  

After the outtage was over, for some reason stopped requesting CPU work (on I5 intel machine).. with 4 units uncrunched, estimated at about 1 hour each for 3 processors. Took nap, computer had dry CPU's.. hit upload, and came clear.

6.10.58, not attached to other projects, latest nvidia drivers, no custom settings/applications. No limits shown. Cache was set to 5 days.


Janice
ID: 1019008 · Report as offensive
Profile Uli
Volunteer tester
Avatar

Send message
Joined: 6 Feb 00
Posts: 10923
Credit: 5,996,015
RAC: 1
Germany
Message 1019033 - Posted: 24 Jul 2010, 4:05:12 UTC

No problems here, except the boards are slowing down.
Pluto will always be a planet to me.

Seti Ambassador
Not to late to order an Anni Shirt
ID: 1019033 · Report as offensive
Ianab
Volunteer tester

Send message
Joined: 11 Jun 08
Posts: 732
Credit: 20,635,586
RAC: 5
New Zealand
Message 1019153 - Posted: 24 Jul 2010, 10:36:19 UTC

Downloads of new work were painfully slow with a few retrys, probably due to the higher limits this time around. Not enough congestion to cause a real problem, but I think a slightly lower limit initially would make things smoother, then ramp up as the traffic slows down?

Ian
ID: 1019153 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 1019249 - Posted: 24 Jul 2010, 16:47:31 UTC

Not sure if it's a problem, but it's strange how difficult it is to get more tasks at the moment, the cricket shows it's only running half capacity, but 9 times or more out of 10 there's no work sent when requested.

btw I'm not missing WUs at all just now, just wondering.
ID: 1019249 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 1019256 - Posted: 24 Jul 2010, 17:11:45 UTC

Current result creation rate is very low, guess that's why it's so diff to getting WUs.
ID: 1019256 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1019281 - Posted: 24 Jul 2010, 19:27:32 UTC - in response to Message 1019256.  

Current result creation rate is very low, guess that's why it's so diff to getting WUs.

As long as there's "Results ready to send" of both types, creation merely adds to the end of that queue. If anything, low creation rate reduces database load somewhat and allows other processes to run a bit more efficiently.
                                                               Joe
ID: 1019281 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1019286 - Posted: 24 Jul 2010, 19:44:20 UTC - in response to Message 1019256.  

Current result creation rate is very low, guess that's why it's so diff to getting WUs.


Looking through the new work I've got I found a lot of -2s and -3s meaning they've been out before. Most were time outs and either aborts or client detaches. One is a -6 that has been waiting since Feb. 26th for someone else to complete it. It looks like a big portion of them were ghosts timing out.

But anyway, my point is that result creation rate isn't all that important as far as the amount of work going out so long as there are ghosts, aborts, timeouts, and detaches to send.



PROUD MEMBER OF Team Starfire World BOINC
ID: 1019286 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1019365 - Posted: 25 Jul 2010, 2:31:11 UTC - in response to Message 1019281.  
Last modified: 25 Jul 2010, 2:32:38 UTC

Current result creation rate is very low, guess that's why it's so diff to getting WUs.

As long as there's "Results ready to send" of both types, creation merely adds to the end of that queue. If anything, low creation rate reduces database load somewhat and allows other processes to run a bit more efficiently.
                                                               Joe


The real bottleneck, besides high network traffic, is the Download Feeder process. It only holds 100 WUs at a time, a mix of S@H Enhanced CPU/GPU, Astropulse, and Cuda tasks. It refills on a 5-6 second cycle, so even though there are several hundred thousand "Results Ready to Send", there are actually only 100 available in each 6-second cycle. If your request for work hits the Scheduling Server during the portion of the Feeder cycle when it has no tasks of the type you are requesting, you get the "No Tasks Available" message, and have to try again later.

So as Joe and perryjay said, as long as there are plenty of "Results Ready to Send", the actual creation rate is irrelevant.
ID: 1019365 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 1019392 - Posted: 25 Jul 2010, 5:54:03 UTC

Jeff, Eric

Looking at the Cricket Graphs and knowing I have one machine that has been attempting get work all day... There is a process that is Stuck for Downloads.

Regards


Please consider a Donation to the Seti Project.

ID: 1019392 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13745
Credit: 208,696,464
RAC: 304
Australia
Message 1019398 - Posted: 25 Jul 2010, 6:21:47 UTC - in response to Message 1019392.  

Looking at the Cricket Graphs and knowing I have one machine that has been attempting get work all day... There is a process that is Stuck for Downloads.

Could be the machine, traffic is around the level it has been after the last few outages once the per client Work Unit lmit has been hit by everyone; i hit the limit ages back but have been getting work after each Work Unit is completed.
Grant
Darwin NT
ID: 1019398 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1019723 - Posted: 26 Jul 2010, 15:41:06 UTC

Problems?? GHOSTS, GHOSTS, GHOSTS!!!!!

There has to be something the guys at Berkeley can do to slow these things down! I've got a ton of them again. A lot of them are past ghosts from others. One of them is already a -6 so it might kill it if it gets sent in again. I have too much work to do a run-down/detach right now so I will have to hold on to them until Friday at the least.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1019723 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 1019963 - Posted: 27 Jul 2010, 4:54:36 UTC
Last modified: 27 Jul 2010, 4:54:49 UTC

AS Quoted from Joe

My latest CPU MB WUs are way underestimated - like as 17 minutes and 6 minutes; this will cause catastrophic failures (-177, here we come) going forward.

I have to agree that DA seems to have fumbled the ball again. And it's a pity, as things seemed to be working quite well the last week or two.

I have about 100 of these now; I have suspended new tasks until I find a way to handle these. Must I abort them, because they will all error -177 out? Or is there something simple I can do with them?

Thanks for your help!

There are at least two relatively simple fixes. The more sophisticated one is Fred M's new rescheduler which you can get from http://www.efmer.eu/forum_tt/index.php?topic=428.0. It can boost the rsc_fpops_bound values for all S@H MB tasks to 5e17 which amounts to more than a year on even the fastest hosts. That removes the protection against a hung task which the bound is meant to provide, but there's no other downside AFAIK.

The even simpler alternative is to shut BOINC down completely and do a global replace in client_state.xml of all <rsc_fpops_bound> with <rsc_fpops_bound>3. That boosts the bound by a factor of 4 at least, but affects all tasks for all projects. If you can wait until the beginning of the outage, doing that just twice gives a boost of at least 34. That should be sufficient protection against -177 errors.
                                                                Joe


So there are some things afoot.
Please consider a Donation to the Seti Project.

ID: 1019963 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 1020219 - Posted: 28 Jul 2010, 1:06:14 UTC

Okay, the odd one...

My machine with the 9800 GT with a bunch of the shorties (~3 min each) and a DCF of 0.145 is now showing a DCF 1.0004 and the run times look normal. The machine with the 250 has not caught up yet.

Regards


Please consider a Donation to the Seti Project.

ID: 1020219 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 1020398 - Posted: 28 Jul 2010, 16:55:32 UTC
Last modified: 28 Jul 2010, 16:55:44 UTC

I have had a user state that they recieved a Detach and Reattach message for no apparent reason. Can anyone else confirm this?

Regards
Please consider a Donation to the Seti Project.

ID: 1020398 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13745
Credit: 208,696,464
RAC: 304
Australia
Message 1020413 - Posted: 28 Jul 2010, 17:43:26 UTC - in response to Message 1020398.  
Last modified: 28 Jul 2010, 17:43:44 UTC

Hasn't happened to me, but i remember quite a few posts about it during the first couple of extended outages.
Grant
Darwin NT
ID: 1020413 · Report as offensive
Lonnie Christensen
Volunteer tester
Avatar

Send message
Joined: 1 Feb 04
Posts: 7
Credit: 3,091,656
RAC: 0
United States
Message 1020511 - Posted: 29 Jul 2010, 3:30:22 UTC

I can't seem to get any work out of Seti. I have eighteen computers and they are set to execpt four days ahead of work. I can't even get one...........
ID: 1020511 · Report as offensive
Profile Uli
Volunteer tester
Avatar

Send message
Joined: 6 Feb 00
Posts: 10923
Credit: 5,996,015
RAC: 1
Germany
Message 1020534 - Posted: 29 Jul 2010, 4:34:03 UTC - in response to Message 1020511.  

Welcome to the boards Lonnie. I won't hide or move your post, but you won't be able to get any work until tomorrow sometime, Berkeley time, just as the title of this threat states.
Pluto will always be a planet to me.

Seti Ambassador
Not to late to order an Anni Shirt
ID: 1020534 · Report as offensive
parl

Send message
Joined: 22 May 04
Posts: 95
Credit: 4,476,976
RAC: 0
United States
Message 1020843 - Posted: 30 Jul 2010, 17:16:55 UTC

Oddly enough, the u/l and d/l servers as well as the splitters are currently not operational. So no work yet.
ID: 1020843 · Report as offensive
parl

Send message
Joined: 22 May 04
Posts: 95
Credit: 4,476,976
RAC: 0
United States
Message 1020846 - Posted: 30 Jul 2010, 17:21:02 UTC

OK. I was able to u/l and report, even though the Server status page says the u/l server is Disabled.
ID: 1020846 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Extended Outage July 23 2010 Problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.