Panic Mode On (28) Server problems

Message boards : Number crunching : Panic Mode On (28) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 16 · Next

AuthorMessage
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 967137 - Posted: 31 Jan 2010, 1:10:51 UTC - in response to Message 967134.  

Maybe I did not state my case clearly.......

I do NOT want the client to do a project wide backoff when it feels it necessary.

You made your case perfectly clear, and I understood it perfectly.

You want all the available bandwidth, and if doing so slows everyone down including you then that's what you really want.

The technical term for what you want is a "Denial of Service Attack."
ID: 967137 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13847
Credit: 208,696,464
RAC: 304
Australia
Message 967145 - Posted: 31 Jan 2010, 1:39:54 UTC - in response to Message 967134.  

I want it the plonk the servers anytime it needs to get work or report it.

I know this is hard on the servers and bandwidth....

But that is what I want. When the servers are in trouble and it takes a couple of DAYS to clear the ready to send buffer, it drives the kitties wild.

The reason it takes a couple of days to clear is because of all the attempts to get work or report it. If that didn't happen, what takes 2 days to recover from would only take half a day.
It's the continuous retries that cause the problem.
Grant
Darwin NT
ID: 967145 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 967535 - Posted: 1 Feb 2010, 19:11:27 UTC

Looks to me like another shorties storm.
Got over 400 on Rig on a Bench and 200 plus on my other main cruncher.

Dave
ID: 967535 · Report as offensive
Dave

Send message
Joined: 29 Mar 02
Posts: 778
Credit: 25,001,396
RAC: 0
United Kingdom
Message 967537 - Posted: 1 Feb 2010, 19:25:03 UTC
Last modified: 1 Feb 2010, 19:25:27 UTC

What about the client "plinking" the server after a random time interval e.g it could be 1 min, could be 30, could be 3 hours?
ID: 967537 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 967552 - Posted: 1 Feb 2010, 21:02:18 UTC - in response to Message 967537.  

What about the client "plinking" the server after a random time interval e.g it could be 1 min, could be 30, could be 3 hours?

For anyone who is really interested in this, I'd recommend reading RFC-2821 (you can Google for it), because internet E-Mail has all of the same issues.

The section on "sending strategies" (section 4.5.4) goes straight to what the BOINC client is trying to do with the BOINC servers.

As a goal, you want the minimum number of connections per second to the server that are required to fully use the available resources. Double that number and everything takes twice as long, but the same number of "things" happens per minute, so staying on the low edge gives some room for bursts and etc.

The project-wide backoff idea comes right out of RFC-2821. It says:

   Retries continue until the message is transmitted or the sender gives
   up; the give-up time generally needs to be at least 4-5 days.  The
   parameters to the retry algorithm MUST be configurable.

   A client SHOULD keep a list of hosts it cannot reach and
   corresponding connection timeouts, rather than just retrying queued
   mail items.

   Experience suggests that failures are typically transient (the target
   system or its connection has crashed), favoring a policy of two
   connection attempts in the first hour the message is in the queue,
   and then backing off to one every two or three hours.


The second paragraph is the interesting one. The idea is that if the receiving mail server can't accept mail right this second, the next message to them in the queue isn't likely to succeed if we send it right now.

(I would not recommend a 4-5 day timeout for BOINC, it does not share that with SMTP)

If the client backoff was fairly extreme (or took due-date into account so that those uploads for work due in two weeks was on a more leisurely schedule) the load(s) on the SETI@Home servers would not have the peaks and valleys.

But it would leave the average cruncher shocked and concerned because they've never seen how their E-Mail is handled, but they can see what the BOINC client is doing to try to send work.

... and it's the same issue, especially at busy sites.

Implement some BIG backoffs, and watch everything go through on the second try.
ID: 967552 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 967879 - Posted: 3 Feb 2010, 18:03:27 UTC

Not really a panic, but the Cricket graphs seem to have gone wonky.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 967879 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 967890 - Posted: 3 Feb 2010, 19:04:19 UTC - in response to Message 967879.  

Not really a panic, but the Cricket graphs seem to have gone wonky.

Someone mentioned that in the tech news post earlier. I was thinking it looked like the update job was still going, but just not collecting data. I've realized that when that normally happens the graph will stay at whatever level it was when it last updated. The current graph shows mostly nothing, literally "Cur: nan bits/sec". Tho that could just be how cricket responds to not getting new data. I use MRTG instead of cricket. So I don't really know its ins and outs.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 967890 · Report as offensive
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 967926 - Posted: 3 Feb 2010, 20:36:54 UTC - in response to Message 967890.  

It's not just SETI's cricket graph that is down, I tried looking at a few other Berkeley routers and they are all missing data for the same period.
ID: 967926 · Report as offensive
Profile 52 Aces
Avatar

Send message
Joined: 7 Jan 02
Posts: 497
Credit: 14,261,068
RAC: 67
United States
Message 967941 - Posted: 3 Feb 2010, 22:00:44 UTC
Last modified: 3 Feb 2010, 22:01:19 UTC

2/3/2010 1:57:12 PM SETI@home Requesting new tasks for CPU
2/3/2010 1:57:17 PM SETI@home Scheduler request completed: got 0 new tasks
2/3/2010 1:57:17 PM SETI@home Message from server: (Project has no jobs available)


Something just sucked away the spare inventory. It dropped fast. Maybe Higley school district is back online :-)

Good news is lots of 'tapes' still being split.
ID: 967941 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 967944 - Posted: 3 Feb 2010, 22:24:55 UTC - in response to Message 967941.  
Last modified: 3 Feb 2010, 22:27:01 UTC

Good news is lots of 'tapes' still being split.

I would't be sure about it. Current result creation rate: 3.3043/sec. That are probably just resends.

EDIT: Can have something to do with "Workunits waiting for assimilation: 321,084". If they are not getting assimilated, they cannot be deleted -> no disc space. AFAIR we had that at least once.
ID: 967944 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19374
Credit: 40,757,560
RAC: 67
United Kingdom
Message 968039 - Posted: 4 Feb 2010, 7:54:20 UTC - in response to Message 967944.  

Good news is lots of 'tapes' still being split.

I would't be sure about it. Current result creation rate: 3.3043/sec. That are probably just resends.

EDIT: Can have something to do with "Workunits waiting for assimilation: 321,084". If they are not getting assimilated, they cannot be deleted -> no disc space. AFAIR we had that at least once.

I think you are correct. For the last few hours, since 05:45 utc, the only msg i get when requesting work, is;
no work from project.

The server status page, is not reporting problems, except the "Workunits waiting for assimilation" numbers. And the cricket graph is all over the place.
ID: 968039 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 968269 - Posted: 5 Feb 2010, 14:00:20 UTC

Oh dear - the server status page hasn't updated since 09:20 UTC and the cricket graph has taken a dive...

F.
ID: 968269 · Report as offensive
Matthew S. McCleary
Avatar

Send message
Joined: 9 Sep 99
Posts: 121
Credit: 2,288,242
RAC: 0
United States
Message 968275 - Posted: 5 Feb 2010, 15:10:46 UTC

"Message from server: (Project has no jobs available)" on several of my crunchers. Server status page hasn't been updated in six hours, but reports 51,000 multibeam available.
ID: 968275 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 968279 - Posted: 5 Feb 2010, 15:29:07 UTC - in response to Message 968269.  

Oh dear - the server status page hasn't updated since 09:20 UTC and the cricket graph has taken a dive...

F.

The cricket is not chirping much......meow.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 968279 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 968307 - Posted: 5 Feb 2010, 17:21:32 UTC

I'd guess it's related to the issue Matt posted in the Tech News last night/this morning. Seems the science DB is being a bit fussy.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 968307 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 968339 - Posted: 5 Feb 2010, 18:58:01 UTC - in response to Message 968275.  
Last modified: 5 Feb 2010, 18:58:17 UTC

"Message from server: (Project has no jobs available)" on several of my crunchers.

At least there are some generous people out there. Somebody detached a 32-core SUN SPARC-Enterprise just as the WUs ran out, and donated 400 tasks to the common good.
ID: 968339 · Report as offensive
Matthew S. McCleary
Avatar

Send message
Joined: 9 Sep 99
Posts: 121
Credit: 2,288,242
RAC: 0
United States
Message 968342 - Posted: 5 Feb 2010, 19:04:31 UTC - in response to Message 968339.  

"Message from server: (Project has no jobs available)" on several of my crunchers.

At least there are some generous people out there. Somebody detached a 32-core SUN SPARC-Enterprise just as the WUs ran out, and donated 400 tasks to the common good.


Man, that guy has two 64-CPU, two 32-CPU, and two 8-CPU Suns. Wish I had that kind of hardware to monkey with.
ID: 968342 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 968344 - Posted: 5 Feb 2010, 19:11:08 UTC - in response to Message 968339.  

"Message from server: (Project has no jobs available)" on several of my crunchers.

At least there are some generous people out there. Somebody detached a 32-core SUN SPARC-Enterprise just as the WUs ran out, and donated 400 tasks to the common good.

They may not be out of work, or tapes. Could just be the process feeding the feeder has stopped. Without the pages displaying the server status updating it's unknown what might be going on. Uploads & reporting it going on w/o any problems tho.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 968344 · Report as offensive
Luke
Volunteer developer
Avatar

Send message
Joined: 31 Dec 06
Posts: 2546
Credit: 817,560
RAC: 0
New Zealand
Message 968399 - Posted: 5 Feb 2010, 21:17:50 UTC - in response to Message 968275.  

"Message from server: (Project has no jobs available)" on several of my crunchers. Server status page hasn't been updated in six hours, but reports 51,000 multibeam available.


Same problem here. uploading & reporting are fine though. Good time to give my cache a purge. I've set NNT on all machines.
Perhaps I'll run my laptop on PrimeGrid for a few days, once I'm out of S@H tasks.
- Luke.
ID: 968399 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 968414 - Posted: 5 Feb 2010, 21:50:37 UTC
Last modified: 5 Feb 2010, 21:51:57 UTC

Secondary science database has been disabled and Server Staus page has just got up to date.
Things might start to recover soon.
With some luck and a fare wind.
Now getting Project is temp shut down for maintenence message.
Thank's to the team.

Dave
ID: 968414 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 16 · Next

Message boards : Number crunching : Panic Mode On (28) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.