Panic Mode On (28) Server problems

Author	Message
1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 967137 - Posted: 31 Jan 2010, 1:10:51 UTC - in response to Message 967134. Maybe I did not state my case clearly....... I do NOT want the client to do a project wide backoff when it feels it necessary. You made your case perfectly clear, and I understood it perfectly. You want all the available bandwidth, and if doing so slows everyone down including you then that's what you really want. The technical term for what you want is a "Denial of Service Attack." ID: 967137 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13893 Credit: 208,696,464 RAC: 304	Message 967145 - Posted: 31 Jan 2010, 1:39:54 UTC - in response to Message 967134. I want it the plonk the servers anytime it needs to get work or report it. I know this is hard on the servers and bandwidth.... But that is what I want. When the servers are in trouble and it takes a couple of DAYS to clear the ready to send buffer, it drives the kitties wild. The reason it takes a couple of days to clear is because of all the attempts to get work or report it. If that didn't happen, what takes 2 days to recover from would only take half a day. It's the continuous retries that cause the problem. Grant Darwin NT ID: 967145 ·

FiveHamlet Send message Joined: 5 Oct 99 Posts: 783 Credit: 32,638,578 RAC: 0	Message 967535 - Posted: 1 Feb 2010, 19:11:27 UTC Looks to me like another shorties storm. Got over 400 on Rig on a Bench and 200 plus on my other main cruncher. Dave ID: 967535 ·

Dave Send message Joined: 29 Mar 02 Posts: 778 Credit: 25,001,396 RAC: 0	Message 967537 - Posted: 1 Feb 2010, 19:25:03 UTC Last modified: 1 Feb 2010, 19:25:27 UTC What about the client "plinking" the server after a random time interval e.g it could be 1 min, could be 30, could be 3 hours? ID: 967537 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 967552 - Posted: 1 Feb 2010, 21:02:18 UTC - in response to Message 967537. What about the client "plinking" the server after a random time interval e.g it could be 1 min, could be 30, could be 3 hours? For anyone who is really interested in this, I'd recommend reading RFC-2821 (you can Google for it), because internet E-Mail has all of the same issues. The section on "sending strategies" (section 4.5.4) goes straight to what the BOINC client is trying to do with the BOINC servers. As a goal, you want the minimum number of connections per second to the server that are required to fully use the available resources. Double that number and everything takes twice as long, but the same number of "things" happens per minute, so staying on the low edge gives some room for bursts and etc. The project-wide backoff idea comes right out of RFC-2821. It says: Retries continue until the message is transmitted or the sender gives up; the give-up time generally needs to be at least 4-5 days. The parameters to the retry algorithm MUST be configurable. A client SHOULD keep a list of hosts it cannot reach and corresponding connection timeouts, rather than just retrying queued mail items. Experience suggests that failures are typically transient (the target system or its connection has crashed), favoring a policy of two connection attempts in the first hour the message is in the queue, and then backing off to one every two or three hours. The second paragraph is the interesting one. The idea is that if the receiving mail server can't accept mail right this second, the next message to them in the queue isn't likely to succeed if we send it right now. (I would not recommend a 4-5 day timeout for BOINC, it does not share that with SMTP) If the client backoff was fairly extreme (or took due-date into account so that those uploads for work due in two weeks was on a more leisurely schedule) the load(s) on the SETI@Home servers would not have the peaks and valleys. But it would leave the average cruncher shocked and concerned because they've never seen how their E-Mail is handled, but they can see what the BOINC client is doing to try to send work. ... and it's the same issue, especially at busy sites. Implement some BIG backoffs, and watch everything go through on the second try. ID: 967552 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51515 Credit: 1,018,363,574 RAC: 1,004	Message 967879 - Posted: 3 Feb 2010, 18:03:27 UTC Not really a panic, but the Cricket graphs seem to have gone wonky. "Time is simply the mechanism that keeps everything from happening all at once." ID: 967879 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 967890 - Posted: 3 Feb 2010, 19:04:19 UTC - in response to Message 967879. Not really a panic, but the Cricket graphs seem to have gone wonky. Someone mentioned that in the tech news post earlier. I was thinking it looked like the update job was still going, but just not collecting data. I've realized that when that normally happens the graph will stay at whatever level it was when it last updated. The current graph shows mostly nothing, literally "Cur: nan bits/sec". Tho that could just be how cricket responds to not getting new data. I use MRTG instead of cricket. So I don't really know its ins and outs. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 967890 ·

Keith T. Volunteer tester Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9	Message 967926 - Posted: 3 Feb 2010, 20:36:54 UTC - in response to Message 967890. It's not just SETI's cricket graph that is down, I tried looking at a few other Berkeley routers and they are all missing data for the same period. ID: 967926 ·

52 Aces Send message Joined: 7 Jan 02 Posts: 497 Credit: 14,261,068 RAC: 67	Message 967941 - Posted: 3 Feb 2010, 22:00:44 UTC Last modified: 3 Feb 2010, 22:01:19 UTC 2/3/2010 1:57:12 PM SETI@home Requesting new tasks for CPU 2/3/2010 1:57:17 PM SETI@home Scheduler request completed: got 0 new tasks 2/3/2010 1:57:17 PM SETI@home Message from server: (Project has no jobs available) Something just sucked away the spare inventory. It dropped fast. Maybe Higley school district is back online :-) Good news is lots of 'tapes' still being split. ID: 967941 ·

Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0	Message 967944 - Posted: 3 Feb 2010, 22:24:55 UTC - in response to Message 967941. Last modified: 3 Feb 2010, 22:27:01 UTC Good news is lots of 'tapes' still being split. I would't be sure about it. Current result creation rate: 3.3043/sec. That are probably just resends. EDIT: Can have something to do with "Workunits waiting for assimilation: 321,084". If they are not getting assimilated, they cannot be deleted -> no disc space. AFAIR we had that at least once. ID: 967944 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19518 Credit: 40,757,560 RAC: 67	Message 968039 - Posted: 4 Feb 2010, 7:54:20 UTC - in response to Message 967944. Good news is lots of 'tapes' still being split. I would't be sure about it. Current result creation rate: 3.3043/sec. That are probably just resends. EDIT: Can have something to do with "Workunits waiting for assimilation: 321,084". If they are not getting assimilated, they cannot be deleted -> no disc space. AFAIR we had that at least once. I think you are correct. For the last few hours, since 05:45 utc, the only msg i get when requesting work, is; no work from project. The server status page, is not reporting problems, except the "Workunits waiting for assimilation" numbers. And the cricket graph is all over the place. ID: 968039 ·

Fred W Volunteer tester Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0	Message 968269 - Posted: 5 Feb 2010, 14:00:20 UTC Oh dear - the server status page hasn't updated since 09:20 UTC and the cricket graph has taken a dive... F. ID: 968269 ·

Matthew S. McCleary Send message Joined: 9 Sep 99 Posts: 121 Credit: 2,288,242 RAC: 0	Message 968275 - Posted: 5 Feb 2010, 15:10:46 UTC "Message from server: (Project has no jobs available)" on several of my crunchers. Server status page hasn't been updated in six hours, but reports 51,000 multibeam available. ID: 968275 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51515 Credit: 1,018,363,574 RAC: 1,004	Message 968279 - Posted: 5 Feb 2010, 15:29:07 UTC - in response to Message 968269. Oh dear - the server status page hasn't updated since 09:20 UTC and the cricket graph has taken a dive... F. The cricket is not chirping much......meow. "Time is simply the mechanism that keeps everything from happening all at once." ID: 968279 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 968307 - Posted: 5 Feb 2010, 17:21:32 UTC I'd guess it's related to the issue Matt posted in the Tech News last night/this morning. Seems the science DB is being a bit fussy. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 968307 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 968339 - Posted: 5 Feb 2010, 18:58:01 UTC - in response to Message 968275. Last modified: 5 Feb 2010, 18:58:17 UTC "Message from server: (Project has no jobs available)" on several of my crunchers. At least there are some generous people out there. Somebody detached a 32-core SUN SPARC-Enterprise just as the WUs ran out, and donated 400 tasks to the common good. ID: 968339 ·

Matthew S. McCleary Send message Joined: 9 Sep 99 Posts: 121 Credit: 2,288,242 RAC: 0	Message 968342 - Posted: 5 Feb 2010, 19:04:31 UTC - in response to Message 968339. "Message from server: (Project has no jobs available)" on several of my crunchers. At least there are some generous people out there. Somebody detached a 32-core SUN SPARC-Enterprise just as the WUs ran out, and donated 400 tasks to the common good. Man, that guy has two 64-CPU, two 32-CPU, and two 8-CPU Suns. Wish I had that kind of hardware to monkey with. ID: 968342 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 968344 - Posted: 5 Feb 2010, 19:11:08 UTC - in response to Message 968339. "Message from server: (Project has no jobs available)" on several of my crunchers. At least there are some generous people out there. Somebody detached a 32-core SUN SPARC-Enterprise just as the WUs ran out, and donated 400 tasks to the common good. They may not be out of work, or tapes. Could just be the process feeding the feeder has stopped. Without the pages displaying the server status updating it's unknown what might be going on. Uploads & reporting it going on w/o any problems tho. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 968344 ·

Luke Volunteer developer Send message Joined: 31 Dec 06 Posts: 2546 Credit: 817,560 RAC: 0	Message 968399 - Posted: 5 Feb 2010, 21:17:50 UTC - in response to Message 968275. "Message from server: (Project has no jobs available)" on several of my crunchers. Server status page hasn't been updated in six hours, but reports 51,000 multibeam available. Same problem here. uploading & reporting are fine though. Good time to give my cache a purge. I've set NNT on all machines. Perhaps I'll run my laptop on PrimeGrid for a few days, once I'm out of S@H tasks. - Luke. ID: 968399 ·

FiveHamlet Send message Joined: 5 Oct 99 Posts: 783 Credit: 32,638,578 RAC: 0	Message 968414 - Posted: 5 Feb 2010, 21:50:37 UTC Last modified: 5 Feb 2010, 21:51:57 UTC Secondary science database has been disabled and Server Staus page has just got up to date. Things might start to recover soon. With some luck and a fare wind. Now getting Project is temp shut down for maintenence message. Thank's to the team. Dave ID: 968414 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.