Panic Mode On (35) Server problems

Author	Message
Keith Send message Joined: 19 May 99 Posts: 483 Credit: 938,268 RAC: 0	Message 1012039 - Posted: 5 Jul 2010, 9:46:06 UTC Last modified: 5 Jul 2010, 9:54:18 UTC Could this cyclic traffic be the much promised "Science" in action? Keith [Sorry. just noticed someone else already suggested this!! Great minds think alike and so did we!!] ID: 1012039 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1012040 - Posted: 5 Jul 2010, 9:46:12 UTC Last modified: 5 Jul 2010, 9:49:25 UTC BTW, period looks more like <~5h than 4h (count number of periods between 24h mark lines). ID: 1012040 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1012041 - Posted: 5 Jul 2010, 9:47:38 UTC or the Schedular/Server up the limit to 30, then drops it later because of dropped connections? Claggy ID: 1012041 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1012043 - Posted: 5 Jul 2010, 9:51:30 UTC - in response to Message 1012041. Last modified: 5 Jul 2010, 9:52:21 UTC or the Schedular/Server up the limit to 30, then drops it later because of dropped connections? Claggy It could be checked ! Lets ask for work at maxed bandwidth time. Will host get additional 10 (or any >20) tasks or not? I'm afaid 20 tasks per host limit hardwired now and can be changed only manually by editing some config file. ID: 1012043 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1012045 - Posted: 5 Jul 2010, 9:56:21 UTC Last modified: 5 Jul 2010, 10:02:29 UTC To rule out all variants of internal communications theories - what router is pictured? AFAIK all SETI lab servers will be on same side of pictured router, not? (I.e. should traffic between SETI servers be pictured on those graphs or not?) EDIT: meantime new spike approaching ;D Lets see if per host limit changed or not EDIT2: Nope, my host still on 20 tasks diet EDIT3: But Joe's correlation in place again. Splitters just stopped and spike started to rise... ID: 1012045 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1012046 - Posted: 5 Jul 2010, 10:01:05 UTC - in response to Message 1012043. Last modified: 5 Jul 2010, 10:01:46 UTC or the Schedular/Server up the limit to 30, then drops it later because of dropped connections? Claggy It could be checked ! Lets ask for work at maxed bandwidth time. Will host get additional 10 (or any >20) tasks or not? I'm afaid 20 tasks per host limit hardwired now and can be changed only manually by editing some config file. Data Spike is happening now, already got 24 Astropulse task on my Laptop, Server wouldn't give me any more, (i know on the 3rd July (on UTC) the Max was 30 tasks, then went down after that) Claggy ID: 1012046 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304	Message 1012047 - Posted: 5 Jul 2010, 10:01:58 UTC - in response to Message 1012031. Last modified: 5 Jul 2010, 10:13:19 UTC Anybody have a clue what the bandwidth cycles on the Cricket Graph are all about? I don't think I have ever seen such a well defined pattern before.... There appears to be correlation with when the splitters are boosting the "Results ready to send" and when they're idle. Compare Scarecrow's graphs, though it's hard to really match the time scales. It's a case where sampling the server status once an hour isn't quite enough to pin down the relationship, but my guess is the high rate download bursts are occurring just after the splitters have stopped for awhile. Or it could just be coincidence... Joe And "ready to send" doesn't drop to zero when splitters are idle, but bandwidth load drops hugely nevertheless. My suspicions- The Ready to Send buffer probably drops quite rapidly when badnwidth is maxed out, then continues to drop gradually once the network traffic drops off till such time as the splitters fire up & top up the buffer; the graphs aren't updated frequently enough to see accurately what's happening. Extremely wild supposition- The traffic bursts may be related to odd work request behaviour. I noticed one or 2 threads where people commented about the client not requesting new work, even though they had less than 20 in their cache. After a while, it does request work & that's when you're getting those bursts in network traffic. Lots of clients running down their buffer below 20 Work Units before requesting more, resulting in short bursts of network traffic. EDIT- just had a look at the Astropulse graphs. It shows a full Ready to Send buffer, with ups & downs similar to MB, but the slope of the waveform is different. MB- buffer fills quickly, drains slowly. AP- buffer fills slowly, but drains quickly. Looks like the spkies could be very much AP related. Just odd with their fairly consistent frequency. Grant Darwin NT ID: 1012047 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1012050 - Posted: 5 Jul 2010, 10:04:38 UTC - in response to Message 1012046. Last modified: 5 Jul 2010, 10:05:52 UTC or the Schedular/Server up the limit to 30, then drops it later because of dropped connections? Claggy It could be checked ! Lets ask for work at maxed bandwidth time. Will host get additional 10 (or any >20) tasks or not? I'm afaid 20 tasks per host limit hardwired now and can be changed only manually by editing some config file. Data Spike is happening now, already got 24 Astropulse task on my Laptop, Server wouldn't give me any more, (i know on the 3rd July (on UTC) the Max was 30 tasks, then went down after that) Claggy Wow, how you did that! ;D I have 20 MB tasks and GPU AP work fetch suppressed ... 05/07/2010 14:04:17 SETI@home Requesting new tasks for CPU and GPU 05/07/2010 14:04:32 SETI@home Scheduler request completed: got 0 new tasks 05/07/2010 14:04:32 SETI@home Message from server: No work sent 05/07/2010 14:04:32 SETI@home Message from server: This computer has reached a limit on tasks in progress ID: 1012050 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1012053 - Posted: 5 Jul 2010, 10:12:36 UTC - in response to Message 1012050. Last modified: 5 Jul 2010, 10:54:43 UTC or the Schedular/Server up the limit to 30, then drops it later because of dropped connections? Claggy It could be checked ! Lets ask for work at maxed bandwidth time. Will host get additional 10 (or any >20) tasks or not? I'm afaid 20 tasks per host limit hardwired now and can be changed only manually by editing some config file. Data Spike is happening now, already got 24 Astropulse task on my Laptop, Server wouldn't give me any more, (i know on the 3rd July (on UTC) the Max was 30 tasks, then went down after that) Claggy Wow, how you did that! ;D I have 20 MB tasks and GPU AP work fetch suppressed ... 05/07/2010 14:04:17 SETI@home Requesting new tasks for CPU and GPU 05/07/2010 14:04:32 SETI@home Scheduler request completed: got 0 new tasks 05/07/2010 14:04:32 SETI@home Message from server: No work sent 05/07/2010 14:04:32 SETI@home Message from server: This computer has reached a limit on tasks in progress It's my Laptop, it has an Astropulse only app_info, it doesn't do Seti Cuda, as it's GPU is incapable with only 128Mb, that does Collatz instead. For your situation, i suggest you disable CPU work fetch in your Setiathome preferences, when you finish a CPU task, it can only get GPU tasks, :) Claggy ID: 1012053 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1012063 - Posted: 5 Jul 2010, 11:09:19 UTC - in response to Message 1012035. I think it's the schedular deciding 4 hour to send out Astropulse tasks primarily, Server Status page's Astropulse Results ready to send figure dropped to 7800 a bit after 1000UTC, now recovering, it's at 8,542 at 1100UTC, and was 200 less 10 minutes earlier. Claggy ID: 1012063 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1012075 - Posted: 5 Jul 2010, 11:40:09 UTC - in response to Message 1012063. I think it's the schedular deciding 4 hour to send out Astropulse tasks primarily, Server Status page's Astropulse Results ready to send figure dropped to 7800 a bit after 1000UTC, now recovering, it's at 8,542 at 1100UTC, and was 200 less 10 minutes earlier. Claggy If so fast AP will kill bandwidth forever :D ID: 1012075 ·

Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0	Message 1012076 - Posted: 5 Jul 2010, 11:42:33 UTC Just my two cents worth, previously posted over in the news thread: This looks like limit cycling, resulting from a separate detection/correction process. Some internal process drives the system to maximum output. we have seen that this causes a complete system crash, so some external method has been created to detect full output for more than x seconds, at which point an external throttle is applied. Apparently the throttle is in place for a fixed time, and when it is removed the system cycles up to its upper limit again. I have seen this in other complex systems. It suggests that the basic problem causing the limit cycling is either incurable, or at least incurable for some period of time. The external detection/correction is a patch (or kludge if you prefer), but a common one in complex systems. ID: 1012076 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1012080 - Posted: 5 Jul 2010, 11:52:35 UTC - in response to Message 1012076. Last modified: 5 Jul 2010, 11:54:25 UTC But what could be used as such limit? When bandwidth down we still can recive work AFAIK. So, if external work request rate would stay the same we should see no bandwidth drop (or we should not get work when requested if that limiter in place). Currently we don't get work as requested because of 20 task limit, but how this limit can lead to oscillations - don't know... [In other words limiter should constrain smth bandwidth related. But what is it if work requests still go OK?] ID: 1012080 ·

Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0	Message 1012086 - Posted: 5 Jul 2010, 12:01:26 UTC - in response to Message 1012080. But what could be used as such limit? When bandwidth down we still can recive work AFAIK. So, if external work request rate would stay the same we should see no bandwidth drop (or we should not get work when requested if that limiter in place). Currently we don't get work as requested because of 20 task limit, but how this limit can lead to oscillations - don't know... [In other words limiter should constrain smth bandwidth related. But what is it if work requests still go OK?] Now you are getting out of my specialty, but ... I have only received one or two WUs at a time since Friday, nowhere near my cache setting or the 20 WU limit. If I've been unlucky enough to only request work during the "low" part of the cycle (which is about 75 % of the time) than maybe the "throttle" is actually a less than 20 WU limit for a fixed period of time. Another possible throttle would be to just ignore a fixed percentage of the work requests, but then I would have expected to see the old "project has no work available" message sometimes. I haven't seen any of those since the three day outage ended. Or possibly my original theory is totally wrong. But is does look like what I see when somebody applies an external fix, while they try to figure out the actual inner workings of the total system. ID: 1012086 ·

Miep Volunteer moderator Send message Joined: 23 Jul 99 Posts: 2412 Credit: 351,996 RAC: 0	Message 1012087 - Posted: 5 Jul 2010, 12:01:29 UTC Gna. just as I was finishing I must have pressed the wrong combination of keys and lost my careful contructed argument. I think we might be seeing the 'standard setup' machines which got synchronised by the extended outage. They were dry, reported, got work and now keep reporting every ~4.5h. They get new work, maxing out the bandwidth, depleting ready to send and that causes the splitters to fire up and refill. the 40M background we see is most likely forum readers on higher caches, where the 20 cap pushes them into one by one - spreading more evenly. Carola ------- I'm multilingual - I can misunderstand people in several languages! ID: 1012087 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1012088 - Posted: 5 Jul 2010, 12:02:27 UTC Last modified: 5 Jul 2010, 12:03:16 UTC Well, new data that belongs to AP-related theories :) : currently we on baseline bandwidth. That is, lowest number of work fetches. There are >9k AP tasks "ready to send". My cost reconfigured (thanks Claggy!) to ask only for GPU AP. It has 19 MB tasks now so doesn't affected by 20 task limit and asks for GPU AP work constantly. But it still get no single AP task (despite of bandwidth low level load!). Looks like AP tasks distribution is really turned off all times except bandwidth spikes... Could someone check this more thoroughly? For example, to recive fresh (not resend, _0 or _1 in task name only) AP task while bandwidth not maxed ? ID: 1012088 ·

Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0	Message 1012089 - Posted: 5 Jul 2010, 12:03:17 UTC - in response to Message 1012087. Last modified: 5 Jul 2010, 12:05:22 UTC That (Carola's theory) makes sense. Have the splitters always run only when the pool of WUs ready to send gets low, or is this something introduced with the recent BOINC software changes? ID: 1012089 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1012091 - Posted: 5 Jul 2010, 12:08:01 UTC - in response to Message 1012087. Last modified: 5 Jul 2010, 12:09:12 UTC Gna. just as I was finishing I must have pressed the wrong combination of keys and lost my careful contructed argument. I think we might be seeing the 'standard setup' machines which got synchronised by the extended outage. They were dry, reported, got work and now keep reporting every ~4.5h. They get new work, maxing out the bandwidth, depleting ready to send and that causes the splitters to fire up and refill. the 40M background we see is most likely forum readers on higher caches, where the 20 cap pushes them into one by one - spreading more evenly. Well, it's viable indeed, BUT AFAIK almost all back off intervals in BOINC are randomized around some mean time. Moreover, "crowd" hosts have definitely different performance. That is, extended outage can be perfect synchronization event indeed, but after it synching should be lost. That is, in oscillations we should see spike blurring, but it re-creates its form and length pretty fine ~10 (or more?) iterations already. In "initial sync event" theory I don't understand why blurring absent... ID: 1012091 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1012096 - Posted: 5 Jul 2010, 12:19:01 UTC Last modified: 5 Jul 2010, 12:23:29 UTC >20 AP work fetch attempts already done, no single AP task recived and no single "no work" message recived too. That is, project has plenty of work for send, but it's MB work, not AP work. So, looks like we can get AP work only at bandwidth spike time indeed and spike itself created by AP-only hosts asking (and reciving) AP work when it becomes available. That is, that "limiter" should be disabling AP task distribution and enabling it only for short time intervals (and then bandwidth maxed). Looks pretty logical to me :) [EDIT: it also allow to estimate AP-only configs %. When all AP-only hosts will be served bandwidth spike should go down, but it remains maxed each spike. That is, whole spikes time to this moment isn't enough to give all AP-only hosts 20 AP tasks per host. Also, if most of those hosts use opt AP with completion time ~10h, such spikes will never go down provided system config remains the same after next outage. ] ID: 1012096 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1012098 - Posted: 5 Jul 2010, 12:28:00 UTC LoL, and just after such nice theory was written my host recived AP task :) It's not resend, fresh new task with replication of 2, recived at bandwidth low plato time... That is, AP work fetch not blocked... theory crashed ;D ID: 1012098 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.