Panic Mode On (35) Server problems

Message boards : Number crunching : Panic Mode On (35) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next

AuthorMessage
Profile Keith

Send message
Joined: 19 May 99
Posts: 483
Credit: 938,268
RAC: 0
United Kingdom
Message 1012039 - Posted: 5 Jul 2010, 9:46:06 UTC
Last modified: 5 Jul 2010, 9:54:18 UTC

Could this cyclic traffic be the much promised "Science" in action?

Keith

[Sorry. just noticed someone else already suggested this!!
Great minds think alike and so did we!!]
ID: 1012039 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1012040 - Posted: 5 Jul 2010, 9:46:12 UTC
Last modified: 5 Jul 2010, 9:49:25 UTC

BTW, period looks more like <~5h than 4h (count number of periods between 24h mark lines).
ID: 1012040 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1012041 - Posted: 5 Jul 2010, 9:47:38 UTC

or the Schedular/Server up the limit to 30, then drops it later because of dropped connections?

Claggy
ID: 1012041 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1012043 - Posted: 5 Jul 2010, 9:51:30 UTC - in response to Message 1012041.  
Last modified: 5 Jul 2010, 9:52:21 UTC

or the Schedular/Server up the limit to 30, then drops it later because of dropped connections?

Claggy

It could be checked ! Lets ask for work at maxed bandwidth time. Will host get additional 10 (or any >20) tasks or not? I'm afaid 20 tasks per host limit hardwired now and can be changed only manually by editing some config file.
ID: 1012043 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1012045 - Posted: 5 Jul 2010, 9:56:21 UTC
Last modified: 5 Jul 2010, 10:02:29 UTC

To rule out all variants of internal communications theories - what router is pictured? AFAIK all SETI lab servers will be on same side of pictured router, not? (I.e. should traffic between SETI servers be pictured on those graphs or not?)

EDIT: meantime new spike approaching ;D Lets see if per host limit changed or not

EDIT2: Nope, my host still on 20 tasks diet

EDIT3: But Joe's correlation in place again. Splitters just stopped and spike started to rise...
ID: 1012045 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1012046 - Posted: 5 Jul 2010, 10:01:05 UTC - in response to Message 1012043.  
Last modified: 5 Jul 2010, 10:01:46 UTC

or the Schedular/Server up the limit to 30, then drops it later because of dropped connections?

Claggy

It could be checked ! Lets ask for work at maxed bandwidth time. Will host get additional 10 (or any >20) tasks or not? I'm afaid 20 tasks per host limit hardwired now and can be changed only manually by editing some config file.

Data Spike is happening now, already got 24 Astropulse task on my Laptop, Server wouldn't give me any more,
(i know on the 3rd July (on UTC) the Max was 30 tasks, then went down after that)

Claggy
ID: 1012046 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13746
Credit: 208,696,464
RAC: 304
Australia
Message 1012047 - Posted: 5 Jul 2010, 10:01:58 UTC - in response to Message 1012031.  
Last modified: 5 Jul 2010, 10:13:19 UTC

Anybody have a clue what the bandwidth cycles on the Cricket Graph are all about? I don't think I have ever seen such a well defined pattern before....

There appears to be correlation with when the splitters are boosting the "Results ready to send" and when they're idle. Compare Scarecrow's graphs, though it's hard to really match the time scales. It's a case where sampling the server status once an hour isn't quite enough to pin down the relationship, but my guess is the high rate download bursts are occurring just after the splitters have stopped for awhile. Or it could just be coincidence...
                                                                 Joe

And "ready to send" doesn't drop to zero when splitters are idle, but bandwidth load drops hugely nevertheless.


My suspicions-
The Ready to Send buffer probably drops quite rapidly when badnwidth is maxed out, then continues to drop gradually once the network traffic drops off till such time as the splitters fire up & top up the buffer; the graphs aren't updated frequently enough to see accurately what's happening.


Extremely wild supposition-
The traffic bursts may be related to odd work request behaviour.
I noticed one or 2 threads where people commented about the client not requesting new work, even though they had less than 20 in their cache. After a while, it does request work & that's when you're getting those bursts in network traffic. Lots of clients running down their buffer below 20 Work Units before requesting more, resulting in short bursts of network traffic.


EDIT- just had a look at the Astropulse graphs. It shows a full Ready to Send buffer, with ups & downs similar to MB, but the slope of the waveform is different.
MB- buffer fills quickly, drains slowly.
AP- buffer fills slowly, but drains quickly.

Looks like the spkies could be very much AP related. Just odd with their fairly consistent frequency.
Grant
Darwin NT
ID: 1012047 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1012050 - Posted: 5 Jul 2010, 10:04:38 UTC - in response to Message 1012046.  
Last modified: 5 Jul 2010, 10:05:52 UTC

or the Schedular/Server up the limit to 30, then drops it later because of dropped connections?

Claggy

It could be checked ! Lets ask for work at maxed bandwidth time. Will host get additional 10 (or any >20) tasks or not? I'm afaid 20 tasks per host limit hardwired now and can be changed only manually by editing some config file.

Data Spike is happening now, already got 24 Astropulse task on my Laptop, Server wouldn't give me any more,
(i know on the 3rd July (on UTC) the Max was 30 tasks, then went down after that)

Claggy

Wow, how you did that! ;D I have 20 MB tasks and GPU AP work fetch suppressed ...
05/07/2010 14:04:17 SETI@home Requesting new tasks for CPU and GPU
05/07/2010 14:04:32 SETI@home Scheduler request completed: got 0 new tasks
05/07/2010 14:04:32 SETI@home Message from server: No work sent
05/07/2010 14:04:32 SETI@home Message from server: This computer has reached a limit on tasks in progress
ID: 1012050 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1012053 - Posted: 5 Jul 2010, 10:12:36 UTC - in response to Message 1012050.  
Last modified: 5 Jul 2010, 10:54:43 UTC

or the Schedular/Server up the limit to 30, then drops it later because of dropped connections?

Claggy

It could be checked ! Lets ask for work at maxed bandwidth time. Will host get additional 10 (or any >20) tasks or not? I'm afaid 20 tasks per host limit hardwired now and can be changed only manually by editing some config file.

Data Spike is happening now, already got 24 Astropulse task on my Laptop, Server wouldn't give me any more,
(i know on the 3rd July (on UTC) the Max was 30 tasks, then went down after that)

Claggy

Wow, how you did that! ;D I have 20 MB tasks and GPU AP work fetch suppressed ...
05/07/2010 14:04:17 SETI@home Requesting new tasks for CPU and GPU
05/07/2010 14:04:32 SETI@home Scheduler request completed: got 0 new tasks
05/07/2010 14:04:32 SETI@home Message from server: No work sent
05/07/2010 14:04:32 SETI@home Message from server: This computer has reached a limit on tasks in progress


It's my Laptop, it has an Astropulse only app_info, it doesn't do Seti Cuda, as it's GPU is incapable with only 128Mb, that does Collatz instead.

For your situation, i suggest you disable CPU work fetch in your Setiathome preferences, when you finish a CPU task, it can only get GPU tasks, :)

Claggy
ID: 1012053 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1012063 - Posted: 5 Jul 2010, 11:09:19 UTC - in response to Message 1012035.  

I think it's the schedular deciding 4 hour to send out Astropulse tasks primarily,

Server Status page's Astropulse Results ready to send figure dropped to 7800 a bit after 1000UTC, now recovering, it's at 8,542 at 1100UTC, and was 200 less 10 minutes earlier.

Claggy
ID: 1012063 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1012075 - Posted: 5 Jul 2010, 11:40:09 UTC - in response to Message 1012063.  

I think it's the schedular deciding 4 hour to send out Astropulse tasks primarily,

Server Status page's Astropulse Results ready to send figure dropped to 7800 a bit after 1000UTC, now recovering, it's at 8,542 at 1100UTC, and was 200 less 10 minutes earlier.

Claggy

If so fast AP will kill bandwidth forever :D
ID: 1012075 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 1012076 - Posted: 5 Jul 2010, 11:42:33 UTC

Just my two cents worth, previously posted over in the news thread:

This looks like limit cycling, resulting from a separate detection/correction process. Some internal process drives the system to maximum output. we have seen that this causes a complete system crash, so some external method has been created to detect full output for more than x seconds, at which point an external throttle is applied. Apparently the throttle is in place for a fixed time, and when it is removed the system cycles up to its upper limit again.

I have seen this in other complex systems. It suggests that the basic problem causing the limit cycling is either incurable, or at least incurable for some period of time. The external detection/correction is a patch (or kludge if you prefer), but a common one in complex systems.


ID: 1012076 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1012080 - Posted: 5 Jul 2010, 11:52:35 UTC - in response to Message 1012076.  
Last modified: 5 Jul 2010, 11:54:25 UTC

But what could be used as such limit?
When bandwidth down we still can recive work AFAIK. So, if external work request rate would stay the same we should see no bandwidth drop (or we should not get work when requested if that limiter in place). Currently we don't get work as requested because of 20 task limit, but how this limit can lead to oscillations - don't know...

[In other words limiter should constrain smth bandwidth related. But what is it if work requests still go OK?]
ID: 1012080 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 1012086 - Posted: 5 Jul 2010, 12:01:26 UTC - in response to Message 1012080.  

But what could be used as such limit?
When bandwidth down we still can recive work AFAIK. So, if external work request rate would stay the same we should see no bandwidth drop (or we should not get work when requested if that limiter in place). Currently we don't get work as requested because of 20 task limit, but how this limit can lead to oscillations - don't know...

[In other words limiter should constrain smth bandwidth related. But what is it if work requests still go OK?]


Now you are getting out of my specialty, but ... I have only received one or two WUs at a time since Friday, nowhere near my cache setting or the 20 WU limit. If I've been unlucky enough to only request work during the "low" part of the cycle (which is about 75 % of the time) than maybe the "throttle" is actually a less than 20 WU limit for a fixed period of time.

Another possible throttle would be to just ignore a fixed percentage of the work requests, but then I would have expected to see the old "project has no work available" message sometimes. I haven't seen any of those since the three day outage ended.

Or possibly my original theory is totally wrong. But is does look like what I see when somebody applies an external fix, while they try to figure out the actual inner workings of the total system.

ID: 1012086 · Report as offensive
Profile Miep
Volunteer moderator
Avatar

Send message
Joined: 23 Jul 99
Posts: 2412
Credit: 351,996
RAC: 0
Message 1012087 - Posted: 5 Jul 2010, 12:01:29 UTC

Gna. just as I was finishing I must have pressed the wrong combination of keys and lost my careful contructed argument.

I think we might be seeing the 'standard setup' machines which got synchronised by the extended outage.
They were dry, reported, got work and now keep reporting every ~4.5h. They get new work, maxing out the bandwidth, depleting ready to send and that causes the splitters to fire up and refill.

the 40M background we see is most likely forum readers on higher caches, where the 20 cap pushes them into one by one - spreading more evenly.
Carola
-------
I'm multilingual - I can misunderstand people in several languages!
ID: 1012087 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1012088 - Posted: 5 Jul 2010, 12:02:27 UTC
Last modified: 5 Jul 2010, 12:03:16 UTC

Well, new data that belongs to AP-related theories :) :
currently we on baseline bandwidth. That is, lowest number of work fetches.
There are >9k AP tasks "ready to send".
My cost reconfigured (thanks Claggy!) to ask only for GPU AP.
It has 19 MB tasks now so doesn't affected by 20 task limit and asks for GPU AP work constantly.
But it still get no single AP task (despite of bandwidth low level load!).

Looks like AP tasks distribution is really turned off all times except bandwidth spikes...
Could someone check this more thoroughly? For example, to recive fresh (not resend, _0 or _1 in task name only) AP task while bandwidth not maxed ?
ID: 1012088 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 1012089 - Posted: 5 Jul 2010, 12:03:17 UTC - in response to Message 1012087.  
Last modified: 5 Jul 2010, 12:05:22 UTC

That (Carola's theory) makes sense. Have the splitters always run only when the pool of WUs ready to send gets low, or is this something introduced with the recent BOINC software changes?

ID: 1012089 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1012091 - Posted: 5 Jul 2010, 12:08:01 UTC - in response to Message 1012087.  
Last modified: 5 Jul 2010, 12:09:12 UTC

Gna. just as I was finishing I must have pressed the wrong combination of keys and lost my careful contructed argument.

I think we might be seeing the 'standard setup' machines which got synchronised by the extended outage.
They were dry, reported, got work and now keep reporting every ~4.5h. They get new work, maxing out the bandwidth, depleting ready to send and that causes the splitters to fire up and refill.

the 40M background we see is most likely forum readers on higher caches, where the 20 cap pushes them into one by one - spreading more evenly.


Well, it's viable indeed, BUT AFAIK almost all back off intervals in BOINC are randomized around some mean time. Moreover, "crowd" hosts have definitely different performance.
That is, extended outage can be perfect synchronization event indeed, but after it synching should be lost. That is, in oscillations we should see spike blurring, but it re-creates its form and length pretty fine ~10 (or more?) iterations already. In "initial sync event" theory I don't understand why blurring absent...
ID: 1012091 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1012096 - Posted: 5 Jul 2010, 12:19:01 UTC
Last modified: 5 Jul 2010, 12:23:29 UTC

>20 AP work fetch attempts already done, no single AP task recived and no single "no work" message recived too. That is, project has plenty of work for send, but it's MB work, not AP work.
So, looks like we can get AP work only at bandwidth spike time indeed and spike itself created by AP-only hosts asking (and reciving) AP work when it becomes available. That is, that "limiter" should be disabling AP task distribution and enabling it only for short time intervals (and then bandwidth maxed).
Looks pretty logical to me :)

[EDIT: it also allow to estimate AP-only configs %. When all AP-only hosts will be served bandwidth spike should go down, but it remains maxed each spike. That is, whole spikes time to this moment isn't enough to give all AP-only hosts 20 AP tasks per host. Also, if most of those hosts use opt AP with completion time ~10h, such spikes will never go down provided system config remains the same after next outage.
]
ID: 1012096 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1012098 - Posted: 5 Jul 2010, 12:28:00 UTC

LoL, and just after such nice theory was written my host recived AP task :)
It's not resend, fresh new task with replication of 2, recived at bandwidth low plato time... That is, AP work fetch not blocked... theory crashed ;D
ID: 1012098 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next

Message boards : Number crunching : Panic Mode On (35) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.