Daily midnight traffic peaks


log in

Advanced search

Message boards : Number crunching : Daily midnight traffic peaks

1 · 2 · Next
Author Message
archae86
Send message
Joined: 31 Aug 99
Posts: 888
Credit: 1,572,688
RAC: 42
United States
Message 871354 - Posted: 2 Mar 2009, 15:53:20 UTC
Last modified: 2 Mar 2009, 15:54:37 UTC

While I recall sometimes in the past seeing clear network traffic peaks after midnight, I think there has been little of that for some time. But for four out of the most recent five days, the network activity graphs have shown a strong peak, quickly rising to near the bandwidth limit, then, perhaps an hour later, starting a slower but still rather rapid decay back down to the 50 to 70 range which seems the short-term norm lately.

Can it be that enough hosts are still chewing up their daily 100/cpu limit daily to create this demand spike as they go on to their next day of discards?

Clues on the SETI server status page also are interesting. It shows AP result creation rate as well over half that of MP, though the results received rate is barely over a 20th. Taken at face value, this suggests that the very large majority of distributed AP results are erroring out, or otherwise not successfully returned.

Another possibility might be that some of the AP numbers include both 5.00 and 5.03, while others are only 5.03? My anecdotal observation is that new 5.00 result issue is usually for a WU which already has one valid return, so grant credit immediately on the next return, while new 5.03 seems typically to have a multi-week wait for the quorum partner.

[edit: bolded the links]
____________

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2245
Credit: 8,586,711
RAC: 4,199
United States
Message 871359 - Posted: 2 Mar 2009, 16:02:04 UTC

It might not be errors. It is possible that some of the faster hosts (Q9xxx), or the fast Xeons with or without a GPU, or multiple GPUs are just crunching more than the quota allows per day. At midnight, the quota starts over and those systems fill the cache up until they get the quota again.

I think this has been around for quite a while..even before the first AP release. Of course with AP's occasional problems, and the file size being ~24x larger, the midnight-spike becomes amplified.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4227
Credit: 1,042,553
RAC: 346
United States
Message 871448 - Posted: 2 Mar 2009, 21:37:09 UTC - in response to Message 871354.
Last modified: 2 Mar 2009, 21:39:12 UTC

...
Can it be that enough hosts are still chewing up their daily 100/cpu limit daily to create this demand spike as they go on to their next day of discards?

Maybe, or maybe it's just hosts with huge queues they're trying to fill. With the current WUs almost all having deadlines of 3 weeks or more, there are few which the Scheduler will consider infeasible for such hosts, and the host work fetch requests are not inhibited due to work being done at high priority.

Clues on the SETI server status page also are interesting. It shows AP result creation rate as well over half that of MP, though the results received rate is barely over a 20th. Taken at face value, this suggests that the very large majority of distributed AP results are erroring out, or otherwise not successfully returned.
...

Scarecrow's graphs for AP and Enhanced are showing the average AP result creation rate as 0.625 and the average S@H Enhanced rate as 8.782. Both are operating in the good mode where they build the "Results ready to send" up to a high water mark, then rest for awhile[1]. That makes the most recent sample on the Server status page fairly meaningless in isolation.

[1] Even when the splitters are resting, there's some Result creation from reissues.
Joe

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 871483 - Posted: 2 Mar 2009, 22:43:41 UTC - in response to Message 871354.

Can it be that enough hosts are still chewing up their daily 100/cpu limit daily to create this demand spike as they go on to their next day of discards

"Midnight" for the purposes of daily quotas is Midnight, UTC. That's around 4:00pm PST (Berkeley time).

The Cricket graphs look like they're on local time, which is basically UTC -8 hours.

So the peak is roughly 0800 UTC -- which does not line up with the "fresh" quotas at midnight.

____________

archae86
Send message
Joined: 31 Aug 99
Posts: 888
Credit: 1,572,688
RAC: 42
United States
Message 871485 - Posted: 2 Mar 2009, 22:46:08 UTC - in response to Message 871448.

Scarecrow's graphs for AP and Enhanced are showing the average AP result creation rate as 0.625 and the average S@H Enhanced rate as 8.782. Both are operating in the good mode where they build the "Results ready to send" up to a high water mark, then rest for awhile[1]. That makes the most recent sample on the Server status page fairly meaningless in isolation.
Joe
Ah, I missed that completely.

On the other hand, hand clicking on a small sample of actual newly downloaded AP 5.03 work does seem to show that many of the quite common _2, _3, _4 results are resends for work sent today which got a prompt client error/compute error 0.00 CPU time on hosts which are doing this repeatedly.

On one case, clicking on down to the task level gets this:

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
CreateProcess() failed - A required privilege is not held by the client. (0x522)
</message>
]]>


in others this:
<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
too many exit(0)s
</message>
]]>


in others this:
<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
CreateProcess() failed - Access is denied. (0x5)
</message>
]]>


in others (actually these logged client error/downloading repeatedly, and obviously the specific file reference varies):
<core_client_version>6.2.15</core_client_version>
<![CDATA[
<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>ap_06dc08ah_B1_P0_00065_20090213_06338.wu</file_name>
<error_code>-119</error_code>
<error_message>MD5 check failed</error_message>
</file_xfer_error>

</message>
]]>


all these from failed quorum partners of just one host, for downloads in the last day, and in most if not all cases the offending host had failed a whole page minimum.

Certainly some of this goes on all the time, but, despite my error on the creation rate evidence, I think it is rather more prevalent on AP 5.03 at the moment than the historic norm.

Many of them, however, were not managing anywhere near 100*ncpu failures a day--more like 2, most likely because they had failed enough in the past to get down to the 1/day/CPU limit.
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8461
Credit: 48,823,310
RAC: 81,103
United Kingdom
Message 871487 - Posted: 2 Mar 2009, 22:54:33 UTC - in response to Message 871483.

Can it be that enough hosts are still chewing up their daily 100/cpu limit daily to create this demand spike as they go on to their next day of discards

"Midnight" for the purposes of daily quotas is Midnight, UTC. That's around 4:00pm PST (Berkeley time).

The Cricket graphs look like they're on local time, which is basically UTC -8 hours.

So the peak is roughly 0800 UTC -- which does not line up with the "fresh" quotas at midnight.

I'm pretty sure that 'quota' resets at midnight, local time (@ server) - which in SETI's case means PST, in line with the Cricket graphs.

Not that I'm about to trash 99 WUs to test that theory....

Profile Geek@PlayProject donor
Volunteer tester
Avatar
Send message
Joined: 31 Jul 01
Posts: 2466
Credit: 85,674,440
RAC: 26,101
United States
Message 871499 - Posted: 2 Mar 2009, 23:04:51 UTC

I agree, quota resets are the most likely cause. Due to all the trashed work caches and people trying to get that app_info file correct.
____________
Boinc....Boinc....Boinc....Boinc....

msattlerProject donor
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38870
Credit: 577,618,846
RAC: 524,347
United States
Message 871660 - Posted: 3 Mar 2009, 8:15:41 UTC - in response to Message 871487.

Can it be that enough hosts are still chewing up their daily 100/cpu limit daily to create this demand spike as they go on to their next day of discards

"Midnight" for the purposes of daily quotas is Midnight, UTC. That's around 4:00pm PST (Berkeley time).

The Cricket graphs look like they're on local time, which is basically UTC -8 hours.

So the peak is roughly 0800 UTC -- which does not line up with the "fresh" quotas at midnight.

I'm pretty sure that 'quota' resets at midnight, local time (@ server) - which in SETI's case means PST, in line with the Cricket graphs.

Not that I'm about to trash 99 WUs to test that theory....

I am pretty certain that the Cricket Graphs are running on Berkeley time.....so the spikes occur at server midnight, concurrent with the reset of the daily quotas.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Rudy
Volunteer tester
Send message
Joined: 23 Jun 99
Posts: 189
Credit: 565,196
RAC: 51
Canada
Message 871755 - Posted: 3 Mar 2009, 15:59:51 UTC
Last modified: 3 Mar 2009, 16:00:16 UTC

Looking at the scarecrow graphs for results returned (7 day trends), it looks like most of the traffic spikes are coming from AP. Both the AP results returned spikes up and the crunch time spikes down at midnight.

Could there be a large number of crunchers with corrupt AP applications or app info still out there?

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1622
Credit: 22,099,274
RAC: 4,120
United States
Message 871770 - Posted: 3 Mar 2009, 16:31:58 UTC - in response to Message 871448.

...
Can it be that enough hosts are still chewing up their daily 100/cpu limit daily to create this demand spike as they go on to their next day of discards?

Maybe, or maybe it's just hosts with huge queues they're trying to fill. With the current WUs almost all having deadlines of 3 weeks or more, there are few which the Scheduler will consider infeasible for such hosts, and the host work fetch requests are not inhibited due to work being done at high priority.

Joe


Where can I find a description of how the deadlines are computed? There seems to be a lot of noise on the message boards (not surprising!). Thx.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8461
Credit: 48,823,310
RAC: 81,103
United Kingdom
Message 871774 - Posted: 3 Mar 2009, 16:41:55 UTC - in response to Message 871770.

...
Can it be that enough hosts are still chewing up their daily 100/cpu limit daily to create this demand spike as they go on to their next day of discards?

Maybe, or maybe it's just hosts with huge queues they're trying to fill. With the current WUs almost all having deadlines of 3 weeks or more, there are few which the Scheduler will consider infeasible for such hosts, and the host work fetch requests are not inhibited due to work being done at high priority.

Joe


Where can I find a description of how the deadlines are computed? There seems to be a lot of noise on the message boards (not surprising!). Thx.

There's a table here, in the middle of the thread Estimates and Deadlines revisited which documents the research which Joe initiated in December 2007.

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 871866 - Posted: 4 Mar 2009, 1:10:52 UTC - in response to Message 871660.
Last modified: 4 Mar 2009, 1:11:28 UTC


I am pretty certain that the Cricket Graphs are running on Berkeley time.....so the spikes occur at server midnight, concurrent with the reset of the daily quotas.

I'm in the same time zone, it's just after 5:00pm and the Cricket Graph page says Last updated at Tue Mar 3 17:08:43 2009.

If they were on GMT, it'd be more like Wed Mar 4 01:08:43 2009.

I don't use Cricket, I use MRTG -- same author. MRTG updates every five minutes.
____________

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2245
Credit: 8,586,711
RAC: 4,199
United States
Message 871903 - Posted: 4 Mar 2009, 2:35:47 UTC

I just noticed something... Mark, I think you must be looking at the wrong router interface for the cricket graph.. supposed to be 2_3, not 5_1.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4227
Credit: 1,042,553
RAC: 346
United States
Message 871944 - Posted: 4 Mar 2009, 5:47:05 UTC - in response to Message 871903.

I just noticed something... Mark, I think you must be looking at the wrong router interface for the cricket graph.. supposed to be 2_3, not 5_1.

Either works for the big picture, but 5_1 seems to have some percentage of other traffic than ours, for instance it has a 123 MBits/sec peak reading currently which would be a neat trick through the 100 MBits/sec link.
Joe

msattlerProject donor
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38870
Credit: 577,618,846
RAC: 524,347
United States
Message 871959 - Posted: 4 Mar 2009, 8:04:24 UTC - in response to Message 871944.
Last modified: 4 Mar 2009, 8:10:00 UTC

I just noticed something... Mark, I think you must be looking at the wrong router interface for the cricket graph.. supposed to be 2_3, not 5_1.

Either works for the big picture, but 5_1 seems to have some percentage of other traffic than ours, for instance it has a 123 MBits/sec peak reading currently which would be a neat trick through the 100 MBits/sec link.
Joe

If I recall correctly, the other difference is that the 2_3 graph is inside out....in other words, what it shows as 'in' is actually coming in from Seti (and maybe some other Berkeley traffic and then sent out, and what it shows as 'out' is actually inbound from us and out to the Seti servers.
The 5_1 graph actually shows outbound from Seti and inbound from us.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile Odan
Send message
Joined: 8 May 03
Posts: 91
Credit: 15,331,177
RAC: 0
United Kingdom
Message 872026 - Posted: 4 Mar 2009, 13:59:24 UTC - in response to Message 871959.

If I recall correctly, the other difference is that the 2_3 graph is inside out....in other words, what it shows as 'in' is actually coming in from Seti (and maybe some other Berkeley traffic and then sent out, and what it shows as 'out' is actually inbound from us and out to the Seti servers.
The 5_1 graph actually shows outbound from Seti and inbound from us.



Yes, you are right; 2_3 is inside out. You will notice that a few days ago when SETI was maxed out trying to send out AP V5 the green "in" graph was maxed out.

Profile Virtual Boss*
Volunteer tester
Avatar
Send message
Joined: 4 May 08
Posts: 417
Credit: 6,178,184
RAC: 140
Australia
Message 872036 - Posted: 4 Mar 2009, 14:38:25 UTC

Could there be a large number of crunchers with corrupt AP applications or app info still out there?


My answer would be YES

2 out of 6 WU's I got have hosts that are trashing every WU they get.

Hosts 4077778 and 4069312.

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3396
Credit: 46,339,857
RAC: 9,834
Russia
Message 872038 - Posted: 4 Mar 2009, 14:46:05 UTC - in response to Message 872036.
Last modified: 4 Mar 2009, 14:46:20 UTC

1170594009 417760101 21 Feb 2009 20:15:02 UTC 1 Mar 2009 20:51:41 UTC Over Client error Compute error 464,726.30 614.09 ---

So big time wasted...
It seems it would be better if AP would be opt in rather than opt out. That is, disabled by default... It doesn't work OK for set-and-forget hosts... Some user attention required at least to [re]install opt app. Stock app crash looks too devastating indeed.

Rudy
Volunteer tester
Send message
Joined: 23 Jun 99
Posts: 189
Credit: 565,196
RAC: 51
Canada
Message 872041 - Posted: 4 Mar 2009, 14:50:10 UTC - in response to Message 872036.



My answer would be YES

2 out of 6 WU's I got have hosts that are trashing every WU they get.


very interesting

looking at when they are trashing, it seems that the quota reset is every 25 hours from the last contact and not at midnight

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8461
Credit: 48,823,310
RAC: 81,103
United Kingdom
Message 872044 - Posted: 4 Mar 2009, 14:56:33 UTC - in response to Message 872041.

My answer would be YES

2 out of 6 WU's I got have hosts that are trashing every WU they get.

very interesting

looking at when they are trashing, it seems that the quota reset is every 25 hours from the last contact and not at midnight

Not quite.

The quota is reset at midnight: but if BOINC is backed off because of quota restrictions, it backs off to a random time sometime in the first hour after the reset. If everyone's client called in for a new quota at exactly 00:00:01, we'd see a much sharper spike than the one we're discussing - BOINC is trying to flatten it out a bit.

1 · 2 · Next

Message boards : Number crunching : Daily midnight traffic peaks

Copyright © 2014 University of California