Message boards :
Number crunching :
Daily midnight traffic peaks
Message board moderation
Author | Message |
---|---|
archae86 Send message Joined: 31 Aug 99 Posts: 909 Credit: 1,582,816 RAC: 0 ![]() |
While I recall sometimes in the past seeing clear network traffic peaks after midnight, I think there has been little of that for some time. But for four out of the most recent five days, the network activity graphs have shown a strong peak, quickly rising to near the bandwidth limit, then, perhaps an hour later, starting a slower but still rather rapid decay back down to the 50 to 70 range which seems the short-term norm lately. Can it be that enough hosts are still chewing up their daily 100/cpu limit daily to create this demand spike as they go on to their next day of discards? Clues on the SETI server status page also are interesting. It shows AP result creation rate as well over half that of MP, though the results received rate is barely over a 20th. Taken at face value, this suggests that the very large majority of distributed AP results are erroring out, or otherwise not successfully returned. Another possibility might be that some of the AP numbers include both 5.00 and 5.03, while others are only 5.03? My anecdotal observation is that new 5.00 result issue is usually for a WU which already has one valid return, so grant credit immediately on the next return, while new 5.03 seems typically to have a multi-week wait for the quorum partner. [edit: bolded the links] |
Cosmic_Ocean ![]() Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 ![]() ![]() |
It might not be errors. It is possible that some of the faster hosts (Q9xxx), or the fast Xeons with or without a GPU, or multiple GPUs are just crunching more than the quota allows per day. At midnight, the quota starts over and those systems fill the cache up until they get the quota again. I think this has been around for quite a while..even before the first AP release. Of course with AP's occasional problems, and the file size being ~24x larger, the midnight-spike becomes amplified. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
... Maybe, or maybe it's just hosts with huge queues they're trying to fill. With the current WUs almost all having deadlines of 3 weeks or more, there are few which the Scheduler will consider infeasible for such hosts, and the host work fetch requests are not inhibited due to work being done at high priority. Clues on the SETI server status page also are interesting. It shows AP result creation rate as well over half that of MP, though the results received rate is barely over a 20th. Taken at face value, this suggests that the very large majority of distributed AP results are erroring out, or otherwise not successfully returned. Scarecrow's graphs for AP and Enhanced are showing the average AP result creation rate as 0.625 and the average S@H Enhanced rate as 8.782. Both are operating in the good mode where they build the "Results ready to send" up to a high water mark, then rest for awhile[1]. That makes the most recent sample on the Server status page fairly meaningless in isolation. [1] Even when the splitters are resting, there's some Result creation from reissues. Joe |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
Can it be that enough hosts are still chewing up their daily 100/cpu limit daily to create this demand spike as they go on to their next day of discards "Midnight" for the purposes of daily quotas is Midnight, UTC. That's around 4:00pm PST (Berkeley time). The Cricket graphs look like they're on local time, which is basically UTC -8 hours. So the peak is roughly 0800 UTC -- which does not line up with the "fresh" quotas at midnight. |
archae86 Send message Joined: 31 Aug 99 Posts: 909 Credit: 1,582,816 RAC: 0 ![]() |
Scarecrow's graphs for AP and Enhanced are showing the average AP result creation rate as 0.625 and the average S@H Enhanced rate as 8.782. Both are operating in the good mode where they build the "Results ready to send" up to a high water mark, then rest for awhile[1]. That makes the most recent sample on the Server status page fairly meaningless in isolation.Ah, I missed that completely. On the other hand, hand clicking on a small sample of actual newly downloaded AP 5.03 work does seem to show that many of the quite common _2, _3, _4 results are resends for work sent today which got a prompt client error/compute error 0.00 CPU time on hosts which are doing this repeatedly. On one case, clicking on down to the task level gets this: <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> CreateProcess() failed - A required privilege is not held by the client. (0x522) </message> ]]> in others this: <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> too many exit(0)s </message> ]]> in others this: <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> CreateProcess() failed - Access is denied. (0x5) </message> ]]> in others (actually these logged client error/downloading repeatedly, and obviously the specific file reference varies): <core_client_version>6.2.15</core_client_version> <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>ap_06dc08ah_B1_P0_00065_20090213_06338.wu</file_name> <error_code>-119</error_code> <error_message>MD5 check failed</error_message> </file_xfer_error> </message> ]]> all these from failed quorum partners of just one host, for downloads in the last day, and in most if not all cases the offending host had failed a whole page minimum. Certainly some of this goes on all the time, but, despite my error on the creation rate evidence, I think it is rather more prevalent on AP 5.03 at the moment than the historic norm. Many of them, however, were not managing anywhere near 100*ncpu failures a day--more like 2, most likely because they had failed enough in the past to get down to the 1/day/CPU limit. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14571 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Can it be that enough hosts are still chewing up their daily 100/cpu limit daily to create this demand spike as they go on to their next day of discards I'm pretty sure that 'quota' resets at midnight, local time (@ server) - which in SETI's case means PST, in line with the Cricket graphs. Not that I'm about to trash 99 WUs to test that theory.... |
![]() ![]() Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0 ![]() |
I agree, quota resets are the most likely cause. Due to all the trashed work caches and people trying to get that app_info file correct. Boinc....Boinc....Boinc....Boinc.... |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
Can it be that enough hosts are still chewing up their daily 100/cpu limit daily to create this demand spike as they go on to their next day of discards I am pretty certain that the Cricket Graphs are running on Berkeley time.....so the spikes occur at server midnight, concurrent with the reset of the daily quotas. "Freedom is just Chaos, with better lighting." Alan Dean Foster ![]() |
Rudy Send message Joined: 23 Jun 99 Posts: 189 Credit: 794,998 RAC: 0 ![]() |
Looking at the scarecrow graphs for results returned (7 day trends), it looks like most of the traffic spikes are coming from AP. Both the AP results returned spikes up and the crunch time spikes down at midnight. Could there be a large number of crunchers with corrupt AP applications or app info still out there? |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 ![]() |
... Where can I find a description of how the deadlines are computed? There seems to be a lot of noise on the message boards (not surprising!). Thx. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14571 Credit: 200,643,578 RAC: 874 ![]() ![]() |
... There's a table here, in the middle of the thread Estimates and Deadlines revisited which documents the research which Joe initiated in December 2007. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
I'm in the same time zone, it's just after 5:00pm and the Cricket Graph page says Last updated at Tue Mar 3 17:08:43 2009. If they were on GMT, it'd be more like Wed Mar 4 01:08:43 2009. I don't use Cricket, I use MRTG -- same author. MRTG updates every five minutes. |
Cosmic_Ocean ![]() Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 ![]() ![]() |
I just noticed something... Mark, I think you must be looking at the wrong router interface for the cricket graph.. supposed to be 2_3, not 5_1. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
I just noticed something... Mark, I think you must be looking at the wrong router interface for the cricket graph.. supposed to be 2_3, not 5_1. Either works for the big picture, but 5_1 seems to have some percentage of other traffic than ours, for instance it has a 123 MBits/sec peak reading currently which would be a neat trick through the 100 MBits/sec link. Joe |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
I just noticed something... Mark, I think you must be looking at the wrong router interface for the cricket graph.. supposed to be 2_3, not 5_1. If I recall correctly, the other difference is that the 2_3 graph is inside out....in other words, what it shows as 'in' is actually coming in from Seti (and maybe some other Berkeley traffic and then sent out, and what it shows as 'out' is actually inbound from us and out to the Seti servers. The 5_1 graph actually shows outbound from Seti and inbound from us. "Freedom is just Chaos, with better lighting." Alan Dean Foster ![]() |
![]() Send message Joined: 8 May 03 Posts: 91 Credit: 15,331,177 RAC: 0 ![]() |
If I recall correctly, the other difference is that the 2_3 graph is inside out....in other words, what it shows as 'in' is actually coming in from Seti (and maybe some other Berkeley traffic and then sent out, and what it shows as 'out' is actually inbound from us and out to the Seti servers. Yes, you are right; 2_3 is inside out. You will notice that a few days ago when SETI was maxed out trying to send out AP V5 the green "in" graph was maxed out. |
![]() ![]() Send message Joined: 4 May 08 Posts: 417 Credit: 6,440,287 RAC: 0 ![]() |
|
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 ![]() ![]() |
1170594009 417760101 21 Feb 2009 20:15:02 UTC 1 Mar 2009 20:51:41 UTC Over Client error Compute error 464,726.30 614.09 --- So big time wasted... It seems it would be better if AP would be opt in rather than opt out. That is, disabled by default... It doesn't work OK for set-and-forget hosts... Some user attention required at least to [re]install opt app. Stock app crash looks too devastating indeed. |
Rudy Send message Joined: 23 Jun 99 Posts: 189 Credit: 794,998 RAC: 0 ![]() |
very interesting looking at when they are trashing, it seems that the quota reset is every 25 hours from the last contact and not at midnight |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14571 Credit: 200,643,578 RAC: 874 ![]() ![]() |
My answer would be YES Not quite. The quota is reset at midnight: but if BOINC is backed off because of quota restrictions, it backs off to a random time sometime in the first hour after the reset. If everyone's client called in for a new quota at exactly 00:00:01, we'd see a much sharper spike than the one we're discussing - BOINC is trying to flatten it out a bit. |
©2023 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.