Message boards :
Number crunching :
Problems...
Message board moderation
Author | Message |
---|---|
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3807 Credit: 1,114,826,392 RAC: 3,319 |
I'm hoping that this may be actually seen by a bona fide BOINC developer. Maybe I should post in a BOINC forum but that isn't where the outage happened. So, the most recent outage happened on Feb. 17th. It's now Feb. 20th and I still have 24 workunits that won't upload. I can't get new ones until they do so my clients are idle. I have seen this happen on perhaps a dozen outages as I've been running SAH with BOINC (and without) since their day of release. The problem seems to be on the upload server side. The server seems to accept so many connections for upload that it has none left to send the required confirmations to the clients. Whatever the mechanism that is causing it, the results can be seen on the client plainly: uploads partially or even completely finish but they still time out and then need to be restarted from zero. This saturates the upload server's bandwidth. I sometimes see half or more of the "Upload Pending" workunits with 50%-100% progress bars. So all of that is wasted bandwidth. Multiply that by a few thousand clients constantly hammering the server and it's no wonder that an outage will knock out the uploads for days. The inbound connections and bandwidth is maxed but the vast majority of it is being wasted. This is the longest that I have ever seen it happen for. If BOINC is to be touted as a viable platform for scientific computing I think that this issue really should be addressed. It makes the problems caused by outages last several times longer than they should. Thank you and flame away. :^p Edit: All of the uploads seem to be failing at 100% from the most recent outage. I've seen perhaps one or two exceptions to the hundreds(?) of failures at 100%. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3807 Credit: 1,114,826,392 RAC: 3,319 |
But the cricket graphs show the bandwidth utilization to be much lower than normal. The issue is IMO something else. A router or a switch, either on the SETI side, or in between the users and SETI. Likely (I don't even know where to find that) but whatever it is, it's been happening from the first outage since BOINC was released. Just retried my pending uploads. Two got to 100% and then timed out, the next two failed and then the client went into "Project Backoff." |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3807 Credit: 1,114,826,392 RAC: 3,319 |
|
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
So, the most recent outage happened on Feb. 17th. Actually, the upload problem started on the 15th. (before the outage), so the outage/recovery is not implicated in the ongoing problem. Here's the url you wanted: http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=/router-interfaces/inr-250/gigabitethernet2_3&ranges=d%3Aw&view=Octets Note that this is our viewpoint, not the lab's, so the blue "bits out" trace are the uploads leaving us and heading to the lab: the green block are the downloads leaving the lab and heading in to our machines. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3807 Credit: 1,114,826,392 RAC: 3,319 |
Thanks! That will get a bookmark. From TFU: Values at last update: Average bits in (for the day): Cur: 14.92 Mbits/sec Avg: 31.15 Mbits/sec Max: 74.39 Mbits/sec Average bits out (for the day): Cur: 4.53 Mbits/sec Avg: 5.67 Mbits/sec Max: 10.20 Mbits/sec This pretty well shows there's a problem... bits in and out should be roughly equal. But bits in is respectively 3, 5 and 7 times what bits out is. There's all the wasted bandwidth from incomplete uploads. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3807 Credit: 1,114,826,392 RAC: 3,319 |
Bits in and out is never equal, because the wast difference in size of what we download and upload. You're right... I forgot the disparate sizes. IIRC the outbound workunits are, what, about 300KB, but the inbound results are about... 30KB? So bits in should be about one-tenth of bits out. Instead the average is that it's four times bits out, or about forty times too large. So that equates to about 98% wasted bandwidth, or about one workunit completing for every forty that get to 100% and then need to be retried. Houston, do we have a problem? :^p |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3807 Credit: 1,114,826,392 RAC: 3,319 |
|
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Bits in and out is never equal, because the wast difference in size of what we download and upload. It isn't quite as simple as that. The same data link that is being monitored on that router also carries a lot of BOINC administrative overhead traffic - specifically sched_request.xml and sched_reply.xml files. For enthusiastic GPU crunchers, these files dwarf the "data" files you're thinking of - and the biggest ones are inbound to Berkeley, so show up on the 'upload' side of the balance-sheet. But what we do know, from past experience, is that when the servers are singing the 'download' side can run at 93 - 95 Mbit/sec continuously for days on end, and the upload side can sustain around 25 Mbit/sec - though the line (nominal 100 Mb) can't do both at the same time. |
Rick Send message Joined: 3 Dec 99 Posts: 79 Credit: 11,486,227 RAC: 0 |
None of us actually know for sure what is wrong but here's my 2 cents worth. I can confirm the observation that started this thread. Uploads sitting at some % complete (many times 100% complete) but they won't actually leave your task list. I don't have any insight into the inner workings of BOINC or SETI but it would make sense that there are multiple things going on with an upload. It's not simply push the data over the line and you're done. Getting the data there is only half of it. There must be a database update that has to take place to register the fact that you have uploaded a result. If that database update is having issues then it would make sense that we would see 100% uploads but the task is still sitting there in your task list. The data is sitting on the disk at Berkeley but SETI doesn't know about it. If that upload times out then eventually you'll go through the whole process again. But, what happend to that data you uploaded? If it's not represented in the database then the backend processes don't know what to do with it. If that becomes orphan data then it may never be cleaned up. Now, multiply that times how many failed uploads and you eventually end up with a full disk. Now the data can't even be uploaded so the symptom changes to uploads with less than 100% completion. Oddly, Eric reported that a database outage and a full disk were both issues he's had to deal with. The fact that some folks can eventually get some uploads to work may be because there is some cleanup task that runs periodically that drops those orphan uploads. That would free up some disk so now some folks can upload tasks but they still run into the database issue so the symptom goes back to 100% uploaded but the task stays on their list. Until the disk fills up again and we start the cycle all over again. That's all just conjecture but it does seem to be one scenario that would explain some of the symptoms we're seeing. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
I can explain the cause, but not the symptom. .........by the light of day.... but at night I'm one hell of a lover. Baby. "Time is simply the mechanism that keeps everything from happening all at once." |
Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0 |
It's not simply push the data over the line and you're done. Getting the data there is only half of it. There must be a database update that has to take place to register the fact that you have uploaded a result. Wrong assumption :-) The database access doesn't happen before the task is reported. The upload really only is a copy-file process. Gruß, Gundolf Computer sind nicht alles im Leben. (Kleiner Scherz) SETI@home classic workunits 3,758 SETI@home classic CPU time 66,520 hours |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
The upload really only is a copy-file process. But on the sort of industrial-scale servers we're talking about here, even that requires some pretty clever (and fast) file indexing/cataloging to make sure your data doesn't overwrite mine, and both of them can be found again when the time comes for validation. My 8-core Vista Pro workstation was noticably slowed recently by a mere 20K files hidden in the recycle bin: I think Bruno has got the same sort of indigestion, but on a massive scale - remember that it should be keeping track of tens of millions of files at a time, changing at the rate of ten or twenty a second. I very much doubt anyone on this side of the message boards has ever seen a machine capable of doing that, let alone had to manage it. |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
We all know there are issues after the OverHeat of Servers. I have been emailing Debug Logs to the Seti Staff in hope of sorting things. That said, what I have been seeing over the last few days is hit or miss. I surmise the "libcurl" bug rears its ugly head again. This is also complicated in that I am moving furnature to the wife's work apartment. Don't ask. Regards Please consider a Donation to the Seti Project. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
I've posted in detail at Lunatics: but - 1) Did you get any information in reply? Which sub-systems of the server farm are under suspicion? 2) There are some wild - and actionable - assertions about Hurricane Electric flying around in Technical News. False, so far as I can tell from Wireshark. They need contradicting. 3) Is there anything - social engineering taken for granted - that we can do to help? You've mentioned logs, I've mentioned Wireshark. Is this still a diagnosis thing, or have we reached internal remediation? Status information is becoming an urgent need. |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
Richard I have been sending updated logs to DA and Rom. I even had time to downgrade from 6.10.32 to 6.10.18 to see if that would help (which was suggested I might have). I did get a scheduler request through. Uploads, No Joy.... As I think about it more in depth, I amd starting to suspect it is "libcurl" on the Server Side. So while I have not dug into Trac, to see when Libcurl might have been updated for ATI or other purposes. Something seems to stick the back of my head that this relates. Now to prove it. So if you have a valid TCP Dump send it. You know where From everything I have seen Networkwise, it is Not the Network. DNS Resolves, I see server contact and see headers sent and recieved. Then things go into Never Never land. Timeouts happen or the contact is broken (which would imply that the server told the client to disconnect). So Network communications are fine. The fact that Cricket shows a "few" users are able to send uploads and then get downloads is proof. What most people do not know, is when you pay for a Gigabit Link what is involved on that side to insure that the Link is Solid. They have staff 24/7/365 that Seti would Love to be able to afford. Regards Please consider a Donation to the Seti Project. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Yes. I know where. But I don't know what. At the moment, I have a 64KB Wireshark native-format filtered capture covering an entire - complete and now reported, for all I know validated - upload event earlier this afternoon. But can they do Wireshark? Which of the 20-odd alternative formats should I save the file in? They are - I hope - busy on the problem. They need a comms officer to handle questions like that from irritants like me. I think you volunteered [not tonight, I read you concerning the apartment move] but in general, several years ago. Please advise. |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
What can be cut an pasted to email works (export specific lines CSV if nothing else). While I suspect they have used TCPDump in the past, it would take time to look from remote (the weekend). Then this does not mention knowing what you are looking (for the users, google TCP Handshake for a starter, the streaming files over the Internet) or isolating a specific instance. As far as me, doing what I do... I do it, keep doing it. I wish sometimes that I could get more feedback. I have not quit! Many times what I see here tells me things that prompts an educated email. Things then happen. That is all I can do. Wish List: Errors have more specific information! Regards Please consider a Donation to the Seti Project. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Mail sent - you have copy. It's gone 3 am here (I'm on UTC) - if I don't see a response, either here or by mail, before my head stops buzzing, I'm going to crash out and pick up in the morning. See ya. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Managed to get a bit more information via Wireshark. This was the response to a scheduler request (Anakin, 208.68.240.20), reporting two tasks: (Direct link). About ten minutes later, on an automatic retry, the report went through successfully. Both tasks validated: 550828606 is an old resend, which doesn't tell us much, but 572822287 is fresh work split on Friday - two of us have been allocated, downloaded, uploaded and reported. From what I can see, every part of the system is actually working "properly". It's just that one aspect of "properly" is a server-side protective mechanism which drops connections (sends a RST packet) when the server is in some sense 'too busy'. For several days, on two servers (Anakin and Bruno), that protective mechanism has been kicking in far too often, at traffic and loading levels that the two servers can normally handle without turning a hair. If I was in the lab, rather than pontificating from an armchair six thousand miles away, I'd be asking myself two questions: 1) Are Anakin and Bruno actually showing signs of overwork? (CPU loading, temperatures, number of processes running, memory utilisation - that sort of thing) 2) Has some part of the lab complex - not the individual servers, but the wider system like the gigabit LAN to the Network Attached Storage, and the NAS itself - developed throughput or response problems? Nobody has posted any sign of a 'trigger point' yet: no actual breakdown, change in workflow pattern, or anything like that. It just slowed to a crawl on Monday morning. The only other thing I can think of (and this is veering off into complete guesswork) is the increasing amount of pending credit reported here in recent weeks. That represents lots of additional files stored, and stored for longer. Could something like a disk filing system index have grown to reach a tipping point, where it no longer fits in memory and has to spill over onto its own hard disk? |
HSchmirPo Send message Joined: 17 Jan 06 Posts: 18 Credit: 39,561,621 RAC: 0 |
Hello, reporting from Germany. The last few weeks, my pendings reached 565,825.16. Since hours, the repond from seti is: 21.02.2010 15:05:09 SETI@home Sending scheduler request: Requested by user. 21.02.2010 15:05:09 SETI@home Reporting 3 completed tasks, not requesting new tasks 21.02.2010 15:05:11 Project communication failed: attempting access to reference site 21.02.2010 15:05:12 Internet access OK - project servers may be temporarily down. 21.02.2010 15:05:14 SETI@home Scheduler request failed: Server returned nothing (no headers, no data) 21.02.2010 15:06:15 SETI@home Fetching scheduler list 21.02.2010 15:06:20 SETI@home Master file download succeeded HSchmirPo ET? It's me! |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.