Problems...

Message boards : Number crunching : Problems...
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 13 · Next

AuthorMessage
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3799
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 972229 - Posted: 20 Feb 2010, 14:29:11 UTC
Last modified: 20 Feb 2010, 14:55:31 UTC

I'm hoping that this may be actually seen by a bona fide BOINC developer. Maybe I should post in a BOINC forum but that isn't where the outage happened.

So, the most recent outage happened on Feb. 17th. It's now Feb. 20th and I still have 24 workunits that won't upload. I can't get new ones until they do so my clients are idle. I have seen this happen on perhaps a dozen outages as I've been running SAH with BOINC (and without) since their day of release.

The problem seems to be on the upload server side. The server seems to accept so many connections for upload that it has none left to send the required confirmations to the clients. Whatever the mechanism that is causing it, the results can be seen on the client plainly: uploads partially or even completely finish but they still time out and then need to be restarted from zero. This saturates the upload server's bandwidth.

I sometimes see half or more of the "Upload Pending" workunits with 50%-100% progress bars. So all of that is wasted bandwidth. Multiply that by a few thousand clients constantly hammering the server and it's no wonder that an outage will knock out the uploads for days. The inbound connections and bandwidth is maxed but the vast majority of it is being wasted. This is the longest that I have ever seen it happen for.

If BOINC is to be touted as a viable platform for scientific computing I think that this issue really should be addressed. It makes the problems caused by outages last several times longer than they should.

Thank you and flame away. :^p

Edit: All of the uploads seem to be failing at 100% from the most recent outage. I've seen perhaps one or two exceptions to the hundreds(?) of failures at 100%.
ID: 972229 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3799
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 972243 - Posted: 20 Feb 2010, 15:05:11 UTC - in response to Message 972241.  

But the cricket graphs show the bandwidth utilization to be much lower than normal. The issue is IMO something else. A router or a switch, either on the SETI side, or in between the users and SETI.


Likely (I don't even know where to find that) but whatever it is, it's been happening from the first outage since BOINC was released. Just retried my pending uploads. Two got to 100% and then timed out, the next two failed and then the client went into "Project Backoff."
ID: 972243 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3799
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 972245 - Posted: 20 Feb 2010, 15:12:57 UTC - in response to Message 972244.  

After a normal weekly outage though, the bandwidth is maxed out. Now it isn't even used 25%.


Got an URL for this? Thanks.
ID: 972245 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14676
Credit: 200,643,578
RAC: 874
United Kingdom
Message 972251 - Posted: 20 Feb 2010, 15:23:57 UTC - in response to Message 972229.  
Last modified: 20 Feb 2010, 15:24:10 UTC

So, the most recent outage happened on Feb. 17th.

Actually, the upload problem started on the 15th. (before the outage), so the outage/recovery is not implicated in the ongoing problem.

Here's the url you wanted:

http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=/router-interfaces/inr-250/gigabitethernet2_3&ranges=d%3Aw&view=Octets

Note that this is our viewpoint, not the lab's, so the blue "bits out" trace are the uploads leaving us and heading to the lab: the green block are the downloads leaving the lab and heading in to our machines.
ID: 972251 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3799
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 972254 - Posted: 20 Feb 2010, 15:31:40 UTC - in response to Message 972251.  
Last modified: 20 Feb 2010, 15:32:11 UTC

Thanks! That will get a bookmark.

From TFU:

Values at last update:

Average bits in (for the day):
Cur: 14.92 Mbits/sec
Avg: 31.15 Mbits/sec
Max: 74.39 Mbits/sec

Average bits out (for the day):
Cur: 4.53 Mbits/sec
Avg: 5.67 Mbits/sec
Max: 10.20 Mbits/sec


This pretty well shows there's a problem... bits in and out should be roughly equal. But bits in is respectively 3, 5 and 7 times what bits out is. There's all the wasted bandwidth from incomplete uploads.
ID: 972254 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3799
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 972261 - Posted: 20 Feb 2010, 15:37:38 UTC - in response to Message 972259.  
Last modified: 20 Feb 2010, 15:38:48 UTC

Bits in and out is never equal, because the wast difference in size of what we download and upload.


You're right... I forgot the disparate sizes. IIRC the outbound workunits are, what, about 300KB, but the inbound results are about... 30KB? So bits in should be about one-tenth of bits out. Instead the average is that it's four times bits out, or about forty times too large. So that equates to about 98% wasted bandwidth, or about one workunit completing for every forty that get to 100% and then need to be retried.

Houston, do we have a problem? :^p
ID: 972261 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3799
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 972267 - Posted: 20 Feb 2010, 15:47:43 UTC

Could very well be a network problem. I can't tell. All I know is that there is a problem, and it's been around since Day 1, and I hope someone "official" can have a look at it as it's severely wasting the already-limited resources of this project. <<shrug>>

Thanks again. :^)
ID: 972267 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14676
Credit: 200,643,578
RAC: 874
United Kingdom
Message 972273 - Posted: 20 Feb 2010, 16:00:45 UTC - in response to Message 972261.  

Bits in and out is never equal, because the wast difference in size of what we download and upload.


You're right... I forgot the disparate sizes. IIRC the outbound workunits are, what, about 300KB, but the inbound results are about... 30KB? So bits in should be about one-tenth of bits out. Instead the average is that it's four times bits out, or about forty times too large. So that equates to about 98% wasted bandwidth, or about one workunit completing for every forty that get to 100% and then need to be retried.

Houston, do we have a problem? :^p

It isn't quite as simple as that. The same data link that is being monitored on that router also carries a lot of BOINC administrative overhead traffic - specifically sched_request.xml and sched_reply.xml files. For enthusiastic GPU crunchers, these files dwarf the "data" files you're thinking of - and the biggest ones are inbound to Berkeley, so show up on the 'upload' side of the balance-sheet.

But what we do know, from past experience, is that when the servers are singing the 'download' side can run at 93 - 95 Mbit/sec continuously for days on end, and the upload side can sustain around 25 Mbit/sec - though the line (nominal 100 Mb) can't do both at the same time.
ID: 972273 · Report as offensive
Rick
Avatar

Send message
Joined: 3 Dec 99
Posts: 79
Credit: 11,486,227
RAC: 0
United States
Message 972287 - Posted: 20 Feb 2010, 16:18:04 UTC

None of us actually know for sure what is wrong but here's my 2 cents worth.

I can confirm the observation that started this thread. Uploads sitting at some % complete (many times 100% complete) but they won't actually leave your task list. I don't have any insight into the inner workings of BOINC or SETI but it would make sense that there are multiple things going on with an upload. It's not simply push the data over the line and you're done. Getting the data there is only half of it. There must be a database update that has to take place to register the fact that you have uploaded a result. If that database update is having issues then it would make sense that we would see 100% uploads but the task is still sitting there in your task list. The data is sitting on the disk at Berkeley but SETI doesn't know about it. If that upload times out then eventually you'll go through the whole process again. But, what happend to that data you uploaded? If it's not represented in the database then the backend processes don't know what to do with it. If that becomes orphan data then it may never be cleaned up. Now, multiply that times how many failed uploads and you eventually end up with a full disk. Now the data can't even be uploaded so the symptom changes to uploads with less than 100% completion. Oddly, Eric reported that a database outage and a full disk were both issues he's had to deal with.

The fact that some folks can eventually get some uploads to work may be because there is some cleanup task that runs periodically that drops those orphan uploads. That would free up some disk so now some folks can upload tasks but they still run into the database issue so the symptom goes back to 100% uploaded but the task stays on their list. Until the disk fills up again and we start the cycle all over again.

That's all just conjecture but it does seem to be one scenario that would explain some of the symptoms we're seeing.

ID: 972287 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 972309 - Posted: 20 Feb 2010, 16:47:01 UTC - in response to Message 972287.  
Last modified: 20 Feb 2010, 16:49:40 UTC



That's all just conjecture but it does seem to be one scenario that would explain some of the symptoms we're seeing.

I can explain the cause, but not the symptom.

.........by the light of day.... but at night I'm one hell of a lover.

Baby.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 972309 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 972360 - Posted: 20 Feb 2010, 18:13:02 UTC - in response to Message 972287.  

It's not simply push the data over the line and you're done. Getting the data there is only half of it. There must be a database update that has to take place to register the fact that you have uploaded a result.

Wrong assumption :-)

The database access doesn't happen before the task is reported. The upload really only is a copy-file process.

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours
ID: 972360 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14676
Credit: 200,643,578
RAC: 874
United Kingdom
Message 972370 - Posted: 20 Feb 2010, 18:25:43 UTC - in response to Message 972360.  

The upload really only is a copy-file process.

But on the sort of industrial-scale servers we're talking about here, even that requires some pretty clever (and fast) file indexing/cataloging to make sure your data doesn't overwrite mine, and both of them can be found again when the time comes for validation.

My 8-core Vista Pro workstation was noticably slowed recently by a mere 20K files hidden in the recycle bin: I think Bruno has got the same sort of indigestion, but on a massive scale - remember that it should be keeping track of tens of millions of files at a time, changing at the rate of ten or twenty a second. I very much doubt anyone on this side of the message boards has ever seen a machine capable of doing that, let alone had to manage it.
ID: 972370 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 972559 - Posted: 21 Feb 2010, 1:18:26 UTC

We all know there are issues after the OverHeat of Servers.

I have been emailing Debug Logs to the Seti Staff in hope of sorting things.

That said, what I have been seeing over the last few days is hit or miss.

I surmise the "libcurl" bug rears its ugly head again.

This is also complicated in that I am moving furnature to the wife's work apartment. Don't ask.

Regards

Please consider a Donation to the Seti Project.

ID: 972559 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14676
Credit: 200,643,578
RAC: 874
United Kingdom
Message 972567 - Posted: 21 Feb 2010, 1:35:06 UTC - in response to Message 972559.  

I've posted in detail at Lunatics: but -

1) Did you get any information in reply? Which sub-systems of the server farm are under suspicion?

2) There are some wild - and actionable - assertions about Hurricane Electric flying around in Technical News. False, so far as I can tell from Wireshark. They need contradicting.

3) Is there anything - social engineering taken for granted - that we can do to help? You've mentioned logs, I've mentioned Wireshark. Is this still a diagnosis thing, or have we reached internal remediation? Status information is becoming an urgent need.
ID: 972567 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 972573 - Posted: 21 Feb 2010, 1:53:54 UTC - in response to Message 972567.  

Richard

I have been sending updated logs to DA and Rom. I even had time to downgrade from 6.10.32 to 6.10.18 to see if that would help (which was suggested I might have). I did get a scheduler request through. Uploads, No Joy....

As I think about it more in depth, I amd starting to suspect it is "libcurl" on the Server Side. So while I have not dug into Trac, to see when Libcurl might have been updated for ATI or other purposes. Something seems to stick the back of my head that this relates. Now to prove it.

So if you have a valid TCP Dump send it. You know where

From everything I have seen Networkwise, it is Not the Network. DNS Resolves, I see server contact and see headers sent and recieved. Then things go into Never Never land.
Timeouts happen or the contact is broken (which would imply that the server told the client to disconnect). So Network communications are fine. The fact that Cricket shows a "few" users are able to send uploads and then get downloads is proof.

What most people do not know, is when you pay for a Gigabit Link what is involved on that side to insure that the Link is Solid. They have staff 24/7/365 that Seti would Love to be able to afford.

Regards


Please consider a Donation to the Seti Project.

ID: 972573 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14676
Credit: 200,643,578
RAC: 874
United Kingdom
Message 972578 - Posted: 21 Feb 2010, 2:03:57 UTC

Yes. I know where. But I don't know what.

At the moment, I have a 64KB Wireshark native-format filtered capture covering an entire - complete and now reported, for all I know validated - upload event earlier this afternoon. But can they do Wireshark? Which of the 20-odd alternative formats should I save the file in? They are - I hope - busy on the problem. They need a comms officer to handle questions like that from irritants like me. I think you volunteered [not tonight, I read you concerning the apartment move] but in general, several years ago. Please advise.
ID: 972578 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 972587 - Posted: 21 Feb 2010, 2:25:30 UTC - in response to Message 972578.  

What can be cut an pasted to email works (export specific lines CSV if nothing else).

While I suspect they have used TCPDump in the past, it would take time to look from remote (the weekend). Then this does not mention knowing what you are looking (for the users, google TCP Handshake for a starter, the streaming files over the Internet) or isolating a specific instance.

As far as me, doing what I do... I do it, keep doing it. I wish sometimes that I could get more feedback. I have not quit! Many times what I see here tells me things that prompts an educated email. Things then happen.

That is all I can do.

Wish List:
Errors have more specific information!

Regards

Please consider a Donation to the Seti Project.

ID: 972587 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14676
Credit: 200,643,578
RAC: 874
United Kingdom
Message 972608 - Posted: 21 Feb 2010, 3:15:33 UTC - in response to Message 972587.  

Mail sent - you have copy. It's gone 3 am here (I'm on UTC) - if I don't see a response, either here or by mail, before my head stops buzzing, I'm going to crash out and pick up in the morning. See ya.
ID: 972608 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14676
Credit: 200,643,578
RAC: 874
United Kingdom
Message 972757 - Posted: 21 Feb 2010, 13:07:24 UTC

Managed to get a bit more information via Wireshark. This was the response to a scheduler request (Anakin, 208.68.240.20), reporting two tasks:


(Direct link).

About ten minutes later, on an automatic retry, the report went through successfully. Both tasks validated: 550828606 is an old resend, which doesn't tell us much, but 572822287 is fresh work split on Friday - two of us have been allocated, downloaded, uploaded and reported.

From what I can see, every part of the system is actually working "properly". It's just that one aspect of "properly" is a server-side protective mechanism which drops connections (sends a RST packet) when the server is in some sense 'too busy'. For several days, on two servers (Anakin and Bruno), that protective mechanism has been kicking in far too often, at traffic and loading levels that the two servers can normally handle without turning a hair.

If I was in the lab, rather than pontificating from an armchair six thousand miles away, I'd be asking myself two questions:

1) Are Anakin and Bruno actually showing signs of overwork? (CPU loading, temperatures, number of processes running, memory utilisation - that sort of thing)

2) Has some part of the lab complex - not the individual servers, but the wider system like the gigabit LAN to the Network Attached Storage, and the NAS itself - developed throughput or response problems?

Nobody has posted any sign of a 'trigger point' yet: no actual breakdown, change in workflow pattern, or anything like that. It just slowed to a crawl on Monday morning. The only other thing I can think of (and this is veering off into complete guesswork) is the increasing amount of pending credit reported here in recent weeks. That represents lots of additional files stored, and stored for longer. Could something like a disk filing system index have grown to reach a tipping point, where it no longer fits in memory and has to spill over onto its own hard disk?
ID: 972757 · Report as offensive
Profile HSchmirPo
Avatar

Send message
Joined: 17 Jan 06
Posts: 18
Credit: 39,561,621
RAC: 0
Germany
Message 972776 - Posted: 21 Feb 2010, 14:12:06 UTC - in response to Message 972757.  

Hello,
reporting from Germany.
The last few weeks, my pendings reached 565,825.16.
Since hours, the repond from seti is:

21.02.2010 15:05:09 SETI@home Sending scheduler request: Requested by user.
21.02.2010 15:05:09 SETI@home Reporting 3 completed tasks, not requesting new tasks
21.02.2010 15:05:11 Project communication failed: attempting access to reference site
21.02.2010 15:05:12 Internet access OK - project servers may be temporarily down.
21.02.2010 15:05:14 SETI@home Scheduler request failed: Server returned nothing (no headers, no data)
21.02.2010 15:06:15 SETI@home Fetching scheduler list
21.02.2010 15:06:20 SETI@home Master file download succeeded

HSchmirPo

ET? It's me!
ID: 972776 · Report as offensive
1 · 2 · 3 · 4 . . . 13 · Next

Message boards : Number crunching : Problems...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.