Problems...


log in

Advanced search

Message boards : Number crunching : Problems...

1 · 2 · 3 · 4 . . . 13 · Next
Author Message
Profile Mr. KevvyProject donor
Volunteer tester
Avatar
Send message
Joined: 15 May 99
Posts: 690
Credit: 72,398,796
RAC: 76,533
Canada
Message 972229 - Posted: 20 Feb 2010, 14:29:11 UTC
Last modified: 20 Feb 2010, 14:55:31 UTC

I'm hoping that this may be actually seen by a bona fide BOINC developer. Maybe I should post in a BOINC forum but that isn't where the outage happened.

So, the most recent outage happened on Feb. 17th. It's now Feb. 20th and I still have 24 workunits that won't upload. I can't get new ones until they do so my clients are idle. I have seen this happen on perhaps a dozen outages as I've been running SAH with BOINC (and without) since their day of release.

The problem seems to be on the upload server side. The server seems to accept so many connections for upload that it has none left to send the required confirmations to the clients. Whatever the mechanism that is causing it, the results can be seen on the client plainly: uploads partially or even completely finish but they still time out and then need to be restarted from zero. This saturates the upload server's bandwidth.

I sometimes see half or more of the "Upload Pending" workunits with 50%-100% progress bars. So all of that is wasted bandwidth. Multiply that by a few thousand clients constantly hammering the server and it's no wonder that an outage will knock out the uploads for days. The inbound connections and bandwidth is maxed but the vast majority of it is being wasted. This is the longest that I have ever seen it happen for.

If BOINC is to be touted as a viable platform for scientific computing I think that this issue really should be addressed. It makes the problems caused by outages last several times longer than they should.

Thank you and flame away. :^p

Edit: All of the uploads seem to be failing at 100% from the most recent outage. I've seen perhaps one or two exceptions to the hundreds(?) of failures at 100%.
____________
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
--- Margaret Mead

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3404
Credit: 19,573,197
RAC: 19,124
Sweden
Message 972241 - Posted: 20 Feb 2010, 14:56:58 UTC - in response to Message 972229.
Last modified: 20 Feb 2010, 14:57:32 UTC

I'm hoping that this may be actually seen by a bona fide BOINC developer. Maybe I should post in a BOINC forum but that isn't where the outage happened.

So, the most recent outage happened on Feb. 17th. It's now Feb. 20th and I still have 24 workunits that won't upload. I can't get new ones until they do so my clients are idle. I have seen this happen on perhaps a dozen outages as I've been running SAH with BOINC (and without) since their day of release.

The problem seems to be on the upload server side. The server seems to accept so many connections for upload that it has none left to send the required confirmations to the clients. Whatever the mechanism that is causing it, the results can be seen on the client plainly: uploads partially or even completely finish but they still time out and then need to be restarted from zero. This saturates the upload server's bandwidth.

I sometimes see half or more of the "Upload Pending" workunits with 50%-100% progress bars. So all of that is wasted bandwidth. Multiply that by a few thousand clients constantly hammering the server and it's no wonder that an outage will knock out the uploads for days. The inbound connections and bandwidth is maxed but the vast majority of it is being wasted. This is the longest that I have ever seen it happen for.

If BOINC is to be touted as a viable platform for scientific computing I think that this issue really should be addressed. It makes the problems caused by outages last several times longer than they should.

Thank you and flame away. :^p


But the cricket graphs show the bandwidth utilization to be much lower than normal. The issue is IMO something else. A router or a switch, either on the SETI side, or in between the users and SETI.

Sten-Arne

Profile Mr. KevvyProject donor
Volunteer tester
Avatar
Send message
Joined: 15 May 99
Posts: 690
Credit: 72,398,796
RAC: 76,533
Canada
Message 972243 - Posted: 20 Feb 2010, 15:05:11 UTC - in response to Message 972241.

But the cricket graphs show the bandwidth utilization to be much lower than normal. The issue is IMO something else. A router or a switch, either on the SETI side, or in between the users and SETI.


Likely (I don't even know where to find that) but whatever it is, it's been happening from the first outage since BOINC was released. Just retried my pending uploads. Two got to 100% and then timed out, the next two failed and then the client went into "Project Backoff."
____________
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
--- Margaret Mead

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3404
Credit: 19,573,197
RAC: 19,124
Sweden
Message 972244 - Posted: 20 Feb 2010, 15:06:51 UTC - in response to Message 972243.
Last modified: 20 Feb 2010, 15:07:19 UTC

But the cricket graphs show the bandwidth utilization to be much lower than normal. The issue is IMO something else. A router or a switch, either on the SETI side, or in between the users and SETI.


Likely (I don't even know where to find that) but whatever it is, it's been happening from the first outage since BOINC was released. Just retried my pending uploads. Two got to 100% and then timed out, the next two failed and then the client went into "Project Backoff."


After a normal weekly outage though, the bandwidth is maxed out. Now it isn't even used 25%.

Sten-Arne

Profile Mr. KevvyProject donor
Volunteer tester
Avatar
Send message
Joined: 15 May 99
Posts: 690
Credit: 72,398,796
RAC: 76,533
Canada
Message 972245 - Posted: 20 Feb 2010, 15:12:57 UTC - in response to Message 972244.

After a normal weekly outage though, the bandwidth is maxed out. Now it isn't even used 25%.


Got an URL for this? Thanks.
____________
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
--- Margaret Mead

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8460
Credit: 48,801,015
RAC: 82,352
United Kingdom
Message 972251 - Posted: 20 Feb 2010, 15:23:57 UTC - in response to Message 972229.
Last modified: 20 Feb 2010, 15:24:10 UTC

So, the most recent outage happened on Feb. 17th.

Actually, the upload problem started on the 15th. (before the outage), so the outage/recovery is not implicated in the ongoing problem.

Here's the url you wanted:

http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=/router-interfaces/inr-250/gigabitethernet2_3&ranges=d%3Aw&view=Octets

Note that this is our viewpoint, not the lab's, so the blue "bits out" trace are the uploads leaving us and heading to the lab: the green block are the downloads leaving the lab and heading in to our machines.

Profile Mr. KevvyProject donor
Volunteer tester
Avatar
Send message
Joined: 15 May 99
Posts: 690
Credit: 72,398,796
RAC: 76,533
Canada
Message 972254 - Posted: 20 Feb 2010, 15:31:40 UTC - in response to Message 972251.
Last modified: 20 Feb 2010, 15:32:11 UTC

Thanks! That will get a bookmark.

From TFU:

Values at last update:

Average bits in (for the day):
Cur: 14.92 Mbits/sec
Avg: 31.15 Mbits/sec
Max: 74.39 Mbits/sec

Average bits out (for the day):
Cur: 4.53 Mbits/sec
Avg: 5.67 Mbits/sec
Max: 10.20 Mbits/sec


This pretty well shows there's a problem... bits in and out should be roughly equal. But bits in is respectively 3, 5 and 7 times what bits out is. There's all the wasted bandwidth from incomplete uploads.
____________
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
--- Margaret Mead

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3404
Credit: 19,573,197
RAC: 19,124
Sweden
Message 972259 - Posted: 20 Feb 2010, 15:33:42 UTC - in response to Message 972254.

Thanks! That will get a bookmark.

From TFU:

Values at last update:

Average bits in (for the day):
Cur: 14.92 Mbits/sec
Avg: 31.15 Mbits/sec
Max: 74.39 Mbits/sec

Average bits out (for the day):
Cur: 4.53 Mbits/sec
Avg: 5.67 Mbits/sec
Max: 10.20 Mbits/sec


This pretty well shows there's a problem... bits in and out should be roughly equal. But bits in is respectively 3, 5 and 7 times what bits out is. There's all the wasted bandwidth from incomplete uploads.



Bits in and out is never equal, because the wast difference in size of what we download and upload.

Sten-Arne

Profile Mr. KevvyProject donor
Volunteer tester
Avatar
Send message
Joined: 15 May 99
Posts: 690
Credit: 72,398,796
RAC: 76,533
Canada
Message 972261 - Posted: 20 Feb 2010, 15:37:38 UTC - in response to Message 972259.
Last modified: 20 Feb 2010, 15:38:48 UTC

Bits in and out is never equal, because the wast difference in size of what we download and upload.


You're right... I forgot the disparate sizes. IIRC the outbound workunits are, what, about 300KB, but the inbound results are about... 30KB? So bits in should be about one-tenth of bits out. Instead the average is that it's four times bits out, or about forty times too large. So that equates to about 98% wasted bandwidth, or about one workunit completing for every forty that get to 100% and then need to be retried.

Houston, do we have a problem? :^p
____________
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
--- Margaret Mead

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3404
Credit: 19,573,197
RAC: 19,124
Sweden
Message 972266 - Posted: 20 Feb 2010, 15:43:17 UTC - in response to Message 972261.
Last modified: 20 Feb 2010, 15:44:27 UTC

Bits in and out is never equal, because the wast difference in size of what we download and upload.


You're right... I forgot the disparate sizes. IIRC the outbound workunits are, what, about 300KB, but the inbound results are about... 30KB? So bits in should be about one-tenth of bits out. Instead the average is that it's four times bits out, or about forty times too large. So that equates to about 98% wasted bandwidth, or about one workunit completing for every forty that get to 100% and then need to be retried.

Houston, do we have a problem? :^p



Don't forget that at the moment there's going to be much more uploads to SETI than downloads. Many clients have reached the limit where they can't request more work before they upload finished WU's.

However, I still stand by my opinion that the main problem now, is a router or switch somewhere on this planet.

And with that I end my participation in this discussion until we know for sure what the problem was. No need to speculate further I think.

Sten-Arne

Profile Mr. KevvyProject donor
Volunteer tester
Avatar
Send message
Joined: 15 May 99
Posts: 690
Credit: 72,398,796
RAC: 76,533
Canada
Message 972267 - Posted: 20 Feb 2010, 15:47:43 UTC

Could very well be a network problem. I can't tell. All I know is that there is a problem, and it's been around since Day 1, and I hope someone "official" can have a look at it as it's severely wasting the already-limited resources of this project. <<shrug>>

Thanks again. :^)
____________
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
--- Margaret Mead

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8460
Credit: 48,801,015
RAC: 82,352
United Kingdom
Message 972273 - Posted: 20 Feb 2010, 16:00:45 UTC - in response to Message 972261.

Bits in and out is never equal, because the wast difference in size of what we download and upload.


You're right... I forgot the disparate sizes. IIRC the outbound workunits are, what, about 300KB, but the inbound results are about... 30KB? So bits in should be about one-tenth of bits out. Instead the average is that it's four times bits out, or about forty times too large. So that equates to about 98% wasted bandwidth, or about one workunit completing for every forty that get to 100% and then need to be retried.

Houston, do we have a problem? :^p

It isn't quite as simple as that. The same data link that is being monitored on that router also carries a lot of BOINC administrative overhead traffic - specifically sched_request.xml and sched_reply.xml files. For enthusiastic GPU crunchers, these files dwarf the "data" files you're thinking of - and the biggest ones are inbound to Berkeley, so show up on the 'upload' side of the balance-sheet.

But what we do know, from past experience, is that when the servers are singing the 'download' side can run at 93 - 95 Mbit/sec continuously for days on end, and the upload side can sustain around 25 Mbit/sec - though the line (nominal 100 Mb) can't do both at the same time.

Rick
Avatar
Send message
Joined: 3 Dec 99
Posts: 79
Credit: 11,486,227
RAC: 0
United States
Message 972287 - Posted: 20 Feb 2010, 16:18:04 UTC

None of us actually know for sure what is wrong but here's my 2 cents worth.

I can confirm the observation that started this thread. Uploads sitting at some % complete (many times 100% complete) but they won't actually leave your task list. I don't have any insight into the inner workings of BOINC or SETI but it would make sense that there are multiple things going on with an upload. It's not simply push the data over the line and you're done. Getting the data there is only half of it. There must be a database update that has to take place to register the fact that you have uploaded a result. If that database update is having issues then it would make sense that we would see 100% uploads but the task is still sitting there in your task list. The data is sitting on the disk at Berkeley but SETI doesn't know about it. If that upload times out then eventually you'll go through the whole process again. But, what happend to that data you uploaded? If it's not represented in the database then the backend processes don't know what to do with it. If that becomes orphan data then it may never be cleaned up. Now, multiply that times how many failed uploads and you eventually end up with a full disk. Now the data can't even be uploaded so the symptom changes to uploads with less than 100% completion. Oddly, Eric reported that a database outage and a full disk were both issues he's had to deal with.

The fact that some folks can eventually get some uploads to work may be because there is some cleanup task that runs periodically that drops those orphan uploads. That would free up some disk so now some folks can upload tasks but they still run into the database issue so the symptom goes back to 100% uploaded but the task stays on their list. Until the disk fills up again and we start the cycle all over again.

That's all just conjecture but it does seem to be one scenario that would explain some of the symptoms we're seeing.

____________

msattlerProject donor
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38863
Credit: 577,379,430
RAC: 522,776
United States
Message 972309 - Posted: 20 Feb 2010, 16:47:01 UTC - in response to Message 972287.
Last modified: 20 Feb 2010, 16:49:40 UTC



That's all just conjecture but it does seem to be one scenario that would explain some of the symptoms we're seeing.

I can explain the cause, but not the symptom.

.........by the light of day.... but at night I'm one hell of a lover.

Baby.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile Gundolf Jahn
Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 357,953
RAC: 37
Germany
Message 972360 - Posted: 20 Feb 2010, 18:13:02 UTC - in response to Message 972287.

It's not simply push the data over the line and you're done. Getting the data there is only half of it. There must be a database update that has to take place to register the fact that you have uploaded a result.

Wrong assumption :-)

The database access doesn't happen before the task is reported. The upload really only is a copy-file process.

Gruß,
Gundolf
____________
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8460
Credit: 48,801,015
RAC: 82,352
United Kingdom
Message 972370 - Posted: 20 Feb 2010, 18:25:43 UTC - in response to Message 972360.

The upload really only is a copy-file process.

But on the sort of industrial-scale servers we're talking about here, even that requires some pretty clever (and fast) file indexing/cataloging to make sure your data doesn't overwrite mine, and both of them can be found again when the time comes for validation.

My 8-core Vista Pro workstation was noticably slowed recently by a mere 20K files hidden in the recycle bin: I think Bruno has got the same sort of indigestion, but on a massive scale - remember that it should be keeping track of tens of millions of files at a time, changing at the rate of ten or twenty a second. I very much doubt anyone on this side of the message boards has ever seen a machine capable of doing that, let alone had to manage it.

Profile Pappa
Volunteer tester
Avatar
Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 972559 - Posted: 21 Feb 2010, 1:18:26 UTC

We all know there are issues after the OverHeat of Servers.

I have been emailing Debug Logs to the Seti Staff in hope of sorting things.

That said, what I have been seeing over the last few days is hit or miss.

I surmise the "libcurl" bug rears its ugly head again.

This is also complicated in that I am moving furnature to the wife's work apartment. Don't ask.

Regards

____________
Please consider a Donation to the Seti Project.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8460
Credit: 48,801,015
RAC: 82,352
United Kingdom
Message 972567 - Posted: 21 Feb 2010, 1:35:06 UTC - in response to Message 972559.

I've posted in detail at Lunatics: but -

1) Did you get any information in reply? Which sub-systems of the server farm are under suspicion?

2) There are some wild - and actionable - assertions about Hurricane Electric flying around in Technical News. False, so far as I can tell from Wireshark. They need contradicting.

3) Is there anything - social engineering taken for granted - that we can do to help? You've mentioned logs, I've mentioned Wireshark. Is this still a diagnosis thing, or have we reached internal remediation? Status information is becoming an urgent need.

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3404
Credit: 19,573,197
RAC: 19,124
Sweden
Message 972572 - Posted: 21 Feb 2010, 1:53:41 UTC - in response to Message 972559.

We all know there are issues after the OverHeat of Servers.

I have been emailing Debug Logs to the Seti Staff in hope of sorting things.

That said, what I have been seeing over the last few days is hit or miss.

I surmise the "libcurl" bug rears its ugly head again.

This is also complicated in that I am moving furnature to the wife's work apartment. Don't ask.

Regards


The "issues" started well before the overheat or the weekly outage. They started on last Sunday or early Monday.

Sten-Arne

Profile Pappa
Volunteer tester
Avatar
Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 972573 - Posted: 21 Feb 2010, 1:53:54 UTC - in response to Message 972567.

Richard

I have been sending updated logs to DA and Rom. I even had time to downgrade from 6.10.32 to 6.10.18 to see if that would help (which was suggested I might have). I did get a scheduler request through. Uploads, No Joy....

As I think about it more in depth, I amd starting to suspect it is "libcurl" on the Server Side. So while I have not dug into Trac, to see when Libcurl might have been updated for ATI or other purposes. Something seems to stick the back of my head that this relates. Now to prove it.

So if you have a valid TCP Dump send it. You know where

From everything I have seen Networkwise, it is Not the Network. DNS Resolves, I see server contact and see headers sent and recieved. Then things go into Never Never land.
Timeouts happen or the contact is broken (which would imply that the server told the client to disconnect). So Network communications are fine. The fact that Cricket shows a "few" users are able to send uploads and then get downloads is proof.

What most people do not know, is when you pay for a Gigabit Link what is involved on that side to insure that the Link is Solid. They have staff 24/7/365 that Seti would Love to be able to afford.

Regards


____________
Please consider a Donation to the Seti Project.

1 · 2 · 3 · 4 . . . 13 · Next

Message boards : Number crunching : Problems...

Copyright © 2014 University of California