Panic Mode On (19) Server problems

Message boards : Number crunching : Panic Mode On (19) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

AuthorMessage
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 914546 - Posted: 6 Jul 2009, 0:38:24 UTC - in response to Message 914425.  


As I write this, most of the bandwidth is wasted because it is divided across so many users that most slices are just too small.


Surfing these boards I don't get or retain all that is said. Notwithstanding, on the topic of upload failures I wonder if Berkeley could introduce a dedicated front end server that provides some sort of buffering. The objective would be to minimize any upload failures. The method would be to store the uploads into some sort of buffer, from which the results would be leaked into the rest of the system as fast as possible, consistent with never having a failed upload.

Yes, it looks like 'bruno' is this server, but it has shared responsibilities and we have frequent periods of up-load failures. So perhaps restructuring this box makes some sense.

I suppose the total bandwidth is the main problem. So is it possible to give the upload side priority over downloads?

Why? Accepting that the uploads are less 'efficient' because of small slices, isn't it a more 'elegant' solution to use the uploads to control the overall system. There are probably many reasons to view it this way. One would be: Assuming that WU storage is limited to some fixed amount, its capacity is not relaxed until the results are returned and validated. In the limit that nothing is uploaded, the WU storage will fill up with pending work and the system will stall. Giving uploads the a larger bandwidth will increase the turn over of WU storage. So in effect, the computing latency would be reduced-- fewer pending jobs. At times downloads would appear to stall, but only because the uploads have priority, a situation which resolves itself either due to a reduced upload demand (because of large random or quasi-random upload bursts), or due to fewer wu's being completed by those clients with a cache too small to span the period of congestion.
ID: 914546 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914583 - Posted: 6 Jul 2009, 3:18:45 UTC - in response to Message 914544.  

Sadly, you missed my point.

No one in their right mind would take modern traffic engineering with the obvious benefits and label it "controlling silly drivers."

But when someone suggests that the same concepts and same benefits would apply here, it's immediately labelled "controlling silly users."

I have no desire to control silly users, I want to see the traffic flowing smoothly.

... and I resent the "spin."

Perhaps, though there is a balance between an approach/attitude of controlling silly users and one of looking at users as not being too silly to begin with. It is one of those long standing conundrums. I'd like to 'control' the BOINC developers so that they would focus on clearing problems in their existing client first. For me, I find I work mostly with two BOINC clients -- 5.4.5 for non GPU workstations, and 6.4.5 for GPU workstations and/or Vista/Win7 workstations. Some of the work allocation handling of most of the 6.x client for me are just plain strange.

Of course, since the 'stuck work unit' issue seems to show up primarily in SETI, rather than change the BOINC client, which affects all projects, one would hope that the root cause gets looked at instead. Then again, the BOINC development effort for some reason does tend to be a tad SETI-centric....


Ah yes, control those silly users <smile>.

Yes, control the silly users....

In the same sense that stop signs and traffic lights control those silly drivers.

Control traffic, keep the flow smooth, and everyone gets home for dinner on time.



ID: 914583 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914591 - Posted: 6 Jul 2009, 3:33:42 UTC - in response to Message 914546.  
Last modified: 6 Jul 2009, 3:33:54 UTC

Yes, it looks like 'bruno' is this server, but it has shared responsibilities and we have frequent periods of up-load failures. So perhaps restructuring this box makes some sense.

I suppose the total bandwidth is the main problem. So is it possible to give the upload side priority over downloads?

Why? Accepting that the uploads are less 'efficient' because of small slices (much removed)


No.

The problem is taking any resource and cutting it too thin.

I did a quick back-of-the-envelope calculation and estimated 6,000 upload attempts per minute, which is clearly too many. Imagine taking one normal ten inch pie (if you need specifics, boysenberry) and trying to cut 6,000 slices and you've got the picture.

It's probably pushing the analogy too far, but you get a new pie every minute, and if you could get everyone to take their turns, have enough folks each minute that they'd each actually get a slice, well, you can fill in the rest.

It's the same for the scheduler as it is with upload and download. We don't see as big a problem with the scheduler because most of us talk to it once per day, when upload and download are once per work unit, but it's the same issue.
ID: 914591 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15692
Credit: 84,761,841
RAC: 28
United States
Message 914592 - Posted: 6 Jul 2009, 3:43:22 UTC - in response to Message 914525.  
Last modified: 6 Jul 2009, 4:00:06 UTC

Ah yes, control those silly users <smile>.

Before doing that, I'd love to see the BOINC developers clean up existing open tickets on the BOINC client -- there are enough of those. And then of course fulfill the plan for ATI GPU support -- that would be very nice.

I take your point about the retry button when uploads are 'stuck'. I'd note that I don't encounter this with the other projects I work with though.

There was a time I hit that retry button for uploads, but these days instead of doing that, given I'm one of those multiple project folks, I use a different approach - I temporarily suspend SETI to reduce the number of 'stuck' uploads being generated for SETI until the existing stuck uploads clear. Since SETI is the ONLY project I am working with in that form of overload mode and response, this works for me.

Similarly, as I mentioned earlier in this thread, I've added projects and reduced the SETI share generally. This project doesn't really need extra user CPU cycles right now, and other BOINC projects seem to be handling their much smaller workload just fine.


I don't know why you're feeling so attacked by my simple comment. I was merely suggesting that hitting retry causes the problem to be worse. I'm not certain why you felt it necessary to point the fingers at the BOINC developers over my comment, or that you are calling the users "silly", which is your words, not mine. I don't think users are "silly", but I do think that users who try to force comms by hitting retry are part of the problem. Its a logical conclusion to come to and there's no reason for getting defensive over it.
ID: 914592 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15692
Credit: 84,761,841
RAC: 28
United States
Message 914594 - Posted: 6 Jul 2009, 3:49:37 UTC - in response to Message 914544.  
Last modified: 6 Jul 2009, 4:09:51 UTC

Perhaps, though there is a balance between an approach/attitude of controlling silly users and one of looking at users as not being too silly to begin with.

I'd like to 'control' the BOINC developers so that they would focus on clearing problems in their existing client first.

Of course, since the 'stuck work unit' issue seems to show up primarily in SETI, rather than change the BOINC client, which affects all projects, one would hope that the root cause gets looked at instead. Then again, the BOINC development effort for some reason does tend to be a tad SETI-centric....


As I stated previously - its your words, not mine, that users are "silly". I don't think they're "silly" at all, but I do think forcing comms when the servers are already dropping connections is a bad thing, and that perhaps trying to control the traffic by not allowing people to do this would be a good thing. Like Ned's stoplight analogy, its not silly at all to try to control the traffic better. Is it silly to want to reduce collisions? That's effectively what's happening on an overloaded network, and I'd like to see less collisions, not more, and certainly not more caused purposely.

Yet you are getting frustrated and defensive over my single comment anyway.
ID: 914594 · Report as offensive
Simplex0
Volunteer tester

Send message
Joined: 28 May 99
Posts: 124
Credit: 205,874
RAC: 0
Message 914604 - Posted: 6 Jul 2009, 4:10:54 UTC

I would say that there IS a problem within SETI\BOINC.

Why do they send out new wu's when they are not able to receive the ones that are finished?


ID: 914604 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15692
Credit: 84,761,841
RAC: 28
United States
Message 914606 - Posted: 6 Jul 2009, 4:20:11 UTC - in response to Message 914604.  

I would say that there IS a problem within SETI\BOINC.


The problem with SETI is that their bandwidth is maxed out for various reasons, and the servers are dropping connections.

As it pertains to this situation, there is no problem with BOINC itself. The BOINC client will simply see that it cannot connect and will retry again later automatically. The only problem is that users are frustrated right now, and they are pressing the "retry now" button in BOINC Manager, causing additional load on the SETI servers.

Why do they send out new wu's when they are not able to receive the ones that are finished?


If they don't continue sending out work, the problem will only become worse when you have over 500,000 clients suddenly asking for work when it is available. This is the exact problem the servers have after the weekly outages - suddenly everyone wants to contact the servers, which causes the servers to drop connections because there's so many hosts vying for attention.
ID: 914606 · Report as offensive
Chelski
Avatar

Send message
Joined: 3 Jan 00
Posts: 121
Credit: 8,979,050
RAC: 0
Malaysia
Message 914633 - Posted: 6 Jul 2009, 7:03:13 UTC

I remember at the last AP panic problem someone started a donation drive for the 1gbps cable up the hill. Does anyone know what is the status of that donation drive and how much more that is needed?
ID: 914633 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 914660 - Posted: 6 Jul 2009, 9:08:48 UTC

Re uploading problems.
At my farm this appears to be effecting the Windows machines very badly and not worrying the Linux boxes much at all. Each of my Windows boxes has at least 100 units backed up waiting to upload, but the Linux boxes only have a dozen or so maximum on hold and are operating reasonably normally. Going back through the log it seems they can get through to upload with only 3 or 4 tries.

Is anyone else finding this and does anyone know why it's happening ?

Brodo
ID: 914660 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 914661 - Posted: 6 Jul 2009, 9:12:01 UTC - in response to Message 914660.  

Is it affecting all your windows machines or only those running CUDA?
ID: 914661 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14044
Credit: 208,696,464
RAC: 304
Australia
Message 914672 - Posted: 6 Jul 2009, 9:30:11 UTC - in response to Message 913931.  


Something new to panic about.
Uploads are now going through.
Unfortunately downloads have come to a near standstill.

Panic away.
Grant
Darwin NT
ID: 914672 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 914675 - Posted: 6 Jul 2009, 9:35:02 UTC - in response to Message 914672.  


Something new to panic about.
Uploads are now going through.
Unfortunately downloads have come to a near standstill.

Panic away.

If that's anywhere near 18 hours, please can you suggest my lottery numbers for next week?? ;-))

F.
ID: 914675 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14044
Credit: 208,696,464
RAC: 304
Australia
Message 914679 - Posted: 6 Jul 2009, 9:44:50 UTC - in response to Message 914675.  
Last modified: 6 Jul 2009, 9:45:48 UTC


Something new to panic about.
Uploads are now going through.
Unfortunately downloads have come to a near standstill.

Panic away.

If that's anywhere near 18 hours, please can you suggest my lottery numbers for next week?? ;-))

I think we went just a bit over the 18 hours. And unfortunately i missed out on our $120,000,000 draw last week. I could have made sure these issues were a thing of the past & still had some spare change left over.
No such luck.
:-(
Grant
Darwin NT
ID: 914679 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 21999
Credit: 7,508,002
RAC: 20
United Kingdom
Message 914706 - Posted: 6 Jul 2009, 12:32:28 UTC
Last modified: 6 Jul 2009, 12:33:16 UTC

Just to give Ned's ideas a little backup (not that he needs it):

I agree that some deliberate data throttling by the s@h servers could reduce the overload on the choked link and gain smoother and FASTER data transfers.

A simple analogy is to compare the traffic flow on a busy but freely flowing motorway/freeway/autobahn as compared to the same road congested with traffic josling along at an average very slow speed...

So...

Server controls?

Or traffic management queues on the link itself?

(And no, NOT 'policing' controls. They are a very intrusive blunt device that throw away a proportion of the bandwidth to kill the (innocent bystanders as) overload offenders!)


Alternatively, use binary transfer of the WU data to in effect increase the link capacity to 130%-ish of present?


Happy crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 914706 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 914710 - Posted: 6 Jul 2009, 13:05:08 UTC

Just finished uploading the ones that have been waiting I think since Saturday. Now have to wait for download as I have got two BOINC projects going on, one of them the WUs are 9hrs for my little computer.
ID: 914710 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 914711 - Posted: 6 Jul 2009, 13:10:09 UTC
Last modified: 6 Jul 2009, 13:35:14 UTC

I am no longer seeing RED.

My uploads all rapidly went to Berkeley about 30 minutes ago. Now I am slowly getting downloads. Mostly "no work available" response but occasionaly I get 2 or 3 work units.

[edit]
We just might recover from last weeks outage before tomorrows outage starts!

[edit2]
NOT!! (Not Likely)
Boinc....Boinc....Boinc....Boinc....
ID: 914711 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 914735 - Posted: 6 Jul 2009, 15:24:18 UTC

Grrrrrr..............
Boinc....Boinc....Boinc....Boinc....
ID: 914735 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914757 - Posted: 6 Jul 2009, 16:21:14 UTC - in response to Message 914706.  

Server controls?

Or traffic management queues on the link itself?

I think client controls....

If you try to control at the servers, you've still got a process that gets the TCP SYN, opens a control block, decides there are too many, and closes it (gracefully?) and that is, IMO, a big part of the problem now: just too many SYN packets, too many control blocks, too many handles.

I think the only real answer is a way to tell the clients "hey, quit throwing so many packets -- I can't catch 'em all."
ID: 914757 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 914764 - Posted: 6 Jul 2009, 16:34:39 UTC
Last modified: 6 Jul 2009, 16:38:26 UTC


Some news? NO..

The UL was ~ possible today between 09:55 UTC and 15:42 UTC.. now again not possible.. in this time I could upload all my WUs.. but got only some WUs.. not continuously full load.. some time only one or two (of four) GPUs running, or idle.. and surprise.. in ~ 40 min. again continuously idle time, because work request not possible.

ID: 914764 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 914783 - Posted: 6 Jul 2009, 17:11:55 UTC - in response to Message 914583.  

Perhaps I missed your point, it is after all an imperfect world, since not everyone agrees with me all the time <smile>. And I do apologize for posting something which you would take offense to as being spin.

There was an interesting study a while back, noting that in the US there is a significantly higher accident rate AND significantly higher use of traffic control and warning signs and devices, when compared to the UK and Germany.

Perhaps you missed my point about things being a balancing act.

Within the BOINC world (and this particular project is very much a *part* and not *all* of the BOINC world) SETI traffic flow is among the most problematic. Most folks who participate in non SETI BOINC projects likely concur with that assessment. For those who participate in a SETI only BOINC environment, the lack of hands on comparative experience can make discussions of 'what's best for the BOINC client and their users (silly or not)' perhaps a bit more argumentative than otherwise as their experience sets and perhaps their ideal results might be less in concert with one another than one would wish.

I too want to see traffic flowing smoothly -- for all BOINC projects. I would rather not see SETI specific traffic flow issues dictate the configuration of the BOINC client as the SETI specific traffic flow issues are in fact SETI project specific.

Traffic flow issues for SETI may well be something attributed to users -- in that there are too many of them for the available resource and performance level that this project has. Solving that issue is not a case (in my view) of changing the BOINC client, but something more local.

That being said, I realize that much of the BOINC client development has SETI specific roots, resources and influence and so the possibility that an effort to SETI specific issues might well bleed into design changes for the BOINC client.


Sadly, you missed my point.

No one in their right mind would take modern traffic engineering with the obvious benefits and label it "controlling silly drivers."

But when someone suggests that the same concepts and same benefits would apply here, it's immediately labelled "controlling silly users."

I have no desire to control silly users, I want to see the traffic flowing smoothly.

... and I resent the "spin."



ID: 914783 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (19) Server problems


 
©2026 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.