Message boards :
Number crunching :
Panic Mode On (19) Server problems
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next
| Author | Message |
|---|---|
|
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1
|
Surfing these boards I don't get or retain all that is said. Notwithstanding, on the topic of upload failures I wonder if Berkeley could introduce a dedicated front end server that provides some sort of buffering. The objective would be to minimize any upload failures. The method would be to store the uploads into some sort of buffer, from which the results would be leaked into the rest of the system as fast as possible, consistent with never having a failed upload. Yes, it looks like 'bruno' is this server, but it has shared responsibilities and we have frequent periods of up-load failures. So perhaps restructuring this box makes some sense. I suppose the total bandwidth is the main problem. So is it possible to give the upload side priority over downloads? Why? Accepting that the uploads are less 'efficient' because of small slices, isn't it a more 'elegant' solution to use the uploads to control the overall system. There are probably many reasons to view it this way. One would be: Assuming that WU storage is limited to some fixed amount, its capacity is not relaxed until the results are returned and validated. In the limit that nothing is uploaded, the WU storage will fill up with pending work and the system will stall. Giving uploads the a larger bandwidth will increase the turn over of WU storage. So in effect, the computing latency would be reduced-- fewer pending jobs. At times downloads would appear to stall, but only because the uploads have priority, a situation which resolves itself either due to a reduced upload demand (because of large random or quasi-random upload bursts), or due to fewer wu's being completed by those clients with a cache too small to span the period of congestion. |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
Sadly, you missed my point. No one in their right mind would take modern traffic engineering with the obvious benefits and label it "controlling silly drivers." But when someone suggests that the same concepts and same benefits would apply here, it's immediately labelled "controlling silly users." I have no desire to control silly users, I want to see the traffic flowing smoothly. ... and I resent the "spin." Perhaps, though there is a balance between an approach/attitude of controlling silly users and one of looking at users as not being too silly to begin with. It is one of those long standing conundrums. I'd like to 'control' the BOINC developers so that they would focus on clearing problems in their existing client first. For me, I find I work mostly with two BOINC clients -- 5.4.5 for non GPU workstations, and 6.4.5 for GPU workstations and/or Vista/Win7 workstations. Some of the work allocation handling of most of the 6.x client for me are just plain strange. |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
Yes, it looks like 'bruno' is this server, but it has shared responsibilities and we have frequent periods of up-load failures. So perhaps restructuring this box makes some sense. No. The problem is taking any resource and cutting it too thin. I did a quick back-of-the-envelope calculation and estimated 6,000 upload attempts per minute, which is clearly too many. Imagine taking one normal ten inch pie (if you need specifics, boysenberry) and trying to cut 6,000 slices and you've got the picture. It's probably pushing the analogy too far, but you get a new pie every minute, and if you could get everyone to take their turns, have enough folks each minute that they'd each actually get a slice, well, you can fill in the rest. It's the same for the scheduler as it is with upload and download. We don't see as big a problem with the scheduler because most of us talk to it once per day, when upload and download are once per work unit, but it's the same issue. |
OzzFan ![]() Send message Joined: 9 Apr 02 Posts: 15692 Credit: 84,761,841 RAC: 28
|
Ah yes, control those silly users <smile>. I don't know why you're feeling so attacked by my simple comment. I was merely suggesting that hitting retry causes the problem to be worse. I'm not certain why you felt it necessary to point the fingers at the BOINC developers over my comment, or that you are calling the users "silly", which is your words, not mine. I don't think users are "silly", but I do think that users who try to force comms by hitting retry are part of the problem. Its a logical conclusion to come to and there's no reason for getting defensive over it. |
OzzFan ![]() Send message Joined: 9 Apr 02 Posts: 15692 Credit: 84,761,841 RAC: 28
|
Perhaps, though there is a balance between an approach/attitude of controlling silly users and one of looking at users as not being too silly to begin with. As I stated previously - its your words, not mine, that users are "silly". I don't think they're "silly" at all, but I do think forcing comms when the servers are already dropping connections is a bad thing, and that perhaps trying to control the traffic by not allowing people to do this would be a good thing. Like Ned's stoplight analogy, its not silly at all to try to control the traffic better. Is it silly to want to reduce collisions? That's effectively what's happening on an overloaded network, and I'd like to see less collisions, not more, and certainly not more caused purposely. Yet you are getting frustrated and defensive over my single comment anyway. |
|
Simplex0 Send message Joined: 28 May 99 Posts: 124 Credit: 205,874 RAC: 0 |
I would say that there IS a problem within SETI\BOINC. Why do they send out new wu's when they are not able to receive the ones that are finished? |
OzzFan ![]() Send message Joined: 9 Apr 02 Posts: 15692 Credit: 84,761,841 RAC: 28
|
I would say that there IS a problem within SETI\BOINC. The problem with SETI is that their bandwidth is maxed out for various reasons, and the servers are dropping connections. As it pertains to this situation, there is no problem with BOINC itself. The BOINC client will simply see that it cannot connect and will retry again later automatically. The only problem is that users are frustrated right now, and they are pressing the "retry now" button in BOINC Manager, causing additional load on the SETI servers. Why do they send out new wu's when they are not able to receive the ones that are finished? If they don't continue sending out work, the problem will only become worse when you have over 500,000 clients suddenly asking for work when it is available. This is the exact problem the servers have after the weekly outages - suddenly everyone wants to contact the servers, which causes the servers to drop connections because there's so many hosts vying for attention. |
|
Chelski Send message Joined: 3 Jan 00 Posts: 121 Credit: 8,979,050 RAC: 0
|
I remember at the last AP panic problem someone started a donation drive for the 1gbps cable up the hill. Does anyone know what is the status of that donation drive and how much more that is needed? |
|
Terror Australis Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44
|
Re uploading problems. At my farm this appears to be effecting the Windows machines very badly and not worrying the Linux boxes much at all. Each of my Windows boxes has at least 100 units backed up waiting to upload, but the Linux boxes only have a dozen or so maximum on hold and are operating reasonably normally. Going back through the log it seems they can get through to upload with only 3 or 4 tries. Is anyone else finding this and does anyone know why it's happening ? Brodo |
Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0
|
Is it affecting all your windows machines or only those running CUDA? |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14044 Credit: 208,696,464 RAC: 304
|
Something new to panic about. Uploads are now going through. Unfortunately downloads have come to a near standstill. Panic away. Grant Darwin NT |
|
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0
|
If that's anywhere near 18 hours, please can you suggest my lottery numbers for next week?? ;-)) F.
|
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14044 Credit: 208,696,464 RAC: 304
|
I think we went just a bit over the 18 hours. And unfortunately i missed out on our $120,000,000 draw last week. I could have made sure these issues were a thing of the past & still had some spare change left over. No such luck. :-( Grant Darwin NT |
ML1 Send message Joined: 25 Nov 01 Posts: 21999 Credit: 7,508,002 RAC: 20
|
Just to give Ned's ideas a little backup (not that he needs it): I agree that some deliberate data throttling by the s@h servers could reduce the overload on the choked link and gain smoother and FASTER data transfers. A simple analogy is to compare the traffic flow on a busy but freely flowing motorway/freeway/autobahn as compared to the same road congested with traffic josling along at an average very slow speed... So... Server controls? Or traffic management queues on the link itself? (And no, NOT 'policing' controls. They are a very intrusive blunt device that throw away a proportion of the bandwidth to kill the (innocent bystanders as) overload offenders!) Alternatively, use binary transfer of the WU data to in effect increase the link capacity to 130%-ish of present? Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
[B^S] madmac Send message Joined: 9 Feb 04 Posts: 1175 Credit: 4,754,897 RAC: 0
|
|
Geek@Play Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0
|
I am no longer seeing RED. My uploads all rapidly went to Berkeley about 30 minutes ago. Now I am slowly getting downloads. Mostly "no work available" response but occasionaly I get 2 or 3 work units. [edit] We just might recover from last weeks outage before tomorrows outage starts! [edit2] NOT!! (Not Likely) Boinc....Boinc....Boinc....Boinc.... |
Geek@Play Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0
|
Grrrrrr.............. Boinc....Boinc....Boinc....Boinc.... |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
Server controls? I think client controls.... If you try to control at the servers, you've still got a process that gets the TCP SYN, opens a control block, decides there are too many, and closes it (gracefully?) and that is, IMO, a big part of the problem now: just too many SYN packets, too many control blocks, too many handles. I think the only real answer is a way to tell the clients "hey, quit throwing so many packets -- I can't catch 'em all." |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5
|
Some news? NO.. The UL was ~ possible today between 09:55 UTC and 15:42 UTC.. now again not possible.. in this time I could upload all my WUs.. but got only some WUs.. not continuously full load.. some time only one or two (of four) GPUs running, or idle.. and surprise.. in ~ 40 min. again continuously idle time, because work request not possible.
|
|
BarryAZ Send message Joined: 1 Apr 01 Posts: 2580 Credit: 16,982,517 RAC: 0
|
Perhaps I missed your point, it is after all an imperfect world, since not everyone agrees with me all the time <smile>. And I do apologize for posting something which you would take offense to as being spin. There was an interesting study a while back, noting that in the US there is a significantly higher accident rate AND significantly higher use of traffic control and warning signs and devices, when compared to the UK and Germany. Perhaps you missed my point about things being a balancing act. Within the BOINC world (and this particular project is very much a *part* and not *all* of the BOINC world) SETI traffic flow is among the most problematic. Most folks who participate in non SETI BOINC projects likely concur with that assessment. For those who participate in a SETI only BOINC environment, the lack of hands on comparative experience can make discussions of 'what's best for the BOINC client and their users (silly or not)' perhaps a bit more argumentative than otherwise as their experience sets and perhaps their ideal results might be less in concert with one another than one would wish. I too want to see traffic flowing smoothly -- for all BOINC projects. I would rather not see SETI specific traffic flow issues dictate the configuration of the BOINC client as the SETI specific traffic flow issues are in fact SETI project specific. Traffic flow issues for SETI may well be something attributed to users -- in that there are too many of them for the available resource and performance level that this project has. Solving that issue is not a case (in my view) of changing the BOINC client, but something more local. That being said, I realize that much of the BOINC client development has SETI specific roots, resources and influence and so the possibility that an effort to SETI specific issues might well bleed into design changes for the BOINC client. Sadly, you missed my point.
|
©2026 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.