Message boards :
Technical News :
Hot Day (Oct 02 2012)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Hello again. Today was the usual outage day, but we got a *lot* done, so I figured I'd report on a bit of it. Everything in the server closet is now on the new Foundry X448 switch. Of course this is all internal traffic - the workunits/results are still going over our Hurricane Electric network. Still, it's a major improvement in quality and may actually grease several wheels. In fact, we may use it to replace the HE router as well at some point. The download servers have been trading off for a bit - we are now currently settled on using vader and georgem as the download server pair. As well, I just moved from apache to nginx on those servers. I think it's working well, but if any of you notice weird behavior let me know! Otherwise, Jeff and Eric worked pretty hard today to align the beta and public projects - for the first time in a while (years?) their database configurations match, which will make the immediate future of development a lot easier (we've been dealing with having several code sandboxes and so forth for a while). In less great news, carolyn (the mysql server) crashed for no known reason. Probably a linux hiccup of some sort, which is common for us these days. The very silver lining is that it crashed right after the backup finished, and in such a manner than didn't cause any corruption or even get the replica server in a funny state. It's as if nothing happened, really. However one sudden crisis at the end of the day today: the air conditioning in the building seems to have gone kaput. Our server closet is just fine (phew!) but we do have several servers not in the closet and they are burning up. We are shutting a few of the less necessary ones off for the evening. Hopefully the a/c will be fixed before too long. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Arvid Almstrom Send message Joined: 23 Mar 00 Posts: 98 Credit: 137,331,372 RAC: 0 |
Hi Matt Thanks for the update on the current state of things and good luck trying to keeping cool. I was interested if you had any thoughts on the network problems of the past 7-10 days. From a novice, it looks like there was, over this last week, a correlation between AP splitters running and the whole SETI project stopping to respond. Many thanks for the update and keep 'em coming. Arvid Arvid Almstrom |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13841 Credit: 208,696,464 RAC: 304 |
The download servers have been trading off for a bit - we are now currently settled on using vader and georgem as the download server pair. As well, I just moved from apache to nginx on those servers. I think it's working well, but if any of you notice weird behavior let me know! Downloads appear to be pretty much like normal- the maxed out network traffic making connections & downloads difficult (although at present not impossible). NB- i operate connecting to only the one download server (208.68.240.13) as the other (208.68.240.18) is always timing out. What is of concern is that for the last 3 weeks or so there have been serious issues with uploading, and frequent issues getting work from the Scheduler. Uploads are a case of either all of them timing out straight away, or they just sit there, elapsed time ticking away & nothing happening. After 1-5 minutes they'll time out to try again later. Usually once they start to download, they continue OK. However often they will sit at 100% for anything up to 3 minutes & either finally complete, or timout & have to start from scratch again. Whatever was done (i think your time Sunday) sorted it out, but the problem is back again right now. And the problem with the Scheduler is that requests for work are often met with "Project has no tasks available", "No tasks sent" or "Timeout was reached" so when we are finally able to upload enough work to request more, we can't get any. This all seemed to happen around the time all the shortie WUs started going through the system. eg on one card it usually takes 15-20min to process 3 WUs. With the shorties it's doing 3 WUs in 4-5min. That's 3 to 4 times the throughput. Grant Darwin NT |
Wiggo Send message Joined: 24 Jan 00 Posts: 36575 Credit: 261,360,520 RAC: 489 |
For me after the last 3 outages have been very problematic with the uploads mostly cueing up forever but then again the downloads havn't been much better when I finally get the uploads cleared. Sorry Matt but things are far from good on this end of the system. :( Cheers. |
__W__ Send message Joined: 28 Mar 09 Posts: 116 Credit: 5,943,642 RAC: 0 |
Thank you for the update, i like this little techical background informations, so i know, that i'm not the only one living with fortune and/or throwbacks working with computers and technical surroundings ... ;-) ... the air conditioning in the building seems to have gone kaput... The only good news on this is, amerikans use german words, but the word is "kaputt - kaputter - am kaputtesten". I hope is't not "am kapputtesten", to get soon lower degrees for the servers ;-) Keep on running ... __W__ _______________________________________________________________________________ |
Swibby Bear Send message Joined: 1 Aug 01 Posts: 246 Credit: 7,945,093 RAC: 0 |
Matt - I think you are trying to jam too much down the internet pipe by using two download servers. Some years ago, one of the two download servers was out of commission for a day or two, and this let the internet connections proceed smoothly. I would like to suggest that you suspend one of the d/l servers for a day or two to see if the connections smooth out and actually result in better overall throughput. Please and Thanks !!! Whit |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
I'm with Swibby Bear and Grant: it's time something was done about the scheduler, as 100 WU's (as last I heard) held in memory is too few, now that you guys have (as I see it) 4 to 6 different types of WU's, (CPU MB and AP, NVideo MB & AP and ATI/OpenCL MB & AP) with the assignment of type apparently (to me) being done BEFORE the WU gets into the scheduler's memory. A doubling of the number of WU's to 200, if not a quadrupling to 400, seems to be in order... I've mentioned this before, BTW, and got shot down with "It's working the way it is" - not as well as it could, (as I saw it then) and it's not working too well now, and it'll only get worse from here, as more and/or faster computers come along! . Hello, from Albany, CA!... |
Alaun Send message Joined: 29 Nov 05 Posts: 18 Credit: 9,310,773 RAC: 0 |
Always like the updates! How much would it cost to get the gigabit line up the hill? It might help donations if there were a specific equipment list and dollar amount. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13841 Credit: 208,696,464 RAC: 304 |
Matt - I think you are trying to jam too much down the internet pipe by using two download servers. Some years ago, one of the two download servers was out of commission for a day or two, and this let the internet connections proceed smoothly. I would like to suggest that you suspend one of the d/l servers for a day or two to see if the connections smooth out and actually result in better overall throughput. Please and Thanks !!! The only time i recall a single download server running everything almost came to a grinding halt. The main download problem is bandwidth- there just isn't enough of it with the 100Mb/s connection. However the present problems are due to underlying system issues. People aren't going to find out how well the new server situation is working untill they can return all their present work & get new work. Untill the upload problem & the the Scheduler problem are resolved, that's not going to happen. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13841 Credit: 208,696,464 RAC: 304 |
Always like the updates! There have been a few oblique references to that in the past. The line is there, it's just a matter of being connected. And that requires the Univeristy's agreement. It would appear campus politics is involved; untill those issues (whatever they are) are resolved it's not going to happen. Grant Darwin NT |
rob smith Send message Joined: 7 Mar 03 Posts: 22492 Credit: 416,307,556 RAC: 380 |
Matt, in response to your request for glitch reports - Uploads are extremely sticky to non-existent. This is resulting in my main host is finishing tasks far faster than they are being uploaded. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
Matt, There is something fairly new, and terrible, happening. I'm surprised that you seem unaware. The short version: There's been something major wrong for weeks that was not wrong six weeks ago. We're bone-dry out here, having worked-through our multi-day caches while the project was up and running. We have Managers unable to communicate with the Client due to gigantic backlogs of un-uploaded, un-reported, and un-downloaded tasks. Lately, uploads stick after reporting 100% progress. "Update" requests are behaving badly. Lots of "transient HTTP errors," lots of scheduler requests that time-out, lots of work units taking far longer to download (after getting stuck) than to crunch (and I'm not referring to "shorties" but to the incredibly slow and usually interrupted downloads). We are accustomed to "the usual difficulties." This is new. Your faithful are losing hope. Please wave a dead chicken over the racks in the server closet, or maybe out at Hurricane Electric (we can't tell). Soon, please. Whatever you think my biases may be, try to hear what I'm telling you with fresh ears. This is relatively new, but not yesterday new. ...and it's bad. If the things you changed Tuesday were supposed to help, as of this writing, they haven't. Things may be worse, in fact; but they were so bad to start-with that isolating a "new" bad condition isn't possible. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Now the UL are the problem, they take a long time to start then actualy goes fast, after that stuck at 100%, after some long time return with and error. |
Astro-AL Send message Joined: 31 Mar 00 Posts: 18 Credit: 95,868,034 RAC: 80 |
I am having the same problems and its getting very old. Why can't the problem be at least talked about. My cpu hasn't had work enough to keep it busy more than 1 hour. I can't get any work for cpu only gpu. whats with this? |
eaglescouter Send message Joined: 28 Dec 02 Posts: 162 Credit: 42,012,553 RAC: 0 |
No joy on uploads, even after the weekly outage/maintenance period, and the rebooting of my local project. Transfers tab shows: "Uploading 0.00 KBps" Messages tab shows: "10/3/2012 7:59:24 AM SETI@home Temporarily failed upload of 21jn12ac.10893.7429.10.10.102_1_0: connect() failed" It's not too many computers, it's a lack of circuit breakers for this room. But we can fix it :) |
rob smith Send message Joined: 7 Mar 03 Posts: 22492 Credit: 416,307,556 RAC: 380 |
Matt, Update on my earlier comment - the rate of failure of uploads is getting higher as time wears on. I can't comment on downloads as I haven't had any for at least a day, probably due to the vast pile of uploads.... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Bernie Vine Send message Joined: 26 May 99 Posts: 9957 Credit: 103,452,613 RAC: 328 |
Matt, I turned off NNT on one machine and it downloaded 100+ tasks right off, so that has improved. Uploads however have not, I am assured in a couple of threads the lab is totally aware of the problems and working towards a solution. |
Svirfnebli Send message Joined: 29 Dec 05 Posts: 3 Credit: 47,240,821 RAC: 0 |
nnt? |
Bernie Vine Send message Joined: 26 May 99 Posts: 9957 Credit: 103,452,613 RAC: 328 |
nnt? Sorry - No New Tasks - I had set all my machines to NNT as it seem sensible. |
Dave Barstow Send message Joined: 14 May 99 Posts: 76 Credit: 15,064,044 RAC: 0 |
I usually remain silent regarding the projects failures, hiccups and just plain strange behavior, but the last couple of weeks HAVE BEEN RIDICULOUS! I just looked at my GPU temp graph for the past 24 hours and noted that it showed a TOTAL run time of less than 30 minutes. During that period I managed to get about 40 CUDA-FERMI tasks by manually 'prodding' the system and aborting some uploads that had been trying to U/L for greater than 18 hours, thereby dropping the number of stalled uploads to 8 or less on this quad-core machine, allowing it to get 20 new tasks for about 15 minutes of work. This is roughly an approximation of the previous two weeks. THIS SUCKS! BIG TIME! |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.