Hot Day (Oct 02 2012)

Message boards : Technical News : Hot Day (Oct 02 2012)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1290461 - Posted: 2 Oct 2012, 23:18:47 UTC

Hello again. Today was the usual outage day, but we got a *lot* done, so I figured I'd report on a bit of it.

Everything in the server closet is now on the new Foundry X448 switch. Of course this is all internal traffic - the workunits/results are still going over our Hurricane Electric network. Still, it's a major improvement in quality and may actually grease several wheels. In fact, we may use it to replace the HE router as well at some point.

The download servers have been trading off for a bit - we are now currently settled on using vader and georgem as the download server pair. As well, I just moved from apache to nginx on those servers. I think it's working well, but if any of you notice weird behavior let me know!

Otherwise, Jeff and Eric worked pretty hard today to align the beta and public projects - for the first time in a while (years?) their database configurations match, which will make the immediate future of development a lot easier (we've been dealing with having several code sandboxes and so forth for a while).

In less great news, carolyn (the mysql server) crashed for no known reason. Probably a linux hiccup of some sort, which is common for us these days. The very silver lining is that it crashed right after the backup finished, and in such a manner than didn't cause any corruption or even get the replica server in a funny state. It's as if nothing happened, really.

However one sudden crisis at the end of the day today: the air conditioning in the building seems to have gone kaput. Our server closet is just fine (phew!) but we do have several servers not in the closet and they are burning up. We are shutting a few of the less necessary ones off for the evening. Hopefully the a/c will be fixed before too long.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1290461 · Report as offensive
Profile Arvid Almstrom
Avatar

Send message
Joined: 23 Mar 00
Posts: 98
Credit: 137,331,372
RAC: 0
Australia
Message 1290466 - Posted: 2 Oct 2012, 23:31:19 UTC - in response to Message 1290461.  

Hi Matt

Thanks for the update on the current state of things and good luck trying to keeping cool.

I was interested if you had any thoughts on the network problems of the past 7-10 days. From a novice, it looks like there was, over this last week, a correlation between AP splitters running and the whole SETI project stopping to respond.

Many thanks for the update and keep 'em coming.

Arvid
Arvid Almstrom
ID: 1290466 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1290476 - Posted: 2 Oct 2012, 23:50:40 UTC - in response to Message 1290461.  
Last modified: 2 Oct 2012, 23:55:14 UTC

The download servers have been trading off for a bit - we are now currently settled on using vader and georgem as the download server pair. As well, I just moved from apache to nginx on those servers. I think it's working well, but if any of you notice weird behavior let me know!

Downloads appear to be pretty much like normal- the maxed out network traffic making connections & downloads difficult (although at present not impossible).
NB- i operate connecting to only the one download server (208.68.240.13) as the other (208.68.240.18) is always timing out.

What is of concern is that for the last 3 weeks or so there have been serious issues with uploading, and frequent issues getting work from the Scheduler.

Uploads are a case of either all of them timing out straight away, or they just sit there, elapsed time ticking away & nothing happening. After 1-5 minutes they'll time out to try again later. Usually once they start to download, they continue OK. However often they will sit at 100% for anything up to 3 minutes & either finally complete, or timout & have to start from scratch again.
Whatever was done (i think your time Sunday) sorted it out, but the problem is back again right now.


And the problem with the Scheduler is that requests for work are often met with "Project has no tasks available", "No tasks sent" or "Timeout was reached" so when we are finally able to upload enough work to request more, we can't get any.

This all seemed to happen around the time all the shortie WUs started going through the system. eg on one card it usually takes 15-20min to process 3 WUs. With the shorties it's doing 3 WUs in 4-5min. That's 3 to 4 times the throughput.
Grant
Darwin NT
ID: 1290476 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36840
Credit: 261,360,520
RAC: 489
Australia
Message 1290499 - Posted: 3 Oct 2012, 0:56:51 UTC - in response to Message 1290476.  

For me after the last 3 outages have been very problematic with the uploads mostly cueing up forever but then again the downloads havn't been much better when I finally get the uploads cleared.

Sorry Matt but things are far from good on this end of the system. :(

Cheers.
ID: 1290499 · Report as offensive
__W__
Avatar

Send message
Joined: 28 Mar 09
Posts: 116
Credit: 5,943,642
RAC: 0
Germany
Message 1290524 - Posted: 3 Oct 2012, 2:24:10 UTC - in response to Message 1290461.  

Thank you for the update,
i like this little techical background informations, so i know, that i'm not the only one living with fortune and/or throwbacks working with computers and technical surroundings ... ;-)

... the air conditioning in the building seems to have gone kaput...

The only good news on this is, amerikans use german words, but the word is "kaputt - kaputter - am kaputtesten".
I hope is't not "am kapputtesten", to get soon lower degrees for the servers ;-)

Keep on running ...

__W__

_______________________________________________________________________________
ID: 1290524 · Report as offensive
Swibby Bear

Send message
Joined: 1 Aug 01
Posts: 246
Credit: 7,945,093
RAC: 0
United States
Message 1290546 - Posted: 3 Oct 2012, 3:46:36 UTC

Matt - I think you are trying to jam too much down the internet pipe by using two download servers. Some years ago, one of the two download servers was out of commission for a day or two, and this let the internet connections proceed smoothly. I would like to suggest that you suspend one of the d/l servers for a day or two to see if the connections smooth out and actually result in better overall throughput. Please and Thanks !!!

Whit
ID: 1290546 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1290581 - Posted: 3 Oct 2012, 5:25:38 UTC
Last modified: 3 Oct 2012, 5:27:05 UTC

I'm with Swibby Bear and Grant: it's time something was done about the scheduler, as 100 WU's (as last I heard) held in memory is too few, now that you guys have (as I see it) 4 to 6 different types of WU's, (CPU MB and AP, NVideo MB & AP and ATI/OpenCL MB & AP) with the assignment of type apparently (to me) being done BEFORE the WU gets into the scheduler's memory. A doubling of the number of WU's to 200, if not a quadrupling to 400, seems to be in order... I've mentioned this before, BTW, and got shot down with "It's working the way it is" - not as well as it could, (as I saw it then) and it's not working too well now, and it'll only get worse from here, as more and/or faster computers come along!
.

Hello, from Albany, CA!...
ID: 1290581 · Report as offensive
Profile Alaun

Send message
Joined: 29 Nov 05
Posts: 18
Credit: 9,310,773
RAC: 0
United States
Message 1290582 - Posted: 3 Oct 2012, 5:26:38 UTC

Always like the updates!

How much would it cost to get the gigabit line up the hill?
It might help donations if there were a specific equipment list and dollar amount.

ID: 1290582 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1290583 - Posted: 3 Oct 2012, 5:28:34 UTC - in response to Message 1290546.  

Matt - I think you are trying to jam too much down the internet pipe by using two download servers. Some years ago, one of the two download servers was out of commission for a day or two, and this let the internet connections proceed smoothly. I would like to suggest that you suspend one of the d/l servers for a day or two to see if the connections smooth out and actually result in better overall throughput. Please and Thanks !!!

The only time i recall a single download server running everything almost came to a grinding halt.
The main download problem is bandwidth- there just isn't enough of it with the 100Mb/s connection.
However the present problems are due to underlying system issues. People aren't going to find out how well the new server situation is working untill they can return all their present work & get new work.

Untill the upload problem & the the Scheduler problem are resolved, that's not going to happen.
Grant
Darwin NT
ID: 1290583 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1290584 - Posted: 3 Oct 2012, 5:31:19 UTC - in response to Message 1290582.  

Always like the updates!

How much would it cost to get the gigabit line up the hill?
It might help donations if there were a specific equipment list and dollar amount.

There have been a few oblique references to that in the past. The line is there, it's just a matter of being connected. And that requires the Univeristy's agreement.
It would appear campus politics is involved; untill those issues (whatever they are) are resolved it's not going to happen.

Grant
Darwin NT
ID: 1290584 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22535
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1290586 - Posted: 3 Oct 2012, 5:40:45 UTC

Matt, in response to your request for glitch reports - Uploads are extremely sticky to non-existent. This is resulting in my main host is finishing tasks far faster than they are being uploaded.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1290586 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1290592 - Posted: 3 Oct 2012, 5:48:28 UTC - in response to Message 1290461.  




I think it's working well, but if any of you notice weird behavior let me know!




Matt,

There is something fairly new, and terrible, happening. I'm surprised that you seem unaware.

The short version: There's been something major wrong for weeks that was not wrong six weeks ago.

We're bone-dry out here, having worked-through our multi-day caches while the project was up and running.

We have Managers unable to communicate with the Client due to gigantic backlogs of un-uploaded, un-reported, and un-downloaded tasks. Lately, uploads stick after reporting 100% progress. "Update" requests are behaving badly. Lots of "transient HTTP errors," lots of scheduler requests that time-out, lots of work units taking far longer to download (after getting stuck) than to crunch (and I'm not referring to "shorties" but to the incredibly slow and usually interrupted downloads).

We are accustomed to "the usual difficulties." This is new.

Your faithful are losing hope.

Please wave a dead chicken over the racks in the server closet, or maybe out at Hurricane Electric (we can't tell). Soon, please.

Whatever you think my biases may be, try to hear what I'm telling you with fresh ears. This is relatively new, but not yesterday new.

...and it's bad.

If the things you changed Tuesday were supposed to help, as of this writing, they haven't. Things may be worse, in fact; but they were so bad to start-with that isolating a "new" bad condition isn't possible.
ID: 1290592 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1290694 - Posted: 3 Oct 2012, 11:29:07 UTC

Now the UL are the problem, they take a long time to start then actualy goes fast, after that stuck at 100%, after some long time return with and error.
ID: 1290694 · Report as offensive
Profile Astro-AL

Send message
Joined: 31 Mar 00
Posts: 18
Credit: 95,868,034
RAC: 80
United States
Message 1290731 - Posted: 3 Oct 2012, 13:44:04 UTC - in response to Message 1290592.  

I am having the same problems and its getting very old. Why can't the problem be at least talked about. My cpu hasn't had work enough to keep it busy more than 1 hour. I can't get any work for cpu only gpu. whats with this?
ID: 1290731 · Report as offensive
Profile eaglescouter

Send message
Joined: 28 Dec 02
Posts: 162
Credit: 42,012,553
RAC: 0
United States
Message 1290762 - Posted: 3 Oct 2012, 15:01:09 UTC

No joy on uploads, even after the weekly outage/maintenance period, and the rebooting of my local project.

Transfers tab shows: "Uploading 0.00 KBps"
Messages tab shows: "10/3/2012 7:59:24 AM SETI@home Temporarily failed upload of 21jn12ac.10893.7429.10.10.102_1_0: connect() failed"


It's not too many computers, it's a lack of circuit breakers for this room. But we can fix it :)
ID: 1290762 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22535
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1290792 - Posted: 3 Oct 2012, 15:50:19 UTC

Matt,
Update on my earlier comment - the rate of failure of uploads is getting higher as time wears on.


I can't comment on downloads as I haven't had any for at least a day, probably due to the vast pile of uploads....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1290792 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9958
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1290825 - Posted: 3 Oct 2012, 17:13:45 UTC - in response to Message 1290792.  

Matt,
Update on my earlier comment - the rate of failure of uploads is getting higher as time wears on.


I can't comment on downloads as I haven't had any for at least a day, probably due to the vast pile of uploads....

I turned off NNT on one machine and it downloaded 100+ tasks right off, so that has improved.

Uploads however have not, I am assured in a couple of threads the lab is totally aware of the problems and working towards a solution.
ID: 1290825 · Report as offensive
Profile Svirfnebli

Send message
Joined: 29 Dec 05
Posts: 3
Credit: 47,240,821
RAC: 0
United States
Message 1290885 - Posted: 3 Oct 2012, 19:37:32 UTC - in response to Message 1290825.  

nnt?
ID: 1290885 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9958
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1290889 - Posted: 3 Oct 2012, 19:51:04 UTC - in response to Message 1290885.  

nnt?

Sorry - No New Tasks - I had set all my machines to NNT as it seem sensible.


ID: 1290889 · Report as offensive
Profile Dave Barstow

Send message
Joined: 14 May 99
Posts: 76
Credit: 15,064,044
RAC: 0
Philippines
Message 1290941 - Posted: 3 Oct 2012, 21:57:59 UTC

I usually remain silent regarding the projects failures, hiccups and just plain strange behavior, but the last couple of weeks HAVE BEEN RIDICULOUS!

I just looked at my GPU temp graph for the past 24 hours and noted that it showed a TOTAL run time of less than 30 minutes. During that period I managed to get about 40 CUDA-FERMI tasks by manually 'prodding' the system and aborting some uploads that had been trying to U/L for greater than 18 hours, thereby dropping the number of stalled uploads to 8 or less on this quad-core machine, allowing it to get 20 new tasks for about 15 minutes of work. This is roughly an approximation of the previous two weeks.

THIS SUCKS! BIG TIME!
ID: 1290941 · Report as offensive
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Technical News : Hot Day (Oct 02 2012)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.