Message boards :
Number crunching :
Panic Mode On (77) Server Problems?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 22 · Next
Author | Message |
---|---|
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
It's after noon in Berkeley and I haven't seen any indication that the fellas in the lab have changed anything. I assume they know there is a problem with Uploads. I've got computers showing "project backoff" times of up-to almost six hours and similarly sad "retry" times. The repeating wave on the incoming side of the cricket graph makes me think there is "something" that doesn't like "something" and whatever is broken is sneezing at semi-regular intervals; not merely that the communications are simply slow because of an over-abundance of data. (shortie storm, AP distribution, the usual suspects) The "sawtooth" shape of the thing (without correlation with the "outflow") makes me wonder. I hope it's making someone else wonder, too. I don't care, of course, until there are so many un-uploaded (and therefore unreported) work units in all of our queues that all of the Clients need manual intervention to be able to "Update." We (who read here) will be fine, but I'd wager a fair number of crunchers aren't going to be willing to babysit the process for hours (retry now / update, retry now / update) to prevent the "report" queue and client_state files from becoming unmanageable for the Client. I've noticed that several of mine have refused to auto "Update" while there are hundreds of Uploads stuck. Me? I keep-up with it and try to mitigate the damage. Thank goodness I had a situation last night where I could babysit one of my rigs every few minutes for about 3 hours "retrying" then "updating" of something like 1,500 tasks (and that isn't even a particularly fast machine). I suspect that there are a lot of people who won't do that. It really isn't important whether the work gets reported now or tomorrow or a week from next Thursday, but it would be a crying shame to have all that crunching eventually time-out and go to waste because it has "broken" the Client's ability to deal with the backlog. |
Bernie Vine Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328 |
It's after noon in Berkeley and I haven't seen any indication that the fellas in the lab have changed anything. Agree 100% I have had to set all my crunchers to NNT because unless I sit here 24/7 my fastest machine is completing WU's faster than they can be reported and the queue continues to grow. Imagine what it must be like on some of the top 10 machines. I have just done a "ping" check to all the machines on the 208.xxx.xxx subnet I can see, and all of them show packet loss at a level my old company could not have worked at!! PS I wouldn't "assume" that the lab knows anything is wrong, as we know staffing levels are low, so they may actually be no-one there. |
Waldo Send message Joined: 19 May 12 Posts: 174 Credit: 317,086,018 RAC: 0 |
I can track batches of shorties that my 24 core box is processing via its UPS load stats. It's only about a 2% difference drop in power when running shorties, but at that machines power level it is apparent. Hal would have to answer but it looks like he is using rrdtool with snmp or something similar that utilizes rrd.. This is just a guess and I might be wrong.. Some of the newer web enabled ups cards do trending by default. You might want to check out cacti it is one of the pieces of software I use. PS: It is a endless cycle when wu finish faster than they can be uploaded :( I am sure someone is looking at it.. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
You might want to check out cacti it is one of the pieces of software I use. Thanks. When i get a chance i'll see if it recognises my UPS, and can make sense of it's output. Grant Darwin NT |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
Well I hope that the guys work out what they didn't get right during the last outage during tomorrow's outage as this past week has been very annoying. Cheers. |
Bernie Vine Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328 |
Well I hope that the guys work out what they didn't get right during the last outage during tomorrow's outage as this past week has been very annoying. Certainly haven't had a problem "all week" just since late Saturday UTC. As there was no fix, or message or announcement on the front page. It seems likely they still are unaware of the problem. If it is "fixed" fixed during the outage it is going to take a while to settle down!! |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
I can track batches of shorties that my 24 core box is processing via its UPS load stats. It's only about a 2% difference drop in power when running shorties, but at that machines power level it is apparent. I use MRTG instead of RRDtool. The cricket graphs we like to fret over so much are generated using software based on RRDtool. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Vepide Send message Joined: 2 Feb 08 Posts: 6 Credit: 103,183 RAC: 0 |
I have two WU' downloading and it seems they've been downloading for days...when I hit retry it gets a few bytes then just craps out...not a problem with any other projects, just SETI. If they're using some kind of bandwidth throttling they should turn it off, or check to see if they have a failed NIC or something else in the pipeline. |
Keith White Send message Joined: 29 May 99 Posts: 392 Credit: 13,035,233 RAC: 22 |
Well the servers are up, it took all my results and reported them successfully but the cricket graphs are still dead. Did you upgrade their backbone so our stuff isn't being piped through that link? I thought the new gigabit switch was only internal. Edit: Oops, my bad, seems like I just happen to turn on internet access just at the point the servers came up. Looks like the cricket graph is now starting to go up. "Life is just nature's way of keeping meat fresh." - The Doctor |
Akio Send message Joined: 18 May 11 Posts: 375 Credit: 32,129,242 RAC: 0 |
No luck getting much to download here - says project has no tasks available. Must have to just wait it out for a bit. |
fscheel Send message Joined: 13 Apr 12 Posts: 73 Credit: 11,135,641 RAC: 0 |
Not downloading here either. :( |
Bernie Vine Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328 |
My main machine ran out of work due to the "upload" problem, now I have had 6 tasks, all shorties, all done. Oh well see how things are shaping up tomorrow. |
Starman Send message Joined: 15 May 99 Posts: 204 Credit: 81,351,915 RAC: 25 |
Well overall I'm mostly good for CPU work. One machine has lots to keep it going for close to a week. Unfortunately, by brand new ATI 7870 GPU has nothing to impress me with, my ATI 4870 GPU will be out of work before I go to bed tonight, and my ATI 6300 (Turks) is also twiddling its thumbs. Ohh Well. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
No luck getting much to download here - says project has no tasks available. Same here. And i notice the network traffic isn't maxed out, so something's gumming up the works somewhere. Grant Darwin NT |
Starman Send message Joined: 15 May 99 Posts: 204 Credit: 81,351,915 RAC: 25 |
Well, just got 4 CPU WU, but the download was no better than the last 4-5 days. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
It seems that the guys got onto what ever it was that was causing problems since the last outage as all but my lowly P.O.S. Vista rig are back up to full (it surprised me though that that P.O.S. was able to report well over 400 in its first 2 hits at the servers once they came back up). Now if the guys can just restrain themselves to just 1 new tape at a time until the AP's are done things will go well. Cheers. |
Starman Send message Joined: 15 May 99 Posts: 204 Credit: 81,351,915 RAC: 25 |
What's interesting is all the WU's i've been getting the last 4-5 days are all due on or around Oct. 5th, and all run high priority. |
S@NL Etienne Dokkum Send message Joined: 11 Jun 99 Posts: 212 Credit: 43,822,095 RAC: 0 |
My main machine ran out of work due to the "upload" problem, now I have had 6 tasks, all shorties, all done. Nothing better, spewing out shorties like a mad man... |
Bernie Vine Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328 |
OK so I believe that there is something WRONG, the graph has not been maxed out since it came back yet this morning all my machines had tasks, all backed off for 5-7 hours, and even with retry they my download 1 or 2 but go straight into backoff again why?? If network is not maxed why can't I connect? |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
OK so I believe that there is something WRONG, the graph has not been maxed out since it came back yet this morning all my machines had tasks, all backed off for 5-7 hours, and even with retry they my download 1 or 2 but go straight into backoff again why?? Without AP's in the system the cricket graph is sitting about where you would normally expect it to but as to you can't get a look in is probably me & others sucking up whatever is available. Cheers. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.