Panic Mode On (8) Server problems

Message boards : Number crunching : Panic Mode On (8) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 15 · Next

AuthorMessage
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 791255 - Posted: 2 Aug 2008, 5:36:57 UTC - in response to Message 791249.  


Are there a whole bunch of short/noisy Work Units going through the system at the moment?
Just had a look at the network graphs, and the traffic has been steadily increasing over the last few hours. Presently it's sitting at 82Mb/s (it's hit 92) and even the uploads are at 11.5Mb/s.
The system's copping a pounding the likes of which would normally only occur after an outage.

The 05jn08aa and ac, and 10jn08aa and ab data I've gotten in the last half hour is all Very High Angle Range short work. The overflow rate shown on the Science status page is only 2.3% for the last 10 minutes, so unless I just picked a quiet time to look it doesn't seem there's a noise problem.
                                                                Joe

"Results out in the field" is very high also and has been going up the past couple of hours at least. I don't think I've ever seen network traffic that high except during recovery and/or when they are backing-up data offsite. There seems to be lot of something going out.
ID: 791255 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 791280 - Posted: 2 Aug 2008, 7:12:25 UTC


Not looking good- ready to send is down to 60,000 & result creation rate is down to .62

I'm also now unable to return completed work (system connect errors), i expect being due to the server load.
Grant
Darwin NT
ID: 791280 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 791285 - Posted: 2 Aug 2008, 7:32:32 UTC

Upload server is definitely getting pounded right now, I had to hit retry several times before it finally went up. The unit was all the way up and then it would just stop.

ID: 791285 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 791286 - Posted: 2 Aug 2008, 7:37:01 UTC - in response to Message 791280.  


Not looking good- ready to send is down to 60,000 & result creation rate is down to .62

I'm also now unable to return completed work (system connect errors), i expect being due to the server load.

Ready to send is now down to 0, so work will only be available as produced. That's bad news in the sense that many requests will return a no work available reply, but good news for getting, returning, and reporting since the Cricket graphs have dropped considerably. I had a dozen downloads which had been struggling for 2 hours getting little bits and pieces then timing out. They took off ten minutes ago and downloaded at my full dial-up rate.
                                                                 Joe
ID: 791286 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 791291 - Posted: 2 Aug 2008, 8:12:43 UTC - in response to Message 791286.  

....the Cricket graphs have dropped considerably. I had a dozen downloads which had been struggling for 2 hours getting little bits and pieces then timing out.

That's because the ready to send queue is now down to 0 & the splitters still aren't spliting; result creation rate is less than 2. With the workload at the time it ran out it'd have to be around 20-25/s just to meet demand, let alone try & build up some reserves.

I've suspended my network activity till things settle down again. No point hammering the servers when they've got nothing to serve...
Grant
Darwin NT
ID: 791291 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 791690 - Posted: 3 Aug 2008, 0:02:37 UTC


A couple of the splitters that weren't running now are, but the result creation rate is only about 10/s.
Looks like there's something still gumming up the works.
Grant
Darwin NT
ID: 791690 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 792388 - Posted: 3 Aug 2008, 23:22:24 UTC

In the ongoing saga of BOINC project issues this past week, Rosetta appears to have totally lost its connectivity to the internet as of about 20 minutes ago. Nada, vapor.

They (like SETI) were having a problem generating work, that seemed to have been solved yesterday, but now they in vapor land.

That makes three projects in trouble at the moment -- Einstein (which may resurface this coming week), SETI (with no work available for the past couple of days) and Rosetta. For me, it just pushes work share to functioning projects.

At the beginning of this year, my accumulated work split was:

Einstein - 26.7%
SETI -- 19.5%
World Grid - 17.2%
Rosetta - 13.8%
Climate - 13.7%
Spinhenge - 5.4%
Predictor - 2.2%
Climate BBC - 1.5%

My weekly split back then was

Spinhenge 28.2%
Rosetta 22.1%
Climate 17.0%
SETI 14.1%
World Grid 10.1%
Einstein 8.5%

In the interim, I added Malaria (replacing the dead Predictor project)

So my cumulative credit split 8 months later:

Einstein - 19.4%
SETI - 17.1%
Climate - 15.7%
Rosetta -- 15.5%
World Grid - 14.0%
Spinhenge - 13.0%
Malaria - 2.7%
Predictor - 1.5%
Climate BBC 1.1%

My current weekly splits are:

Spinhenge: 25.2%
Malaria 19.9%
Climate 19.8%
Rosetta 14.6%
SETI 10.4%
World Grid 5.9%
Einstein 4.2%


What drives resource share sets for me are project reliability, clean work units, and project status communication.

Spinhenge has been very solid, and as a small project, it quite good at letting folks know when they have issues. Malaria is a new project for me, it has run quite well (just periodic 1 hours bumps over the past few months). Climate has occassional problems -- but mostly with getting trickles reported, as a long cycle project those problems resolve on themselves, plus they have solid forum moderation. Rosetta has been quite solid (notwithstanding the recent issues), but has had some problems with work units that generated false positives for some AV software.

SETI, is the 'big boy' project -- and the workload sometimes makes it difficult to keep it running regularly. It has (in my view) the best admin support communication, plus of course the largest (and very active) user community. But I can't run it on workstations which I don't monitor closely because of the periodic 'black hole' workunits (which particularly bother multi-core workstations it seems). So I can only run it on the locally accessed farm.

World Grid seems to run well, but it is a rather different kettle of fish -- I don't participate in its newsgroups.

Then there is Einstein -- I added it as my second project years ago when SETI was in teething problems, and it ran very solid for a long time. But over time, the switch to their mid-range work units has been less happy for me. so I've reduced its share of cycles. Also, whenever they complete one major batch of work units and move to a newer 'formulation' the transition has not been pretty. We are seeing that now -- and I may not reactivate for it until the dust settles (at a guess in two weeks).






ID: 792388 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 792405 - Posted: 4 Aug 2008, 0:01:05 UTC

My machines still have about 2 or 3 days of work left on them so I am not in any trouble yet.

I have my prefs set to .1 and 4 days.

ID: 792405 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 792417 - Posted: 4 Aug 2008, 0:14:14 UTC - in response to Message 792405.  

I have my prefs set to .1 and 4 days.

Same here, got about 2 days worth left to go.
Grant
Darwin NT
ID: 792417 · Report as offensive
Andre Howard
Volunteer tester
Avatar

Send message
Joined: 16 May 99
Posts: 124
Credit: 217,463,217
RAC: 0
United States
Message 792425 - Posted: 4 Aug 2008, 0:27:30 UTC - in response to Message 792417.  

I have my prefs set to .1 and 4 days.

Same here, got about 2 days worth left to go.



Seems to be a popular preferences setting, same here too.

ID: 792425 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 792428 - Posted: 4 Aug 2008, 0:30:26 UTC

Well I hope that the boyz (Eric and Matt) can squeeze some more horsepower out of the splitters tomorrow or by Tuesday's outrage at the latest.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 792428 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 792429 - Posted: 4 Aug 2008, 0:32:28 UTC - in response to Message 792425.  

I have my prefs set to .1 and 4 days.

Same here, got about 2 days worth left to go.

Seems to be a popular preferences setting, same here too.

Since running with a 4 day cache i think i've only run out of work twice in the last 3-4 years, and one of those times was only for a few hours.
Most outages are only a day or so, but a 4 day buffer gives you enough work for these extended ones.
Grant
Darwin NT
ID: 792429 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 792443 - Posted: 4 Aug 2008, 2:40:02 UTC - in response to Message 792405.  

It makes sense to set for a higher work buffer if you running only one or two projects. For me, a buffer of 1.5 days works ok -- when a project has problems sending work, one of the other 4 active projects tends to pick up the slack.

What I did do recently is increase my 'three project' workstations to four or five as an insurance.



My machines still have about 2 or 3 days of work left on them so I am not in any trouble yet.

I have my prefs set to .1 and 4 days.


ID: 792443 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 792449 - Posted: 4 Aug 2008, 2:59:33 UTC - in response to Message 792443.  

It makes sense to set for a higher work buffer if you running only one or two projects. For me, a buffer of 1.5 days works ok -- when a project has problems sending work, one of the other 4 active projects tends to pick up the slack.

The only problem with having a work buffer when connected to more than one project is that people tend to get in a lather when BOINC starts paying back any long term debt that's accumulated & doesn't download or process any work for their other projects for a while.
The consternation & anguish can be considerable.

Grant
Darwin NT
ID: 792449 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 792797 - Posted: 4 Aug 2008, 18:54:51 UTC

One of the things that is different with the current lack of work problem here is that, like Einstein, this one is something of a self-generated problem.

Typically, outages here are do to various surprises -- a particular server spilling guts, a connectivity problem, some previously unknown code problem, that sort of thing.

It seems in this case, that in working on getting out the new style work units, something didn't get 'volume tested' and so the roll out has caused a fairly extensive (several days now) problem.

I'm hoping it gets resolved soon though.

Einstein has gotten bitten (even worse) with their new work unit rollout -- they have been pretty much offline for a week now.

I am really glad that the other projects I've attached to have been running reasonably well this past week.

ID: 792797 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 793693 - Posted: 6 Aug 2008, 11:30:34 UTC


Could the storm be over?
Outgoing traffic down to 68Mb/s, incoming down to 10.6Mb/s. Ready to send buffer is full & best of all the buffer is full while the result creation rate is around 15.
Grant
Darwin NT
ID: 793693 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 793740 - Posted: 6 Aug 2008, 14:40:02 UTC - in response to Message 793693.  


Could the storm be over?
Outgoing traffic down to 68Mb/s, incoming down to 10.6Mb/s. Ready to send buffer is full & best of all the buffer is full while the result creation rate is around 15.

Lookin' good to me.......at least on the surface. I am sure there are database issues that are still being dealt with..as well as AP rollout issues. But at least hosts asking for work are likely to get it now, and uploads no longer seem to be an issue.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 793740 · Report as offensive
Dorphas
Avatar

Send message
Joined: 16 May 99
Posts: 118
Credit: 8,007,247
RAC: 0
United States
Message 794019 - Posted: 7 Aug 2008, 3:06:26 UTC

grrrrrrrrrrrrrrr.. :(
ID: 794019 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 794516 - Posted: 8 Aug 2008, 8:35:15 UTC

Hmmmm.......something for the boyz to look at in the morning.....
The Scarecrow graphs show workunits awaiting validation steadily on the rise......
The status page shows ap_validate on bruno disabled....not sure if these are all AP WUs awaiting validation or if something else is afoot.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 794516 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 794519 - Posted: 8 Aug 2008, 9:11:03 UTC
Last modified: 8 Aug 2008, 9:11:42 UTC

Been getting occasional upload/download errors, Ok on 2nd or 3rd attempt.
Had a look at the network graphs & it shows a lot of traffic spikes, mostly download but a few upload in there as well. And the overall traffic trend appears to be upward....
Grant
Darwin NT
ID: 794519 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 15 · Next

Message boards : Number crunching : Panic Mode On (8) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.