Panic Mode On (48) Server problems?

Message boards : Number crunching : Panic Mode On (48) Server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10

AuthorMessage
Iona
Avatar

Send message
Joined: 12 Jul 07
Posts: 790
Credit: 22,438,118
RAC: 0
United Kingdom
Message 1120071 - Posted: 22 Jun 2011, 17:24:31 UTC

Hmmm. I wonder if the recent server problems, have anything to do with the 17 Orphan WUs that I've got, or should that be, not got?



Don't take life too seriously, as you'll never come out of it alive!
ID: 1120071 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 1120073 - Posted: 22 Jun 2011, 17:26:28 UTC - in response to Message 1120067.  

Ok, uploads are working again. Now, take it easy folks, don't break something again :-)

Yes, keep the finger away from the retry button... As much as possible lol :)
ID: 1120073 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1120076 - Posted: 22 Jun 2011, 17:37:14 UTC

Got all the backed up work uploaded and reported from the Frozen 920.

But it appears the validators might not be working, as all 600 some tasks went directly to the pending bin.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1120076 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1120081 - Posted: 22 Jun 2011, 17:48:17 UTC - in response to Message 1120051.  


What I have seen lately is an unusual amount of 'shorties'. When the ratio of 'shorties' drops back to levels I noticed before a lot of the problems will be gone. On my GPU I can do at least 4 shorties in the same time it takes to process a single 'long' workunit. Now there will always be some shorties so you can't expect the number of workunits processed (needing both bandwidth and transaction resources for processing) to go down be a factor four, but I think a factor two should be possible. Maybe the servers and the internet connection can keep up again when this happens ...

Tom

From now until the ALFA receiver system is brought down for servicing in late July, it will be used only by the A2130 GALFACTS project which produces only VHAR "shorties", according to the aoschedule page. The A2010 ALFALFA observations which give most of the midrange tasks are not being done now, their "spring 2011" campaign ended March 21 or so. They would have started "fall 2011" observations in August, but will have to wait until ALFA is back in operation.

The A2133 Alfa Ultra-Deep Survey which produces mostly VLAR tasks was getting observing time most days until June 5, that will help the bandwidth. And there were some A2048 AGES and P2030 Pulsar survey observations in May which will help somewhat.

IOW, while we're basically doing work in the order received from Arecibo the trend is likely to be toward more shorties. I have no idea how much uncrunched data is in storage. My guess is the project may work toward being able to use the GBT Kepler field observations rather than pulling Arecibo data out of storage.
                                                                 Joe
ID: 1120081 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1120083 - Posted: 22 Jun 2011, 17:54:00 UTC - in response to Message 1120076.  

Got all the backed up work uploaded and reported from the Frozen 920.

But it appears the validators might not be working, as all 600 some tasks went directly to the pending bin.

[As of 22 Jun 2011 | 17:30:05 UTC] the sahvalidators on bruno were not running. They may need more of a "kick" than the upload server did.
Donald
Infernal Optimist / Submariner, retired
ID: 1120083 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1120087 - Posted: 22 Jun 2011, 17:58:12 UTC - in response to Message 1120083.  

Got all the backed up work uploaded and reported from the Frozen 920.

But it appears the validators might not be working, as all 600 some tasks went directly to the pending bin.

[As of 22 Jun 2011 | 17:30:05 UTC] the sahvalidators on bruno were not running. They may need more of a "kick" than the upload server did.

A little birdy just told me (kitties luv birdies) that Bruno is the main culprit lately. Might have a failing CPU or worst case, mobo. A CPU transplant is scheduled for some time next week, in the hopes that will cure Bruno's ills.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1120087 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1120090 - Posted: 22 Jun 2011, 18:00:05 UTC - in response to Message 1120076.  

Well, once they started my uploads flew in. As to the validaters on Bruno, I guess they can't kick him as hard anymore. He's getting too old for that. They have to just kind of gently nudge him in the direction they want him to go and hope for the best. :-)


PROUD MEMBER OF Team Starfire World BOINC
ID: 1120090 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1120094 - Posted: 22 Jun 2011, 18:05:47 UTC - in response to Message 1120092.  

Got all the backed up work uploaded and reported from the Frozen 920.

But it appears the validators might not be working, as all 600 some tasks went directly to the pending bin.

[As of 22 Jun 2011 | 17:30:05 UTC] the sahvalidators on bruno were not running. They may need more of a "kick" than the upload server did.

A little birdy just told me (kitties luv birdies) that Bruno is the main culprit lately. Might have a failing CPU or worst case, mobo. A CPU transplant is scheduled for some time next week, in the hopes that will cure Bruno's ills.


So, the CPU have scheduled crashes at 7:00 UTC? Somehow I find that hard to believe..

Dunno, unless there could be a stressful chron job that launches at that time.
Perhaps Synergy can be put back on the validation task until dear Bruno gets his feet back under him.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1120094 · Report as offensive
justsomeguy

Send message
Joined: 27 May 99
Posts: 84
Credit: 6,084,595
RAC: 11
United States
Message 1120099 - Posted: 22 Jun 2011, 18:12:20 UTC - in response to Message 1120063.  


Well, then they'd better try to find out what is happening in the lab at 7:00 UTC. I agree with you that is is indeed very strange that many of the "crashes" recently have all occured at more or less exactly 7:00 UTC, and that is not normal for any hardware problem, to always happen at the same time.



IIRC, we had an issue with the Sun boxes running redhat (rhel4) that would force
boxes down when it rolled the date over. I believe this was partly fixed by relaxing a cron process to 30 minutes instead of 15 and a bios update.

However, in this case, I don't think thumper has been the primary culprit.
Although it has certainly had it's raid issues.

Kevin

"Two things are infinite: The universe and human stupidity; and I'm not sure about the universe." - Albert Einstein

ID: 1120099 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1120105 - Posted: 22 Jun 2011, 18:22:12 UTC - in response to Message 1120100.  

Got all the backed up work uploaded and reported from the Frozen 920.

But it appears the validators might not be working, as all 600 some tasks went directly to the pending bin.

[As of 22 Jun 2011 | 17:30:05 UTC] the sahvalidators on bruno were not running. They may need more of a "kick" than the upload server did.

A little birdy just told me (kitties luv birdies) that Bruno is the main culprit lately. Might have a failing CPU or worst case, mobo. A CPU transplant is scheduled for some time next week, in the hopes that will cure Bruno's ills.


So, the CPU have scheduled crashes at 7:00 UTC? Somehow I find that hard to believe..

Dunno, unless there could be a stressful chron job that launches at that time.
Perhaps Synergy can be put back on the validation task until dear Bruno gets his feet back under him.


But just a few days ago, they had Synergy handle everything except uploads, and maybe one more thing. Nevertheless, there was the same lockup as we saw at 07:00 UTC today/yesterday (depending on your location.)

Well, I dunno. Maybe we should just quit and throw a tantrum :-)

It's good to know, that it's never too LATE to give up.

LOL...I'll wait to give up until I get home from work in 11 hours and see what's kickin'.

And then I'll wait some more...years, that is.

"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1120105 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1120108 - Posted: 22 Jun 2011, 18:31:40 UTC - in response to Message 1120087.  
Last modified: 22 Jun 2011, 18:32:10 UTC

A little birdy just told me (kitties luv birdies) that Bruno is the main culprit lately. Might have a failing CPU or worst case, mobo. A CPU transplant is scheduled for some time next week, in the hopes that will cure Bruno's ills.

That is useful information. Hope the transplant works.
Makes it easier to be patient, not just with the S@H servers,
but also with the folks who are not patient with the S@H servers (8{b)
Donald
Infernal Optimist / Submariner, retired
ID: 1120108 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 1120128 - Posted: 22 Jun 2011, 19:18:42 UTC - in response to Message 1120108.  

A little birdy just told me (kitties luv birdies) that Bruno is the main culprit lately. Might have a failing CPU or worst case, mobo. A CPU transplant is scheduled for some time next week, in the hopes that will cure Bruno's ills.


I've never seen a (properly handled) CPU go bad. Always motherboard.
ID: 1120128 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30930
Credit: 53,134,872
RAC: 32
United States
Message 1120183 - Posted: 22 Jun 2011, 21:48:07 UTC - in response to Message 1120179.  

Well, I wonder if they're going to turn on sah validators before they go home, or if we have to guess our RAC for yet more hours...

Just so long as they turn them on right after the once every 24 stats export is done.

ID: 1120183 · Report as offensive
Profile S@NL - XP_Freak

Send message
Joined: 10 Jul 99
Posts: 99
Credit: 6,248,265
RAC: 0
Netherlands
Message 1120203 - Posted: 22 Jun 2011, 22:26:19 UTC - in response to Message 1120195.  

Well, I think the validators just came online. My credit is starting to climb.

Starting with the northeren crunchers? :)
For I still get nothing. :(


Goodbye Seti Classic
ID: 1120203 · Report as offensive
Previous · 1 . . . 7 · 8 · 9 · 10

Message boards : Number crunching : Panic Mode On (48) Server problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.