Panic Mode On (79) Server Problems?

Message boards : Number crunching : Panic Mode On (79) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 22 · Next

AuthorMessage
Mark Fiske

Send message
Joined: 15 Aug 11
Posts: 713
Credit: 7,392,921
RAC: 0
United States
Message 1310060 - Posted: 25 Nov 2012, 6:29:29 UTC - in response to Message 1310044.  

Well, I wasn't expecting this but I just got 94 CPU WU's out of the blue. Happy Camper!

Mark
ID: 1310060 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1310067 - Posted: 25 Nov 2012, 7:26:14 UTC - in response to Message 1310044.  

Timeouts, Timeouts, Timeouts!!!!!!!

Now it's Couldn't connect to server, Couldn't connect to server, Couldn't connect to server!!!

Back to Timeouts, Timeouts, Timeouts!!!!!!! again.
Grant
Darwin NT
ID: 1310067 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1310075 - Posted: 25 Nov 2012, 7:57:21 UTC - in response to Message 1310067.  
Last modified: 25 Nov 2012, 7:57:46 UTC

Timeouts, Timeouts, Timeouts!!!!!!!

Now it's Couldn't connect to server, Couldn't connect to server, Couldn't connect to server!!!

Back to Timeouts, Timeouts, Timeouts!!!!!!! again.


I think it's now just throwing random errors. Timeouts, couldn't connect & failure when receiving data from the peer depending on the mood it's in.


I've even received a No tasks sent, but that was on some GPU requests- i got a whole bunch of VLARs on one CPU request so at least that particular response makes sense.
Grant
Darwin NT
ID: 1310075 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1310111 - Posted: 25 Nov 2012, 10:11:29 UTC - in response to Message 1310075.  

Use a proxy, with them the comunications errors are minimized, and you could easely rebuild your caches.
ID: 1310111 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1310206 - Posted: 25 Nov 2012, 18:03:01 UTC - in response to Message 1310111.  

Use a proxy, with them the comunications errors are minimized, and you could easely rebuild your caches.

Then i've got the hassle of finding a working proxy, then finding a new one every few days when the working one nolonger does.
Grant
Darwin NT
ID: 1310206 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1310213 - Posted: 25 Nov 2012, 18:08:45 UTC - in response to Message 1310206.  


Just noticed that the AP Science Database & Assimilators haven't been running for a few days- lots of work to be assimilated is backing up.
Grant
Darwin NT
ID: 1310213 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 1310216 - Posted: 25 Nov 2012, 18:16:03 UTC

Just posting what I'm currently getting in my event log since it seems to succeed within an hour of me posting what I'm currently getting in my event log. I don't question the voodoo, I just go with it.

11/25/2012 1:01:03 PM | SETI@home | Sending scheduler request: To fetch work.
11/25/2012 1:01:03 PM | SETI@home | Reporting 24 completed tasks, requesting new tasks for CPU and ATI
11/25/2012 1:01:25 PM | | Project communication failed: attempting access to reference site
11/25/2012 1:01:25 PM | SETI@home | Scheduler request failed: Couldn't connect to server
11/25/2012 1:01:26 PM | | Internet access OK - project servers may be temporarily down.
11/25/2012 1:03:06 PM | SETI@home | Sending scheduler request: To fetch work.
11/25/2012 1:03:06 PM | SETI@home | Reporting 24 completed tasks, requesting new tasks for CPU and ATI
11/25/2012 1:03:29 PM | | Project communication failed: attempting access to reference site
11/25/2012 1:03:29 PM | SETI@home | Scheduler request failed: Couldn't connect to server
11/25/2012 1:03:30 PM | | Internet access OK - project servers may be temporarily down.
11/25/2012 1:06:11 PM | SETI@home | Sending scheduler request: To fetch work.
11/25/2012 1:06:11 PM | SETI@home | Reporting 24 completed tasks, requesting new tasks for CPU and ATI
11/25/2012 1:06:34 PM | | Project communication failed: attempting access to reference site
11/25/2012 1:06:34 PM | SETI@home | Scheduler request failed: Couldn't connect to server
11/25/2012 1:06:35 PM | | Internet access OK - project servers may be temporarily down.
11/25/2012 1:12:29 PM | SETI@home | Sending scheduler request: To fetch work.
11/25/2012 1:12:29 PM | SETI@home | Reporting 24 completed tasks, requesting new tasks for CPU and ATI
11/25/2012 1:12:52 PM | | Project communication failed: attempting access to reference site
11/25/2012 1:12:52 PM | SETI@home | Scheduler request failed: Couldn't connect to server
11/25/2012 1:12:54 PM | | Internet access OK - project servers may be temporarily down.


"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 1310216 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1310226 - Posted: 25 Nov 2012, 18:29:37 UTC - in response to Message 1310216.  

Just posting what I'm currently getting in my event log since it seems to succeed within an hour of me posting what I'm currently getting in my event log. I don't question the voodoo, I just go with it.



I've been getting a lot of can't connect errors here too. Not on your end.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1310226 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 1310230 - Posted: 25 Nov 2012, 18:34:01 UTC - in response to Message 1310226.  

Just posting what I'm currently getting in my event log since it seems to succeed within an hour of me posting what I'm currently getting in my event log. I don't question the voodoo, I just go with it.



I've been getting a lot of can't connect errors here too. Not on your end.

See what I mean... VOODOO!

11/25/2012 1:12:29 PM | SETI@home | Sending scheduler request: To fetch work.
11/25/2012 1:12:29 PM | SETI@home | Reporting 24 completed tasks, requesting new tasks for CPU and ATI
11/25/2012 1:12:52 PM | | Project communication failed: attempting access to reference site
11/25/2012 1:12:52 PM | SETI@home | Scheduler request failed: Couldn't connect to server
11/25/2012 1:12:54 PM | | Internet access OK - project servers may be temporarily down.
11/25/2012 1:27:50 PM | SETI@home | Sending scheduler request: To fetch work.
11/25/2012 1:27:50 PM | SETI@home | Reporting 24 completed tasks, requesting new tasks for CPU and ATI
11/25/2012 1:29:03 PM | SETI@home | Scheduler request completed: got 11 new tasks

"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 1310230 · Report as offensive
Lionel

Send message
Joined: 25 Mar 00
Posts: 680
Credit: 563,640,304
RAC: 597
Australia
Message 1310389 - Posted: 26 Nov 2012, 4:43:53 UTC - in response to Message 1310111.  

Use a proxy, with them the comunications errors are minimized, and you could easely rebuild your caches.


As Grant has basically said, it doesn't always work...they are waking up to the traffic that seti puts through and soon this avenue will be closed for many of us...what they need to do is increase the bandwidth beyond 100Mbps...

ID: 1310389 · Report as offensive
Lionel

Send message
Joined: 25 Mar 00
Posts: 680
Credit: 563,640,304
RAC: 597
Australia
Message 1310390 - Posted: 26 Nov 2012, 4:46:42 UTC - in response to Message 1310230.  

Just posting what I'm currently getting in my event log since it seems to succeed within an hour of me posting what I'm currently getting in my event log. I don't question the voodoo, I just go with it.



I've been getting a lot of can't connect errors here too. Not on your end.

See what I mean... VOODOO!

11/25/2012 1:12:29 PM | SETI@home | Sending scheduler request: To fetch work.
11/25/2012 1:12:29 PM | SETI@home | Reporting 24 completed tasks, requesting new tasks for CPU and ATI
11/25/2012 1:12:52 PM | | Project communication failed: attempting access to reference site
11/25/2012 1:12:52 PM | SETI@home | Scheduler request failed: Couldn't connect to server
11/25/2012 1:12:54 PM | | Internet access OK - project servers may be temporarily down.
11/25/2012 1:27:50 PM | SETI@home | Sending scheduler request: To fetch work.
11/25/2012 1:27:50 PM | SETI@home | Reporting 24 completed tasks, requesting new tasks for CPU and ATI
11/25/2012 1:29:03 PM | SETI@home | Scheduler request completed: got 11 new tasks


getting a lot of that here as well...suspect that many of us will be...




ID: 1310390 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1310413 - Posted: 26 Nov 2012, 7:08:23 UTC - in response to Message 1310389.  

Use a proxy, with them the comunications errors are minimized, and you could easely rebuild your caches.


As Grant has basically said, it doesn't always work...they are waking up to the traffic that seti puts through and soon this avenue will be closed for many of us...what they need to do is increase the bandwidth beyond 100Mbps...

That would help (massively- till the next bottleneck is hit), but what doesn't make sense is why using a proxy does give better connections & speeds than not using one?
Grant
Darwin NT
ID: 1310413 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1310414 - Posted: 26 Nov 2012, 7:09:55 UTC - in response to Message 1310413.  
Last modified: 26 Nov 2012, 7:10:58 UTC

Even with all the wierdness going on, my systems have managed to stay busy while at work.

And while the inbound network traffic has been rather odd (little peaks here & there & gradually increasing overall) since coming back up after the multiple Scheduler breakdowns, there have been a couple of significant dips while i was away. And they also affected the download traffic.
Grant
Darwin NT
ID: 1310414 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30591
Credit: 53,134,872
RAC: 32
United States
Message 1310417 - Posted: 26 Nov 2012, 7:27:07 UTC - in response to Message 1310413.  

Use a proxy, with them the comunications errors are minimized, and you could easely rebuild your caches.


As Grant has basically said, it doesn't always work...they are waking up to the traffic that seti puts through and soon this avenue will be closed for many of us...what they need to do is increase the bandwidth beyond 100Mbps...

That would help (massively- till the next bottleneck is hit), but what doesn't make sense is why using a proxy does give better connections & speeds than not using one?

Because as Eric stated there is a problem upstream from SSL possibly in the Campus tunnel. Has nothing to do with pipe size.

Oh and can you imagine how much worse the scheduler ghosts woes would be if the pipe was 10X wider? Would there be 10X the number of ghosts?

IIRC Eric was able to get a test in and a 5X increase in pipe size hits a bottleneck that may not be surmountable. I also have a question, do you think the hardware can take 5X additional 24/7 or what will break next?

ID: 1310417 · Report as offensive
Lionel

Send message
Joined: 25 Mar 00
Posts: 680
Credit: 563,640,304
RAC: 597
Australia
Message 1310420 - Posted: 26 Nov 2012, 7:52:22 UTC - in response to Message 1310417.  

Use a proxy, with them the comunications errors are minimized, and you could easely rebuild your caches.


As Grant has basically said, it doesn't always work...they are waking up to the traffic that seti puts through and soon this avenue will be closed for many of us...what they need to do is increase the bandwidth beyond 100Mbps...

That would help (massively- till the next bottleneck is hit), but what doesn't make sense is why using a proxy does give better connections & speeds than not using one?

Because as Eric stated there is a problem upstream from SSL possibly in the Campus tunnel. Has nothing to do with pipe size.

Oh and can you imagine how much worse the scheduler ghosts woes would be if the pipe was 10X wider? Would there be 10X the number of ghosts?

IIRC Eric was able to get a test in and a 5X increase in pipe size hits a bottleneck that may not be surmountable. I also have a question, do you think the hardware can take 5X additional 24/7 or what will break next?


maximum throughput (down to us) is governed by the maximum rate at which work can be created and sent...that is the natural ceiling...given a maximum down rate you can approximate a maximum up rate based on average returned work unit size...the pipe should be wider than these to allow for other things such as overhead/management traffic...



ID: 1310420 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1310421 - Posted: 26 Nov 2012, 7:55:41 UTC - in response to Message 1310417.  
Last modified: 26 Nov 2012, 7:56:29 UTC

Oh and can you imagine how much worse the scheduler ghosts woes would be if the pipe was 10X wider? Would there be 10X the number of ghosts?

Maybe, maybe not.
When the Scheduler was using the campus network, it was responding in less than 7 seconds, often within 2-4
So it would appear the network congestion is a factor- remove it & no more ghosts at all.


IIRC Eric was able to get a test in and a 5X increase in pipe size hits a bottleneck that may not be surmountable. I also have a question, do you think the hardware can take 5X additional 24/7 or what will break next?

Keep in mind if there were a 5 fold increase in available bandwidth, the load on the servers would drop 5 times faster.
The load would probably be less than it is now becasue there wouldn't be all the re-tries going on, or the acccumulation of ghosts.

I have no doubt we'd find some new major problem sooner rather than later, but it would erase completely several existing ones.
Grant
Darwin NT
ID: 1310421 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1310428 - Posted: 26 Nov 2012, 8:56:18 UTC - in response to Message 1310421.  


Inbound & outbound traffic is plummeting. Hopefully it'll recover again like it's done twice previously.
*fingers crossed*
Grant
Darwin NT
ID: 1310428 · Report as offensive
musicplayer

Send message
Joined: 17 May 10
Posts: 2430
Credit: 926,046
RAC: 0
Message 1310436 - Posted: 26 Nov 2012, 9:29:36 UTC

And I got a new batch of jobs coming my way. Thanks!
ID: 1310436 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1310463 - Posted: 26 Nov 2012, 12:33:56 UTC - in response to Message 1310421.  



I have no doubt we'd find some new major problem sooner rather than later, but it would erase completely several existing ones.



It would be fun to play some other game for a while.

Let's try it and see!

ID: 1310463 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1310487 - Posted: 26 Nov 2012, 15:22:48 UTC

I've been speculating for at least two years now that if we were able to increase the pipe to even just 200mbit, I'm not sure the scheduler/feeder would handle it. Even two years ago when GPUs were much slower and less common, the database was having trouble keeping up. Of course, we have better hardware now, but I think it's significantly more likely we'll run into a software limitation, on top of the disk I/O limitation.

Bigger pipe will likely cause more issues without some sort of restraint (per-host limits are a simple way to do it, but there are better ways.. like server-side cache size based on DCF). It was a good idea to run the scheduler on a different link, as long as that can be reliable. It will at least allow a high rate of successful contacts to report work and be assigned new work, and then you just have to fight for bandwidth on the download link, which in the grand scheme of things, isn't that huge of an issue.

You wouldn't end up with ghosts, you'd just end up with 10+ hour back-offs, but you can over-come those with some manual intervention, or some less draconian exponential back-off calculations in the client itself.

Maybe once the scheduler reliability issues get sorted out, we can possibly test the database's capability to keep up for a 24-hour period by updating some DNS records and ramp the bandwidth up? Maybe pick a Saturday or we have the winter holidays coming up where the campus will be empty except a select few faculty members. Could do a 1-3 day test on 200+ mbit then, assuming the red tape can be removed temporarily for such a thing.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1310487 · Report as offensive
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (79) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.