Panic Mode On (51) Server problems?

Message boards : Number crunching : Panic Mode On (51) Server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 10 · Next

AuthorMessage
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1133037 - Posted: 28 Jul 2011, 17:04:53 UTC

Dunno what's going on, but the splitters have still not picked up the pace since the outrage.
Very little work coming down the pipeline to this cruncher.

Not even saturating the bandwidth.

Meow?


"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1133037 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1133332 - Posted: 29 Jul 2011, 4:16:20 UTC - in response to Message 1133037.  


Still the splitters aren't producing enough work, they've actually cut back further on their production levels.
Grant
Darwin NT
ID: 1133332 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1133356 - Posted: 29 Jul 2011, 5:50:23 UTC

Hmmm.....
Returning home from work, I see that work generation is still in the ol' crapper.

Wonder what's holding it back.

Meowfffffffft.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1133356 · Report as offensive
Bernd Noessler

Send message
Joined: 15 Nov 09
Posts: 99
Credit: 52,635,434
RAC: 0
Germany
Message 1133367 - Posted: 29 Jul 2011, 6:39:32 UTC - in response to Message 1133348.  

@Sten-Arne

The apache at port 80 of 208.68.240.13 is down since monday.
You can make an entry in the hosts file of your machine.

208.68.240.18 boinc2.ssl.berkeley.edu


Then restart boinc and the downloads should work.
ID: 1133367 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1133374 - Posted: 29 Jul 2011, 6:47:38 UTC - in response to Message 1133367.  

I'm sure that the guys will sort things out in the morning. ;)

Cheers.
ID: 1133374 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1133384 - Posted: 29 Jul 2011, 6:59:39 UTC - in response to Message 1133374.  

I'm sure that the guys will sort things out in the morning. ;)

Cheers.

I kinda thought they would do that today.....
Things are just not flowing like they should be.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1133384 · Report as offensive
Bernd Noessler

Send message
Joined: 15 Nov 09
Posts: 99
Credit: 52,635,434
RAC: 0
Germany
Message 1133457 - Posted: 29 Jul 2011, 14:00:37 UTC - in response to Message 1133455.  


I mentioned it in a different thread on wednesday. But it seems nobody
has read it. :-(
ID: 1133457 · Report as offensive
Bernd Noessler

Send message
Joined: 15 Nov 09
Posts: 99
Credit: 52,635,434
RAC: 0
Germany
Message 1133467 - Posted: 29 Jul 2011, 14:31:39 UTC - in response to Message 1133459.  



Although something is strange/F'ed up anyhow. Shouldn't the system be setup so that a connect attempt to the servers would result in 50/50 or so connect attempts to either .13, or .18 ? In that way you would pretty soon connect to one server that is working, if one is down.



That has to be done on client side. The TTL of the boinc2 entries is only
5 minutes. So your namerserver should give you the .13 first and after 5 minutes
the .18 in first place.
ID: 1133467 · Report as offensive
Bernd Noessler

Send message
Joined: 15 Nov 09
Posts: 99
Credit: 52,635,434
RAC: 0
Germany
Message 1133492 - Posted: 29 Jul 2011, 15:14:18 UTC - in response to Message 1133480.  



Well it could and should be done on the server side as a round robin function IMO. My DNS server did not in any situation try the working IP, it always tried the non working, and that for days. Flushing the DNS cache made no difference, rebooting made no difference, and I'm sure I'm not alone with this problem, which easily could be fixed on the server side.

Edit, added: It may be as it will with all that, I just have to edit the host file when needed.


You can check your nameserver with a ping. Send a ping to boinc2, wait 5 minutes
and then send another ping. The IP's should be different.

ID: 1133492 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1133495 - Posted: 29 Jul 2011, 15:18:42 UTC - in response to Message 1133480.  

Well it could and should be done on the server side as a round robin function IMO. My DNS server did not in any situation try the working IP, it always tried the non working, and that for days. Flushing the DNS cache made no difference, rebooting made no difference, and I'm sure I'm not alone with this problem, which easily could be fixed on the server side.

Edit, added: It may be as it will with all that, I just have to edit the host file when needed.

AFAIK, it is implemented as round robin DNS - it's always looked that way when I've tracked it down. It's worth trying ipconfig/displaydns to find out what your local machine's DNS resolver currently thinks the IP address should be before/during/after a download request - it shows the current TTL timer countdown as well, which is useful.

If displaydns consistently shows the wrong address, then something upstream (DNS server/proxy/ISP) is mis-handling TTL. Or there might, indeed, be a mis-configuration at SETI - that would affect us all, and we can check that by comparing notes here.

There used to be a bug in BOINC, which Ned Ludd and I finally got them the acknowledge and fix in v6.10.33 (March 2009) - If BOINC had already tried a download, and failed, it carried on attempting to download from the same IP address for evermore, rather than re-querying DNS (which would pick up the round robin). It wasn't BOINC's fault - it was a bug in the underlying libcurl library that handles the TCI/IP layer. And it shouldn't be a problem in any current version of BOINC.
ID: 1133495 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1133511 - Posted: 29 Jul 2011, 15:31:07 UTC - in response to Message 1133506.  

Well it could and should be done on the server side as a round robin function IMO. My DNS server did not in any situation try the working IP, it always tried the non working, and that for days. Flushing the DNS cache made no difference, rebooting made no difference, and I'm sure I'm not alone with this problem, which easily could be fixed on the server side.

Edit, added: It may be as it will with all that, I just have to edit the host file when needed.

AFAIK, it is implemented as round robin DNS - it's always looked that way when I've tracked it down. It's worth trying ipconfig/displaydns to find out what your local machine's DNS resolver currently thinks the IP address should be before/during/after a download request - it shows the current TTL timer countdown as well, which is useful.

If displaydns consistently shows the wrong address, then something upstream (DNS server/proxy/ISP) is mis-handling TTL. Or there might, indeed, be a mis-configuration at SETI - that would affect us all, and we can check that by comparing notes here.

There used to be a bug in BOINC, which Ned Ludd and I finally got them the acknowledge and fix in v6.10.33 (March 2009) - If BOINC had already tried a download, and failed, it carried on attempting to download from the same IP address for evermore, rather than re-querying DNS (which would pick up the round robin). It wasn't BOINC's fault - it was a bug in the underlying libcurl library that handles the TCI/IP layer. And it shouldn't be a problem in any current version of BOINC.

I'm on 6.10.18 on both machines, and I refuse to upgrade, so I just have to live with it :-)

In that case, half your downloads will stall, and you will have to do a (carefully-timed) restart of BOINC to free them while the 'right' server is on DNS duty.

That's what I like about crunching for SETI, rather than other projects - it actually feels like you're doing some of the work yourself, not just leaving it to the computer. ;-)
ID: 1133511 · Report as offensive
Bernd Noessler

Send message
Joined: 15 Nov 09
Posts: 99
Credit: 52,635,434
RAC: 0
Germany
Message 1133520 - Posted: 29 Jul 2011, 15:39:04 UTC - in response to Message 1133504.  



It is not different, it is the same .13. Well that is for my Win 7 with or without flushdns. On my Vista though it changes every time I do a flushdns, no matter if it's been 5 minutes or not in between. The Vista machine did not have any download problems either.


I have tried it with Win XP in a VirtualBox. I have a BIND as nameserver
in my local network. The IP changes every 5 minutes. Do you have a local
nameserver or do you use the nameserver of your ISP ?
ID: 1133520 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1133522 - Posted: 29 Jul 2011, 15:40:35 UTC
Last modified: 29 Jul 2011, 15:41:37 UTC

I have posted that in the other thread about this problems, is that maybe something better than the current round robin DNS?

Since it's not the first time that we have problems like that here, I wonder if it would not cause less problems if SETI had two different download server URLs, for example dl1.ssl.berkeley.edu and dl2.ssl.berkeley.edu and send both as possible download locations like rosetta is doing for example:

<url>http://srv3.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://boinc.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://srv4.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://srv3.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://boinc.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://srv4.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>



So for a SETI WU it could be:

<url>http://dl1.ssl.berkeley.edu/sah/download_fanout/61/08ap11ae.3480.1703.14.10.29</url>
<url>http://dl2.ssl.berkeley.edu/sah/download_fanout/61/08ap11ae.3480.1703.14.10.29</url>



Don't know how the load balancing works in that case, if the BOINC client picks just one of them, than that would be pretty easy, not need for any big server side changes. If the client starts from the top and tries one after the other, than the sheduler would have to send dl1,dl2 to all even number results (_0, _2,...) and dl2,dl1 to all odd number results. I think it might work better that the current way... but I might be wrong of course.

ID: 1133522 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1133526 - Posted: 29 Jul 2011, 15:48:43 UTC - in response to Message 1133514.  

Well, it's never happened before, or maybe it has but I've forgotten. It must have happened since I already had edited my host file long time ago, but commented out the boinc server parts.

Don't mind me, it's Alzheimers light I guess :-)

It has happened before, but it's an intermittent problem which keeps cropping up, hanging around for a while, and going away again.

I guess that because downloads are sort-of working, and they all go out over the same link, it doesn't show up as a problem on the lab monitoring tools: and they don't know it needs kicking until we kick up a fuss here, or someone on the 'inside' mailing distribution circuit passes on a message. Hint to mods?
ID: 1133526 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1133549 - Posted: 29 Jul 2011, 17:08:10 UTC

Now if we could just get the splitters back in high gear......

Meowgrrrrrrrrr.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1133549 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1133770 - Posted: 30 Jul 2011, 0:16:24 UTC

Could it be???
Did the boyz kick something into gear before locking up the lab for the weekend?

The Cricket graphs just maxxed for the first time in a while and splitter speed is up.

More power, Scotty!!!!
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1133770 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1133900 - Posted: 30 Jul 2011, 4:25:57 UTC - in response to Message 1133770.  


Well, it was nice while it lasted.
They're back to producing just a trickle again.
Grant
Darwin NT
ID: 1133900 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1133910 - Posted: 30 Jul 2011, 4:32:28 UTC - in response to Message 1133900.  


Well, it was nice while it lasted.
They're back to producing just a trickle again.

Yeah, shucks.
Dunno what's limiting it.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1133910 · Report as offensive
Bernd Noessler

Send message
Joined: 15 Nov 09
Posts: 99
Credit: 52,635,434
RAC: 0
Germany
Message 1134009 - Posted: 30 Jul 2011, 6:47:06 UTC

Nothing changed with 208.68.240.13. The forwarding of port 80
doesn't work. Interesting is the forwarding of port 443 (https)
is working and connects me to vader.

ID: 1134009 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1134030 - Posted: 30 Jul 2011, 7:42:56 UTC

Well, little work is making it's way down to the kitties.
Not that they cannot connect or anything, but the scheduler is not sending out any tunas.

If my cache is running down, some faster fishes than mine are gonna be flopping on the beach soon.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1134030 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 10 · Next

Message boards : Number crunching : Panic Mode On (51) Server problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.