Panic Mode On (51) Server problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (51) Server problems?

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next
Author Message
Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,921,085
RAC: 13,535
United Kingdom
Message 1133511 - Posted: 29 Jul 2011, 15:31:07 UTC - in response to Message 1133506.

Well it could and should be done on the server side as a round robin function IMO. My DNS server did not in any situation try the working IP, it always tried the non working, and that for days. Flushing the DNS cache made no difference, rebooting made no difference, and I'm sure I'm not alone with this problem, which easily could be fixed on the server side.

Edit, added: It may be as it will with all that, I just have to edit the host file when needed.

AFAIK, it is implemented as round robin DNS - it's always looked that way when I've tracked it down. It's worth trying ipconfig/displaydns to find out what your local machine's DNS resolver currently thinks the IP address should be before/during/after a download request - it shows the current TTL timer countdown as well, which is useful.

If displaydns consistently shows the wrong address, then something upstream (DNS server/proxy/ISP) is mis-handling TTL. Or there might, indeed, be a mis-configuration at SETI - that would affect us all, and we can check that by comparing notes here.

There used to be a bug in BOINC, which Ned Ludd and I finally got them the acknowledge and fix in v6.10.33 (March 2009) - If BOINC had already tried a download, and failed, it carried on attempting to download from the same IP address for evermore, rather than re-querying DNS (which would pick up the round robin). It wasn't BOINC's fault - it was a bug in the underlying libcurl library that handles the TCI/IP layer. And it shouldn't be a problem in any current version of BOINC.

I'm on 6.10.18 on both machines, and I refuse to upgrade, so I just have to live with it :-)

In that case, half your downloads will stall, and you will have to do a (carefully-timed) restart of BOINC to free them while the 'right' server is on DNS duty.

That's what I like about crunching for SETI, rather than other projects - it actually feels like you're doing some of the work yourself, not just leaving it to the computer. ;-)

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3307
Credit: 16,259,923
RAC: 11,365
Sweden
Message 1133514 - Posted: 29 Jul 2011, 15:35:25 UTC - in response to Message 1133511.

Well it could and should be done on the server side as a round robin function IMO. My DNS server did not in any situation try the working IP, it always tried the non working, and that for days. Flushing the DNS cache made no difference, rebooting made no difference, and I'm sure I'm not alone with this problem, which easily could be fixed on the server side.

Edit, added: It may be as it will with all that, I just have to edit the host file when needed.

AFAIK, it is implemented as round robin DNS - it's always looked that way when I've tracked it down. It's worth trying ipconfig/displaydns to find out what your local machine's DNS resolver currently thinks the IP address should be before/during/after a download request - it shows the current TTL timer countdown as well, which is useful.

If displaydns consistently shows the wrong address, then something upstream (DNS server/proxy/ISP) is mis-handling TTL. Or there might, indeed, be a mis-configuration at SETI - that would affect us all, and we can check that by comparing notes here.

There used to be a bug in BOINC, which Ned Ludd and I finally got them the acknowledge and fix in v6.10.33 (March 2009) - If BOINC had already tried a download, and failed, it carried on attempting to download from the same IP address for evermore, rather than re-querying DNS (which would pick up the round robin). It wasn't BOINC's fault - it was a bug in the underlying libcurl library that handles the TCI/IP layer. And it shouldn't be a problem in any current version of BOINC.

I'm on 6.10.18 on both machines, and I refuse to upgrade, so I just have to live with it :-)

In that case, half your downloads will stall, and you will have to do a (carefully-timed) restart of BOINC to free them while the 'right' server is on DNS duty.

That's what I like about crunching for SETI, rather than other projects - it actually feels like you're doing some of the work yourself, not just leaving it to the computer. ;-)


Well, it's never happened before, or maybe it has but I've forgotten. It must have happened since I already had edited my host file long time ago, but commented out the boinc server parts.

Don't mind me, it's Alzheimers light I guess :-)
____________

Bernd Noessler
Send message
Joined: 15 Nov 09
Posts: 99
Credit: 52,635,315
RAC: 1
Germany
Message 1133520 - Posted: 29 Jul 2011, 15:39:04 UTC - in response to Message 1133504.



It is not different, it is the same .13. Well that is for my Win 7 with or without flushdns. On my Vista though it changes every time I do a flushdns, no matter if it's been 5 minutes or not in between. The Vista machine did not have any download problems either.


I have tried it with Win XP in a VirtualBox. I have a BIND as nameserver
in my local network. The IP changes every 5 minutes. Do you have a local
nameserver or do you use the nameserver of your ISP ?

Profile Link
Avatar
Send message
Joined: 18 Sep 03
Posts: 812
Credit: 1,500,378
RAC: 447
Germany
Message 1133522 - Posted: 29 Jul 2011, 15:40:35 UTC
Last modified: 29 Jul 2011, 15:41:37 UTC

I have posted that in the other thread about this problems, is that maybe something better than the current round robin DNS?

Since it's not the first time that we have problems like that here, I wonder if it would not cause less problems if SETI had two different download server URLs, for example dl1.ssl.berkeley.edu and dl2.ssl.berkeley.edu and send both as possible download locations like rosetta is doing for example:

<url>http://srv3.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://boinc.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://srv4.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://srv3.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://boinc.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://srv4.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>



So for a SETI WU it could be:

<url>http://dl1.ssl.berkeley.edu/sah/download_fanout/61/08ap11ae.3480.1703.14.10.29</url>
<url>http://dl2.ssl.berkeley.edu/sah/download_fanout/61/08ap11ae.3480.1703.14.10.29</url>



Don't know how the load balancing works in that case, if the BOINC client picks just one of them, than that would be pretty easy, not need for any big server side changes. If the client starts from the top and tries one after the other, than the sheduler would have to send dl1,dl2 to all even number results (_0, _2,...) and dl2,dl1 to all odd number results. I think it might work better that the current way... but I might be wrong of course.

____________
.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,921,085
RAC: 13,535
United Kingdom
Message 1133526 - Posted: 29 Jul 2011, 15:48:43 UTC - in response to Message 1133514.

Well, it's never happened before, or maybe it has but I've forgotten. It must have happened since I already had edited my host file long time ago, but commented out the boinc server parts.

Don't mind me, it's Alzheimers light I guess :-)

It has happened before, but it's an intermittent problem which keeps cropping up, hanging around for a while, and going away again.

I guess that because downloads are sort-of working, and they all go out over the same link, it doesn't show up as a problem on the lab monitoring tools: and they don't know it needs kicking until we kick up a fuss here, or someone on the 'inside' mailing distribution circuit passes on a message. Hint to mods?

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37287
Credit: 498,171,497
RAC: 494,255
United States
Message 1133549 - Posted: 29 Jul 2011, 17:08:10 UTC

Now if we could just get the splitters back in high gear......

Meowgrrrrrrrrr.
____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37287
Credit: 498,171,497
RAC: 494,255
United States
Message 1133770 - Posted: 30 Jul 2011, 0:16:24 UTC

Could it be???
Did the boyz kick something into gear before locking up the lab for the weekend?

The Cricket graphs just maxxed for the first time in a while and splitter speed is up.

More power, Scotty!!!!
____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5561
Credit: 51,253,726
RAC: 38,570
Australia
Message 1133900 - Posted: 30 Jul 2011, 4:25:57 UTC - in response to Message 1133770.


Well, it was nice while it lasted.
They're back to producing just a trickle again.
____________
Grant
Darwin NT.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37287
Credit: 498,171,497
RAC: 494,255
United States
Message 1133910 - Posted: 30 Jul 2011, 4:32:28 UTC - in response to Message 1133900.


Well, it was nice while it lasted.
They're back to producing just a trickle again.

Yeah, shucks.
Dunno what's limiting it.
____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

Bernd Noessler
Send message
Joined: 15 Nov 09
Posts: 99
Credit: 52,635,315
RAC: 1
Germany
Message 1134009 - Posted: 30 Jul 2011, 6:47:06 UTC

Nothing changed with 208.68.240.13. The forwarding of port 80
doesn't work. Interesting is the forwarding of port 443 (https)
is working and connects me to vader.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37287
Credit: 498,171,497
RAC: 494,255
United States
Message 1134030 - Posted: 30 Jul 2011, 7:42:56 UTC

Well, little work is making it's way down to the kitties.
Not that they cannot connect or anything, but the scheduler is not sending out any tunas.

If my cache is running down, some faster fishes than mine are gonna be flopping on the beach soon.
____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5561
Credit: 51,253,726
RAC: 38,570
Australia
Message 1134056 - Posted: 30 Jul 2011, 9:00:48 UTC - in response to Message 1134030.


Yep, both my machines are getting work. Just not very much of it & only on every 10-20th request. Both caches are running down.
____________
Grant
Darwin NT.

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3307
Credit: 16,259,923
RAC: 11,365
Sweden
Message 1134272 - Posted: 30 Jul 2011, 21:41:58 UTC

First time now in 20 hours that the splitters seems to be building up a cache of Results ready to send, and the bandwidth utilization is above 90 Mbits/sec.

Let's hope it can stay this way for a bit longer than the last time.
____________

Profile Zapped Sparky
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 30 Aug 08
Posts: 5606
Credit: 1,118,948
RAC: 1,642
United Kingdom
Message 1134314 - Posted: 30 Jul 2011, 23:19:43 UTC

My cache ran out last night due to "no tasks available", Boinc has managed to grab a few tasks to keep going today and is now filling the cache back up quite well. Pretty much all shorties, my CPU is flying through them.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5561
Credit: 51,253,726
RAC: 38,570
Australia
Message 1134324 - Posted: 30 Jul 2011, 23:45:37 UTC - in response to Message 1134272.

Let's hope it can stay this way for a bit longer than the last time.

Fingers crossed.
Now if they could sort out the dodgy download server all should be right in time for the next outage.

____________
Grant
Darwin NT.

Profile Helli
Volunteer tester
Avatar
Send message
Joined: 15 Dec 99
Posts: 697
Credit: 77,237,966
RAC: 74,758
Germany
Message 1134411 - Posted: 31 Jul 2011, 5:06:24 UTC

Similar here. Cache ran empty three hours ago, but 1054 WU stuck in
download queque: HTTP error. No Download actually...

Helli

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5561
Credit: 51,253,726
RAC: 38,570
Australia
Message 1134414 - Posted: 31 Jul 2011, 5:19:51 UTC - in response to Message 1134411.


Suspend network activity, then re-enable it a couple of seconds later.
Usually gets things going for me.
____________
Grant
Darwin NT.

Profile Helli
Volunteer tester
Avatar
Send message
Joined: 15 Dec 99
Posts: 697
Credit: 77,237,966
RAC: 74,758
Germany
Message 1134417 - Posted: 31 Jul 2011, 5:31:21 UTC

Yup, a few are flowing. Shorties. 5 minutes each. But if you can do
16 Workunits in five Minutes then you have to be very patiently. ;-)

Helli

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5561
Credit: 51,253,726
RAC: 38,570
Australia
Message 1134641 - Posted: 31 Jul 2011, 22:12:33 UTC - in response to Message 1134417.


It was nice while it lasted, looks like about 4 hours ago we ran out of MB work to split.
____________
Grant
Darwin NT.

rob smith
Volunteer moderator
Send message
Joined: 7 Mar 03
Posts: 7661
Credit: 44,661,518
RAC: 75,457
United Kingdom
Message 1134734 - Posted: 1 Aug 2011, 8:23:55 UTC

Well its Monday morning and the splitters are running out of tapes to split, which is a good thing, it means they more or less got the right number of tapes loaded on Friday.

____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (51) Server problems?

Copyright © 2014 University of California