Panic Mode On (51) Server problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (51) Server problems?

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next
Author Message
Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8808
Credit: 53,434,214
RAC: 43,986
United Kingdom
Message 1133511 - Posted: 29 Jul 2011, 15:31:07 UTC - in response to Message 1133506.

Well it could and should be done on the server side as a round robin function IMO. My DNS server did not in any situation try the working IP, it always tried the non working, and that for days. Flushing the DNS cache made no difference, rebooting made no difference, and I'm sure I'm not alone with this problem, which easily could be fixed on the server side.

Edit, added: It may be as it will with all that, I just have to edit the host file when needed.

AFAIK, it is implemented as round robin DNS - it's always looked that way when I've tracked it down. It's worth trying ipconfig/displaydns to find out what your local machine's DNS resolver currently thinks the IP address should be before/during/after a download request - it shows the current TTL timer countdown as well, which is useful.

If displaydns consistently shows the wrong address, then something upstream (DNS server/proxy/ISP) is mis-handling TTL. Or there might, indeed, be a mis-configuration at SETI - that would affect us all, and we can check that by comparing notes here.

There used to be a bug in BOINC, which Ned Ludd and I finally got them the acknowledge and fix in v6.10.33 (March 2009) - If BOINC had already tried a download, and failed, it carried on attempting to download from the same IP address for evermore, rather than re-querying DNS (which would pick up the round robin). It wasn't BOINC's fault - it was a bug in the underlying libcurl library that handles the TCI/IP layer. And it shouldn't be a problem in any current version of BOINC.

I'm on 6.10.18 on both machines, and I refuse to upgrade, so I just have to live with it :-)

In that case, half your downloads will stall, and you will have to do a (carefully-timed) restart of BOINC to free them while the 'right' server is on DNS duty.

That's what I like about crunching for SETI, rather than other projects - it actually feels like you're doing some of the work yourself, not just leaving it to the computer. ;-)

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3747
Credit: 21,447,296
RAC: 14,777
Sweden
Message 1133514 - Posted: 29 Jul 2011, 15:35:25 UTC - in response to Message 1133511.

Well it could and should be done on the server side as a round robin function IMO. My DNS server did not in any situation try the working IP, it always tried the non working, and that for days. Flushing the DNS cache made no difference, rebooting made no difference, and I'm sure I'm not alone with this problem, which easily could be fixed on the server side.

Edit, added: It may be as it will with all that, I just have to edit the host file when needed.

AFAIK, it is implemented as round robin DNS - it's always looked that way when I've tracked it down. It's worth trying ipconfig/displaydns to find out what your local machine's DNS resolver currently thinks the IP address should be before/during/after a download request - it shows the current TTL timer countdown as well, which is useful.

If displaydns consistently shows the wrong address, then something upstream (DNS server/proxy/ISP) is mis-handling TTL. Or there might, indeed, be a mis-configuration at SETI - that would affect us all, and we can check that by comparing notes here.

There used to be a bug in BOINC, which Ned Ludd and I finally got them the acknowledge and fix in v6.10.33 (March 2009) - If BOINC had already tried a download, and failed, it carried on attempting to download from the same IP address for evermore, rather than re-querying DNS (which would pick up the round robin). It wasn't BOINC's fault - it was a bug in the underlying libcurl library that handles the TCI/IP layer. And it shouldn't be a problem in any current version of BOINC.

I'm on 6.10.18 on both machines, and I refuse to upgrade, so I just have to live with it :-)

In that case, half your downloads will stall, and you will have to do a (carefully-timed) restart of BOINC to free them while the 'right' server is on DNS duty.

That's what I like about crunching for SETI, rather than other projects - it actually feels like you're doing some of the work yourself, not just leaving it to the computer. ;-)


Well, it's never happened before, or maybe it has but I've forgotten. It must have happened since I already had edited my host file long time ago, but commented out the boinc server parts.

Don't mind me, it's Alzheimers light I guess :-)
____________

Bernd Noessler
Send message
Joined: 15 Nov 09
Posts: 99
Credit: 52,635,434
RAC: 0
Germany
Message 1133520 - Posted: 29 Jul 2011, 15:39:04 UTC - in response to Message 1133504.



It is not different, it is the same .13. Well that is for my Win 7 with or without flushdns. On my Vista though it changes every time I do a flushdns, no matter if it's been 5 minutes or not in between. The Vista machine did not have any download problems either.


I have tried it with Win XP in a VirtualBox. I have a BIND as nameserver
in my local network. The IP changes every 5 minutes. Do you have a local
nameserver or do you use the nameserver of your ISP ?

Profile Link
Avatar
Send message
Joined: 18 Sep 03
Posts: 840
Credit: 1,578,051
RAC: 55
Germany
Message 1133522 - Posted: 29 Jul 2011, 15:40:35 UTC
Last modified: 29 Jul 2011, 15:41:37 UTC

I have posted that in the other thread about this problems, is that maybe something better than the current round robin DNS?

Since it's not the first time that we have problems like that here, I wonder if it would not cause less problems if SETI had two different download server URLs, for example dl1.ssl.berkeley.edu and dl2.ssl.berkeley.edu and send both as possible download locations like rosetta is doing for example:

<url>http://srv3.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://boinc.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://srv4.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://srv3.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://boinc.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>
<url>http://srv4.bakerlab.org/rosetta/download/262/avgE_from_pdb.gz</url>



So for a SETI WU it could be:

<url>http://dl1.ssl.berkeley.edu/sah/download_fanout/61/08ap11ae.3480.1703.14.10.29</url>
<url>http://dl2.ssl.berkeley.edu/sah/download_fanout/61/08ap11ae.3480.1703.14.10.29</url>



Don't know how the load balancing works in that case, if the BOINC client picks just one of them, than that would be pretty easy, not need for any big server side changes. If the client starts from the top and tries one after the other, than the sheduler would have to send dl1,dl2 to all even number results (_0, _2,...) and dl2,dl1 to all odd number results. I think it might work better that the current way... but I might be wrong of course.

____________
.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8808
Credit: 53,434,214
RAC: 43,986
United Kingdom
Message 1133526 - Posted: 29 Jul 2011, 15:48:43 UTC - in response to Message 1133514.

Well, it's never happened before, or maybe it has but I've forgotten. It must have happened since I already had edited my host file long time ago, but commented out the boinc server parts.

Don't mind me, it's Alzheimers light I guess :-)

It has happened before, but it's an intermittent problem which keeps cropping up, hanging around for a while, and going away again.

I guess that because downloads are sort-of working, and they all go out over the same link, it doesn't show up as a problem on the lab monitoring tools: and they don't know it needs kicking until we kick up a fuss here, or someone on the 'inside' mailing distribution circuit passes on a message. Hint to mods?

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5942
Credit: 62,339,107
RAC: 37,643
Australia
Message 1133900 - Posted: 30 Jul 2011, 4:25:57 UTC - in response to Message 1133770.


Well, it was nice while it lasted.
They're back to producing just a trickle again.
____________
Grant
Darwin NT.

Bernd Noessler
Send message
Joined: 15 Nov 09
Posts: 99
Credit: 52,635,434
RAC: 0
Germany
Message 1134009 - Posted: 30 Jul 2011, 6:47:06 UTC

Nothing changed with 208.68.240.13. The forwarding of port 80
doesn't work. Interesting is the forwarding of port 443 (https)
is working and connects me to vader.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5942
Credit: 62,339,107
RAC: 37,643
Australia
Message 1134056 - Posted: 30 Jul 2011, 9:00:48 UTC - in response to Message 1134030.


Yep, both my machines are getting work. Just not very much of it & only on every 10-20th request. Both caches are running down.
____________
Grant
Darwin NT.

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3747
Credit: 21,447,296
RAC: 14,777
Sweden
Message 1134272 - Posted: 30 Jul 2011, 21:41:58 UTC

First time now in 20 hours that the splitters seems to be building up a cache of Results ready to send, and the bandwidth utilization is above 90 Mbits/sec.

Let's hope it can stay this way for a bit longer than the last time.
____________

Profile Zapped SparkyProject donor
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 30 Aug 08
Posts: 9370
Credit: 1,333,553
RAC: 702
United Kingdom
Message 1134314 - Posted: 30 Jul 2011, 23:19:43 UTC

My cache ran out last night due to "no tasks available", Boinc has managed to grab a few tasks to keep going today and is now filling the cache back up quite well. Pretty much all shorties, my CPU is flying through them.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5942
Credit: 62,339,107
RAC: 37,643
Australia
Message 1134324 - Posted: 30 Jul 2011, 23:45:37 UTC - in response to Message 1134272.

Let's hope it can stay this way for a bit longer than the last time.

Fingers crossed.
Now if they could sort out the dodgy download server all should be right in time for the next outage.

____________
Grant
Darwin NT.

Profile HelliProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Dec 99
Posts: 705
Credit: 92,912,853
RAC: 59,496
Germany
Message 1134411 - Posted: 31 Jul 2011, 5:06:24 UTC

Similar here. Cache ran empty three hours ago, but 1054 WU stuck in
download queque: HTTP error. No Download actually...

Helli

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5942
Credit: 62,339,107
RAC: 37,643
Australia
Message 1134414 - Posted: 31 Jul 2011, 5:19:51 UTC - in response to Message 1134411.


Suspend network activity, then re-enable it a couple of seconds later.
Usually gets things going for me.
____________
Grant
Darwin NT.

Profile HelliProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Dec 99
Posts: 705
Credit: 92,912,853
RAC: 59,496
Germany
Message 1134417 - Posted: 31 Jul 2011, 5:31:21 UTC

Yup, a few are flowing. Shorties. 5 minutes each. But if you can do
16 Workunits in five Minutes then you have to be very patiently. ;-)

Helli

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5942
Credit: 62,339,107
RAC: 37,643
Australia
Message 1134641 - Posted: 31 Jul 2011, 22:12:33 UTC - in response to Message 1134417.


It was nice while it lasted, looks like about 4 hours ago we ran out of MB work to split.
____________
Grant
Darwin NT.

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8809
Credit: 62,889,669
RAC: 75,363
United Kingdom
Message 1134734 - Posted: 1 Aug 2011, 8:23:55 UTC

Well its Monday morning and the splitters are running out of tapes to split, which is a good thing, it means they more or less got the right number of tapes loaded on Friday.

____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (51) Server problems?

Copyright © 2014 University of California