Panic Mode On (80) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (80) Server Problems?

Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · Next
Author Message
Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5562
Credit: 51,313,517
RAC: 39,913
Australia
Message 1332150 - Posted: 28 Jan 2013, 3:38:17 UTC - in response to Message 1332114.

Grant

The last time they used the campus network for schedular requests

Ths is what happened.

The scheduler will be down until someone can get to the lab to reboot it. I'll try to convince Angela to let me go in once the turkey is in the oven.

Eric

?
____________
Grant
Darwin NT.

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2203
Credit: 8,010,496
RAC: 4,157
United States
Message 1332166 - Posted: 28 Jan 2013, 6:00:05 UTC

Not entirely sure what that is about, either.

I do remember they tried changing the IP and also the ISP the scheduler listened on. The server was still in the closet and was still hooked up to the same internal network as all the other servers, but it had an IP for the campus ISP, and DNS was of course updated to reflect that.

My memory is a little fuzzy, but I think that made things worse somehow, but I don't recall just how. There also may have been some kind of issue with remote-login since the scheduler was now on a different subnet, which would require someone to actually go into the lab.

If we could possibly get our soft-limit of 100mbit increased to 150, that would probably fix just about everything regarding communications. That won't fix the database having I/O performance issues, or getting fragmented and bloated, so limits may still be required, but maybe the limits could be increased a little, like 50% to start with and see what happens after a week. Then add another 50%, and so on.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

rob smith
Volunteer moderator
Send message
Joined: 7 Mar 03
Posts: 7668
Credit: 44,747,954
RAC: 75,283
United Kingdom
Message 1332272 - Posted: 28 Jan 2013, 17:39:08 UTC

A quick sum
Number of MB tasks produced per second ~60 (based on an average production rate of 30WU/s)
Amount of MB data to be transferred per second = 60*366 = 22000KB
Now that's only 22MB per second, which leaves a fair bit of change from the 100KBs pipe.

So what is gobbling up the other 78MB???
My sums ignore overheads, even if these run at 100% of the "real" data there is something having a fair old feast at the expense of S@H's link between the lab and the outside world....
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,938,528
RAC: 13,679
United Kingdom
Message 1332275 - Posted: 28 Jan 2013, 17:46:15 UTC - in response to Message 1332272.

A quick sum
Number of MB tasks produced per second ~60 (based on an average production rate of 30WU/s)
Amount of MB data to be transferred per second = 60*366 = 22000KB
Now that's only 22MB per second, which leaves a fair bit of change from the 100KBs pipe.

So what is gobbling up the other 78MB???
My sums ignore overheads, even if these run at 100% of the "real" data there is something having a fair old feast at the expense of S@H's link between the lab and the outside world....

Check your bits and bytes.

22 MegaBytes (normal unit for file sizes and storage)
is a lot more than
100 Megabits (normal unit for communications channels)

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37296
Credit: 498,845,548
RAC: 502,868
United States
Message 1332281 - Posted: 28 Jan 2013, 18:03:28 UTC - in response to Message 1332275.

A quick sum
Number of MB tasks produced per second ~60 (based on an average production rate of 30WU/s)
Amount of MB data to be transferred per second = 60*366 = 22000KB
Now that's only 22MB per second, which leaves a fair bit of change from the 100KBs pipe.

So what is gobbling up the other 78MB???
My sums ignore overheads, even if these run at 100% of the "real" data there is something having a fair old feast at the expense of S@H's link between the lab and the outside world....

Check your bits and bytes.

22 MegaBytes (normal unit for file sizes and storage)
is a lot more than
100 Megabits (normal unit for communications channels)

He would also be ignoring AP WUs as well........
____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

rob smith
Volunteer moderator
Send message
Joined: 7 Mar 03
Posts: 7668
Credit: 44,747,954
RAC: 75,283
United Kingdom
Message 1332292 - Posted: 28 Jan 2013, 18:43:40 UTC - in response to Message 1332275.

In that case why do we get reasonable download rates sometime when the splitters are going all out, and yet others (like now) the performance is very poor?
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37296
Credit: 498,845,548
RAC: 502,868
United States
Message 1332294 - Posted: 28 Jan 2013, 18:48:08 UTC - in response to Message 1332292.

In that case why do we get reasonable download rates sometime when the splitters are going all out, and yet others (like now) the performance is very poor?

It seems to be usually when the larger AP WUs are added to the download mix that things get rather tied up. I have noticed at times that it appears that AP downloads, although still slow, seem to be less likely to stall or hang, thus tying up the download link longer.
____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

rob smith
Volunteer moderator
Send message
Joined: 7 Mar 03
Posts: 7668
Credit: 44,747,954
RAC: 75,283
United Kingdom
Message 1332304 - Posted: 28 Jan 2013, 19:24:17 UTC

An AP is about 22 times the size of an MB.
It could be that the presence of a feed of APs just trips things over the line. Likewise a high demand, such as a shortie storm has the same effect.
A small perturbation is just enough to upset the scheduler, which causes a higher number of "rejects" than normal, and so the snowball of delays and retries grows.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3567
Credit: 97,948,839
RAC: 79,209
United States
Message 1332309 - Posted: 28 Jan 2013, 19:30:29 UTC - in response to Message 1332294.

In that case why do we get reasonable download rates sometime when the splitters are going all out, and yet others (like now) the performance is very poor?

It seems to be usually when the larger AP WUs are added to the download mix that things get rather tied up. I have noticed at times that it appears that AP downloads, although still slow, seem to be less likely to stall or hang, thus tying up the download link longer.

AP's are ~20 times larger than MB, but only take about 6 times the amount of time to process. The 100Mb pipe is often sufficient for standard MB tasks when there isn't a large volume of shorties. Add in AP or batches of shorties and then it does get choked. Hopefully the work towards larger MB tasks will help take some of the load off of the line.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile petri33
Volunteer tester
Send message
Joined: 6 Jun 02
Posts: 349
Credit: 53,278,338
RAC: 139,864
Finland
Message 1332341 - Posted: 28 Jan 2013, 20:55:15 UTC - in response to Message 1332294.


Some random thoughts in the evening..

.. are AP and MB work units generated on some machine and then copied over network to a distribution server?

If so, are they loaded to the downlod server using the same network card/interface that is used by users to download work units to their machines?

if so, could it be that the generator/copier saturates the channel?

if not so, how about the disk read/write speed of the download machine? Simultaneous red/write operations could hurt RAID performance. The writes alone are quite costly.

But I guess that you have ruled these out already.
____________

fscheel
Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1332347 - Posted: 28 Jan 2013, 21:14:54 UTC - in response to Message 1331980.

Can someone recommend a good reliable source to get a paid proxy that would work with SETI?

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3567
Credit: 97,948,839
RAC: 79,209
United States
Message 1332348 - Posted: 28 Jan 2013, 21:27:26 UTC - in response to Message 1332341.


Some random thoughts in the evening..

.. are AP and MB work units generated on some machine and then copied over network to a distribution server?

If so, are they loaded to the download server using the same network card/interface that is used by users to download work units to their machines?

if so, could it be that the generator/copier saturates the channel?

if not so, how about the disk read/write speed of the download machine? Simultaneous red/write operations could hurt RAID performance. The writes alone are quite costly.

But I guess that you have ruled these out already.

IIRC most of, if not all, the servers use a Fibre Channel interconnect to the storage array.

They have seen the FC network become saturated before, but that was from some changes they were trying I believe. Most of that kind of stuff gets posted in Technical News.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3567
Credit: 97,948,839
RAC: 79,209
United States
Message 1332349 - Posted: 28 Jan 2013, 21:29:19 UTC - in response to Message 1332347.

Can someone recommend a good reliable source to get a paid proxy that would work with SETI?

I am sure you could find a private paid proxy to use, but you might want to hit up the free ones first.
http://www.xroxy.com/proxylist.php?port=&type=&ssl=&country=US&latency=&reliability=&sort=port#table
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3276
Credit: 40,803,490
RAC: 60,395
Russia
Message 1332358 - Posted: 28 Jan 2013, 21:57:29 UTC

Few weeks already my main host is almost constantly out of work from SETI.
BOINC big download backofs make impossible to fill cache.
Only when I have time to constantly press "retry now" I can fill cache for day or 2 and usually only for GPU, CPU remains empty/on backup project.
____________
News about SETI opt app releases: https://twitter.com/Raistmer

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4811
Credit: 71,585,885
RAC: 8,806
Australia
Message 1332359 - Posted: 28 Jan 2013, 22:06:35 UTC - in response to Message 1332358.
Last modified: 28 Jan 2013, 22:07:06 UTC

Been not watching closely over the traditional Australia Day long weekend chaos, and my machines were crunching when I looked occasionally. If I had stuck transfers I just put this retryMainTransfers.cmd in my scheduled tasks for every 20 mins or so:

@ECHO OFF boinccmd --get_file_transfers > mainxfers.txt FOR /F "tokens=1,2" %%i IN (mainxfers.txt) DO ( IF "%%i" EQU "name:" echo %%j IF "%%i" EQU "name:" boinccmd --file_transfer http://setiathome.berkeley.edu/ %%j retry )

____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile ivan
Volunteer tester
Avatar
Send message
Joined: 5 Mar 01
Posts: 552
Credit: 120,147,293
RAC: 85,949
United Kingdom
Message 1332362 - Posted: 28 Jan 2013, 22:21:32 UTC - in response to Message 1332359.

Been not watching closely over the traditional Australia Day long weekend chaos, and my machines were crunching when I looked occasionally. If I had stuck transfers I just put this retryMainTransfers.cmd in my scheduled tasks for every 20 mins or so:

@ECHO OFF boinccmd --get_file_transfers > mainxfers.txt FOR /F "tokens=1,2" %%i IN (mainxfers.txt) DO ( IF "%%i" EQU "name:" echo %%j IF "%%i" EQU "name:" boinccmd --file_transfer http://setiathome.berkeley.edu/ %%j retry )


Similarly, I have this as a crontab entry on my Linux boxes, and Windows running cygwin:

[eesridr:~] > cat retryfiles
pgrep boinc > /dev/null
if [ $? -eq 0 ] # Test exit status of "pgrep" command.
then
cd ~/BOINC/
./boinccmd --get_file_transfers | gawk -f retry.awk
fi

[eesridr:~] > cat BOINC/retry.awk
/name/ { n = $2;}
/ xfer active: no/
{ system("./boinccmd --file_transfer http://setiathome.berkeley.edu/ " n " retry");}

____________

ExchangeMan
Volunteer tester
Send message
Joined: 9 Jan 00
Posts: 103
Credit: 104,912,585
RAC: 215,017
United States
Message 1332384 - Posted: 29 Jan 2013, 0:49:34 UTC - in response to Message 1332359.

I have something very similar to this for the same purpose. Gotta love DOS programming.

____________

Profile KWSN Ekky Ekky Ekky
Avatar
Send message
Joined: 25 May 99
Posts: 917
Credit: 9,706,439
RAC: 11,670
United Kingdom
Message 1332457 - Posted: 29 Jan 2013, 8:44:14 UTC
Last modified: 29 Jan 2013, 9:13:57 UTC

Dip in traffic towards Seti detected?
Yes, definitely a downturn. Expect failing reports after a good day of rapid access.

[edit] The thin blue line has hit the bottom - no more reporting until later, I fear. [end edit]
____________

WinterKnight
Volunteer tester
Send message
Joined: 18 May 99
Posts: 8219
Credit: 21,791,161
RAC: 12,914
United Kingdom
Message 1332464 - Posted: 29 Jan 2013, 9:47:02 UTC - in response to Message 1332457.

Dip in traffic towards Seti detected?
Yes, definitely a downturn. Expect failing reports after a good day of rapid access.

[edit] The thin blue line has hit the bottom - no more reporting until later, I fear. [end edit]

That is not the bottom, it is the 10Mb horizontal. The weekly graph shows there is still a bit to go.

But yes, there is a problem.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5562
Credit: 51,313,517
RAC: 39,913
Australia
Message 1332467 - Posted: 29 Jan 2013, 10:04:11 UTC - in response to Message 1332464.

But yes, there is a problem.

Yep, Scheduler borked again.
"Couldn't connect to server" once again the standard response.
____________
Grant
Darwin NT.

Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · Next

Message boards : Number crunching : Panic Mode On (80) Server Problems?

Copyright © 2014 University of California