Panic Mode On (79) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (79) Server Problems?

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 23 · Next
Author Message
Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5920
Credit: 61,710,452
RAC: 17,476
Australia
Message 1310428 - Posted: 26 Nov 2012, 8:56:18 UTC - in response to Message 1310421.


Inbound & outbound traffic is plummeting. Hopefully it'll recover again like it's done twice previously.
*fingers crossed*
____________
Grant
Darwin NT.

musicplayer
Send message
Joined: 17 May 10
Posts: 1475
Credit: 745,610
RAC: 601
Message 1310436 - Posted: 26 Nov 2012, 9:29:36 UTC

And I got a new batch of jobs coming my way. Thanks!

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2897
Credit: 218,381,660
RAC: 21,868
United States
Message 1310463 - Posted: 26 Nov 2012, 12:33:56 UTC - in response to Message 1310421.



I have no doubt we'd find some new major problem sooner rather than later, but it would erase completely several existing ones.



It would be fun to play some other game for a while.

Let's try it and see!

____________

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2328
Credit: 8,869,285
RAC: 683
United States
Message 1310487 - Posted: 26 Nov 2012, 15:22:48 UTC

I've been speculating for at least two years now that if we were able to increase the pipe to even just 200mbit, I'm not sure the scheduler/feeder would handle it. Even two years ago when GPUs were much slower and less common, the database was having trouble keeping up. Of course, we have better hardware now, but I think it's significantly more likely we'll run into a software limitation, on top of the disk I/O limitation.

Bigger pipe will likely cause more issues without some sort of restraint (per-host limits are a simple way to do it, but there are better ways.. like server-side cache size based on DCF). It was a good idea to run the scheduler on a different link, as long as that can be reliable. It will at least allow a high rate of successful contacts to report work and be assigned new work, and then you just have to fight for bandwidth on the download link, which in the grand scheme of things, isn't that huge of an issue.

You wouldn't end up with ghosts, you'd just end up with 10+ hour back-offs, but you can over-come those with some manual intervention, or some less draconian exponential back-off calculations in the client itself.

Maybe once the scheduler reliability issues get sorted out, we can possibly test the database's capability to keep up for a 24-hour period by updating some DNS records and ramp the bandwidth up? Maybe pick a Saturday or we have the winter holidays coming up where the campus will be empty except a select few faculty members. Could do a 1-3 day test on 200+ mbit then, assuming the red tape can be removed temporarily for such a thing.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,856,898
RAC: 16,111
Argentina
Message 1310500 - Posted: 26 Nov 2012, 16:34:23 UTC

What I think is that besides the underlying issue, the current limits are not helping as they should.
My hosts, even the one that has a core 2 duo and an old gt9500, is doing a request every five minutes, and Im sure that the same happens in hughe number of active host so, they changed a few long queries by a lot of short ones.

And, while the need of a certain limit is out discussion, unless until they can solve the underlying issue, may be a slightly higher limit can give a beter trade of between the duration of the queries and the number of them...
____________

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4601
Credit: 121,650,407
RAC: 38,577
United States
Message 1310509 - Posted: 26 Nov 2012, 16:54:24 UTC - in response to Message 1308561.

Quick script to provide the percentage of failures on your machine.


Thanks


Now we just need a small tweak to divide those into 'before tonight' and 'after tonight', so we know what effect Eric's changes have had.

At the moment this will work correctly if the stdoutdae.txt only contains the current months information. Otherwise matching day information for other months & years will be included. Dealing with dates in a .bat file can be a pain. I might have to go with .vbs to make things easier.
I did update it to separate the dates & allow users to enter the number of days they wish to view.


If anyone wishes to use it you can find it here:
http://www.hal6000.com/seti/_com_check_full.txt
Right click & save, depending on your browser, rename to .bat. Then place in the folder where your stdoutdae.txt is located, or modify the script to point to the location of the file.

Change the number on the line "set check_days=" for the number of days you wish to view.

If you wish you can remove the line "del sched_*-%computername%_*.txt" to save the daily information. You can then use this script, http://www.hal6000.com/seti/_com_check_day_calc.txt, to run each day separately by entering the date from the command line in YYYY-MM-DD format.
So entering, _com_check_day_calc.bat 2012-11-26, with the sched_*-PCNAME_2012-11-26.txt files present would give you that days information.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8767
Credit: 52,716,959
RAC: 16,436
United Kingdom
Message 1310519 - Posted: 26 Nov 2012, 17:27:51 UTC - in response to Message 1310509.

I've been aggregating the stdoutdae.txt files from five computers, and splitting them up into batches by date.

First, covering the period from 01 Nov to 20 Nov inclusive - near enough, from when the acute problems started, to when Eric switched to the Campus network last Wednesday.

Scheduler Requests: 10334 Scheduler Successes: 7341 Scheduler Failures (Connect): 221 Scheduler Failures (Peer data): 62 Scheduler Failures (Timeout): 2659 Scheduler Success: 71 % Scheduler Failure: 28 % Scheduler Connect: 7 % of failures Scheduler Peer data: 2 % of failures Scheduler Timeout: 88 % of failures

Then, on Saturday morning (Berkeley time), the scheduler came back online, but back on the old HE/PAIX IP address. Since then...

Scheduler Requests: 1109 Scheduler Successes: 349 Scheduler Failures (Connect): 597 Scheduler Failures (Peer data): 131 Scheduler Failures (Timeout): 28 Scheduler Success: 31 % Scheduler Failure: 68 % Scheduler Connect: 78 % of failures Scheduler Peer data: 17 % of failures Scheduler Timeout: 3 % of failures

So, it's actually been easier to get work here over the weekend than for several weeks past - there have been more errors in total, but the vast majority have been of the 20-second 'Connect' variety, instead of the 5-minute 'Timeout' variety. And with a 'connect' failure, the database knows nothing about your attempt, so no ghosts are created.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5920
Credit: 61,710,452
RAC: 17,476
Australia
Message 1310527 - Posted: 26 Nov 2012, 18:07:50 UTC - in response to Message 1310519.

So, it's actually been easier to get work here over the weekend than for several weeks past - there have been more errors in total, but the vast majority have been of the 20-second 'Connect' variety, instead of the 5-minute 'Timeout' variety. And with a 'connect' failure, the database knows nothing about your attempt, so no ghosts are created.

Things are certainly better than they were, but not as good as when the campus network was being used.
At least most of the time- we're back in one of those major network traffic dives, hoping for yet again another recovery.
____________
Grant
Darwin NT.

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,856,898
RAC: 16,111
Argentina
Message 1310529 - Posted: 26 Nov 2012, 18:08:46 UTC - in response to Message 1310519.

So, it's actually been easier to get work here over the weekend than for several weeks past - there have been more errors in total, but the vast majority have been of the 20-second 'Connect' variety, instead of the 5-minute 'Timeout' variety. And with a 'connect' failure, the database knows nothing about your attempt, so no ghosts are created.

Which gives sense to my guess that we are overloading the scheduller with more requests than it's able to handle... (my raw estimation is that if 20% of the active hosts are trying to do a RPC every 6 mins, the scheduller should be able to process 120 RPCs per second...
I guess that no matter if a lot of people spells BOINC as BIONIC anyway we don't have the technology... (neither the six million dollars)
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8767
Credit: 52,716,959
RAC: 16,436
United Kingdom
Message 1310536 - Posted: 26 Nov 2012, 18:27:43 UTC - in response to Message 1310527.

So, it's actually been easier to get work here over the weekend than for several weeks past - there have been more errors in total, but the vast majority have been of the 20-second 'Connect' variety, instead of the 5-minute 'Timeout' variety. And with a 'connect' failure, the database knows nothing about your attempt, so no ghosts are created.

Things are certainly better than they were, but not as good as when the campus network was being used.
At least most of the time- we're back in one of those major network traffic dives, hoping for yet again another recovery.

If you look carefully at the SSP, you'll see that the AP splitter processes have gone walkabout to Vader, Marvin, GeorgeM...

I think the traffic dive may be to do with reconfigurations in progress.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5920
Credit: 61,710,452
RAC: 17,476
Australia
Message 1310538 - Posted: 26 Nov 2012, 18:35:21 UTC - in response to Message 1310536.

So, it's actually been easier to get work here over the weekend than for several weeks past - there have been more errors in total, but the vast majority have been of the 20-second 'Connect' variety, instead of the 5-minute 'Timeout' variety. And with a 'connect' failure, the database knows nothing about your attempt, so no ghosts are created.

Things are certainly better than they were, but not as good as when the campus network was being used.
At least most of the time- we're back in one of those major network traffic dives, hoping for yet again another recovery.

If you look carefully at the SSP, you'll see that the AP splitter processes have gone walkabout to Vader, Marvin, GeorgeM...

I think the traffic dive may be to do with reconfigurations in progress.

Possibly- it's now bottommed out, which it hadn't done on the previous dives it recovered from.
____________
Grant
Darwin NT.

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5474
Credit: 313,442,786
RAC: 88,967
Brazil
Message 1310554 - Posted: 26 Nov 2012, 18:59:37 UTC

Finaly they start to do what we tell them to do about 2 weeks ago.....

Unload some tasks from the poor overloaded Synergy!

At least now we see a light on the end of the tunnel...
____________

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8755
Credit: 61,654,893
RAC: 33,301
United Kingdom
Message 1310556 - Posted: 26 Nov 2012, 19:03:26 UTC

It certainly looks as if some serious re-knitting is being done. Here's to hoping it is a success.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5474
Credit: 313,442,786
RAC: 88,967
Brazil
Message 1310572 - Posted: 26 Nov 2012, 19:59:53 UTC - in response to Message 1310555.

Finaly they start to do what we tell them to do about 2 weeks ago.....

Unload some tasks from the poor overloaded Synergy!

At least now we see a light on the end of the tunnel...

Let's just hope that it's not the light from another oncoming train....LOL.

Unless you are a Coyote you don´t need to worry!

At least i hope... i´m tired with all this... At least i´m happy all hands report no problem with the yesterday power loss... now i´m going to drink few cold beers...



____________

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 12522
Credit: 14,826,344
RAC: 2,956
United States
Message 1310589 - Posted: 26 Nov 2012, 21:21:36 UTC - in response to Message 1310500.

What I think is that besides the underlying issue, the current limits are not helping as they should.
My hosts, even the one that has a core 2 duo and an old gt9500, is doing a request every five minutes, and Im sure that the same happens in hughe number of active host so, they changed a few long queries by a lot of short ones.

And, while the need of a certain limit is out discussion, unless until they can solve the underlying issue, may be a slightly higher limit can give a beter trade of between the duration of the queries and the number of them...

I never thought the current limit scheme made sense. To have any real effect on network traffic, wouldn't it have to limit tasks over a specific time period rather than just "at a time"?

I hate to say that, despite what the higher-crediting Einstein units are doing for my overall Boinc stats (and even with 99 Einstein units "aborted by user," which I certainly didn't do), but to me it's the only thing that will achieve the desired outcome.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 12522
Credit: 14,826,344
RAC: 2,956
United States
Message 1310591 - Posted: 26 Nov 2012, 21:23:22 UTC

The next big question is, why can't I get page 2 of this thread to load completely?

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4601
Credit: 121,650,407
RAC: 38,577
United States
Message 1310597 - Posted: 26 Nov 2012, 21:34:09 UTC - in response to Message 1310589.

What I think is that besides the underlying issue, the current limits are not helping as they should.
My hosts, even the one that has a core 2 duo and an old gt9500, is doing a request every five minutes, and Im sure that the same happens in hughe number of active host so, they changed a few long queries by a lot of short ones.

And, while the need of a certain limit is out discussion, unless until they can solve the underlying issue, may be a slightly higher limit can give a beter trade of between the duration of the queries and the number of them...

I never thought the current limit scheme made sense. To have any real effect on network traffic, wouldn't it have to limit tasks over a specific time period rather than just "at a time"?

I hate to say that, despite what the higher-crediting Einstein units are doing for my overall Boinc stats (and even with 99 Einstein units "aborted by user," which I certainly didn't do), but to me it's the only thing that will achieve the desired outcome.

They are not trying to effect the network traffic. The limits were put in place to reduce the size of the result and host tables. As they had grown to large to complete in a timely fashion.
Hopefully they will come up with a long term solution. Hopefully by breaking up the tables into sizes that can be quickly searched.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile dancer42
Volunteer tester
Send message
Joined: 2 Jun 02
Posts: 436
Credit: 1,160,474
RAC: 80
United States
Message 1310609 - Posted: 26 Nov 2012, 22:26:32 UTC

After reading what everyone has been saying here I watched the log file over night and noticed that the boinc manager poled the scheduler several times an hour far more often then would seem to make any since. If instead of polling 10 to 50 times an hour perhaps a line could be added to preferences to set a minimum number of unit's before reporting we could then set that number higher when high traffic threatened to bring the scheduling server down. I am thinking the unnecessary polling for 150,000 machines would add a lot two the traffic problems does seti really need to know every signal time I start finish or change any thing?
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,856,898
RAC: 16,111
Argentina
Message 1310616 - Posted: 26 Nov 2012, 22:54:39 UTC - in response to Message 1310609.

After reading what everyone has been saying here I watched the log file over night and noticed that the boinc manager poled the scheduler several times an hour far more often then would seem to make any since. If instead of polling 10 to 50 times an hour perhaps a line could be added to preferences to set a minimum number of unit's before reporting we could then set that number higher when high traffic threatened to bring the scheduling server down. I am thinking the unnecessary polling for 150,000 machines would add a lot two the traffic problems does seti really need to know every signal time I start finish or change any thing?

The clients are not calling the scheduler to report every task, they are calling often because due to the limits their caches are not filled and then they try to get more work to fill them. Of course when the contact is made the already crunched units are reported, but thats just collateral, on normal circumstances and if there is no need for more work, the client will call the scheduler only once every 24 hours to report tasks.
____________

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 23 · Next

Message boards : Number crunching : Panic Mode On (79) Server Problems?

Copyright © 2014 University of California