Panic Mode On (79) Server Problems?

Message boards : Number crunching : Panic Mode On (79) Server Problems?

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 23 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 7486
Credit: 91,083,099
RAC: 46,351
Australia
Message 1310428 - Posted: 26 Nov 2012, 8:56:18 UTC - in response to Message 1310421.  


Inbound & outbound traffic is plummeting. Hopefully it'll recover again like it's done twice previously.
*fingers crossed*


Grant
Darwin NT

ID: 1310428 · Report as offensive
musicplayer

Send message
Joined: 17 May 10
Posts: 1785
Credit: 842,842
RAC: 0
Message 1310436 - Posted: 26 Nov 2012, 9:29:36 UTC

And I got a new batch of jobs coming my way. Thanks!

ID: 1310436 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3373
Credit: 248,474,108
RAC: 19,770
United States
Message 1310463 - Posted: 26 Nov 2012, 12:33:56 UTC - in response to Message 1310421.  



I have no doubt we'd find some new major problem sooner rather than later, but it would erase completely several existing ones.



It would be fun to play some other game for a while.

Let's try it and see!

ID: 1310463 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 2871
Credit: 10,621,656
RAC: 330
United States
Message 1310487 - Posted: 26 Nov 2012, 15:22:48 UTC

I've been speculating for at least two years now that if we were able to increase the pipe to even just 200mbit, I'm not sure the scheduler/feeder would handle it. Even two years ago when GPUs were much slower and less common, the database was having trouble keeping up. Of course, we have better hardware now, but I think it's significantly more likely we'll run into a software limitation, on top of the disk I/O limitation.

Bigger pipe will likely cause more issues without some sort of restraint (per-host limits are a simple way to do it, but there are better ways.. like server-side cache size based on DCF). It was a good idea to run the scheduler on a different link, as long as that can be reliable. It will at least allow a high rate of successful contacts to report work and be assigned new work, and then you just have to fight for bandwidth on the download link, which in the grand scheme of things, isn't that huge of an issue.

You wouldn't end up with ghosts, you'd just end up with 10+ hour back-offs, but you can over-come those with some manual intervention, or some less draconian exponential back-off calculations in the client itself.

Maybe once the scheduler reliability issues get sorted out, we can possibly test the database's capability to keep up for a 24-hour period by updating some DNS records and ramp the bandwidth up? Maybe pick a Saturday or we have the winter holidays coming up where the campus will be empty except a select few faculty members. Could do a 1-3 day test on 200+ mbit then, assuming the red tape can be removed temporarily for such a thing.


Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)

ID: 1310487 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1310500 - Posted: 26 Nov 2012, 16:34:23 UTC

What I think is that besides the underlying issue, the current limits are not helping as they should.
My hosts, even the one that has a core 2 duo and an old gt9500, is doing a request every five minutes, and Im sure that the same happens in hughe number of active host so, they changed a few long queries by a lot of short ones.

And, while the need of a certain limit is out discussion, unless until they can solve the underlying issue, may be a slightly higher limit can give a beter trade of between the duration of the queries and the number of them...


ID: 1310500 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6096
Credit: 155,215,515
RAC: 49,143
United States
Message 1310509 - Posted: 26 Nov 2012, 16:54:24 UTC - in response to Message 1308561.  

Quick script to provide the percentage of failures on your machine.


Thanks


Now we just need a small tweak to divide those into 'before tonight' and 'after tonight', so we know what effect Eric's changes have had.

At the moment this will work correctly if the stdoutdae.txt only contains the current months information. Otherwise matching day information for other months & years will be included. Dealing with dates in a .bat file can be a pain. I might have to go with .vbs to make things easier.
I did update it to separate the dates & allow users to enter the number of days they wish to view.


If anyone wishes to use it you can find it here:
http://www.hal6000.com/seti/_com_check_full.txt
Right click & save, depending on your browser, rename to .bat. Then place in the folder where your stdoutdae.txt is located, or modify the script to point to the location of the file.

Change the number on the line "set check_days=" for the number of days you wish to view.

If you wish you can remove the line "del sched_*-%computername%_*.txt" to save the daily information. You can then use this script, http://www.hal6000.com/seti/_com_check_day_calc.txt, to run each day separately by entering the date from the command line in YYYY-MM-DD format.
So entering, _com_check_day_calc.bat 2012-11-26, with the sched_*-PCNAME_2012-11-26.txt files present would give you that days information.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!

ID: 1310509 · Report as offensive
Richard HaselgroveProject Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 11140
Credit: 83,744,708
RAC: 45,986
United Kingdom
Message 1310519 - Posted: 26 Nov 2012, 17:27:51 UTC - in response to Message 1310509.  

I've been aggregating the stdoutdae.txt files from five computers, and splitting them up into batches by date.

First, covering the period from 01 Nov to 20 Nov inclusive - near enough, from when the acute problems started, to when Eric switched to the Campus network last Wednesday.

Scheduler Requests: 10334
Scheduler Successes: 7341

Scheduler Failures (Connect): 221
Scheduler Failures (Peer data): 62
Scheduler Failures (Timeout): 2659

Scheduler Success: 71 %
Scheduler Failure: 28 %
Scheduler Connect: 7 % of failures
Scheduler Peer data: 2 % of failures
Scheduler Timeout: 88 % of failures

Then, on Saturday morning (Berkeley time), the scheduler came back online, but back on the old HE/PAIX IP address. Since then...

Scheduler Requests: 1109
Scheduler Successes: 349

Scheduler Failures (Connect): 597
Scheduler Failures (Peer data): 131
Scheduler Failures (Timeout): 28

Scheduler Success: 31 %
Scheduler Failure: 68 %
Scheduler Connect: 78 % of failures
Scheduler Peer data: 17 % of failures
Scheduler Timeout: 3 % of failures

So, it's actually been easier to get work here over the weekend than for several weeks past - there have been more errors in total, but the vast majority have been of the 20-second 'Connect' variety, instead of the 5-minute 'Timeout' variety. And with a 'connect' failure, the database knows nothing about your attempt, so no ghosts are created.

ID: 1310519 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 7486
Credit: 91,083,099
RAC: 46,351
Australia
Message 1310527 - Posted: 26 Nov 2012, 18:07:50 UTC - in response to Message 1310519.  

So, it's actually been easier to get work here over the weekend than for several weeks past - there have been more errors in total, but the vast majority have been of the 20-second 'Connect' variety, instead of the 5-minute 'Timeout' variety. And with a 'connect' failure, the database knows nothing about your attempt, so no ghosts are created.

Things are certainly better than they were, but not as good as when the campus network was being used.
At least most of the time- we're back in one of those major network traffic dives, hoping for yet again another recovery.
Grant
Darwin NT

ID: 1310527 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1310529 - Posted: 26 Nov 2012, 18:08:46 UTC - in response to Message 1310519.  

So, it's actually been easier to get work here over the weekend than for several weeks past - there have been more errors in total, but the vast majority have been of the 20-second 'Connect' variety, instead of the 5-minute 'Timeout' variety. And with a 'connect' failure, the database knows nothing about your attempt, so no ghosts are created.

Which gives sense to my guess that we are overloading the scheduller with more requests than it's able to handle... (my raw estimation is that if 20% of the active hosts are trying to do a RPC every 6 mins, the scheduller should be able to process 120 RPCs per second...
I guess that no matter if a lot of people spells BOINC as BIONIC anyway we don't have the technology... (neither the six million dollars)

ID: 1310529 · Report as offensive
Richard HaselgroveProject Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 11140
Credit: 83,744,708
RAC: 45,986
United Kingdom
Message 1310536 - Posted: 26 Nov 2012, 18:27:43 UTC - in response to Message 1310527.  

So, it's actually been easier to get work here over the weekend than for several weeks past - there have been more errors in total, but the vast majority have been of the 20-second 'Connect' variety, instead of the 5-minute 'Timeout' variety. And with a 'connect' failure, the database knows nothing about your attempt, so no ghosts are created.

Things are certainly better than they were, but not as good as when the campus network was being used.
At least most of the time- we're back in one of those major network traffic dives, hoping for yet again another recovery.

If you look carefully at the SSP, you'll see that the AP splitter processes have gone walkabout to Vader, Marvin, GeorgeM...

I think the traffic dive may be to do with reconfigurations in progress.

ID: 1310536 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 7486
Credit: 91,083,099
RAC: 46,351
Australia
Message 1310538 - Posted: 26 Nov 2012, 18:35:21 UTC - in response to Message 1310536.  

So, it's actually been easier to get work here over the weekend than for several weeks past - there have been more errors in total, but the vast majority have been of the 20-second 'Connect' variety, instead of the 5-minute 'Timeout' variety. And with a 'connect' failure, the database knows nothing about your attempt, so no ghosts are created.

Things are certainly better than they were, but not as good as when the campus network was being used.
At least most of the time- we're back in one of those major network traffic dives, hoping for yet again another recovery.

If you look carefully at the SSP, you'll see that the AP splitter processes have gone walkabout to Vader, Marvin, GeorgeM...

I think the traffic dive may be to do with reconfigurations in progress.

Possibly- it's now bottommed out, which it hadn't done on the previous dives it recovered from.
Grant
Darwin NT

ID: 1310538 · Report as offensive
juan BFP
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 5847
Credit: 330,548,657
RAC: 7,799
Panama
Message 1310554 - Posted: 26 Nov 2012, 18:59:37 UTC

Finaly they start to do what we tell them to do about 2 weeks ago.....

Unload some tasks from the poor overloaded Synergy!

At least now we see a light on the end of the tunnel...


ID: 1310554 · Report as offensive
kittymanProject Donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 45913
Credit: 815,184,750
RAC: 125,000
United States
Message 1310555 - Posted: 26 Nov 2012, 19:02:11 UTC - in response to Message 1310554.  

Finaly they start to do what we tell them to do about 2 weeks ago.....

Unload some tasks from the poor overloaded Synergy!

At least now we see a light on the end of the tunnel...

Let's just hope that it's not the light from another oncoming train....LOL.
Cats.....what more does one need?

Have made friends in this life.
Most were cats.

ID: 1310555 · Report as offensive
rob smithProject Donor
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 13335
Credit: 154,704,489
RAC: 117,305
United Kingdom
Message 1310556 - Posted: 26 Nov 2012, 19:03:26 UTC

It certainly looks as if some serious re-knitting is being done. Here's to hoping it is a success.


Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

ID: 1310556 · Report as offensive
juan BFP
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 5847
Credit: 330,548,657
RAC: 7,799
Panama
Message 1310572 - Posted: 26 Nov 2012, 19:59:53 UTC - in response to Message 1310555.  

Finaly they start to do what we tell them to do about 2 weeks ago.....

Unload some tasks from the poor overloaded Synergy!

At least now we see a light on the end of the tunnel...

Let's just hope that it's not the light from another oncoming train....LOL.

Unless you are a Coyote you don´t need to worry!

At least i hope... i´m tired with all this... At least i´m happy all hands report no problem with the yesterday power loss... now i´m going to drink few cold beers...



ID: 1310572 · Report as offensive
David SProject Donor
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 17042
Credit: 20,944,655
RAC: 6,137
United States
Message 1310589 - Posted: 26 Nov 2012, 21:21:36 UTC - in response to Message 1310500.  

What I think is that besides the underlying issue, the current limits are not helping as they should.
My hosts, even the one that has a core 2 duo and an old gt9500, is doing a request every five minutes, and Im sure that the same happens in hughe number of active host so, they changed a few long queries by a lot of short ones.

And, while the need of a certain limit is out discussion, unless until they can solve the underlying issue, may be a slightly higher limit can give a beter trade of between the duration of the queries and the number of them...

I never thought the current limit scheme made sense. To have any real effect on network traffic, wouldn't it have to limit tasks over a specific time period rather than just "at a time"?

I hate to say that, despite what the higher-crediting Einstein units are doing for my overall Boinc stats (and even with 99 Einstein units "aborted by user," which I certainly didn't do), but to me it's the only thing that will achieve the desired outcome.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


ID: 1310589 · Report as offensive
David SProject Donor
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 17042
Credit: 20,944,655
RAC: 6,137
United States
Message 1310591 - Posted: 26 Nov 2012, 21:23:22 UTC

The next big question is, why can't I get page 2 of this thread to load completely?


David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


ID: 1310591 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6096
Credit: 155,215,515
RAC: 49,143
United States
Message 1310597 - Posted: 26 Nov 2012, 21:34:09 UTC - in response to Message 1310589.  

What I think is that besides the underlying issue, the current limits are not helping as they should.
My hosts, even the one that has a core 2 duo and an old gt9500, is doing a request every five minutes, and Im sure that the same happens in hughe number of active host so, they changed a few long queries by a lot of short ones.

And, while the need of a certain limit is out discussion, unless until they can solve the underlying issue, may be a slightly higher limit can give a beter trade of between the duration of the queries and the number of them...

I never thought the current limit scheme made sense. To have any real effect on network traffic, wouldn't it have to limit tasks over a specific time period rather than just "at a time"?

I hate to say that, despite what the higher-crediting Einstein units are doing for my overall Boinc stats (and even with 99 Einstein units "aborted by user," which I certainly didn't do), but to me it's the only thing that will achieve the desired outcome.

They are not trying to effect the network traffic. The limits were put in place to reduce the size of the result and host tables. As they had grown to large to complete in a timely fashion.
Hopefully they will come up with a long term solution. Hopefully by breaking up the tables into sizes that can be quickly searched.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!

ID: 1310597 · Report as offensive
Profile dancer42Project Donor
Volunteer tester

Send message
Joined: 2 Jun 02
Posts: 455
Credit: 2,283,606
RAC: 170
United States
Message 1310609 - Posted: 26 Nov 2012, 22:26:32 UTC

After reading what everyone has been saying here I watched the log file over night and noticed that the boinc manager poled the scheduler several times an hour far more often then would seem to make any since. If instead of polling 10 to 50 times an hour perhaps a line could be added to preferences to set a minimum number of unit's before reporting we could then set that number higher when high traffic threatened to bring the scheduling server down. I am thinking the unnecessary polling for 150,000 machines would add a lot two the traffic problems does seti really need to know every signal time I start finish or change any thing?


ID: 1310609 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1310616 - Posted: 26 Nov 2012, 22:54:39 UTC - in response to Message 1310609.  

After reading what everyone has been saying here I watched the log file over night and noticed that the boinc manager poled the scheduler several times an hour far more often then would seem to make any since. If instead of polling 10 to 50 times an hour perhaps a line could be added to preferences to set a minimum number of unit's before reporting we could then set that number higher when high traffic threatened to bring the scheduling server down. I am thinking the unnecessary polling for 150,000 machines would add a lot two the traffic problems does seti really need to know every signal time I start finish or change any thing?

The clients are not calling the scheduler to report every task, they are calling often because due to the limits their caches are not filled and then they try to get more work to fill them. Of course when the contact is made the already crunched units are reported, but thats just collateral, on normal circumstances and if there is no need for more work, the client will call the scheduler only once every 24 hours to report tasks.

ID: 1310616 · Report as offensive
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 23 · Next

Message boards : Number crunching : Panic Mode On (79) Server Problems?


 
©2016 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.