Panic Mode On (8) Server problems

Message boards : Number crunching : Panic Mode On (8) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 803247 - Posted: 29 Aug 2008, 22:56:57 UTC - in response to Message 803235.  

Seems to be a problem

Yep.
Ready to Send buffer is down to zero, and although the Server Status page says the Splitters are running, they're not producing any work.

Grant
Darwin NT
ID: 803247 · Report as offensive
Bert

Send message
Joined: 12 Oct 06
Posts: 84
Credit: 813,295
RAC: 0
United States
Message 803250 - Posted: 29 Aug 2008, 23:14:41 UTC - in response to Message 803247.  

Seems to be a problem

Yep.
Ready to Send buffer is down to zero, and although the Server Status page says the Splitters are running, they're not producing any work.


Friday just before a long weekend. We should be getting used to it.

I upped my queue to 6 extra days. Should keep me going if we gotta wait until Wednesday.
ID: 803250 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 803260 - Posted: 29 Aug 2008, 23:34:15 UTC


Oooopssss.. it's look like no new work until Tuesday..

Technical News : Sort-of Weekend Wrapup (Aug 28 2008)

ID: 803260 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 803332 - Posted: 30 Aug 2008, 5:03:17 UTC


Looks like someone gave it a kick- the splitters are splitting again & have been doing so for a while now. The Ready to Send buffer is slowly growing again.
Grant
Darwin NT
ID: 803332 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 803510 - Posted: 30 Aug 2008, 23:14:49 UTC


And now the Ready to Send buffer has dropped to 0 again.
Grant
Darwin NT
ID: 803510 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 803517 - Posted: 31 Aug 2008, 0:10:23 UTC

We just hit another patch of short duration workunits that fast optimized systems can crunch in 7-10 minutes each. Due to the short duration they also tend to go to the head of the queue so it doesn't matter how many 10s of 100s of workunits the fast systems already have queued. The splitters simply can't keep up with the demand.

Server status shows the creation rate barley keeping ahead of the return rate at the 31 Aug 0Z snapshot.
"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 803517 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 803532 - Posted: 31 Aug 2008, 1:12:27 UTC - in response to Message 803517.  

The splitters simply can't keep up with the demand.

Which indicates a problem somewhere in the system. The splitters can (and have) churned out 35+ Work Units per second for sustained periods. But for some reason they were stuck at around 10 or so for several hours & have only been doing 15 or so for the last hour & a bit. Usually once the RtS buffer drops by a few 1,000 they'll crank up the pace to 20 or more & maintain the buffer.
That hasn't happened today.
Grant
Darwin NT
ID: 803532 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 803592 - Posted: 31 Aug 2008, 4:49:28 UTC

I think we are dealing with the lack of HD space again, with all the shorties being generated it fills up the space and then the splitters have to throttle back until some more space opens up.

ID: 803592 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 803603 - Posted: 31 Aug 2008, 6:35:45 UTC
Last modified: 31 Aug 2008, 7:30:22 UTC

Which brings up the strange conundrum with having work unit queues.

On one hand, the client seeing a work unit with a short deadline would prioritize it if you are running with a queue of even modest length.

This adds to your turnaround time for those work units already in your queue.

This results in the work unit information and any already returned result taking up space longer on the servers.

The lack of space on the server caps new work unit creation.

However if you have a small queue, this will most likely mean you will run out of work.

Which in turn encourages members to increase their queue size to avoid running out of work.

Which leads to more space being used up on the servers.

Which caps the creation rate of new work units.

Rinse and repeat.

Oh, and I forgot another thing. Since validated work units and their results hang around for I think 24 hours before deletion, a run of short duration workunits that are processed immediately will still take up more space due to the shear number of the work units that can get done in 24 hours. So even if everyone was running with virtually no queues, the servers will still clog up.

It's thermodynamics class all over again. You can't win, you can't break even, you can't get out of the game.
"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 803603 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 803635 - Posted: 31 Aug 2008, 10:25:38 UTC
Last modified: 31 Aug 2008, 10:26:30 UTC

I must be a bit slow here, but how exactly how does Boinc decide which order to do work units. On my quad I keep a 1 day queue, but currently work units with a date 6 or 7th of September are being left in favour of 16th 17th 22nd and 23rd of September. Admittedly the earlier ones are 20mins and the later ones about an hour, however if I was to go on holiday today till the 8th of September (not a long time) then 54 units would have to be re-sent. Doesn't seem an efficient way of doing things.

I only run S@H by the way.

Bernie
ID: 803635 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 803642 - Posted: 31 Aug 2008, 11:06:10 UTC - in response to Message 803635.  
Last modified: 31 Aug 2008, 11:06:40 UTC

I must be a bit slow here, but how exactly how does Boinc decide which order to do work units. On my quad I keep a 1 day queue, but currently work units with a date 6 or 7th of September are being left in favour of 16th 17th 22nd and 23rd of September. Admittedly the earlier ones are 20mins and the later ones about an hour, however if I was to go on holiday today till the 8th of September (not a long time) then 54 units would have to be re-sent. Doesn't seem an efficient way of doing things.

I only run S@H by the way.

Bernie

In the normal course of events, BOINC does the work in the order it was issued by Berkeley. You've identified yourself as being in the UK, so you must be familiar with the concept of a queue - first one to the bus-stop is the first to get on the bus, that sort of thing? That's how BOINC works.

Except - if tail-end Charlie is in danger of missing an important appointment (a deadline), she's allowed to jump the queue and get on the bus first. And how much danger she's in depends on how long the queue is. If the queue is one day, and the deadline is seven days, then there's no risk of missing deadline and no queue-jumping is allowed.

If you actually tell BOINC that you're going to go away on holiday.... well, you can't exactly, but you can say "I'm not going to connect to the internet again for another 10 days", then BOINC will rush through the tasks which have a shorter deadline than that. It'll also assume that you're not going to take your computer on holiday with you, and try to stock up on work it can do without contacting the internet for that length of time.
ID: 803642 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 803643 - Posted: 31 Aug 2008, 11:09:26 UTC - in response to Message 803635.  

I must be a bit slow here, but how exactly how does Boinc decide which order to do work units.

It does then in the order in which they are downloaded.
If there is a chance of missing a deadline (due to a early deadline or due to your connection settings, amount of time Seti gets to run while the computer is on, or the number of hours the computer is actually on) then it will do those Work Units first, then go back to processing them in the order in which they were downloaded.
Grant
Darwin NT
ID: 803643 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 803644 - Posted: 31 Aug 2008, 11:14:22 UTC - in response to Message 803510.  


And now the Ready to Send buffer has dropped to 0 again.


It has drooped to 0 again after the ready to send went up
ID: 803644 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 803646 - Posted: 31 Aug 2008, 11:40:16 UTC
Last modified: 31 Aug 2008, 11:50:49 UTC

I realise I can tell Boinc I am "going away for a few days" but I actively participate, I have 9 machines in various places and I keep tabs on what they are all doing. However, if my Quad belonged to "Joe Public" who just happened to think S@H was a good idea, but took no active part. If today a midday UTC they shutdown their PC till the 8th of September. 54 work units would default and have to be re-issued, still doesn't seem the most efficient way to use resources. Most of my pending credit is exactly this, units that have not been returned by deadline and have to be re-issued.

I am not normally a "numbers hound" however I just noticed I am nearly at the 500,000 milestone and you start to notice things.

Bernie

PS. Oh dear:


ID: 803646 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 803651 - Posted: 31 Aug 2008, 11:53:41 UTC - in response to Message 803603.  

Oh, and I forgot another thing. Since validated work units and their results hang around for I think 24 hours before deletion, a run of short duration workunits that are processed immediately will still take up more space due to the shear number of the work units that can get done in 24 hours. So even if everyone was running with virtually no queues, the servers will still clog up.

Not quite correct. While there's a 24-hour delay so users can see the outcome of their results, this delay is 24 hours after the wu/result-files has been deleted from disk.

The way through the system is basically:
2nd. result reported -> Transitioner -> Validator -> Assimilator -> file_deleter -> Wait 24 hours before wu/result-info purged from database.

If there's no backlog, it normally takes only a couple seconds from 2nd. result is reported, till the wu-file and result-files has been deleted. Well, as long as passes validation that is. ;)

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 803651 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 803783 - Posted: 31 Aug 2008, 19:44:19 UTC - in response to Message 803644.  


And now the Ready to Send buffer has dropped to 0 again.


It has drooped to 0 again after the ready to send went up


Back to zero again,its seems to be yo-yoing again.
ID: 803783 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 803789 - Posted: 31 Aug 2008, 20:11:50 UTC


. . . Database/file status

Results ready to send 0 40m
Current result creation rate 14.23/sec 0m
Results out in the field 3,452,221 40m
Results received in last hour 56,568 0m
Result turnaround time (last hour average) 54.59 hours 0m
Results returned and awaiting validation 2,675,848 40m
Workunits waiting for validation 30 40m
Workunits waiting for assimilation 331 40m
Workunit files waiting for deletion 32 40m
Result files waiting for deletion 88 40m
Workunits waiting for db purging 592,515 40m
Results waiting for db purging 1,250,952 40m
Transitioner backlog (hours) 0 0m

> 'ave plenty to crunch 'till somebody kicks the server again . . .
BOINC Wiki . . .

Science Status Page . . .
ID: 803789 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 803843 - Posted: 31 Aug 2008, 23:35:19 UTC

This looks less than promising!!

Cricket
ID: 803843 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 803844 - Posted: 31 Aug 2008, 23:46:28 UTC

I swear I'm cursed. I'm on dial-up so I have to manually contact the server to return results. And more times than I can count the servers are down or go wonky just as I try to upload my results.
"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 803844 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 803864 - Posted: 1 Sep 2008, 1:00:45 UTC

It looks to be back.

Cricket
"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 803864 · Report as offensive
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 · Next

Message boards : Number crunching : Panic Mode On (8) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.