The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 36 · 37 · 38 · 39 · 40 · 41 · 42 . . . 94 · Next

AuthorMessage
Miklos M.

Send message
Joined: 5 May 99
Posts: 955
Credit: 136,115,648
RAC: 73
Hungary
Message 2028231 - Posted: 18 Jan 2020, 0:04:44 UTC

Failing to send automatically again.
1/17/2020 7:02:41 PM | SETI@home | [sched_op] NVIDIA GPU work request: 643105.16 seconds; 0.00 devices
1/17/2020 7:02:42 PM | SETI@home | Scheduler request completed: got 0 new tasks
1/17/2020 7:02:42 PM | SETI@home | [sched_op] Server version 709
1/17/2020 7:02:42 PM | SETI@home | Project has no tasks available
1/17/2020 7:02:42 PM | SETI@home | Project requested delay of 303 seconds
1/17/2020 7:02:42 PM | SETI@home | [sched_op] Deferring communication for 00:05:03
1/17/2020 7:02:42 PM | SETI@home | [sched_op] Reason: requested by project
ID: 2028231 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2028254 - Posted: 18 Jan 2020, 1:50:35 UTC - in response to Message 2028132.  

Nope. The splitters produce the actual work unit, this is then copied to send to each participating host.
Nothing is "sent" to the hosts. Hosts ask for work at their own pace and this pace is independent of the rate at which the splitters can produce work as long as it is the limiting bottleneck. The extra triplicates will just wait in the rrts queue until the hosts can digest them.

If the splitters can produce the demanded amount of work faster, It'll just mean they will run less. Splitter load is reduced and the server load past splitting stays unchanged. But this is irrelevant anyway because the third copy was never generated for all work units. Only if the result is an overflow, will the third copy be generated. Those are a small minority of all workunits unless there is an unusually noisy tape being split.
ID: 2028254 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2028258 - Posted: 18 Jan 2020, 2:02:17 UTC - in response to Message 2028254.  

Nope. The splitters produce the actual work unit, this is then copied to send to each participating host.
Nothing is "sent" to the hosts. Hosts ask for work at their own pace and this pace is independent of the rate at which the splitters can produce work as long as it is the limiting bottleneck. The extra triplicates will just wait in the rrts queue until the hosts can digest them.

If the splitters can produce the demanded amount of work faster, It'll just mean they will run less. Splitter load is reduced and the server load past splitting stays unchanged. But this is irrelevant anyway because the third copy was never generated for all work units. Only if the result is an overflow, will the third copy be generated. Those are a small minority of all workunits unless there is an unusually noisy tape being split.

Like 08ja09af for example. I can't give you any references because the replica is over 29,000 seconds (8.05 hours) behind. For me the maximum runtime on a RTX 2070 was 1 minutes 28 seconds I think others are ran for 1 minute 18 seconds and others ran for a matter of seconds.
ID: 2028258 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13766
Credit: 208,696,464
RAC: 304
Australia
Message 2028259 - Posted: 18 Jan 2020, 2:05:33 UTC - in response to Message 2028254.  

Nope. The splitters produce the actual work unit, this is then copied to send to each participating host.
Nothing is "sent" to the hosts.
Allocated to & downloaded by the hosts then.


But this is irrelevant anyway because the third copy was never generated for all work units.
There is never any third copy, nor even a second copy. There is only 1 copy of any given WU. It is downloaded as many times as required, but of course each download requires additional entries to keep track of the systems the copies are sent to, the returned results, and the more copies that are sent out, the longer it is in the system until it finally gets Validated, or Errors out.
Hence the more times a WU is replicated (ie the _2, _3, _4 etc on the end), the greater the load on the database.
Grant
Darwin NT
ID: 2028259 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2028260 - Posted: 18 Jan 2020, 2:06:16 UTC

I think one thing that could help the server load is if they could release a big bunch of Astropulse work. Astropulse tasks take many times longer to crunch, so if a bigger proportion of tasks would be them, then the number of tasks 'in flight' at any given time would be lower.
ID: 2028260 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2028264 - Posted: 18 Jan 2020, 2:43:52 UTC - in response to Message 2028259.  

There is only 1 copy of any given WU. It is downloaded as many times as required
We are not talking about workunits (unique sets of data processed by multiple hosts) but actual tasks given to individual hosts. The term the ssp is using is 'result' (even when it's still an uncrunched task to be sent to a host, not an actual crunched result yet). And because rrts stands for 'Results ready to send', this suggests that this duplicating is counted in it.

It is still a separate row in the database even when sharing the same file on the download server and database performance seems to be the issue here, not the disk capacity on the download servers.

One thing that seems weird is the very high ratio between 'Results returned and awaiting validation' and 'Workunits waiting for validation' on the ssp - currently about 6.4. Are those measuring different things despite their titles or is the average duplication really that high? Or is the database broken and containing millions of orphaned results without their parent workunits?
ID: 2028264 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13766
Credit: 208,696,464
RAC: 304
Australia
Message 2028271 - Posted: 18 Jan 2020, 4:28:01 UTC - in response to Message 2028264.  

There is only 1 copy of any given WU. It is downloaded as many times as required
We are not talking about workunits (unique sets of data processed by multiple hosts) but actual tasks given to individual hosts. The term the ssp is using is 'result' (even when it's still an uncrunched task to be sent to a host, not an actual crunched result yet). And because rrts stands for 'Results ready to send', this suggests that this duplicating is counted in it.
The Results-ready-to-send are what we call Work Units.
Work Units are what are allocated & downloaded by each system (generally referred to as being sent) to process (they are also called tasks).


It is still a separate row in the database even when sharing the same file on the download server and database performance seems to be the issue here, not the disk capacity on the download servers.
No, it's not a disk capacity issue, it's a workload issue, and the more work the system has to keep track off, the greater the server problems are. So the more times a WU has to be processed before it's
declared Valid, Invalid or an Error, the greater the load on the database.


One thing that seems weird is the very high ratio between 'Results returned and awaiting validation' and 'Workunits waiting for validation' on the ssp - currently about 6.4.
The number you are quoting isn't related to the 2 terms you quoted.

When things are working 'Results returned and awaiting validation' is usually less the 'Results out in the field', quite a bit less.
And when things are working properly, 'Workunits waiting for validation' and 'Workunits waiting for assimilation' are effectively 0, they are handled as they occur and you might occasionally see a dozen or so as a value there, but that's not very often.
When things are working properly...
Grant
Darwin NT
ID: 2028271 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2028275 - Posted: 18 Jan 2020, 5:18:14 UTC - in response to Message 2028260.  

I think one thing that could help the server load is if they could release a big bunch of Astropulse work. Astropulse tasks take many times longer to crunch, so if a bigger proportion of tasks would be them, then the number of tasks 'in flight' at any given time would be lower.

I agree this would certainly help, it would only help while the work was being processed then we would go back to being in the same situation unless they were able to upgrade servers to cope with capacity
ID: 2028275 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2028281 - Posted: 18 Jan 2020, 5:52:12 UTC - in response to Message 2028254.  

Nope. The splitters produce the actual work unit, this is then copied to send to each participating host.
Nothing is "sent" to the hosts. Hosts ask for work at their own pace and this pace is independent of the rate at which the splitters can produce work as long as it is the limiting bottleneck. The extra triplicates will just wait in the rrts queue until the hosts can digest them.

If the splitters can produce the demanded amount of work faster, It'll just mean they will run less. Splitter load is reduced and the server load past splitting stays unchanged. But this is irrelevant anyway because the third copy was never generated for all work units. Only if the result is an overflow, will the third copy be generated. Those are a small minority of all workunits unless there is an unusually noisy tape being split.


. . I give up, have it your way, and enjoy your little world.

Stephen

:(
ID: 2028281 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13766
Credit: 208,696,464
RAC: 304
Australia
Message 2028283 - Posted: 18 Jan 2020, 6:10:21 UTC

And once again the splitters take a rest & we're out of work.
Grant
Darwin NT
ID: 2028283 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13766
Credit: 208,696,464
RAC: 304
Australia
Message 2028284 - Posted: 18 Jan 2020, 6:59:19 UTC

Even with the splitters shut down and the In-progress falling like a stone, the Validation, Assimilation & Deletion backlog remains, barely touched.
Grant
Darwin NT
ID: 2028284 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22265
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2028288 - Posted: 18 Jan 2020, 8:42:14 UTC

With the validators being very slow and picky it is very difficult to see how well my latest RPi is doing. My existing computers appear to be having tasks validated while for the new one they are being ignored. But this may be down to the fact that the backup database is now over 8 hours behind main....
Backup is obliviously running as it is making progress, but not at one hour per hour. It's running down, so sombody needs to wind its clock up again.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2028288 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14656
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2028291 - Posted: 18 Jan 2020, 9:38:04 UTC
Last modified: 18 Jan 2020, 9:42:39 UTC

Although no new workunits are being split, I've been getting substantial numbers of replacement _2 tasks. That implies that the validators are working, and failing substantial numbers of matches (it may be that many of my _2s turn out to be overflows, and vanish in a flash).

Somebody mentioned 'initial replication'. I have a dim memory that we discussed this years ago, and found that the number should more accurately be called 'current replication' - the figure you see may not necessarily be the true initial number. It would be hard to check that until the databases sync up, and we can check newly-split work (ha!) in real time.

Edit - see Initial Replication of FOUR?? Any comments
ID: 2028291 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13766
Credit: 208,696,464
RAC: 304
Australia
Message 2028296 - Posted: 18 Jan 2020, 10:59:10 UTC
Last modified: 18 Jan 2020, 11:02:08 UTC

I see the splitters fired up for a short while there, then gave up again. The Validation backlog has dropped by a few hundred thousand (it needs to drop by a good 4 million). The Deletion backlog has almost cleared, but that's probably because the Assimilator backlog has increased.

Reducing the server side limits doesn't seem to have helped things much, if at all.
Grant
Darwin NT
ID: 2028296 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 2028300 - Posted: 18 Jan 2020, 11:56:59 UTC

Back to Einstein already ...
ID: 2028300 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2028305 - Posted: 18 Jan 2020, 12:23:46 UTC - in response to Message 2028296.  
Last modified: 18 Jan 2020, 12:24:58 UTC

Reducing the server side limits doesn't seem to have helped things much, if at all.

The effect of reduce the limits will bee see in days (or weeks) only due the way SETI works.
As you could see in the SSP the total WU is about 28-29MM and the daily production of the entire SETIverse is about 3 MM of WU. So if all of those WU where validated and cleared (something not realistic) it will take > 4 days to down this numbers to a more manageable size (< 20 MM WU). That is one of the side effects when you mess with the size of a DB. Increasing is fast, decrease is slow.
In the real world we could expect the changes will start to make difference by the end of this week or sooner if some housekeeping on the DB could be done maybe at the next "mal outrage" day.
My 0.02
ID: 2028305 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2028324 - Posted: 18 Jan 2020, 14:43:30 UTC

Just got told I had a stalled download after I did a manual update because I was running out of tasks on my most active box.

Retried the download. Waiting for the next scheduler to kick.

Sat 18 Jan 2020 08:42:40 AM CST | SETI@home | Scheduler request completed: got 0 new tasks


Maybe later. Alligator?

Tom
A proud member of the OFA (Old Farts Association).
ID: 2028324 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2028325 - Posted: 18 Jan 2020, 14:50:02 UTC - in response to Message 2028284.  

Even with the splitters shut down and the In-progress falling like a stone, the Validation, Assimilation & Deletion backlog remains, barely touched.
Validation backlog did shed a digit. It is under 10 million now.
ID: 2028325 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22265
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2028326 - Posted: 18 Jan 2020, 15:06:11 UTC

...and my latest RPi is now showing some credit
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2028326 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 2028329 - Posted: 18 Jan 2020, 15:41:04 UTC - in response to Message 2028324.  


Maybe later. Alligator?

Tom


After awhile Crocodile
ID: 2028329 · Report as offensive
Previous · 1 . . . 36 · 37 · 38 · 39 · 40 · 41 · 42 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.