The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 72 · 73 · 74 · 75 · 76 · 77 · 78 . . . 94 · Next

AuthorMessage
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030536 - Posted: 2 Feb 2020, 15:51:20 UTC - in response to Message 2030529.  

Maybe it time to start to cut the timeline of the WUs and some changes in the way the work is distributed like sending the resends to the fastest hosts to clear them ASAP.
Again NOT the fastest but the ones with the shortest average turnaround time. Slow host with a tiny cache can return the result faster than a fast host with a huge spoofed cache.

One thing that could prevent this from happening again is if the system monitored the rate of overflows returned and when any file being split exceeds some threshold, that file would be heavily throttled so that it continues being split but would produce only a small percentage of all the workunits.

Or this could even happen without any monitoring if the different splitters split different files instead of all bunching up on the same file. So if some file (or a few files) produced an overflow storm, the storm would be diluted by all the other splitters splitting clean files. But I don't know how this would affect the splitter performance. Spreading out could be faster or slower than bunching up.
ID: 2030536 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2030560 - Posted: 2 Feb 2020, 21:43:58 UTC - in response to Message 2030536.  
Last modified: 2 Feb 2020, 21:44:42 UTC

Again NOT the fastest but the ones with the shortest average turnaround time. Slow host with a tiny cache can return the result faster than a fast host with a huge spoofed cache.

Sorry the meaning was lost in the translation. For me fastests host are the ones with the shortest average turnaround time (less than 1 day). They could clear the WU in very little time and help to reduce the DB size. Obviusly the WU must be sended with a very small death time line (less than 3 days in this case) .

The way is done now, by sending the WU to any hosts (with a long death time line) just make the DB size problem even worst.
ID: 2030560 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2030566 - Posted: 2 Feb 2020, 21:58:36 UTC - in response to Message 2030508.  

I think BoincTasks can do that, as well.

I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this?
ID: 2030566 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 2030584 - Posted: 2 Feb 2020, 23:16:33 UTC

Better solution: if you can detect short tasks without running them, why not just abort them?
Can Boinc Tasks do this? Could the servers?
ID: 2030584 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19048
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2030586 - Posted: 2 Feb 2020, 23:38:11 UTC - in response to Message 2030584.  
Last modified: 3 Feb 2020, 0:23:35 UTC

Better solution: if you can detect short tasks without running them, why not just abort them?
Can Boinc Tasks do this? Could the servers?

The only known way is to run them. For a short time, like the time taken on a 2060 GPU, or better, for bomb to be -9ed.
We don't know how many tasks are sent/day but we do know how many are returned/hr.

Average tasks returned per hr * 24 * short time on GPU / 86400 (s in day) = GPU's needed
ID: 2030586 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2030593 - Posted: 3 Feb 2020, 0:20:11 UTC

Sun 02 Feb 2020 06:16:57 PM CST | SETI@home | Scheduler request completed: got 92 new tasks

Yum! Something to crunch ;)

Tom
A proud member of the OFA (Old Farts Association).
ID: 2030593 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030611 - Posted: 3 Feb 2020, 6:44:56 UTC

Looks like more trouble. About 30 minutes ago the Website got very Slow and the Scheduler checked out;
Mon Feb 3 01:08:50 2020 | SETI@home | [sched_op] Starting scheduler request
Mon Feb 3 01:10:47 2020 | SETI@home | Scheduler request failed: HTTP internal server error
Mon Feb 3 01:10:47 2020 | SETI@home | [sched_op] Reason: Scheduler request failed
Mon Feb 3 01:13:08 2020 | SETI@home | Sending scheduler request: To report completed tasks.
Mon Feb 3 01:14:23 2020 | SETI@home | Scheduler request failed: Couldn't connect to server
Mon Feb 3 01:22:01 2020 | SETI@home | [sched_op] Starting scheduler request
Mon Feb 3 01:23:15 2020 | SETI@home | Scheduler request failed: Failure when receiving data from the peer
Mon Feb 3 01:23:15 2020 | SETI@home | [sched_op] Reason: Scheduler request failed
Mon Feb 3 01:34:15 2020 | SETI@home | [sched_op] Starting scheduler request
Mon Feb 3 01:36:57 2020 | SETI@home | Scheduler request failed: HTTP internal server error
Mon Feb 3 01:36:57 2020 | SETI@home | [sched_op] Reason: Scheduler request failed
Just when everything was working well...
ID: 2030611 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 2030612 - Posted: 3 Feb 2020, 7:09:49 UTC

Well, of all the problems i was expecting to occur, the Scheduler going MIA wasn't one of them.

And it appears it might have just come back to life- no longer timing out, or HTTP errors, or failure when receiving data from the peer (I think every possible error has been given at some stage).
Now it's back to "Project has no tasks available", but at least i can report every thing that's accumulated since the Scheduler went AWOL earlier.
Grant
Darwin NT
ID: 2030612 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030615 - Posted: 3 Feb 2020, 7:31:36 UTC

Looks like the validators have been MIA too, not just the scheduler. The first successful scheduler contact made my RAC drop lower than the lowest point yesterday at the end of the dry period.
ID: 2030615 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030616 - Posted: 3 Feb 2020, 7:36:40 UTC

A few machines are starting to get Downloads again. Hopefully this will blow over quickly.
ID: 2030616 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 2030617 - Posted: 3 Feb 2020, 7:46:51 UTC - in response to Message 2030615.  

Looks like the validators have been MIA too, not just the scheduler. The first successful scheduler contact made my RAC drop lower than the lowest point yesterday at the end of the dry period.
For a while there things were improving (steadily if slowly), but all the new work going out has caused the Validation backlog to increase again.
Grant
Darwin NT
ID: 2030617 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030619 - Posted: 3 Feb 2020, 7:55:46 UTC - in response to Message 2030617.  
Last modified: 3 Feb 2020, 8:55:29 UTC

For a while there things were improving (steadily if slowly), but all the new work going out has caused the Validation backlog to increase again.
The assimilation backlog was reducing until two SSP updates ago. But on the last two updates it too has grown bigger.

Here are the cumulative result counts for the last few days:



Each plotted value is the sum of that value plus all the values below it so that the width of the band between the line and the one below it represents the value of the specific variable. The plots show that db purging has been primarily responsible for the database size reduction and when the database ran out of purgeable results, the total result count started increasing again.

The results waiting for assimilation are an estimated value because the SSP doesn't report it separately. The estimation is based on two assumptions: Those are counted as waiting for validation on ssp and the average replication (number of results per workunit) is 2.2.

The numbers on x-axis are days of February.
ID: 2030619 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 2030627 - Posted: 3 Feb 2020, 10:20:58 UTC - in response to Message 2030586.  

I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this?


Then how could any other piece of s/w do this...just asking for a friend.
ID: 2030627 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030629 - Posted: 3 Feb 2020, 10:42:32 UTC - in response to Message 2030627.  

I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this?
Then how could any other piece of s/w do this...just asking for a friend.
Unfortunately, can't be done - consistently, at any rate.

That's what we're here for - finding the signals in the noise. The only way to do that is to run SETI's own software.

There are occasions when a whole group of tasks are 'similar' - like the recent run of BLC35 tasks. But it wasn't 100%, and there were tasks in there that needed running. The best we can hope for is that the powers that be provide enough workers in the SETI@Home labs to manage the tape splitting process more closely, so that when one of these self-similar groups appears, they can respond by distributing them gradually, amongst other types of work.
ID: 2030629 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2030634 - Posted: 3 Feb 2020, 12:00:57 UTC

I got up this morning and my Windows 10 box had shut down for some reason or other. When it does that I have to turn off the PSU before things will "reset" and then up it comes.

Got this when everything was up again:
2/3/2020 5:51:36 AM | SETI@home | Scheduler request completed: got 150 new tasks


Tom
A proud member of the OFA (Old Farts Association).
ID: 2030634 · Report as offensive
BetelgeuseFive Project Donor
Volunteer tester

Send message
Joined: 6 Jul 99
Posts: 158
Credit: 17,117,787
RAC: 19
Netherlands
Message 2030636 - Posted: 3 Feb 2020, 12:16:02 UTC - in response to Message 2030629.  

I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this?
Then how could any other piece of s/w do this...just asking for a friend.
Unfortunately, can't be done - consistently, at any rate.

That's what we're here for - finding the signals in the noise. The only way to do that is to run SETI's own software.

There are occasions when a whole group of tasks are 'similar' - like the recent run of BLC35 tasks. But it wasn't 100%, and there were tasks in there that needed running. The best we can hope for is that the powers that be provide enough workers in the SETI@Home labs to manage the tape splitting process more closely, so that when one of these self-similar groups appears, they can respond by distributing them gradually, amongst other types of work.


But it should be possible to move resends to the top of the queue (or at least it used to be when all tasks where sent out as pairs: anything with a _2 or higher should be resends).

Tom
ID: 2030636 · Report as offensive
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2030638 - Posted: 3 Feb 2020, 12:47:23 UTC - in response to Message 2030277.  

My Inconclusive results are going up too, even though I've only had a handful of Tasks since last night. Last night I had a large number of Inconclusive results that said 'minimum quorum 1' and only listed a single Inconclusive host. I didn't see how a single Inconclusive host task could ever validate. Now, it's very difficult to bring up my Inconclusive tasks lists, but, it seems those tasks are now listed as; https://setiathome.berkeley.edu/workunit.php?wuid=3862758806
minimum quorum 1
initial replication 3
   Task    Computer            Sent                  Time reported                 Status        Runtime CPUtime Credit             Application
8495599283  1473578  31 Jan 2020, 5:02:48 UTC  31 Jan 2020, 21:47:15 UTC  Completed and validated  15.36  12.61   3.59  SETI@home v8 v8.20 (opencl_ati5_mac) x86_64-apple-darwin
8498611906  6796479   1 Feb 2020, 3:00:50 UTC   1 Feb 2020, 4:00:03 UTC   Completed and validated   4.10   1.93   3.59  SETI@home v8 v8.11 (cuda42_mac) x86_64-apple-darwin
8498669733  8673543   1 Feb 2020, 4:01:52 UTC   1 Feb 2020, 5:29:49 UTC   Completed and validated  15.11  13.09   3.59  SETI@home v8 v8.22 (opencl_nvidia_SoG)
So, the single host are now triple hosts, but they are still just sitting there with a number of them showing one or two Completed, waiting for validation hosts, and some with one or two Inconclusive hosts.
I have a couple of invalid tasks with minimum quorum = 1. Perhaps I have a lot of valid tasks as well with min.q.=1, but they are much harder to spot.
https://setiathome.berkeley.edu/workunit.php?wuid=3861384942
https://setiathome.berkeley.edu/workunit.php?wuid=3861339403
https://setiathome.berkeley.edu/workunit.php?wuid=3861247650
https://setiathome.berkeley.edu/workunit.php?wuid=3861247545
and so on...
https://setiathome.berkeley.edu/results.php?userid=5276&offset=0&show_names=0&state=5&appid=
ID: 2030638 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030639 - Posted: 3 Feb 2020, 12:52:34 UTC - in response to Message 2030636.  

But it should be possible to move resends to the top of the queue (or at least it used to be when all tasks where sent out as pairs: anything with a _2 or higher should be resends).
I don't think this is easy to do for an external tool. Except perhaps by modifying the deadlines of the tasks in client_state.xml to trick boinc into processing them in a hurry.

If you modified the boinc client itself, then you could change the rules it uses to pick the next task to crunch to make it prioritize _2s and higher over _0 and _1.
ID: 2030639 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2030640 - Posted: 3 Feb 2020, 13:13:03 UTC - in response to Message 2030639.  
Last modified: 3 Feb 2020, 13:15:25 UTC

Or...

Instead of modify the client itself, who is not recommended because the dev`s constantly release new updates on it, you could build an external app like the rescheduler.

But instead of reschedulling WU from GPU<>CPU you could rearrange the FIFO order the WU are crunched. So they will be crunched in the order you choose, any order. Obviously until the panic mode is triggered by the client.

The question could be: Why you need to do that? Keep your WU cache big enough to make your host crunching all the WU within a day and you will help to clear the DB fast.
ID: 2030640 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030643 - Posted: 3 Feb 2020, 13:34:33 UTC - in response to Message 2030640.  

But instead of reschedulling WU from GPU<>CPU you could rearrange the FIFO order the WU are crunched. So they will be crunched in the order you choose, any order.
Does the order in which the results are listed in client_state.xml count? There's no field for queue position, so if the physical order doesn't count, then the only way to do this would be faking the deadlines or receive times.

Hacking the client would have the advantage that you wouldn't then need to periodically stop and restart the client to edit the client_state.xml. Every restart makes you lose on average 2.5 minutes of CPU progress and half a task of GPU progress.
ID: 2030643 · Report as offensive
Previous · 1 . . . 72 · 73 · 74 · 75 · 76 · 77 · 78 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.