The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030536 - Posted: 2 Feb 2020, 15:51:20 UTC - in response to Message 2030529. Maybe it time to start to cut the timeline of the WUs and some changes in the way the work is distributed like sending the resends to the fastest hosts to clear them ASAP. Again NOT the fastest but the ones with the shortest average turnaround time. Slow host with a tiny cache can return the result faster than a fast host with a huge spoofed cache. One thing that could prevent this from happening again is if the system monitored the rate of overflows returned and when any file being split exceeds some threshold, that file would be heavily throttled so that it continues being split but would produce only a small percentage of all the workunits. Or this could even happen without any monitoring if the different splitters split different files instead of all bunching up on the same file. So if some file (or a few files) produced an overflow storm, the storm would be diluted by all the other splitters splitting clean files. But I don't know how this would affect the splitter performance. Spreading out could be faster or slower than bunching up. ID: 2030536 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2030560 - Posted: 2 Feb 2020, 21:43:58 UTC - in response to Message 2030536. Last modified: 2 Feb 2020, 21:44:42 UTC Again NOT the fastest but the ones with the shortest average turnaround time. Slow host with a tiny cache can return the result faster than a fast host with a huge spoofed cache. Sorry the meaning was lost in the translation. For me fastests host are the ones with the shortest average turnaround time (less than 1 day). They could clear the WU in very little time and help to reduce the DB size. Obviusly the WU must be sended with a very small death time line (less than 3 days in this case) . The way is done now, by sending the WU to any hosts (with a long death time line) just make the DB size problem even worst. ID: 2030560 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 2030566 - Posted: 2 Feb 2020, 21:58:36 UTC - in response to Message 2030508. I think BoincTasks can do that, as well. I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this? ID: 2030566 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 2030584 - Posted: 2 Feb 2020, 23:16:33 UTC Better solution: if you can detect short tasks without running them, why not just abort them? Can Boinc Tasks do this? Could the servers? ID: 2030584 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19048 Credit: 40,757,560 RAC: 67	Message 2030586 - Posted: 2 Feb 2020, 23:38:11 UTC - in response to Message 2030584. Last modified: 3 Feb 2020, 0:23:35 UTC Better solution: if you can detect short tasks without running them, why not just abort them? Can Boinc Tasks do this? Could the servers? The only known way is to run them. For a short time, like the time taken on a 2060 GPU, or better, for bomb to be -9ed. We don't know how many tasks are sent/day but we do know how many are returned/hr. Average tasks returned per hr * 24 * short time on GPU / 86400 (s in day) = GPU's needed ID: 2030586 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 2030593 - Posted: 3 Feb 2020, 0:20:11 UTC Sun 02 Feb 2020 06:16:57 PM CST \| SETI@home \| Scheduler request completed: got 92 new tasks Yum! Something to crunch ;) Tom A proud member of the OFA (Old Farts Association). ID: 2030593 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2030611 - Posted: 3 Feb 2020, 6:44:56 UTC Looks like more trouble. About 30 minutes ago the Website got very Slow and the Scheduler checked out; Mon Feb 3 01:08:50 2020 \| SETI@home \| [sched_op] Starting scheduler request Mon Feb 3 01:10:47 2020 \| SETI@home \| Scheduler request failed: HTTP internal server error Mon Feb 3 01:10:47 2020 \| SETI@home \| [sched_op] Reason: Scheduler request failed Mon Feb 3 01:13:08 2020 \| SETI@home \| Sending scheduler request: To report completed tasks. Mon Feb 3 01:14:23 2020 \| SETI@home \| Scheduler request failed: Couldn't connect to server Mon Feb 3 01:22:01 2020 \| SETI@home \| [sched_op] Starting scheduler request Mon Feb 3 01:23:15 2020 \| SETI@home \| Scheduler request failed: Failure when receiving data from the peer Mon Feb 3 01:23:15 2020 \| SETI@home \| [sched_op] Reason: Scheduler request failed Mon Feb 3 01:34:15 2020 \| SETI@home \| [sched_op] Starting scheduler request Mon Feb 3 01:36:57 2020 \| SETI@home \| Scheduler request failed: HTTP internal server error Mon Feb 3 01:36:57 2020 \| SETI@home \| [sched_op] Reason: Scheduler request failed Just when everything was working well... ID: 2030611 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 2030612 - Posted: 3 Feb 2020, 7:09:49 UTC Well, of all the problems i was expecting to occur, the Scheduler going MIA wasn't one of them. And it appears it might have just come back to life- no longer timing out, or HTTP errors, or failure when receiving data from the peer (I think every possible error has been given at some stage). Now it's back to "Project has no tasks available", but at least i can report every thing that's accumulated since the Scheduler went AWOL earlier. Grant Darwin NT ID: 2030612 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030615 - Posted: 3 Feb 2020, 7:31:36 UTC Looks like the validators have been MIA too, not just the scheduler. The first successful scheduler contact made my RAC drop lower than the lowest point yesterday at the end of the dry period. ID: 2030615 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2030616 - Posted: 3 Feb 2020, 7:36:40 UTC A few machines are starting to get Downloads again. Hopefully this will blow over quickly. ID: 2030616 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 2030617 - Posted: 3 Feb 2020, 7:46:51 UTC - in response to Message 2030615. Looks like the validators have been MIA too, not just the scheduler. The first successful scheduler contact made my RAC drop lower than the lowest point yesterday at the end of the dry period. For a while there things were improving (steadily if slowly), but all the new work going out has caused the Validation backlog to increase again. Grant Darwin NT ID: 2030617 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030619 - Posted: 3 Feb 2020, 7:55:46 UTC - in response to Message 2030617. Last modified: 3 Feb 2020, 8:55:29 UTC For a while there things were improving (steadily if slowly), but all the new work going out has caused the Validation backlog to increase again. The assimilation backlog was reducing until two SSP updates ago. But on the last two updates it too has grown bigger. Here are the cumulative result counts for the last few days: Each plotted value is the sum of that value plus all the values below it so that the width of the band between the line and the one below it represents the value of the specific variable. The plots show that db purging has been primarily responsible for the database size reduction and when the database ran out of purgeable results, the total result count started increasing again. The results waiting for assimilation are an estimated value because the SSP doesn't report it separately. The estimation is based on two assumptions: Those are counted as waiting for validation on ssp and the average replication (number of results per workunit) is 2.2. The numbers on x-axis are days of February. ID: 2030619 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 2030627 - Posted: 3 Feb 2020, 10:20:58 UTC - in response to Message 2030586. I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this? Then how could any other piece of s/w do this...just asking for a friend. ID: 2030627 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2030629 - Posted: 3 Feb 2020, 10:42:32 UTC - in response to Message 2030627. I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this? Then how could any other piece of s/w do this...just asking for a friend. Unfortunately, can't be done - consistently, at any rate. That's what we're here for - finding the signals in the noise. The only way to do that is to run SETI's own software. There are occasions when a whole group of tasks are 'similar' - like the recent run of BLC35 tasks. But it wasn't 100%, and there were tasks in there that needed running. The best we can hope for is that the powers that be provide enough workers in the SETI@Home labs to manage the tape splitting process more closely, so that when one of these self-similar groups appears, they can respond by distributing them gradually, amongst other types of work. ID: 2030629 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 2030634 - Posted: 3 Feb 2020, 12:00:57 UTC I got up this morning and my Windows 10 box had shut down for some reason or other. When it does that I have to turn off the PSU before things will "reset" and then up it comes. Got this when everything was up again: 2/3/2020 5:51:36 AM \| SETI@home \| Scheduler request completed: got 150 new tasks Tom A proud member of the OFA (Old Farts Association). ID: 2030634 ·

BetelgeuseFive Volunteer tester Send message Joined: 6 Jul 99 Posts: 158 Credit: 17,117,787 RAC: 19	Message 2030636 - Posted: 3 Feb 2020, 12:16:02 UTC - in response to Message 2030629. I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this? Then how could any other piece of s/w do this...just asking for a friend. Unfortunately, can't be done - consistently, at any rate. That's what we're here for - finding the signals in the noise. The only way to do that is to run SETI's own software. There are occasions when a whole group of tasks are 'similar' - like the recent run of BLC35 tasks. But it wasn't 100%, and there were tasks in there that needed running. The best we can hope for is that the powers that be provide enough workers in the SETI@Home labs to manage the tape splitting process more closely, so that when one of these self-similar groups appears, they can respond by distributing them gradually, amongst other types of work. But it should be possible to move resends to the top of the queue (or at least it used to be when all tasks where sent out as pairs: anything with a _2 or higher should be resends). Tom ID: 2030636 ·

Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230	Message 2030638 - Posted: 3 Feb 2020, 12:47:23 UTC - in response to Message 2030277. My Inconclusive results are going up too, even though I've only had a handful of Tasks since last night. Last night I had a large number of Inconclusive results that said 'minimum quorum 1' and only listed a single Inconclusive host. I didn't see how a single Inconclusive host task could ever validate. Now, it's very difficult to bring up my Inconclusive tasks lists, but, it seems those tasks are now listed as; https://setiathome.berkeley.edu/workunit.php?wuid=3862758806 minimum quorum 1 initial replication 3 Task Computer Sent Time reported Status Runtime CPUtime Credit Application 8495599283 1473578 31 Jan 2020, 5:02:48 UTC 31 Jan 2020, 21:47:15 UTC Completed and validated 15.36 12.61 3.59 SETI@home v8 v8.20 (opencl_ati5_mac) x86_64-apple-darwin 8498611906 6796479 1 Feb 2020, 3:00:50 UTC 1 Feb 2020, 4:00:03 UTC Completed and validated 4.10 1.93 3.59 SETI@home v8 v8.11 (cuda42_mac) x86_64-apple-darwin 8498669733 8673543 1 Feb 2020, 4:01:52 UTC 1 Feb 2020, 5:29:49 UTC Completed and validated 15.11 13.09 3.59 SETI@home v8 v8.22 (opencl_nvidia_SoG) So, the single host are now triple hosts, but they are still just sitting there with a number of them showing one or two Completed, waiting for validation hosts, and some with one or two Inconclusive hosts. I have a couple of invalid tasks with minimum quorum = 1. Perhaps I have a lot of valid tasks as well with min.q.=1, but they are much harder to spot. https://setiathome.berkeley.edu/workunit.php?wuid=3861384942 https://setiathome.berkeley.edu/workunit.php?wuid=3861339403 https://setiathome.berkeley.edu/workunit.php?wuid=3861247650 https://setiathome.berkeley.edu/workunit.php?wuid=3861247545 and so on... https://setiathome.berkeley.edu/results.php?userid=5276&offset=0&show_names=0&state=5&appid= ID: 2030638 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030639 - Posted: 3 Feb 2020, 12:52:34 UTC - in response to Message 2030636. But it should be possible to move resends to the top of the queue (or at least it used to be when all tasks where sent out as pairs: anything with a _2 or higher should be resends). I don't think this is easy to do for an external tool. Except perhaps by modifying the deadlines of the tasks in client_state.xml to trick boinc into processing them in a hurry. If you modified the boinc client itself, then you could change the rules it uses to pick the next task to crunch to make it prioritize _2s and higher over _0 and _1. ID: 2030639 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2030640 - Posted: 3 Feb 2020, 13:13:03 UTC - in response to Message 2030639. Last modified: 3 Feb 2020, 13:15:25 UTC Or... Instead of modify the client itself, who is not recommended because the dev`s constantly release new updates on it, you could build an external app like the rescheduler. But instead of reschedulling WU from GPU<>CPU you could rearrange the FIFO order the WU are crunched. So they will be crunched in the order you choose, any order. Obviously until the panic mode is triggered by the client. The question could be: Why you need to do that? Keep your WU cache big enough to make your host crunching all the WU within a day and you will help to clear the DB fast. ID: 2030640 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030643 - Posted: 3 Feb 2020, 13:34:33 UTC - in response to Message 2030640. But instead of reschedulling WU from GPU<>CPU you could rearrange the FIFO order the WU are crunched. So they will be crunched in the order you choose, any order. Does the order in which the results are listed in client_state.xml count? There's no field for queue position, so if the physical order doesn't count, then the only way to do this would be faking the deadlines or receive times. Hacking the client would have the advantage that you wouldn't then need to periodically stop and restart the client to edit the client_state.xml. Every restart makes you lose on average 2.5 minutes of CPU progress and half a task of GPU progress. ID: 2030643 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.