Panic Mode On (114) Server Problems?

Author	Message
Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1969701 - Posted: 10 Dec 2018, 2:14:10 UTC - in response to Message 1969695. has anyone else noticed that it doesnt seem like credit is being awarded, or it's being awarded VERY slowly. we've been back up and running all day, but RAC numbers are still in nosedive. i mean, i see validation numbers going up, but credit totals arent. It's normal. RAC nosedives immediately after the project hiccups. But it takes a week for RAC to reverse course and climb back to what is normal levels. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1969701 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1969713 - Posted: 10 Dec 2018, 4:56:05 UTC - in response to Message 1969695. has anyone else noticed that it doesnt seem like credit is being awarded, or it's being awarded VERY slowly. we've been back up and running all day, but RAC numbers are still in nosedive. i mean, i see validation numbers going up, but credit totals arent. . . Likewise. Mine are still going down as well. Stephen :( ID: 1969713 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1969725 - Posted: 10 Dec 2018, 7:44:51 UTC Last modified: 10 Dec 2018, 7:45:18 UTC I had been thinking of what would help people get work sooner after long outages such as this. The most work a system can get is it's Cache setting, or the 100 WU server side limit (CPU/per GPU)- whichever is lower. The feeder can only supply 200 WUs at a time. At present, when you ask for work & have a big debt, the Scheduler may allocate up to 53 WUs (I've read some others can get more). That works out to supplying work to only 4 systems from that particular group from the feeder. What if the Scheduler allocated work based on the system's ability to process it, and the BOINC Manager's 5 minute and a bit backoff after a Scheduler request? An extreme cruncher- 4 GPUs, 30s to process a WU. So a bit over 10 WUs per GPU in 5minutes. So give let it have up to 50WUs on a request. The complete opposite- an Arm based cruncher. Takes a day to process a WU. Is it going to run out of work in the next 6 hours? If yes, give it a WU. If not, then it doesn't need any more work at this stage. Mid range system, a couple of GTX 1060s running Windows. 7min to process a WU. Is it going to run out of work in the next 5min? If yes, give it 4 WUs. If not, give it 2 WUs. Based on the mid range system, giving it enough work to keep it going for a couple of Scheduler requests even if it doesn't get any work on the very next request it'll still have some to keep it busy. And with the 200 WU Feeder limit, that one feeder group is enough to keep 50 mid range systems occupied for over 10 minutes (2 Scheduler requests) or 100 systems if they've already got some work, as opposed to the current system that would only be able to supply 4 systems with work. And the enabling & disabling of this limiter could be done automatically based on the amount of work presently in a system's cache, and it's application turnaround time. If most systems requesting work have no work, or their time to having no work is 25% or less of it's usual turnaround time, enable the limiter to help get a everyone crunching again. If it's better than 50%, disable the limiter & allocate work as it is done now. Grant Darwin NT ID: 1969725 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1969730 - Posted: 10 Dec 2018, 9:00:38 UTC - in response to Message 1969725. The feeder can only supply 200 WUs at a time. At present, when you ask for work & have a big debt, the Scheduler may allocate up to 53 WUs (I've read some others can get more). That works out to supplying work to only 4 systems from that particular group from the feeder. What if the Scheduler allocated work based on the system's ability to process it, and the BOINC Manager's 5 minute and a bit backoff after a Scheduler request? ... I'd be careful about making assumptions like that: 10/12/2018 08:39:10 \| SETI@home \| Scheduler request completed: got 161 new tasks Work fetch, and the scheduler's response, is based on time (seconds of work requested and allocated) - that one was 10/12/2018 08:39:06 \| SETI@home \| [sched_op] NVIDIA GPU work request: 303982.25 seconds; 0.00 devices 10/12/2018 08:39:10 \| SETI@home \| [sched_op] estimated total NVIDIA GPU task duration: 133545 seconds I manage my normal requests to be 10/12/2018 08:01:39 \| SETI@home \| [sched_op] NVIDIA GPU work request: 3989.27 seconds; 0.00 devices 10/12/2018 08:01:42 \| SETI@home \| [sched_op] estimated total NVIDIA GPU task duration: 4148 seconds or thereabouts. There will be enormous resistance to adding significant extra complexity to the server code: it's a critical component, one slip would crash the project, and it's hideously complicated already. There might be some efficiency savings by considering a maximum of, say, 50 tasks for any one request: there are a lot of tests to be made on each individual task (you can't be your own wingmate - have you had a task from this WU before? And so on.), so speeding that up might allow more requests to be processed. But I'd hate to be the systems analyst tasked with optimising that. ID: 1969730 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1969732 - Posted: 10 Dec 2018, 9:35:25 UTC It's been pretty clear that the scheduler assignments/limits aren't going to change. But I think the easiest would be to make the 100/device a variable. and base it on cache size. PerDevice = 100*FilesInCache/CacheLimit+1 All of which the scheduler 'should' know already without any database lookups of how fast you are compared to others, etc. Everything gets a little at a time, enough to keep the timers going - which the server probably prefers the backoffs, lol. Sure the CPU and slower devices would fill first, but it would be fair. ID: 1969732 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1969733 - Posted: 10 Dec 2018, 9:35:51 UTC - in response to Message 1969730. Last modified: 10 Dec 2018, 9:41:11 UTC There will be enormous resistance to adding significant extra complexity to the server code: it's a critical component, one slip would crash the project, and it's hideously complicated already. I figured that would be the situation- the added complexity to something that's already beyond complex would be a insurmountable issue to it's implementation. Edit- likewise changes to the Scheduler to allow the BOINC Manager to reschedule work between computing resources according to the users preferences (or their manual efforts based on unfathomable ideas). But it would be nice if it were possible to give enough work to each system, based on it's abilities, to keep them busy for a couple of Scheduler requests and so get everyone processing again ASAP and build up their caches faster than is presently the case after even minor outages, let alone longer ones such as we just had. Grant Darwin NT ID: 1969733 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1969750 - Posted: 10 Dec 2018, 12:22:26 UTC I agree, mess with the servers code could be catastrophic but need to be made sooner or latter. With the advent of even faster GPU's who process the WU on the 30 secs range the 100 WU limit is simply unthinkable , it holds for less than an hour. This type of GPU could produce close to 2800 WU/day and even faster ones are ready to deploy their power in the next year. Few people needs on this days a cache greater than 1 day since high speed internet access is almost universal different from 18 years ago when we use dialing connections, and connect once a week only. LOL So a new server code who take that in mind, look the host daily production and send enough WU for a complete day of work, instead of a fixed 100 WU limit will make us pass most of the outages, scheduled or no. Sure that not handle a catastrophic outage who will take days to fix, but that is out the "normal project life". If that change happening, slower host who crunch few WU per day, will receive a lot less than 100 WU of course, but the question is: Do they really need that large quantity of WU or a 10 days cache? By doing that the size of the DB will be keep in the safe range and all host will be feeded as needed by it's real contribution to the project. ID: 1969750 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1969759 - Posted: 10 Dec 2018, 13:50:17 UTC Seems like from a complexity of code etc it would be a lot easier/safer to simply increase the cpu limit from 100 tasks to a higher limit. This would avoid tinkering with the code except for the one (I hope) location where the 100 task limit is located. The question is, would this slow down over-all production of the Seti ecology, speed it up or would it be neutral? Might want to run that change on Beta first. Tom A proud member of the OFA (Old Farts Association). ID: 1969759 ·

JohnDK Volunteer tester Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127	Message 1969773 - Posted: 10 Dec 2018, 15:28:38 UTC Getting Project has no tasks available for the last 25 mins or so... ID: 1969773 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 1969774 - Posted: 10 Dec 2018, 15:31:16 UTC - in response to Message 1969773. Getting Project has no tasks available for the last 25 mins or so... yup, looks like something is wrong. Early stage of PANIC !!! ID: 1969774 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1969777 - Posted: 10 Dec 2018, 15:40:11 UTC - in response to Message 1969759. Last modified: 10 Dec 2018, 15:42:33 UTC Seems like from a complexity of code etc it would be a lot easier/safer to simply increase the cpu limit from 100 tasks to a higher limit. This would avoid tinkering with the code except for the one (I hope) location where the 100 task limit is located. Could be easier but that is against the safe of the DB itself, the main reason why the 100 WU limit exist. Change for 200 for example could double the size of it because most of the ones who don't need that rise will obtain the extra WU too. And that makes almost no difference for the ones who has powerful hosts who really needs more WU to pass the outages. That's why i suggest the limit to be related to the actual host production. Simple a host who produces 1000 WU/day has a cache size limited to 1000, one who produces 5000 WU/Day has a 5000 WU cache size, one who produce 5 WU/day receive a 5 WU cache and etc. Not related to the number of GPU's or CPU's the host has, related to the daily production. You want a bigger cache size, simple make your host produce more WU/day. Nothing complicated to code. To the moderator, if possible, please migrate this conversation to a new thread. ID: 1969777 ·

JohnDK Volunteer tester Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127	Message 1969778 - Posted: 10 Dec 2018, 15:44:20 UTC - in response to Message 1969774. Getting Project has no tasks available for the last 25 mins or so... yup, looks like something is wrong. Early stage of PANIC !!! Panic maybe over, getting tasks again. ID: 1969778 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 1969781 - Posted: 10 Dec 2018, 16:06:51 UTC - in response to Message 1969778. Getting Project has no tasks available for the last 25 mins or so... yup, looks like something is wrong. Early stage of PANIC !!! Panic maybe over, getting tasks again. guess it was just a hiccup. - ID: 1969781 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1969782 - Posted: 10 Dec 2018, 16:25:24 UTC Perhaps this helps? 10/12/2018 16:23:21 \| SETI@home \| Scheduler request completed: got 11 new tasks ID: 1969782 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1969796 - Posted: 10 Dec 2018, 18:39:08 UTC Since then, we have entered a phase where all splitters are offline, but are continuing to run. I'm not quite sure where the output is going, >/dev/nul comes to mind. The science database shows 'disabled', and RTS is falling fast. Apologies for my earlier optimism. ID: 1969796 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1969805 - Posted: 10 Dec 2018, 20:16:11 UTC - in response to Message 1969628. I've also been thinking about the preference to hand out work to those machines that already have work, as someone noted. Could it be that the machines that had work after such a long outage, were asking for a smaller amount of work at a time and thus got service earlier? Maybe if the empty machines asked for a smaller amount of work to start off would they then get some sooner? I know this is from yesterday, but I believe this has already been noticed for years now. I remember back when everything moved to the co-lo, if I had an absolutely empty cache, it was near impossible to get work because I was asking for 2.8M seconds of work. But if I dropped the preference down from 10.00 days to 0.1 days, to make it into about 50,000 seconds, nearly the first request after doing so would yield some tasks, THEN I could just punch it back to 10.00 days and walk away and come back an hour later and have a full cache/task limit. I don't know if it is actually coded in this way, or if it is an unexpected behavior, but if you ask for TOO MUCH work while being empty, it won't give you anything, but once you have something onboard, there doesn't seem to be a "too much" number as long as you have more than ~5 tasks. Just my anecdotal observations. I believe it was also Richard that pointed that one out to me several years ago, so I tried it and it worked. YMMV Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1969805 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1969812 - Posted: 10 Dec 2018, 20:51:59 UTC - in response to Message 1969805. One possible explanation of that is the client request behaviour. 1) If the client requests work, and gets none, it goes into (increasing) backoffs 2) If the client requests work, and gets some, it's free to ask again 3) If the client completes an allocated task, all backoffs are cleared, and it can report the completed task immediately and request more work at the same time. Collectively, these mean that if you're completely dry, it takes ages to get started. But if you have a few dregs to complete, or you get a few tasks at an early request, you have a much better chance of asking, and asking, and asking, until you get more. ID: 1969812 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 1969825 - Posted: 10 Dec 2018, 21:39:28 UTC Last modified: 10 Dec 2018, 21:41:26 UTC Less than an hour until the RTS is empty. I'm guessing someone is working on it as the science db is running. I have no idea what the splitters are doing. edit: looks like the splitters are having problems (or bad data) as there are "channels with errors" ID: 1969825 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1969833 - Posted: 10 Dec 2018, 22:28:30 UTC - in response to Message 1969805. I remember back when everything moved to the co-lo, if I had an absolutely empty cache, it was near impossible to get work because I was asking for 2.8M seconds of work. But if I dropped the preference down from 10.00 days to 0.1 days, to make it into about 50,000 seconds, nearly the first request after doing so would yield some tasks, THEN I could just punch it back to 10.00 days and walk away and come back an hour later and have a full cache/task limit. . . Maybe an easier and safer point of attack would be to reduce the maximum work request from 10 days to say 5? Stephen ? ? ID: 1969833 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1969835 - Posted: 10 Dec 2018, 22:31:14 UTC - in response to Message 1969812. One possible explanation of that is the client request behaviour. 1) If the client requests work, and gets none, it goes into (increasing) backoffs 2) If the client requests work, and gets some, it's free to ask again 3) If the client completes an allocated task, all backoffs are cleared, and it can report the completed task immediately and request more work at the same time. Collectively, these mean that if you're completely dry, it takes ages to get started. But if you have a few dregs to complete, or you get a few tasks at an early request, you have a much better chance of asking, and asking, and asking, until you get more. . . Except that I have found that manual requests for work in an extended backoff period still results in "No tasks available" and increases the backoff timer. Stephen :( ID: 1969835 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.