Panic Mode On (104) Server Problems?

Author	Message
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1844260 - Posted: 25 Jan 2017, 6:57:15 UTC After the outrage, we had a 3hr power here outage due to storms, but everything is now fully loaded. Cheers. ID: 1844260 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1844379 - Posted: 25 Jan 2017, 22:39:15 UTC - in response to Message 1844186. Hal is correct and this has happened before, several times in fact over the years, but thankfully I have none of those here this time. Cheers. Oh yes you have, and plenty of them too. Just one example from one of your computers: https://setiathome.berkeley.edu/workunit.php?wuid=2349156167 That one has been in limbo since Dec 6. You have many more.... I should've went back further it seems, 66 of them, but their deadlines are 25, 26, 27 January so they should clear over the next few days. Cheers. Well 16 of mine have now validated. Cheers. ID: 1844379 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1844442 - Posted: 26 Jan 2017, 6:16:14 UTC - in response to Message 1844250. Last modified: 26 Jan 2017, 6:32:27 UTC What i'm still trying to figure out is why I have to change the Application settings once or twice a day to be able to keep getting work. Eventually, even after the post outage congestion, the usual response from the Scheduler these days is "Project has no tasks available." Change the Application settings, then it has work available. At least for 12 or more hours. Then I get to do it all over again. Rough gist is that when hard limits are set, such as project backoffs and queue sizes, statistically there will be people that fit into the expected area, ones that sometimes fit into the expected area while other times not, then still more that always fall into the always breaks regime. My guess is if the backoffs &/or request intervals were somewhat randomised, then it would allow fairer work distribution. Sadly actual statistics and control systems theory doesn't seem to be on the Agenda for Boinc anytime soon. So you'll need to either continue babysitting, or change something such that you fall into a different 'chance' bucket. [Edit:] Either manually, or Some script to force update a random interval after project backoff expiration might work. The client's aggressive request on backoff expiration won't. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1844442 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1844448 - Posted: 26 Jan 2017, 6:47:58 UTC - in response to Message 1844442. Rough gist is that when hard limits are set, such as project backoffs and queue sizes, statistically there will be people that fit into the expected area, ones that sometimes fit into the expected area while other times not, then still more that always fall into the always breaks regime. My guess is if the backoffs &/or request intervals were somewhat randomised, then it would allow fairer work distribution. Sadly actual statistics and control systems theory doesn't seem to be on the Agenda for Boinc anytime soon. So you'll need to either continue babysitting, or change something such that you fall into a different 'chance' bucket. [Edit:] Either manually, or Some script to force update a random interval after project backoff expiration might work. The client's aggressive request on backoff expiration won't. Problem is it has nothing to do with the backoffs. Every 5 minutes the Manager will ask for work, and for some reasons after a certain period of time it's necessary to change the project's application settings to keep it coming. Even though I don't have an AP application to process AP work, I have to set that option to Yes to get work (and the "If no work available for selected application, accept work from other applications). Then later on I have to set it to No, then Yes, then No and so on. Been this way for a few weeks now. I suspect it's related to the issue that was reported late Dec for people that had selected to do AP work, and "If no work avail, do other work" with v8 left unselected. At the end of Dec they had to specifically enable v8 work in order to get any, where as before it wasn't necessary. (Of course running out of work during the weekly outages does result in ridiculously long backoff times if you're not around to hit retry). Grant Darwin NT ID: 1844448 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1844451 - Posted: 26 Jan 2017, 6:58:11 UTC - in response to Message 1844448. Well the 5 minutes is across the board, so constant for all of us. Likely bugs aside (a big if), there are still periods of full or empty feeder queue which you may fall into. It's a task to determine whether there is a bug in the work issue, or you are simply falling into the empty feeder bucket for some reason (could be anything such as latency you described) Perhaps an answer is to increase the backoff such than more users can get in, maybe not. Either way, there will always be some proportion of users that can never get work. What I'm proposing is, that if fixed time intervals are used, then nomatter what, some proportion of hosts will end up in a state of never being able to get work. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1844451 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1844452 - Posted: 26 Jan 2017, 7:11:08 UTC - in response to Message 1844451. Likely bugs aside (a big if), there are still periods of full or empty feeder queue which you may fall into. It's a task to determine whether there is a bug in the work issue, or you are simply falling into the empty feeder bucket for some reason (could be anything such as latency you described) Possible, but unlikely IMHO. Work returned per hour is less than at times in the past, yet this issue persists. Also, even after the extended outages with work returned right up there & the demand for new work way up there, if the first few requests for work result in none, changing the application settings results in getting work- up to the point the cache is full. If it starts getting work after the outages, it continues to fill up normally. In both cases it will generally get work with each request, just the odd one or 2 it might miss out on. However If the project does ever get the number of crunchers they're hoping for to crunch all the new data, I am expecting significant issues getting work with things as they stand. A 5min +-30 seconds might help with the Scheduler and feeder loads, but as the number of active hosts increases, wouldn't the sheer number of hosts result in an (effectively) random load due to each system's randomness with the initial Scheduler request after BOINC starting? Grant Darwin NT ID: 1844452 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1844453 - Posted: 26 Jan 2017, 7:18:07 UTC - in response to Message 1844452. Yes 'unlikely'. That may possibly account for your ability to change settings, last for 12 hours, then fall back into a hole. I'll be very interested to see if you can break out of that rut without code change on client or server. Being Australia day, I'd like to point out that falling into such holes is pretty much the Australian way. Usually the result of doing so is fairly disruptive. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1844453 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1844454 - Posted: 26 Jan 2017, 7:21:32 UTC - in response to Message 1844452. A 5min +-30 seconds might help with the Scheduler and feeder loads, but as the number of active hosts increases, wouldn't the sheer number of hosts result in an (effectively) random load due to each system's randomness with the initial Scheduler request after BOINC starting? Most likely the backoff needs to be proportional to the rate of requests, since the feed rate is likely more or less constant. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1844454 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1844487 - Posted: 26 Jan 2017, 10:22:23 UTC Grant, Maybe try adding the AP application, maybe that is the one thing that makes your requests different than others. There are a few getting split right now, but normally you won't get any(many) anyways. ID: 1844487 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1844488 - Posted: 26 Jan 2017, 10:37:40 UTC - in response to Message 1844487. Grant, Maybe try adding the AP application, maybe that is the one thing that makes your requests different than others. There are a few getting split right now, but normally you won't get any(many) anyways. True, but really I've no desire to re-do the setup again. I'm happy to run v8 only & crunch all the guppies others don't want. It would be nice if the issue we just fixed. I'll just continue with my manual work around for now. Grant Darwin NT ID: 1844488 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1844598 - Posted: 26 Jan 2017, 20:27:39 UTC Just adding an idea here.. but the feeder can only re-fill so fast/frequently/often, so what if instead of letting someone get lucky and get all 200 tasks that it has in one request.. the feeder gets limited to assigning 20 or 50 tasks at a time? I know there will be more groaning and griping about that, too, but if the feeder is more likely to have tasks in it, then there should--theoretically--be less people who get absolutely nothing because it is empty. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1844598 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1844621 - Posted: 26 Jan 2017, 22:34:50 UTC - in response to Message 1844598. Just adding an idea here.. but the feeder can only re-fill so fast/frequently/often, so what if instead of letting someone get lucky and get all 200 tasks that it has in one request.. the feeder gets limited to assigning 20 or 50 tasks at a time? I know there will be more groaning and griping about that, too, but if the feeder is more likely to have tasks in it, then there should--theoretically--be less people who get absolutely nothing because it is empty. +1 Makes a lot of sense. Would also help for new folks who take a load and are never heard from again, in terms of wingmen waiting for timouts. ID: 1844621 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1844633 - Posted: 26 Jan 2017, 23:01:35 UTC - in response to Message 1844598. +1 Seems to be the old norm for me anyway, at least WAS back last year before this new problem cropped up. I never got more than 40 or so tasks in the first downloads after the outage anyway. I've NEVER received the full 100 task buffer output for my download request. I always sort of thought that was the servers doing, or it was just my luck of the draw always pulling 40~ tasks out of the 100 task buffer. It never took more than 4 or 5 downloads to get back to my quota after the outage and my work request got answered in the queue deluge. I gather from the thread comments that the servers DON'T actually have this mechanism in play. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1844633 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1844634 - Posted: 26 Jan 2017, 23:02:10 UTC - in response to Message 1844598. Just adding an idea here.. but the feeder can only re-fill so fast/frequently/often, so what if instead of letting someone get lucky and get all 200 tasks that it has in one request.. the feeder gets limited to assigning 20 or 50 tasks at a time? I know there will be more groaning and griping about that, too, but if the feeder is more likely to have tasks in it, then there should--theoretically--be less people who get absolutely nothing because it is empty. Something similar was done with AP work requests a few years ago. At least it seemed that way once no more than ~7 tasks were assigned per request. If hitting an empty feeder is the only issue it seems like more users would be seeing their queues dropping & that toggling settings wouldn't fix the users getting no work. At another project there was an issue with the feeder running dry. So the admin adjusted some settings for it. I think they said they increased the number the feeder held at once, but they may have increased how often it was filled. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1844634 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1844665 - Posted: 27 Jan 2017, 0:44:22 UTC - in response to Message 1844634. Last modified: 27 Jan 2017, 0:45:46 UTC Just adding an idea here.. but the feeder can only re-fill so fast/frequently/often, so what if instead of letting someone get lucky and get all 200 tasks that it has in one request.. the feeder gets limited to assigning 20 or 50 tasks at a time? I know there will be more groaning and griping about that, too, but if the feeder is more likely to have tasks in it, then there should--theoretically--be less people who get absolutely nothing because it is empty. Something similar was done with AP work requests a few years ago. At least it seemed that way once no more than ~7 tasks were assigned per request. If hitting an empty feeder is the only issue it seems like more users would be seeing their queues dropping & that toggling settings wouldn't fix the users getting no work. At another project there was an issue with the feeder running dry. So the admin adjusted some settings for it. I think they said they increased the number the feeder held at once, but they may have increased how often it was filled. . . That is exactly how it seems to me as well. But I can offer no idea of just where the issue lies. Except to say it is a recent development. Since they upgraded the OS on one or more of the servers in fact. Stephen . ID: 1844665 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1844700 - Posted: 27 Jan 2017, 6:35:34 UTC - in response to Message 1844665. That is exactly how it seems to me as well. But I can offer no idea of just where the issue lies. Except to say it is a recent development. Since they upgraded the OS on one or more of the servers in fact. It wouldn't be the first time major changes resulted in configuration files going missing or being ignored in one way or another. Grant Darwin NT ID: 1844700 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1845428 - Posted: 30 Jan 2017, 14:10:39 UTC Last modified: 30 Jan 2017, 14:36:43 UTC It seems the problem with downloading Work has surfaced again after being absent for a while. Just as before the machines with only ATI GPUs are Not affected. The problems are with the machines that just had GPUs swapped. One machine went from having 3 nVidia cards to 2 NV and 1 ATI. The other machine went from 2 NV and 1 ATI to 3 nVidia cards. Both machines have been having problems since the GPUs were swapped a couple days ago, up until then they were not having any problems since the last post a couple weeks ago. Suddenly, they are back to having problems. The one machine with the 3 NV GPUs is down over a hundred tasks, changing the preferences works for a few hours then the problem returns. Mon Jan 30 09:27:58 2017 \| SETI@home \| [sched_op] Starting scheduler request Mon Jan 30 09:27:58 2017 \| SETI@home \| Sending scheduler request: To fetch work. Mon Jan 30 09:27:58 2017 \| SETI@home \| Reporting 1 completed tasks Mon Jan 30 09:27:58 2017 \| SETI@home \| Requesting new tasks for CPU and NVIDIA GPU and AMD/ATI GPU Mon Jan 30 09:27:58 2017 \| SETI@home \| [sched_op] CPU work request: 41045.68 seconds; 0.00 devices Mon Jan 30 09:27:58 2017 \| SETI@home \| [sched_op] NVIDIA GPU work request: 139220.66 seconds; 0.00 devices Mon Jan 30 09:27:58 2017 \| SETI@home \| [sched_op] AMD/ATI GPU work request: 9927.06 seconds; 0.00 devices Mon Jan 30 09:28:01 2017 \| SETI@home \| Scheduler request completed: got 0 new tasks Mon Jan 30 09:28:01 2017 \| SETI@home \| [sched_op] Server version 707 Mon Jan 30 09:28:01 2017 \| SETI@home \| Project has no tasks available Mon Jan 30 09:28:01 2017 \| SETI@home \| Project requested delay of 303 seconds Mon 30 Jan 2017 09:33:05 AM EST \| SETI@home \| Requesting new tasks for NVIDIA Mon 30 Jan 2017 09:33:05 AM EST \| SETI@home \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices Mon 30 Jan 2017 09:33:05 AM EST \| SETI@home \| [sched_op] NVIDIA work request: 526053.44 seconds; 0.00 devices Mon 30 Jan 2017 09:33:13 AM EST \| SETI@home \| Scheduler request completed: got 0 new tasks Mon 30 Jan 2017 09:33:13 AM EST \| SETI@home \| [sched_op] Server version 707 Mon 30 Jan 2017 09:33:13 AM EST \| SETI@home \| Project has no tasks available Mon 30 Jan 2017 09:33:13 AM EST \| SETI@home \| Project requested delay of 303 seconds :-( ID: 1845428 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1845439 - Posted: 30 Jan 2017, 15:52:46 UTC Change the preferences and the work returns...for a little while; Mon 30 Jan 2017 10:26:32 AM EST \| SETI@home \| Sending scheduler request: To fetch work. Mon 30 Jan 2017 10:26:32 AM EST \| SETI@home \| Reporting 2 completed tasks Mon 30 Jan 2017 10:26:32 AM EST \| SETI@home \| Requesting new tasks for NVIDIA Mon 30 Jan 2017 10:26:32 AM EST \| SETI@home \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices Mon 30 Jan 2017 10:26:32 AM EST \| SETI@home \| [sched_op] NVIDIA work request: 494789.29 seconds; 0.00 devices Mon 30 Jan 2017 10:26:35 AM EST \| SETI@home \| Scheduler request completed: got 91 new tasks Mon 30 Jan 2017 10:26:35 AM EST \| SETI@home \| [sched_op] Server version 707 Mon 30 Jan 2017 10:26:35 AM EST \| SETI@home \| Project requested delay of 303 seconds ID: 1845439 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1845455 - Posted: 30 Jan 2017, 17:11:15 UTC - in response to Message 1845439. Remind me again, what preferences are you changing to get effect. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1845455 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1845458 - Posted: 30 Jan 2017, 17:30:56 UTC - in response to Message 1845455. Run only the selected applications AstroPulse v7: no SETI@home v8: yes If no work for selected applications is available, accept work from other applications? no It doesn't matter how they are set. Just change them. Right now one machine is set to the above while the other machine is; Run only the selected applications AstroPulse v7: yes SETI@home v8: yes If no work for selected applications is available, accept work from other applications? yes The next time I'll just swap the settings again. It's the act of changing them that matters. ID: 1845458 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.