Panic Mode On (87) Server Problems?

Author	Message
Batter Up Send message Joined: 5 May 99 Posts: 1946 Credit: 24,860,347 RAC: 0	Message 1487853 - Posted: 12 Mar 2014, 16:26:06 UTC " I felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened. " ID: 1487853 ·

Miklos M. Send message Joined: 5 May 99 Posts: 955 Credit: 136,115,648 RAC: 73	Message 1487907 - Posted: 12 Mar 2014, 18:08:09 UTC - in response to Message 1487853. I thought I had a bug on my screen, lol. ID: 1487907 ·

Miklos M. Send message Joined: 5 May 99 Posts: 955 Credit: 136,115,648 RAC: 73	Message 1487908 - Posted: 12 Mar 2014, 18:09:11 UTC - in response to Message 1487632. I wonder if the Reset button would help SETI on this computer. The other computer is getting the units just fine so far. ID: 1487908 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1855 Credit: 268,616,081 RAC: 1,349	Message 1487912 - Posted: 12 Mar 2014, 18:16:11 UTC - in response to Message 1487908. I wonder if the Reset button would help SETI on this computer. The other computer is getting the units just fine so far. Doubt it. If you look at the server status page, the splitters are struggling to produce anything to send. Purely luck of the draw as to whether you'll get anyting until they get closer to keeping up with demand. Too many computers fighting for too few jobs, at the moment. About an hour ago the splitters went to their knees entirely and the SSP froze, but either someone hit the button over there, or they recovered. ID: 1487912 ·

Filipe Send message Joined: 12 Aug 00 Posts: 218 Credit: 21,281,677 RAC: 20	Message 1487918 - Posted: 12 Mar 2014, 18:23:29 UTC What's wrong woth the validatos/assimilators? Seems, it is preventing new work to be split. Maybe a lack of disk space. ID: 1487918 ·

Miklos M. Send message Joined: 5 May 99 Posts: 955 Credit: 136,115,648 RAC: 73	Message 1487930 - Posted: 12 Mar 2014, 18:35:53 UTC - in response to Message 1487912. Thanks, although after hitting the Reset button I got these messages, but no wu's.: 3/12/2014 2:28:49 PM \| SETI@home \| update requested by user 3/12/2014 2:28:54 PM \| SETI@home \| Master file download succeeded 3/12/2014 2:28:59 PM \| SETI@home \| Sending scheduler request: Requested by user. 3/12/2014 2:28:59 PM \| SETI@home \| Not requesting tasks: don't need 3/12/2014 2:29:02 PM \| SETI@home \| Scheduler request completed 3/12/2014 2:29:04 PM \| SETI@home \| Started download of arecibo_181.png 3/12/2014 2:29:04 PM \| SETI@home \| Started download of sah_40.png 3/12/2014 2:29:06 PM \| SETI@home \| Finished download of arecibo_181.png 3/12/2014 2:29:06 PM \| SETI@home \| Finished download of sah_40.png 3/12/2014 2:29:06 PM \| SETI@home \| Started download of sah_banner_290.png 3/12/2014 2:29:06 PM \| SETI@home \| Started download of sah_ss_290.png 3/12/2014 2:29:07 PM \| SETI@home \| Finished download of sah_banner_290.png 3/12/2014 2:29:07 PM \| SETI@home \| Finished download of sah_ss_290.png 3/12/2014 2:30:29 PM \| SETI@home \| update requested by user 3/12/2014 2:30:33 PM \| SETI@home \| Sending scheduler request: Requested by user. 3/12/2014 2:30:33 PM \| SETI@home \| Not requesting tasks: don't need 3/12/2014 2:30:35 PM \| SETI@home \| Scheduler request completed Could this mean that there is hope when they have available units to send? ID: 1487930 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1855 Credit: 268,616,081 RAC: 1,349	Message 1487932 - Posted: 12 Mar 2014, 18:37:25 UTC - in response to Message 1487918. What's wrong woth the validatos/assimilators? Seems, it is preventing new work to be split. Maybe a lack of disk space. If I understand the process correctly, neither the validators or assimilators would be the issue. They're on the tail end of the process; the splitters are on the front. The pattern I see is: 1) Either there's a problem or the system goes down for maintenance (Tuesdays) 2) We keep crunching data, and the demand builds. 3) Once they're back up, we slam the network looking to report and get work. 4) The servers struggle to meet the built-up demand, and the network is congested. 5) Eventually the oscillations cease and they're back to limping along, barely meeting demand, until the next event. But you raise a good point, in that it seems to me that once MBs ready to send hits the 300k range, the splitters slow down, perhaps due to disk space, so that seems to be the extent of the cushion that can be built up. Don't think I've ever seen a similar pattern on APs, but that's probably because there's never enough AP supply to meet the bare demand. Sounds like a lot of supply, but when the crunch comes you can see almost 100k per hour in returns, and each return wants a new one, so it really doesn't take much to slow the traffic flow down. I've got a theory that if everyone were to reduce their cache sizes a bit this might even out, but there's no chance anyone will do that because they're competing to gain access to a limited resource, at least where APs are concerned. ID: 1487932 ·

Miklos M. Send message Joined: 5 May 99 Posts: 955 Credit: 136,115,648 RAC: 73	Message 1487936 - Posted: 12 Mar 2014, 18:41:58 UTC - in response to Message 1487930. Looks like I am back to the same old same old: no work needed or wanted and not sent. I guess I am just out of luck trying to crunch SETI on this computer. Even when I suspend all other work. ID: 1487936 ·

Filipe Send message Joined: 12 Aug 00 Posts: 218 Credit: 21,281,677 RAC: 20	Message 1487938 - Posted: 12 Mar 2014, 18:43:22 UTC Last modified: 12 Mar 2014, 18:44:06 UTC I see it this way: - There is almost 90000 AP WU waiting for assimilation. - The Splitters only produce new WU as disk space is available. - So, as the number of WU waiting to be assimilated is growing, less and less disk space is available, so the results out in the field is steadily dropping. Anyone would like to coment? ID: 1487938 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1855 Credit: 268,616,081 RAC: 1,349	Message 1487939 - Posted: 12 Mar 2014, 18:45:18 UTC - in response to Message 1487930. Thanks, although after hitting the Reset button I got these messages, but no wu's.: ... 3/12/2014 2:28:59 PM \| SETI@home \| Not requesting tasks: don't need ... 3/12/2014 2:30:33 PM \| SETI@home \| Not requesting tasks: don't need ... Could this mean that there is hope when they have available units to send? This means that you already have as much work as your machine wants, per the thresholds you've set (or the defaults). It's not that SETI isn't sending, but that you're telling SETI not to. In BOINC Manager, look at Tools > Computing Preferences > Network Usage, and you'll see setting for Minimum and Maximum work buffer, in days or fractions of a day. This is what controls how much work you have cached Ready to Start. ID: 1487939 ·

bill Send message Joined: 16 Jun 99 Posts: 861 Credit: 29,352,955 RAC: 0	Message 1487940 - Posted: 12 Mar 2014, 18:50:29 UTC - in response to Message 1487938. Last modified: 12 Mar 2014, 18:50:56 UTC "The Splitters only produce new WU as disk space is available." Where did you come across that? ID: 1487940 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14656 Credit: 200,643,578 RAC: 874	Message 1487942 - Posted: 12 Mar 2014, 18:51:48 UTC - in response to Message 1487932. But you raise a good point, in that it seems to me that once MBs ready to send hits the 300k range, the splitters slow down, perhaps due to disk space, so that seems to be the extent of the cushion that can be built up. Don't think I've ever seen a similar pattern on APs, but that's probably because there's never enough AP supply to meet the bare demand. Sounds like a lot of supply, but when the crunch comes you can see almost 100k per hour in returns, and each return wants a new one, so it really doesn't take much to slow the traffic flow down. I've got a theory that if everyone were to reduce their cache sizes a bit this might even out, but there's no chance anyone will do that because they're competing to gain access to a limited resource, at least where APs are concerned. Yes, there are deliberately built-in "high water mark" limits of around 300K MB, and 25K AP, tasks 'ready to send'. The idea is not to keep splitting until all disk space is full, but to stop when there are 'enough' (for some given value of enough), and keep the disk access times snappy. Quite why everything is quite so slow to build up to the high water mark these last last few weeks, I don't know. Shorty storms and stuck tapes are visible to all, but it feels like something else is in play too. And I haven't quite put my finger on it yet. ID: 1487942 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1855 Credit: 268,616,081 RAC: 1,349	Message 1487950 - Posted: 12 Mar 2014, 19:04:49 UTC - in response to Message 1487938. I see it this way: - There is almost 90000 AP WU waiting for assimilation. - The Splitters only produce new WU as disk space is available. - So, as the number of WU waiting to be assimilated is growing, less and less disk space is available, so the results out in the field is steadily dropping. Anyone would like to coment? Possible. I guess it would all depend on what is being stored where. But based purely on watching how things normally go, I doubt it. Big question is, are the files actually living on the servers that are doing the processing of a particular step? I'd be more inclined to think it's a case of process priority. After an outage, it's implied the scheduling server, in charge of moving results out to the field and receiving result reports, gets very busy. (see server descriptions on SSP) Since the AP Validator, AP Assimilators, the Scheduler processes and the Feeder all reside on the same physical server (Synergy) it's reasonable to assume that when Synergy gets busy it assigns higher priority to scheduling and feeding work, and cuts back on validation and assimilation as lower priority tasks. Sending results to the clients, receiving uploaded complete results and accepting reports of the uploads are all "real time" tasks that involve communication with our client software, where validation and assimilation can be done anytime. So establishing priority on that basis would only make sense. ID: 1487950 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1487951 - Posted: 12 Mar 2014, 19:08:58 UTC - in response to Message 1487942. Last modified: 12 Mar 2014, 19:28:28 UTC My MB Host isn't having any trouble. The Two AP Hosts have been dropping since Monday when they were both full. They both should be close to 200, the one is now at 107 and continues to drop with only an occasional download. Most of the time it simply says; Wed Mar 12 14:57:24 2014 \| SETI@home \| Sending scheduler request: To fetch work. Wed Mar 12 14:57:24 2014 \| SETI@home \| Reporting 2 completed tasks Wed Mar 12 14:57:24 2014 \| SETI@home \| Requesting new tasks for ATI Wed Mar 12 14:57:26 2014 \| SETI@home \| Scheduler request completed: got 0 new tasks Wed Mar 12 14:57:26 2014 \| SETI@home \| No tasks sent Wed Mar 12 14:57:26 2014 \| SETI@home \| No tasks are available for AstroPulse v6 Wed Mar 12 15:03:01 2014 \| SETI@home \| Computation for task ap_11ap13aa_B4_P0_00393_20140310_05311.wu_2 finished Wed Mar 12 15:03:01 2014 \| SETI@home \| Starting task ap_04mr13aa_B1_P1_00105_20140310_18914.wu_0 using astropulse_v6 version 607 (opencl_ati_100) in slot 3 Wed Mar 12 15:03:03 2014 \| SETI@home \| Started upload of ap_11ap13aa_B4_P0_00393_20140310_05311.wu_2_0 Wed Mar 12 15:03:07 2014 \| SETI@home \| Finished upload of ap_11ap13aa_B4_P0_00393_20140310_05311.wu_2_0 Wed Mar 12 15:03:07 2014 \| SETI@home \| Sending scheduler request: To fetch work. Wed Mar 12 15:03:07 2014 \| SETI@home \| Reporting 1 completed tasks Wed Mar 12 15:03:07 2014 \| SETI@home \| Requesting new tasks for ATI Wed Mar 12 15:03:09 2014 \| SETI@home \| Scheduler request completed: got 0 new tasks Wed Mar 12 15:03:09 2014 \| SETI@home \| Project has no tasks available... I believe the problem is associated with the bold text; State: All (502) Â· In progress (107) Â· Validation pending (124) Â· Validation inconclusive (5) Â· Valid (265) Â· Invalid (0) Â· Error (1) That number should be close to 100. I think you can blame that on "Workunits waiting for assimilation: 90,453" ID: 1487951 ·

Miklos M. Send message Joined: 5 May 99 Posts: 955 Credit: 136,115,648 RAC: 73	Message 1487953 - Posted: 12 Mar 2014, 19:12:46 UTC - in response to Message 1487939. I increased to number of days to 10 and 10 in the Network as well as the preferences. Still does not want new work. ID: 1487953 ·

Filipe Send message Joined: 12 Aug 00 Posts: 218 Credit: 21,281,677 RAC: 20	Message 1487955 - Posted: 12 Mar 2014, 19:17:00 UTC "The Splitters only produce new WU as disk space is available." Where did you come across that? From some old tech news from Matt ID: 1487955 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1855 Credit: 268,616,081 RAC: 1,349	Message 1487958 - Posted: 12 Mar 2014, 19:22:54 UTC - in response to Message 1487942. Yes, there are deliberately built-in "high water mark" limits of around 300K MB, and 25K AP, tasks 'ready to send'. The idea is not to keep splitting until all disk space is full, but to stop when there are 'enough' (for some given value of enough), and keep the disk access times snappy. Quite why everything is quite so slow to build up to the high water mark these last last few weeks, I don't know. Shorty storms and stuck tapes are visible to all, but it feels like something else is in play too. And I haven't quite put my finger on it yet. Yeah, that is the interesting question. The whole supply and demand thing doesn't explain the way the splitters slow down by itself, but again it could be that process allocation is the issue. When we get a bump in the road, we start slamming the servers to download the work. So georgem and vader get busy servicing downloads, and again using the "real time" vs. "anytime" process priority theory I expressed about, it would seem that the AP splitters on georgem and MB splitters on vader would suffer. The SSP indicated the number of results uploaded during the last hour, but no the number downloaded. Thus, I assume those two numbers are roughly equal, assuming there's files ready to send. It seems to me that I've seen spltting gets slower the more uploads happened in the last hour, which would support that. If anything, it might be a good point to look at what work lives on which machines and see if a better mix could be acheived to redeuce the impact of these swings in traffic. ID: 1487958 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22269 Credit: 416,307,556 RAC: 380	Message 1487959 - Posted: 12 Mar 2014, 19:23:08 UTC Miklos - try something like 4 minimum and extra 0.1 But don't be too surprised if it doesn't fill your buffers for some time as the servers are being a bit reluctant just now... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1487959 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1855 Credit: 268,616,081 RAC: 1,349	Message 1487960 - Posted: 12 Mar 2014, 19:24:22 UTC - in response to Message 1487953. Last modified: 12 Mar 2014, 19:32:43 UTC I increased to number of days to 10 and 10 in the Network as well as the preferences. Still does not want new work. Just to be clear, doesn't want, not didn't get, right? After changing the parameters, did you tell BOINC to read the new config (Advanced > Read config files) or shutdown and restart BOINC? If not, the changes have been stored but are not yet in effect. You running any projects other than SETI? ID: 1487960 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.