Panic Mode On (87) Server Problems?

Message boards : Number crunching : Panic Mode On (87) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 24 · Next

AuthorMessage
Batter Up
Avatar

Send message
Joined: 5 May 99
Posts: 1946
Credit: 24,860,347
RAC: 0
United States
Message 1487853 - Posted: 12 Mar 2014, 16:26:06 UTC

" I felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened. "
ID: 1487853 · Report as offensive
Miklos M.

Send message
Joined: 5 May 99
Posts: 955
Credit: 136,115,648
RAC: 73
Hungary
Message 1487907 - Posted: 12 Mar 2014, 18:08:09 UTC - in response to Message 1487853.  

I thought I had a bug on my screen, lol.
ID: 1487907 · Report as offensive
Miklos M.

Send message
Joined: 5 May 99
Posts: 955
Credit: 136,115,648
RAC: 73
Hungary
Message 1487908 - Posted: 12 Mar 2014, 18:09:11 UTC - in response to Message 1487632.  

I wonder if the Reset button would help SETI on this computer. The other computer is getting the units just fine so far.
ID: 1487908 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 1487912 - Posted: 12 Mar 2014, 18:16:11 UTC - in response to Message 1487908.  

I wonder if the Reset button would help SETI on this computer. The other computer is getting the units just fine so far.

Doubt it. If you look at the server status page, the splitters are struggling to produce anything to send. Purely luck of the draw as to whether you'll get anyting until they get closer to keeping up with demand. Too many computers fighting for too few jobs, at the moment. About an hour ago the splitters went to their knees entirely and the SSP froze, but either someone hit the button over there, or they recovered.
ID: 1487912 · Report as offensive
Filipe

Send message
Joined: 12 Aug 00
Posts: 218
Credit: 21,281,677
RAC: 20
Portugal
Message 1487918 - Posted: 12 Mar 2014, 18:23:29 UTC

What's wrong woth the validatos/assimilators?

Seems, it is preventing new work to be split. Maybe a lack of disk space.
ID: 1487918 · Report as offensive
Miklos M.

Send message
Joined: 5 May 99
Posts: 955
Credit: 136,115,648
RAC: 73
Hungary
Message 1487930 - Posted: 12 Mar 2014, 18:35:53 UTC - in response to Message 1487912.  

Thanks, although after hitting the Reset button I got these messages, but no wu's.:
3/12/2014 2:28:49 PM | SETI@home | update requested by user
3/12/2014 2:28:54 PM | SETI@home | Master file download succeeded
3/12/2014 2:28:59 PM | SETI@home | Sending scheduler request: Requested by user.
3/12/2014 2:28:59 PM | SETI@home | Not requesting tasks: don't need
3/12/2014 2:29:02 PM | SETI@home | Scheduler request completed
3/12/2014 2:29:04 PM | SETI@home | Started download of arecibo_181.png
3/12/2014 2:29:04 PM | SETI@home | Started download of sah_40.png
3/12/2014 2:29:06 PM | SETI@home | Finished download of arecibo_181.png
3/12/2014 2:29:06 PM | SETI@home | Finished download of sah_40.png
3/12/2014 2:29:06 PM | SETI@home | Started download of sah_banner_290.png
3/12/2014 2:29:06 PM | SETI@home | Started download of sah_ss_290.png
3/12/2014 2:29:07 PM | SETI@home | Finished download of sah_banner_290.png
3/12/2014 2:29:07 PM | SETI@home | Finished download of sah_ss_290.png
3/12/2014 2:30:29 PM | SETI@home | update requested by user
3/12/2014 2:30:33 PM | SETI@home | Sending scheduler request: Requested by user.
3/12/2014 2:30:33 PM | SETI@home | Not requesting tasks: don't need
3/12/2014 2:30:35 PM | SETI@home | Scheduler request completed
Could this mean that there is hope when they have available units to send?
ID: 1487930 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 1487932 - Posted: 12 Mar 2014, 18:37:25 UTC - in response to Message 1487918.  

What's wrong woth the validatos/assimilators?

Seems, it is preventing new work to be split. Maybe a lack of disk space.

If I understand the process correctly, neither the validators or assimilators would be the issue. They're on the tail end of the process; the splitters are on the front.
The pattern I see is:
1) Either there's a problem or the system goes down for maintenance (Tuesdays)
2) We keep crunching data, and the demand builds.
3) Once they're back up, we slam the network looking to report and get work.
4) The servers struggle to meet the built-up demand, and the network is congested.
5) Eventually the oscillations cease and they're back to limping along, barely meeting demand, until the next event.

But you raise a good point, in that it seems to me that once MBs ready to send hits the 300k range, the splitters slow down, perhaps due to disk space, so that seems to be the extent of the cushion that can be built up. Don't think I've ever seen a similar pattern on APs, but that's probably because there's never enough AP supply to meet the bare demand. Sounds like a lot of supply, but when the crunch comes you can see almost 100k per hour in returns, and each return wants a new one, so it really doesn't take much to slow the traffic flow down.
I've got a theory that if everyone were to reduce their cache sizes a bit this might even out, but there's no chance anyone will do that because they're competing to gain access to a limited resource, at least where APs are concerned.
ID: 1487932 · Report as offensive
Miklos M.

Send message
Joined: 5 May 99
Posts: 955
Credit: 136,115,648
RAC: 73
Hungary
Message 1487936 - Posted: 12 Mar 2014, 18:41:58 UTC - in response to Message 1487930.  

Looks like I am back to the same old same old: no work needed or wanted and not sent. I guess I am just out of luck trying to crunch SETI on this computer. Even when I suspend all other work.
ID: 1487936 · Report as offensive
Filipe

Send message
Joined: 12 Aug 00
Posts: 218
Credit: 21,281,677
RAC: 20
Portugal
Message 1487938 - Posted: 12 Mar 2014, 18:43:22 UTC
Last modified: 12 Mar 2014, 18:44:06 UTC

I see it this way:

- There is almost 90000 AP WU waiting for assimilation.
- The Splitters only produce new WU as disk space is available.
- So, as the number of WU waiting to be assimilated is growing, less and less disk space is available, so the results out in the field is steadily dropping.


Anyone would like to coment?
ID: 1487938 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 1487939 - Posted: 12 Mar 2014, 18:45:18 UTC - in response to Message 1487930.  

Thanks, although after hitting the Reset button I got these messages, but no wu's.:
...
3/12/2014 2:28:59 PM | SETI@home | Not requesting tasks: don't need
...
3/12/2014 2:30:33 PM | SETI@home | Not requesting tasks: don't need
...
Could this mean that there is hope when they have available units to send?

This means that you already have as much work as your machine wants, per the thresholds you've set (or the defaults). It's not that SETI isn't sending, but that you're telling SETI not to.
In BOINC Manager, look at Tools > Computing Preferences > Network Usage,
and you'll see setting for Minimum and Maximum work buffer, in days or fractions of a day. This is what controls how much work you have cached Ready to Start.
ID: 1487939 · Report as offensive
bill

Send message
Joined: 16 Jun 99
Posts: 861
Credit: 29,352,955
RAC: 0
United States
Message 1487940 - Posted: 12 Mar 2014, 18:50:29 UTC - in response to Message 1487938.  
Last modified: 12 Mar 2014, 18:50:56 UTC

"The Splitters only produce new WU as disk space is available."

Where did you come across that?
ID: 1487940 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14656
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1487942 - Posted: 12 Mar 2014, 18:51:48 UTC - in response to Message 1487932.  

But you raise a good point, in that it seems to me that once MBs ready to send hits the 300k range, the splitters slow down, perhaps due to disk space, so that seems to be the extent of the cushion that can be built up. Don't think I've ever seen a similar pattern on APs, but that's probably because there's never enough AP supply to meet the bare demand. Sounds like a lot of supply, but when the crunch comes you can see almost 100k per hour in returns, and each return wants a new one, so it really doesn't take much to slow the traffic flow down.
I've got a theory that if everyone were to reduce their cache sizes a bit this might even out, but there's no chance anyone will do that because they're competing to gain access to a limited resource, at least where APs are concerned.

Yes, there are deliberately built-in "high water mark" limits of around 300K MB, and 25K AP, tasks 'ready to send'. The idea is not to keep splitting until all disk space is full, but to stop when there are 'enough' (for some given value of enough), and keep the disk access times snappy. Quite why everything is quite so slow to build up to the high water mark these last last few weeks, I don't know. Shorty storms and stuck tapes are visible to all, but it feels like something else is in play too. And I haven't quite put my finger on it yet.
ID: 1487942 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 1487950 - Posted: 12 Mar 2014, 19:04:49 UTC - in response to Message 1487938.  

I see it this way:

- There is almost 90000 AP WU waiting for assimilation.
- The Splitters only produce new WU as disk space is available.
- So, as the number of WU waiting to be assimilated is growing, less and less disk space is available, so the results out in the field is steadily dropping.


Anyone would like to coment?


Possible. I guess it would all depend on what is being stored where. But based purely on watching how things normally go, I doubt it. Big question is, are the files actually living on the servers that are doing the processing of a particular step?

I'd be more inclined to think it's a case of process priority.

After an outage, it's implied the scheduling server, in charge of moving results out to the field and receiving result reports, gets very busy. (see server descriptions on SSP) Since the AP Validator, AP Assimilators, the Scheduler processes and the Feeder all reside on the same physical server (Synergy) it's reasonable to assume that when Synergy gets busy it assigns higher priority to scheduling and feeding work, and cuts back on validation and assimilation as lower priority tasks. Sending results to the clients, receiving uploaded complete results and accepting reports of the uploads are all "real time" tasks that involve communication with our client software, where validation and assimilation can be done anytime. So establishing priority on that basis would only make sense.
ID: 1487950 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1487951 - Posted: 12 Mar 2014, 19:08:58 UTC - in response to Message 1487942.  
Last modified: 12 Mar 2014, 19:28:28 UTC

My MB Host isn't having any trouble. The Two AP Hosts have been dropping since Monday when they were both full. They both should be close to 200, the one is now at 107 and continues to drop with only an occasional download. Most of the time it simply says;
Wed Mar 12 14:57:24 2014 | SETI@home | Sending scheduler request: To fetch work.
Wed Mar 12 14:57:24 2014 | SETI@home | Reporting 2 completed tasks
Wed Mar 12 14:57:24 2014 | SETI@home | Requesting new tasks for ATI
Wed Mar 12 14:57:26 2014 | SETI@home | Scheduler request completed: got 0 new tasks
Wed Mar 12 14:57:26 2014 | SETI@home | No tasks sent
Wed Mar 12 14:57:26 2014 | SETI@home | No tasks are available for AstroPulse v6
Wed Mar 12 15:03:01 2014 | SETI@home | Computation for task ap_11ap13aa_B4_P0_00393_20140310_05311.wu_2 finished
Wed Mar 12 15:03:01 2014 | SETI@home | Starting task ap_04mr13aa_B1_P1_00105_20140310_18914.wu_0 using astropulse_v6 version 607 (opencl_ati_100) in slot 3
Wed Mar 12 15:03:03 2014 | SETI@home | Started upload of ap_11ap13aa_B4_P0_00393_20140310_05311.wu_2_0
Wed Mar 12 15:03:07 2014 | SETI@home | Finished upload of ap_11ap13aa_B4_P0_00393_20140310_05311.wu_2_0
Wed Mar 12 15:03:07 2014 | SETI@home | Sending scheduler request: To fetch work.
Wed Mar 12 15:03:07 2014 | SETI@home | Reporting 1 completed tasks
Wed Mar 12 15:03:07 2014 | SETI@home | Requesting new tasks for ATI
Wed Mar 12 15:03:09 2014 | SETI@home | Scheduler request completed: got 0 new tasks
Wed Mar 12 15:03:09 2014 | SETI@home | Project has no tasks available...

I believe the problem is associated with the bold text;
State: All (502) · In progress (107) · Validation pending (124) · Validation inconclusive (5) · Valid (265) · Invalid (0) · Error (1)
That number should be close to 100. I think you can blame that on "Workunits waiting for assimilation: 90,453"
ID: 1487951 · Report as offensive
Miklos M.

Send message
Joined: 5 May 99
Posts: 955
Credit: 136,115,648
RAC: 73
Hungary
Message 1487953 - Posted: 12 Mar 2014, 19:12:46 UTC - in response to Message 1487939.  

I increased to number of days to 10 and 10 in the Network as well as the preferences. Still does not want new work.
ID: 1487953 · Report as offensive
Filipe

Send message
Joined: 12 Aug 00
Posts: 218
Credit: 21,281,677
RAC: 20
Portugal
Message 1487955 - Posted: 12 Mar 2014, 19:17:00 UTC

"The Splitters only produce new WU as disk space is available."

Where did you come across that?



From some old tech news from Matt
ID: 1487955 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 1487958 - Posted: 12 Mar 2014, 19:22:54 UTC - in response to Message 1487942.  

Yes, there are deliberately built-in "high water mark" limits of around 300K MB, and 25K AP, tasks 'ready to send'. The idea is not to keep splitting until all disk space is full, but to stop when there are 'enough' (for some given value of enough), and keep the disk access times snappy. Quite why everything is quite so slow to build up to the high water mark these last last few weeks, I don't know. Shorty storms and stuck tapes are visible to all, but it feels like something else is in play too. And I haven't quite put my finger on it yet.

Yeah, that is the interesting question. The whole supply and demand thing doesn't explain the way the splitters slow down by itself, but again it could be that process allocation is the issue. When we get a bump in the road, we start slamming the servers to download the work. So georgem and vader get busy servicing downloads, and again using the "real time" vs. "anytime" process priority theory I expressed about, it would seem that the AP splitters on georgem and MB splitters on vader would suffer. The SSP indicated the number of results uploaded during the last hour, but no the number downloaded. Thus, I assume those two numbers are roughly equal, assuming there's files ready to send. It seems to me that I've seen spltting gets slower the more uploads happened in the last hour, which would support that.

If anything, it might be a good point to look at what work lives on which machines and see if a better mix could be acheived to redeuce the impact of these swings in traffic.
ID: 1487958 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22269
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1487959 - Posted: 12 Mar 2014, 19:23:08 UTC

Miklos - try something like 4 minimum and extra 0.1
But don't be too surprised if it doesn't fill your buffers for some time as the servers are being a bit reluctant just now...
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1487959 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 1487960 - Posted: 12 Mar 2014, 19:24:22 UTC - in response to Message 1487953.  
Last modified: 12 Mar 2014, 19:32:43 UTC

I increased to number of days to 10 and 10 in the Network as well as the preferences. Still does not want new work.

Just to be clear, doesn't want, not didn't get, right?

After changing the parameters, did you tell BOINC to read the new config (Advanced > Read config files) or shutdown and restart BOINC? If not, the changes have been stored but are not yet in effect.

You running any projects other than SETI?
ID: 1487960 · Report as offensive
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 24 · Next

Message boards : Number crunching : Panic Mode On (87) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.