Panic Mode On (107) Server Problems?

Message boards : Number crunching : Panic Mode On (107) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 29 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1891864 - Posted: 26 Sep 2017, 2:55:31 UTC - in response to Message 1891859.  

My crunch-only machines are shut down from 4 PM to 9 PM on weekdays, so it's better for me to stock up before that shutdown. And since the CPUs on those machines are relatively slow, they don't chew through very much of that stash. I just factor that into my calculations. The GPUs, on the other hand, just maintain their usual queues up until the outage starts.

One of the things that I assume enters into the scheduler request/response cycle is some sort of timer on the server. If the timer expires before the request has been fulfilled, the scheduler simply responds with whatever it's gathered up until that instant, which may be nothing. I'm guessing that's what happens on those rare occasions when I do happen to get a "no tasks sent" response, but accompanied by another line saying only that no tasks are available for Astropulse v7, when clearly the request asked for both AP and MB tasks. I've just always assumed that the request simply timed out before it even got to the point of looking for MB tasks. If that's true, the next question might be, when does that timer's clock start...when the request reaches the server, when the server hands it to the scheduler, or back when it leaves the host making the request? If it's the latter, could there be a clock synchronization issue in play? I don't know if any of this is valid, but I thought I'd try tossing out a few random thoughts. ;^)

Obviously, if someone familiar with the internal scheduler code would chime in on this topic, it would save a whole bunch of time and speculation......but I don't think I'd hold my breath waiting for that to happen!
ID: 1891864 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1891872 - Posted: 26 Sep 2017, 4:22:32 UTC - in response to Message 1891864.  



Obviously, if someone familiar with the internal scheduler code would chime in on this topic, it would save a whole bunch of time and speculation......but I don't think I'd hold my breath waiting for that to happen!

Ha Ha. I wouldn't hold my breath either since Jeff went bye-bye. He was the only project scientist that ever explained any of the inner workings of the project servers. We don't have any input on how they are set up or any understanding of the specific mechanisms. We have only minimal input to the client side of things like the apps or the Manager.

I always wait out longer than the 305 second interval after shutting the host down and rescheduling before contacting the servers again for a work request. And I try to make sure the 4 crunchers are staggered enough in schedule requests so that the splitter buffer gets a chance to refill since my last request. I don't know how fast the 100 task buffer fills in reality but would expect it to be fairly fast since a couple 100,000 hosts are constantly hitting it.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1891872 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36774
Credit: 261,360,520
RAC: 489
Australia
Message 1891876 - Posted: 26 Sep 2017, 4:50:51 UTC - in response to Message 1891859.  

Yes, the great mystery of the century ....... why do I constantly have troubles getting work on request when there are ~600,000 tasks in the buffer. I don't usually try stocking up till later in the evening since if I started earlier I would crunch through the majority of my overload in the early hours of the night before the outage starts while I'm sleeping and not shepherding the systems. If I could depend on getting work normally or on time, then I could build my overstock earlier in the day.

I'm sorry Keith, but I can't help you out there as I'm certainly having no problems here at all (unless all do).

Cheers.
ID: 1891876 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11415
Credit: 29,581,041
RAC: 66
United States
Message 1891879 - Posted: 26 Sep 2017, 5:04:37 UTC - in response to Message 1891872.  

since Jeff went bye-bye

I missed that does anyone know where he went?
ID: 1891879 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1891888 - Posted: 26 Sep 2017, 6:21:53 UTC - in response to Message 1891879.  

Rumor was that he went to work directly for the Breakthrough Listen project. Not involved directly with SETI anymore.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1891888 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9958
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1891890 - Posted: 26 Sep 2017, 6:55:26 UTC

Don't you mean Matt?
ID: 1891890 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1891891 - Posted: 26 Sep 2017, 6:56:32 UTC - in response to Message 1891890.  

Don't you mean Matt?

Duh. Yeah.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1891891 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1891896 - Posted: 26 Sep 2017, 11:42:59 UTC - in response to Message 1891872.  

I think the "100 WU" buffer is no more, as I occasionally get > 100 WUs on a work request. I assume it has been enlarged, but I don't know what size it is now.
ID: 1891896 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36774
Credit: 261,360,520
RAC: 489
Australia
Message 1891909 - Posted: 26 Sep 2017, 22:51:27 UTC

Well I had just run out of GPU work on my main rig when the outrage ended and now both rigs have full caches again. :-)

Cheers.
ID: 1891909 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1891938 - Posted: 27 Sep 2017, 1:35:42 UTC

I had built a big enough buffer up on the Windows machines but the Linux cruncher processes so fast that I had already worked through my buffer and 1/3 into my cache. I wasn't getting enough work to keep up with production so was down over a 100 tasks when I got home an hour ago. Had to use the server wakeup procedure to get them to send work to start replenishing my cache. Seems like the faster a host processes work the more it is ignored.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1891938 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1891954 - Posted: 27 Sep 2017, 4:08:59 UTC - in response to Message 1891896.  

I think the "100 WU" buffer is no more, as I occasionally get > 100 WUs on a work request. I assume it has been enlarged, but I don't know what size it is now.

You've got 2 video cards on each of your systems, so each system has a limit of 300 WUs- 100 for the CPU, and 100 for each of the GPUs, total = 300.
Grant
Darwin NT
ID: 1891954 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1891960 - Posted: 27 Sep 2017, 5:52:16 UTC - in response to Message 1891954.  

Having a max of 300 WUs doesn't affect the server buffer size - no matter what I want, I can only get the max buffer size on one request for work - right? So if I get > 100 on a work request, the buffer must be > 100.
ID: 1891960 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1891962 - Posted: 27 Sep 2017, 6:01:10 UTC - in response to Message 1891960.  
Last modified: 27 Sep 2017, 6:08:01 UTC

Having a max of 300 WUs doesn't affect the server buffer size - no matter what I want, I can only get the max buffer size on one request for work - right? So if I get > 100 on a work request, the buffer must be > 100.

Pretty sure the feeder has been 200 WUs for a while now.
Grant
Darwin NT
ID: 1891962 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1891965 - Posted: 27 Sep 2017, 6:41:01 UTC - in response to Message 1891962.  

Having a max of 300 WUs doesn't affect the server buffer size - no matter what I want, I can only get the max buffer size on one request for work - right? So if I get > 100 on a work request, the buffer must be > 100.

Pretty sure the feeder has been 200 WUs for a while now.

That would make sense since I have seen > 100 tasks delivered on request when I had cache levels low. Unless the server can quickly dump and refill on the same request. Never seen > 200 tasks so think Grant is correct.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1891965 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 716
Credit: 8,032,827
RAC: 62
France
Message 1891973 - Posted: 27 Sep 2017, 10:19:13 UTC

here somme received task log from setispirit 3.3.0 when i launch it from another PC ( shared boinc folder through Lan )

02-Aug-2017 07:53:02 [SETI@home] Scheduler request completed: got 126 new tasks
03-Aug-2017 19:27:56 [SETI@home] Scheduler request completed: got 9 new tasks
04-Aug-2017 16:51:26 [SETI@home] Scheduler request completed: got 94 new tasks
06-Aug-2017 11:05:43 [SETI@home] Scheduler request completed: got 99 new tasks
08-Aug-2017 13:02:19 [SETI@home] Scheduler request completed: got 58 new tasks
09-Aug-2017 17:33:45 [SETI@home] Scheduler request completed: got 117 new tasks
24-Aug-2017 10:55:24 [SETI@home] Scheduler request completed: got 155 new tasks
26-Aug-2017 10:42:42 [SETI@home] Scheduler request completed: got 131 new tasks
30-Aug-2017 13:38:48 [SETI@home] Scheduler request completed: got 100 new tasks
31-Aug-2017 17:45:31 [SETI@home] Scheduler request completed: got 118 new tasks
02-Sep-2017 10:42:43 [SETI@home] Scheduler request completed: got 100 new tasks
17-Sep-2017 09:22:50 [SETI@home] Scheduler request completed: got 124 new tasks
18-Sep-2017 19:00:05 [SETI@home] Scheduler request completed: got 95 new tasks
19-Sep-2017 13:27:53 [SETI@home] Scheduler request completed: got 45 new tasks


i made one Home location for boinc with only Seti wu (CPU + GPU) crunch and download and 1 day cache
and another Work location with only AP wu (CPU + GPU) crunch and download and 10 days cache

when i set location to Work , no Seti wu download ( normal ) and the GPU cache goes to empty before i set Home location and then lot of Wu is downloading but with more than 1 day cache for CPU ( don't undestand why , strange behavior for me ) ...

then, i set Work location waiting some rare AP wu to download till Ar/ Blc are processed...
ID: 1891973 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1891988 - Posted: 27 Sep 2017, 13:43:50 UTC

My reply for how the scheduler works, was swallowed by the maintenance :(

I posted just as the servers were being turned off
ID: 1891988 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1891991 - Posted: 27 Sep 2017, 13:50:56 UTC - in response to Message 1891938.  
Last modified: 27 Sep 2017, 13:51:32 UTC

........... Seems like the faster a host processes work the more it is ignored.

I have noticed that for some time now but just did not comment. It has always been the case with my computers and there is not that much difference between them.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1891991 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1892003 - Posted: 27 Sep 2017, 14:45:01 UTC - in response to Message 1891864.  
Last modified: 27 Sep 2017, 14:47:31 UTC

Obviously, if someone familiar with the internal scheduler code would chime in on this topic, it would save a whole bunch of time and speculation......but I don't think I'd hold my breath waiting for that to happen!


Ok try 2.

So I have skimmed the scheduling code, so I am little familar in how it works. The first is what we see on the SSP page(RAS! Redundant Acronym Syndrome), that is "Results ready to send" and as this says, it is the tasks from the database that has the unsent status on them. So a query from the php page, and done. It has counted the number of tasks that haven't been sent.
Second what we don't see, but is listed on the SSP page, is the feeder + scheduler combo. So the scheduler has an internal buffer of tasks(and therefore a portion of the database) in memory, that is being replenished by the feeder constantly. When it assigns a task out to a person, it obviously has to record that into the database. And the size of the internal buffer can differ from project to project.
Now the third and the final process we see, is the actual scheduler, that deals with handing out work. That is the logic of the scheduler, it determines if the tasks in the buffer is suitable for the compute type that is requesting the work. Obviously, with the Arecibo VLAR limit on Nvidia cards, there is a little more logic processing that happens, while this is happening, there is a timeout that the scheduler has to follow. That is happening when we get "Project has no available work", etc for Nvidia cards when there is stuff ready to send.
There is another thing, is that the scheduler will never go and query the database for any available work, as that is computationally EXPENSIVE!!! All that task retrieval and insertion into the scheduler's buffer is done by the feeder.
When I say expensive, I mean it has to wait for disk IO to become available, it has to wait for the dbms to respond to the query and it actually running the query, then the scheduler has to parse the response, and build an understanding of what it sees from the results, and that can easily exceed the timeout that the scheduler has to work with, so it never does it.
ID: 1892003 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1892020 - Posted: 27 Sep 2017, 17:46:25 UTC - in response to Message 1892003.  

Obviously, if someone familiar with the internal scheduler code would chime in on this topic, it would save a whole bunch of time and speculation......but I don't think I'd hold my breath waiting for that to happen!


Ok try 2.

So I have skimmed the scheduling code, so I am little familar in how it works. The first is what we see on the SSP page(RAS! Redundant Acronym Syndrome), that is "Results ready to send" and as this says, it is the tasks from the database that has the unsent status on them. So a query from the php page, and done. It has counted the number of tasks that haven't been sent.
Second what we don't see, but is listed on the SSP page, is the feeder + scheduler combo. So the scheduler has an internal buffer of tasks(and therefore a portion of the database) in memory, that is being replenished by the feeder constantly. When it assigns a task out to a person, it obviously has to record that into the database. And the size of the internal buffer can differ from project to project.
Now the third and the final process we see, is the actual scheduler, that deals with handing out work. That is the logic of the scheduler, it determines if the tasks in the buffer is suitable for the compute type that is requesting the work. Obviously, with the Arecibo VLAR limit on Nvidia cards, there is a little more logic processing that happens, while this is happening, there is a timeout that the scheduler has to follow. That is happening when we get "Project has no available work", etc for Nvidia cards when there is stuff ready to send.]
There is another thing, is that the scheduler will never go and query the database for any available work, as that is computationally EXPENSIVE!!! All that task retrieval and insertion into the scheduler's buffer is done by the feeder.
When I say expensive, I mean it has to wait for disk IO to become available, it has to wait for the dbms to respond to the query and it actually running the query, then the scheduler has to parse the response, and build an understanding of what it sees from the results, and that can easily exceed the timeout that the scheduler has to work with, so it never does it.

This is the part that needs to be fixed. Whether that is to remove the Arecibo VLAR restriction on Nvidia cards or buy better and faster hardware to perform the database query or extend the timeout long enough for the task insertion query to finish.

+1 Thanks for the explanation of the feeder mechanism.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1892020 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1892024 - Posted: 27 Sep 2017, 18:34:57 UTC

I suspect, and it's only a suspicion, that the reason invoking the "ghost recovery" process is often successful in retrieving new tasks, even when no ghosts are present, is that a different timer is used, or at least a different, longer time interval. That "ghost recovery" process would, by necessity, require a database query in order to determine what tasks the server thinks are on hand for the requesting host. The results of that query then would have to be compared, task by task, against the tasks identified in the "<other_results>" section of the scheduler request, in order to see if any are missing and need to be resent. It would make sense to me (if making sense matters) that a longer response time might be allowed in order to accomplish that database retrieval and comparison, thus perhaps providing an extra cushion for normal scheduler operations.
ID: 1892024 · Report as offensive
Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 29 · Next

Message boards : Number crunching : Panic Mode On (107) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.