Panic Mode On (107) Server Problems?

Author	Message
Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1891864 - Posted: 26 Sep 2017, 2:55:31 UTC - in response to Message 1891859. My crunch-only machines are shut down from 4 PM to 9 PM on weekdays, so it's better for me to stock up before that shutdown. And since the CPUs on those machines are relatively slow, they don't chew through very much of that stash. I just factor that into my calculations. The GPUs, on the other hand, just maintain their usual queues up until the outage starts. One of the things that I assume enters into the scheduler request/response cycle is some sort of timer on the server. If the timer expires before the request has been fulfilled, the scheduler simply responds with whatever it's gathered up until that instant, which may be nothing. I'm guessing that's what happens on those rare occasions when I do happen to get a "no tasks sent" response, but accompanied by another line saying only that no tasks are available for Astropulse v7, when clearly the request asked for both AP and MB tasks. I've just always assumed that the request simply timed out before it even got to the point of looking for MB tasks. If that's true, the next question might be, when does that timer's clock start...when the request reaches the server, when the server hands it to the scheduler, or back when it leaves the host making the request? If it's the latter, could there be a clock synchronization issue in play? I don't know if any of this is valid, but I thought I'd try tossing out a few random thoughts. ;^) Obviously, if someone familiar with the internal scheduler code would chime in on this topic, it would save a whole bunch of time and speculation......but I don't think I'd hold my breath waiting for that to happen! ID: 1891864 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1891872 - Posted: 26 Sep 2017, 4:22:32 UTC - in response to Message 1891864. Obviously, if someone familiar with the internal scheduler code would chime in on this topic, it would save a whole bunch of time and speculation......but I don't think I'd hold my breath waiting for that to happen! Ha Ha. I wouldn't hold my breath either since Jeff went bye-bye. He was the only project scientist that ever explained any of the inner workings of the project servers. We don't have any input on how they are set up or any understanding of the specific mechanisms. We have only minimal input to the client side of things like the apps or the Manager. I always wait out longer than the 305 second interval after shutting the host down and rescheduling before contacting the servers again for a work request. And I try to make sure the 4 crunchers are staggered enough in schedule requests so that the splitter buffer gets a chance to refill since my last request. I don't know how fast the 100 task buffer fills in reality but would expect it to be fairly fast since a couple 100,000 hosts are constantly hitting it. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1891872 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 35586 Credit: 261,360,520 RAC: 489	Message 1891876 - Posted: 26 Sep 2017, 4:50:51 UTC - in response to Message 1891859. Yes, the great mystery of the century ....... why do I constantly have troubles getting work on request when there are ~600,000 tasks in the buffer. I don't usually try stocking up till later in the evening since if I started earlier I would crunch through the majority of my overload in the early hours of the night before the outage starts while I'm sleeping and not shepherding the systems. If I could depend on getting work normally or on time, then I could build my overstock earlier in the day. I'm sorry Keith, but I can't help you out there as I'm certainly having no problems here at all (unless all do). Cheers. ID: 1891876 ·

betreger Send message Joined: 29 Jun 99 Posts: 11385 Credit: 29,581,041 RAC: 66	Message 1891879 - Posted: 26 Sep 2017, 5:04:37 UTC - in response to Message 1891872. since Jeff went bye-bye I missed that does anyone know where he went? ID: 1891879 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1891888 - Posted: 26 Sep 2017, 6:21:53 UTC - in response to Message 1891879. Rumor was that he went to work directly for the Breakthrough Listen project. Not involved directly with SETI anymore. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1891888 ·

Bernie Vine Volunteer moderator Volunteer tester Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328	Message 1891890 - Posted: 26 Sep 2017, 6:55:26 UTC Don't you mean Matt? ID: 1891890 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1891891 - Posted: 26 Sep 2017, 6:56:32 UTC - in response to Message 1891890. Don't you mean Matt? Duh. Yeah. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1891891 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1891896 - Posted: 26 Sep 2017, 11:42:59 UTC - in response to Message 1891872. I think the "100 WU" buffer is no more, as I occasionally get > 100 WUs on a work request. I assume it has been enlarged, but I don't know what size it is now. ID: 1891896 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 35586 Credit: 261,360,520 RAC: 489	Message 1891909 - Posted: 26 Sep 2017, 22:51:27 UTC Well I had just run out of GPU work on my main rig when the outrage ended and now both rigs have full caches again. :-) Cheers. ID: 1891909 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1891938 - Posted: 27 Sep 2017, 1:35:42 UTC I had built a big enough buffer up on the Windows machines but the Linux cruncher processes so fast that I had already worked through my buffer and 1/3 into my cache. I wasn't getting enough work to keep up with production so was down over a 100 tasks when I got home an hour ago. Had to use the server wakeup procedure to get them to send work to start replenishing my cache. Seems like the faster a host processes work the more it is ignored. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1891938 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13797 Credit: 208,696,464 RAC: 304	Message 1891954 - Posted: 27 Sep 2017, 4:08:59 UTC - in response to Message 1891896. I think the "100 WU" buffer is no more, as I occasionally get > 100 WUs on a work request. I assume it has been enlarged, but I don't know what size it is now. You've got 2 video cards on each of your systems, so each system has a limit of 300 WUs- 100 for the CPU, and 100 for each of the GPUs, total = 300. Grant Darwin NT ID: 1891954 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1891960 - Posted: 27 Sep 2017, 5:52:16 UTC - in response to Message 1891954. Having a max of 300 WUs doesn't affect the server buffer size - no matter what I want, I can only get the max buffer size on one request for work - right? So if I get > 100 on a work request, the buffer must be > 100. ID: 1891960 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13797 Credit: 208,696,464 RAC: 304	Message 1891962 - Posted: 27 Sep 2017, 6:01:10 UTC - in response to Message 1891960. Last modified: 27 Sep 2017, 6:08:01 UTC Having a max of 300 WUs doesn't affect the server buffer size - no matter what I want, I can only get the max buffer size on one request for work - right? So if I get > 100 on a work request, the buffer must be > 100. Pretty sure the feeder has been 200 WUs for a while now. Grant Darwin NT ID: 1891962 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1891965 - Posted: 27 Sep 2017, 6:41:01 UTC - in response to Message 1891962. Having a max of 300 WUs doesn't affect the server buffer size - no matter what I want, I can only get the max buffer size on one request for work - right? So if I get > 100 on a work request, the buffer must be > 100. Pretty sure the feeder has been 200 WUs for a while now. That would make sense since I have seen > 100 tasks delivered on request when I had cache levels low. Unless the server can quickly dump and refill on the same request. Never seen > 200 tasks so think Grant is correct. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1891965 ·

Kissagogo27 Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62	Message 1891973 - Posted: 27 Sep 2017, 10:19:13 UTC here somme received task log from setispirit 3.3.0 when i launch it from another PC ( shared boinc folder through Lan ) 02-Aug-2017 07:53:02 [SETI@home] Scheduler request completed: got 126 new tasks 03-Aug-2017 19:27:56 [SETI@home] Scheduler request completed: got 9 new tasks 04-Aug-2017 16:51:26 [SETI@home] Scheduler request completed: got 94 new tasks 06-Aug-2017 11:05:43 [SETI@home] Scheduler request completed: got 99 new tasks 08-Aug-2017 13:02:19 [SETI@home] Scheduler request completed: got 58 new tasks 09-Aug-2017 17:33:45 [SETI@home] Scheduler request completed: got 117 new tasks 24-Aug-2017 10:55:24 [SETI@home] Scheduler request completed: got 155 new tasks 26-Aug-2017 10:42:42 [SETI@home] Scheduler request completed: got 131 new tasks 30-Aug-2017 13:38:48 [SETI@home] Scheduler request completed: got 100 new tasks 31-Aug-2017 17:45:31 [SETI@home] Scheduler request completed: got 118 new tasks 02-Sep-2017 10:42:43 [SETI@home] Scheduler request completed: got 100 new tasks 17-Sep-2017 09:22:50 [SETI@home] Scheduler request completed: got 124 new tasks 18-Sep-2017 19:00:05 [SETI@home] Scheduler request completed: got 95 new tasks 19-Sep-2017 13:27:53 [SETI@home] Scheduler request completed: got 45 new tasks i made one Home location for boinc with only Seti wu (CPU + GPU) crunch and download and 1 day cache and another Work location with only AP wu (CPU + GPU) crunch and download and 10 days cache when i set location to Work , no Seti wu download ( normal ) and the GPU cache goes to empty before i set Home location and then lot of Wu is downloading but with more than 1 day cache for CPU ( don't undestand why , strange behavior for me ) ... then, i set Work location waiting some rare AP wu to download till Ar/ Blc are processed... ID: 1891973 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1891988 - Posted: 27 Sep 2017, 13:43:50 UTC My reply for how the scheduler works, was swallowed by the maintenance :( I posted just as the servers were being turned off ID: 1891988 ·

Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182	Message 1891991 - Posted: 27 Sep 2017, 13:50:56 UTC - in response to Message 1891938. Last modified: 27 Sep 2017, 13:51:32 UTC ........... Seems like the faster a host processes work the more it is ignored. I have noticed that for some time now but just did not comment. It has always been the case with my computers and there is not that much difference between them. SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours ID: 1891991 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1892003 - Posted: 27 Sep 2017, 14:45:01 UTC - in response to Message 1891864. Last modified: 27 Sep 2017, 14:47:31 UTC Obviously, if someone familiar with the internal scheduler code would chime in on this topic, it would save a whole bunch of time and speculation......but I don't think I'd hold my breath waiting for that to happen! Ok try 2. So I have skimmed the scheduling code, so I am little familar in how it works. The first is what we see on the SSP page(RAS! Redundant Acronym Syndrome), that is "Results ready to send" and as this says, it is the tasks from the database that has the unsent status on them. So a query from the php page, and done. It has counted the number of tasks that haven't been sent. Second what we don't see, but is listed on the SSP page, is the feeder + scheduler combo. So the scheduler has an internal buffer of tasks(and therefore a portion of the database) in memory, that is being replenished by the feeder constantly. When it assigns a task out to a person, it obviously has to record that into the database. And the size of the internal buffer can differ from project to project. Now the third and the final process we see, is the actual scheduler, that deals with handing out work. That is the logic of the scheduler, it determines if the tasks in the buffer is suitable for the compute type that is requesting the work. Obviously, with the Arecibo VLAR limit on Nvidia cards, there is a little more logic processing that happens, while this is happening, there is a timeout that the scheduler has to follow. That is happening when we get "Project has no available work", etc for Nvidia cards when there is stuff ready to send. There is another thing, is that the scheduler will never go and query the database for any available work, as that is computationally EXPENSIVE!!! All that task retrieval and insertion into the scheduler's buffer is done by the feeder. When I say expensive, I mean it has to wait for disk IO to become available, it has to wait for the dbms to respond to the query and it actually running the query, then the scheduler has to parse the response, and build an understanding of what it sees from the results, and that can easily exceed the timeout that the scheduler has to work with, so it never does it. ID: 1892003 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1892020 - Posted: 27 Sep 2017, 17:46:25 UTC - in response to Message 1892003. Obviously, if someone familiar with the internal scheduler code would chime in on this topic, it would save a whole bunch of time and speculation......but I don't think I'd hold my breath waiting for that to happen! Ok try 2. So I have skimmed the scheduling code, so I am little familar in how it works. The first is what we see on the SSP page(RAS! Redundant Acronym Syndrome), that is "Results ready to send" and as this says, it is the tasks from the database that has the unsent status on them. So a query from the php page, and done. It has counted the number of tasks that haven't been sent. Second what we don't see, but is listed on the SSP page, is the feeder + scheduler combo. So the scheduler has an internal buffer of tasks(and therefore a portion of the database) in memory, that is being replenished by the feeder constantly. When it assigns a task out to a person, it obviously has to record that into the database. And the size of the internal buffer can differ from project to project. Now the third and the final process we see, is the actual scheduler, that deals with handing out work. That is the logic of the scheduler, it determines if the tasks in the buffer is suitable for the compute type that is requesting the work. Obviously, with the Arecibo VLAR limit on Nvidia cards, there is a little more logic processing that happens, while this is happening, there is a timeout that the scheduler has to follow. That is happening when we get "Project has no available work", etc for Nvidia cards when there is stuff ready to send.] There is another thing, is that the scheduler will never go and query the database for any available work, as that is computationally EXPENSIVE!!! All that task retrieval and insertion into the scheduler's buffer is done by the feeder. When I say expensive, I mean it has to wait for disk IO to become available, it has to wait for the dbms to respond to the query and it actually running the query, then the scheduler has to parse the response, and build an understanding of what it sees from the results, and that can easily exceed the timeout that the scheduler has to work with, so it never does it. This is the part that needs to be fixed. Whether that is to remove the Arecibo VLAR restriction on Nvidia cards or buy better and faster hardware to perform the database query or extend the timeout long enough for the task insertion query to finish. +1 Thanks for the explanation of the feeder mechanism. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1892020 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1892024 - Posted: 27 Sep 2017, 18:34:57 UTC I suspect, and it's only a suspicion, that the reason invoking the "ghost recovery" process is often successful in retrieving new tasks, even when no ghosts are present, is that a different timer is used, or at least a different, longer time interval. That "ghost recovery" process would, by necessity, require a database query in order to determine what tasks the server thinks are on hand for the requesting host. The results of that query then would have to be compared, task by task, against the tasks identified in the "<other_results>" section of the scheduler request, in order to see if any are missing and need to be resent. It would make sense to me (if making sense matters) that a longer response time might be allowed in order to accomplish that database retrieval and comparison, thus perhaps providing an extra cushion for normal scheduler operations. ID: 1892024 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.