The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2026799 - Posted: 8 Jan 2020, 12:07:58 UTC Referring back to the server issue of 20 December (Anonymous Platform failure after upgrade), I have written up the story so far at #3419. Nils HÃ¸imyr of LHC has produced some useful diagnostics, but LHC now feel that they have to refer the problem back to David Anderson. ID: 2026799 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2026806 - Posted: 8 Jan 2020, 13:18:37 UTC - in response to Message 2026795. I'm going to hang out in the wilds for a bit. I'll do some howling at the moon and dancing around the fire in hopes that these silly rituals help keep the seti machines working well :-). . . Have fun and don't bring any stray coyotes home :) Stephen What about "friendly" stray Coyotes? . . If she is running about in the woods howling you just never know ... Stephen ? ? ID: 2026806 ·

jdzukley Send message Joined: 6 Apr 11 Posts: 19 Credit: 26,357,809 RAC: 74	Message 2026828 - Posted: 8 Jan 2020, 14:59:43 UTC From my perspective, and while I do not want to discount that there are issues and problems out there... If you look at the current server status page lately and currently, we have maxed out the site. We are all crunching all the work the site can give to us. The ready to send que is very low, and generally is, and the tasks counts being created are high. This is all great stuff as far as I am concerned! ID: 2026828 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 2026830 - Posted: 8 Jan 2020, 15:19:43 UTC - in response to Message 2026828. From my perspective, and while I do not want to discount that there are issues and problems out there... If you look at the current server status page lately and currently, we have maxed out the site. We are all crunching all the work the site can give to us. The ready to send que is very low, and generally is, and the tasks counts being created are high. This is all great stuff as far as I am concerned! Munin graphs kinda confirm this: ID: 2026830 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19065 Credit: 40,757,560 RAC: 67	Message 2026891 - Posted: 9 Jan 2020, 0:34:48 UTC - in response to Message 2026859. Beta is back, or at least there's life there. It's slowly coming to life, but don't expect that everything works yet, It's been down since Dec 20, so it will take time to recover. Reported 50 tasks, got two new tasks from Beta. ID: 2026891 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2026911 - Posted: 9 Jan 2020, 2:49:47 UTC - in response to Message 2026859. Beta is back, or at least there's life there. It's slowly coming to life, but don't expect that everything works yet, It's been down since Dec 20, so it will take time to recover. Curious as to what the issue turned out to be. Grant Darwin NT ID: 2026911 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2026913 - Posted: 9 Jan 2020, 2:52:13 UTC Last modified: 9 Jan 2020, 2:54:14 UTC Ready-to-send buffer still not refilling, mostly due to the sustained very high return rate over the last few hours (165k+), and the splitters just aren't capable of the sustained output required to meet that demand, and fill the RTS buffer. Just as well the return rate isn't that little big higher, or we'd be struggling to get work (more than we are- just had a period of "Project has no tasks available" or reporting 10 WUs and only getting 2 back). Grant Darwin NT ID: 2026913 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34770 Credit: 261,360,520 RAC: 489	Message 2026917 - Posted: 9 Jan 2020, 3:15:12 UTC If a way can be found found to stop splitters from bunching up on 1 file (such as blc54_2bit_guppi_58692_62680_HIP24094_0035 ATM) then that would help a great deal. Cheers. ID: 2026917 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2026926 - Posted: 9 Jan 2020, 3:48:26 UTC - in response to Message 2026917. If a way can be found found to stop splitters from bunching up on 1 file (such as blc54_2bit_guppi_58692_62680_HIP24094_0035 ATM) then that would help a great deal. It's been a long, long time since the 1 file, 1 splitter days. Grant Darwin NT ID: 2026926 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2026957 - Posted: 9 Jan 2020, 10:51:24 UTC I guess this would be a Server Issue, though it may need to be moved to another thread if it gets out of hand. Bringing up a new (Ubuntu) client and, due to some hardware issues I was fighting, I aborted the initial 5 CPU tasks I was given so as to eliminate CPU load as a cause for the slow-downs I was seeing. Set the box to a venue that allows GPU only. So far, so good. Finally solved (I believe) the hardware issue, and decided it was time to get some CPU tasks going as well. Set the box to a different venue that allows both CPU and GPU work, but got only GPU work. If I recall correctly, CPU tasks won't be sent until the GPU cache is full or other limits kick in. Set the box to a third venue that allows only CPU work. Had the box updated to the new venue, and it correctly asks for only CPU work now. So far so, good. (And I do understand that it takes 2 updates for this to happen, 1 for the server to change the venue at the client, and the second for the client to do a request for the new venue.) However, when I hit the server with a request for CPU tasks, it responds by sending new work which is still classed as GPU work. Not a one-time thing, it's consistent after restarting the client, resetting the preferences venue flags several times and changing venues again as well. In example: (client ID: 8881207) 977 SETI@home 01/09/20 3:24:20 AM Sending scheduler request: To report completed tasks. 978 SETI@home 01/09/20 3:24:20 AM Reporting 2 completed tasks 979 SETI@home 01/09/20 3:24:20 AM Requesting new tasks for CPU 980 SETI@home 01/09/20 3:24:21 AM Scheduler request completed: got 3 new tasks 981 SETI@home 01/09/20 3:24:24 AM Started download of blc75_2bit_guppi_58693_06659_HIP100511_0136.26291.0.22.45.111.vlar 982 SETI@home 01/09/20 3:24:24 AM Started download of blc64_2bit_guppi_58693_08596_HIP98819_0142.23980.409.22.45.58.vlar 983 SETI@home 01/09/20 3:24:24 AM Started download of blc64_2bit_guppi_58693_07938_HIP100511_0140.25367.409.22.45.71.vlar 984 SETI@home 01/09/20 3:24:30 AM Finished download of blc64_2bit_guppi_58693_08596_HIP98819_0142.23980.409.22.45.58.vlar 985 SETI@home 01/09/20 3:24:31 AM Finished download of blc75_2bit_guppi_58693_06659_HIP100511_0136.26291.0.22.45.111.vlar 986 SETI@home 01/09/20 3:24:31 AM Finished download of blc64_2bit_guppi_58693_07938_HIP100511_0140.25367.409.22.45.71.vlar I look in the project directory, the files are present, but no CPU work is processing. Finding the task names via BOINCTasks shows them as CUDA90 tasks waiting to run. I get that I might be in the penalty box for CPU work as 100% (5) of the CPU tasks I had I aborted, but I can't see how client configuration could cause the scheduler to send me GPU work telling me it's CPU work. I verified the venue selections were correct by watching for logged scheduler requests as above, which correctly reflected my venue settings after updates reset them at the client. Just can't think of anything I might do that would cause the Scheduling server to do this. Am I missing something? ID: 2026957 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2026958 - Posted: 9 Jan 2020, 10:57:13 UTC - in response to Message 2026957. Looking at that computer on the website - All tasks for computer 8881207 - you currently have at least three tasks on the front page assigned to the CPU. That looks good, but I'm not looking through all 504 tasks in progress... ID: 2026958 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2026960 - Posted: 9 Jan 2020, 11:09:01 UTC - in response to Message 2026958. Looking at that computer on the website - All tasks for computer 8881207 - you currently have at least three tasks on the front page assigned to the CPU. That looks good, but I'm not looking through all 504 tasks in progress... Yeah, naturally the moment that message posted I got 10 supposed CPU tasks that were GPU tasks, and the next session I got 4 that were for real. It looks to me like perhaps regardless of the venue it was going to fill the cache of GPU before releasing any more CPU. Just seemed strange that it would send that way ignoring the venue. Strange interaction, but thanks for taking a peek. ID: 2026960 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2026965 - Posted: 9 Jan 2020, 12:45:24 UTC - in response to Message 2026960. ... I got 10 supposed CPU tasks that were GPU tasks ... What's your evidence for that? You can't tell from the download file names - all tasks are identical until they are assigned to a particular class of device, by the server on allocation. <sched_op_debug> in the Event Log gives a useful summary: 09/01/2020 12:09:04 \| SETI@home \| Sending scheduler request: To report completed tasks. 09/01/2020 12:09:04 \| SETI@home \| Reporting 14 completed tasks 09/01/2020 12:09:04 \| SETI@home \| Requesting new tasks for NVIDIA GPU 09/01/2020 12:09:04 \| SETI@home \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices 09/01/2020 12:09:04 \| SETI@home \| [sched_op] NVIDIA GPU work request: 8165.73 seconds; 0.00 devices 09/01/2020 12:09:04 \| SETI@home \| [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices 09/01/2020 12:09:07 \| SETI@home \| Scheduler request completed: got 15 new tasks 09/01/2020 12:09:07 \| SETI@home \| [sched_op] estimated total CPU task duration: 0 seconds 09/01/2020 12:09:07 \| SETI@home \| [sched_op] estimated total NVIDIA GPU task duration: 8702 seconds 09/01/2020 12:09:07 \| SETI@home \| [sched_op] estimated total Intel GPU task duration: 0 seconds That's a proper 'GPU only' transaction. ID: 2026965 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2026990 - Posted: 9 Jan 2020, 18:15:10 UTC - in response to Message 2026960. Last modified: 9 Jan 2020, 18:16:09 UTC It looks to me like perhaps regardless of the venue it was going to fill the cache of GPU before releasing any more CPU. Just seemed strange that it would send that way ignoring the venue. Strange interaction, but thanks for taking a peek. It will always fill the GPU cache before it starts on the CPU (unless you have a really fast CPU and a really, really slow GPU). You just need to make sure to set the cache settings small enough that it doesn't take long to fill the GPU, to get it started on the CPU. Once the settings are changed back, it will continue to fill the new GPU cache, then start filling the CPU cache again. Grant Darwin NT ID: 2026990 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2026992 - Posted: 9 Jan 2020, 18:20:03 UTC Oh, and the splitter output has fallen off to not much more than nothing and we should be out of work in the next 30min or so if they don't come back to life. Grant Darwin NT ID: 2026992 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2027000 - Posted: 9 Jan 2020, 19:45:17 UTC - in response to Message 2026965. ... I got 10 supposed CPU tasks that were GPU tasks ... What's your evidence for that? You can't tell from the download file names - all tasks are identical until they are assigned to a particular class of device, by the server on allocation. I understand that the filename indicates nothing. Not going from that. The evidence was: 1) requested CPU work, not GPU work, as indicated by log, because 2) venue flag excluded GPU work by not being selected, and 3) tasks were received, and their presence verified in project directory, yet 4) listing shows tasks as GPU jobs in a) boincmgr and b) boinctasks. Agreed that turning the debug on would have been useful, but I have to assume that the fact that CPU tasks were the only ones requested is correct in the log. Not a big thing, but my only point in noting this was that the work type flag was (apparently) overridden by other considerations in the scheduler, which is strange. ID: 2027000 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2027001 - Posted: 9 Jan 2020, 19:49:30 UTC - in response to Message 2026990. It looks to me like perhaps regardless of the venue it was going to fill the cache of GPU before releasing any more CPU. Just seemed strange that it would send that way ignoring the venue. Strange interaction, but thanks for taking a peek. It will always fill the GPU cache before it starts on the CPU (unless you have a really fast CPU and a really, really slow GPU). You just need to make sure to set the cache settings small enough that it doesn't take long to fill the GPU, to get it started on the CPU. Once the settings are changed back, it will continue to fill the new GPU cache, then start filling the CPU cache again. And I did note my awareness of that in the original message. But the point of the message was that it ignored my exclusion of GPU tasks via venue flag, and responded to a request for CPU work by sending tasks for GPU. FWIW, my cache settings are .33 days. ID: 2027001 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2027020 - Posted: 9 Jan 2020, 22:30:46 UTC - in response to Message 2026965. ... I got 10 supposed CPU tasks that were GPU tasks ... What's your evidence for that? You can't tell from the download file names - all tasks are identical until they are assigned to a particular class of device, by the server on allocation. (client ID: 8881207) 977 SETI@home 01/09/20 3:24:20 AM Sending scheduler request: To report completed tasks. 978 SETI@home 01/09/20 3:24:20 AM Reporting 2 completed tasks 979 SETI@home 01/09/20 3:24:20 AM Requesting new tasks for CPU 980 SETI@home 01/09/20 3:24:21 AM Scheduler request completed: got 3 new tasks 981 SETI@home 01/09/20 3:24:24 AM Started download of blc75_2bit_guppi_58693_06659_HIP100511_0136.26291.0.22.45.111.vlar https://setiathome.berkeley.edu/workunit.php?wuid=3831002783 982 SETI@home 01/09/20 3:24:24 AM Started download of blc64_2bit_guppi_58693_08596_HIP98819_0142.23980.409.22.45.58.vlar https://setiathome.berkeley.edu/workunit.php?wuid=3831002683 983 SETI@home 01/09/20 3:24:24 AM Started download of blc64_2bit_guppi_58693_07938_HIP100511_0140.25367.409.22.45.71.vlar https://setiathome.berkeley.edu/workunit.php?wuid=3831002789 Seems to me this pretty well proves the point ... please note that the WU assignment clearly shows them assigned to GPU. ID: 2027020 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2027023 - Posted: 9 Jan 2020, 22:40:54 UTC - in response to Message 2027020. Thanks - that's why I asked the question. Seems like there is something screwy with the scheduler, and we need hard evidence like this to track it down. I spent some time exchanging emails with Eric this evening, using Beta to confirm the 'Anonymous Platform' error that plagued us before Christmas. He started Beta, I said 'nada'. He said it had fallen over, and restarted it again. I crashed it. He said 'damn', but we got a log out of it. ID: 2027023 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2027026 - Posted: 9 Jan 2020, 23:09:29 UTC - in response to Message 2027023. Last modified: 9 Jan 2020, 23:10:18 UTC Thanks - that's why I asked the question. Seems like there is something screwy with the scheduler, and we need hard evidence like this to track it down. I spent some time exchanging emails with Eric this evening, using Beta to confirm the 'Anonymous Platform' error that plagued us before Christmas. He started Beta, I said 'nada'. He said it had fallen over, and restarted it again. I crashed it. He said 'damn', but we got a log out of it. Yup, it's all good. Took me a bit to realize all I had to do was search for the tasks on the site and correlate them. I learned that if you ask the site to display tasks by name it auto-sorts the results. Makes it pretty quick to find them. Hard to say if it's worth spending much effort to fix this, but, on the other hand it's often true that what causes this might be causing other things. At least it's another piece of info that might be useful in future. And each time I learn a bit more, which is the main reason I'm here, after all, besides the science of it. Have to keep the brain cells engaged, else they rust. If you ever need someone to help bang on stuff or help prove a case, just ask. If it's within my skill set I'm always willing. Appreciate your taking time to help dig it out! ID: 2027026 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.