Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 94 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
Ready-to-send buffer still not refilling, mostly due to the sustained very high return rate over the last few hours (165k+), and the splitters just aren't capable of the sustained output required to meet that demand, and fill the RTS buffer. Just as well the return rate isn't that little big higher, or we'd be struggling to get work (more than we are- just had a period of "Project has no tasks available" or reporting 10 WUs and only getting 2 back). Grant Darwin NT |
![]() ![]() Send message Joined: 24 Jan 00 Posts: 38119 Credit: 261,360,520 RAC: 489 ![]() ![]() |
If a way can be found found to stop splitters from bunching up on 1 file (such as blc54_2bit_guppi_58692_62680_HIP24094_0035 ATM) then that would help a great deal. Cheers. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
If a way can be found found to stop splitters from bunching up on 1 file (such as blc54_2bit_guppi_58692_62680_HIP24094_0035 ATM) then that would help a great deal.It's been a long, long time since the 1 file, 1 splitter days. Grant Darwin NT |
![]() ![]() ![]() Send message Joined: 1 Apr 13 Posts: 1859 Credit: 268,616,081 RAC: 1,349 ![]() ![]() |
I guess this would be a Server Issue, though it may need to be moved to another thread if it gets out of hand. Bringing up a new (Ubuntu) client and, due to some hardware issues I was fighting, I aborted the initial 5 CPU tasks I was given so as to eliminate CPU load as a cause for the slow-downs I was seeing. Set the box to a venue that allows GPU only. So far, so good. Finally solved (I believe) the hardware issue, and decided it was time to get some CPU tasks going as well. Set the box to a different venue that allows both CPU and GPU work, but got only GPU work. If I recall correctly, CPU tasks won't be sent until the GPU cache is full or other limits kick in. Set the box to a third venue that allows only CPU work. Had the box updated to the new venue, and it correctly asks for only CPU work now. So far so, good. (And I do understand that it takes 2 updates for this to happen, 1 for the server to change the venue at the client, and the second for the client to do a request for the new venue.) However, when I hit the server with a request for CPU tasks, it responds by sending new work which is still classed as GPU work. Not a one-time thing, it's consistent after restarting the client, resetting the preferences venue flags several times and changing venues again as well. In example: (client ID: 8881207) 977 SETI@home 01/09/20 3:24:20 AM Sending scheduler request: To report completed tasks. 978 SETI@home 01/09/20 3:24:20 AM Reporting 2 completed tasks 979 SETI@home 01/09/20 3:24:20 AM Requesting new tasks for CPU 980 SETI@home 01/09/20 3:24:21 AM Scheduler request completed: got 3 new tasks 981 SETI@home 01/09/20 3:24:24 AM Started download of blc75_2bit_guppi_58693_06659_HIP100511_0136.26291.0.22.45.111.vlar 982 SETI@home 01/09/20 3:24:24 AM Started download of blc64_2bit_guppi_58693_08596_HIP98819_0142.23980.409.22.45.58.vlar 983 SETI@home 01/09/20 3:24:24 AM Started download of blc64_2bit_guppi_58693_07938_HIP100511_0140.25367.409.22.45.71.vlar 984 SETI@home 01/09/20 3:24:30 AM Finished download of blc64_2bit_guppi_58693_08596_HIP98819_0142.23980.409.22.45.58.vlar 985 SETI@home 01/09/20 3:24:31 AM Finished download of blc75_2bit_guppi_58693_06659_HIP100511_0136.26291.0.22.45.111.vlar 986 SETI@home 01/09/20 3:24:31 AM Finished download of blc64_2bit_guppi_58693_07938_HIP100511_0140.25367.409.22.45.71.vlar I look in the project directory, the files are present, but no CPU work is processing. Finding the task names via BOINCTasks shows them as CUDA90 tasks waiting to run. I get that I might be in the penalty box for CPU work as 100% (5) of the CPU tasks I had I aborted, but I can't see how client configuration could cause the scheduler to send me GPU work telling me it's CPU work. I verified the venue selections were correct by watching for logged scheduler requests as above, which correctly reflected my venue settings after updates reset them at the client. Just can't think of anything I might do that would cause the Scheduling server to do this. Am I missing something? ![]() ![]() |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Looking at that computer on the website - All tasks for computer 8881207 - you currently have at least three tasks on the front page assigned to the CPU. That looks good, but I'm not looking through all 504 tasks in progress... |
![]() ![]() ![]() Send message Joined: 1 Apr 13 Posts: 1859 Credit: 268,616,081 RAC: 1,349 ![]() ![]() |
Looking at that computer on the website - All tasks for computer 8881207 - you currently have at least three tasks on the front page assigned to the CPU. That looks good, but I'm not looking through all 504 tasks in progress... Yeah, naturally the moment that message posted I got 10 supposed CPU tasks that were GPU tasks, and the next session I got 4 that were for real. It looks to me like perhaps regardless of the venue it was going to fill the cache of GPU before releasing any more CPU. Just seemed strange that it would send that way ignoring the venue. Strange interaction, but thanks for taking a peek. ![]() ![]() |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
... I got 10 supposed CPU tasks that were GPU tasks ...What's your evidence for that? You can't tell from the download file names - all tasks are identical until they are assigned to a particular class of device, by the server on allocation. <sched_op_debug> in the Event Log gives a useful summary: 09/01/2020 12:09:04 | SETI@home | Sending scheduler request: To report completed tasks. 09/01/2020 12:09:04 | SETI@home | Reporting 14 completed tasks 09/01/2020 12:09:04 | SETI@home | Requesting new tasks for NVIDIA GPU 09/01/2020 12:09:04 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices 09/01/2020 12:09:04 | SETI@home | [sched_op] NVIDIA GPU work request: 8165.73 seconds; 0.00 devices 09/01/2020 12:09:04 | SETI@home | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices 09/01/2020 12:09:07 | SETI@home | Scheduler request completed: got 15 new tasks 09/01/2020 12:09:07 | SETI@home | [sched_op] estimated total CPU task duration: 0 seconds 09/01/2020 12:09:07 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 8702 seconds 09/01/2020 12:09:07 | SETI@home | [sched_op] estimated total Intel GPU task duration: 0 secondsThat's a proper 'GPU only' transaction. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
It looks to me like perhaps regardless of the venue it was going to fill the cache of GPU before releasing any more CPU. Just seemed strange that it would send that way ignoring the venue. Strange interaction, but thanks for taking a peek.It will always fill the GPU cache before it starts on the CPU (unless you have a really fast CPU and a really, really slow GPU). You just need to make sure to set the cache settings small enough that it doesn't take long to fill the GPU, to get it started on the CPU. Once the settings are changed back, it will continue to fill the new GPU cache, then start filling the CPU cache again. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
Oh, and the splitter output has fallen off to not much more than nothing and we should be out of work in the next 30min or so if they don't come back to life. Grant Darwin NT |
![]() ![]() ![]() Send message Joined: 1 Apr 13 Posts: 1859 Credit: 268,616,081 RAC: 1,349 ![]() ![]() |
... I got 10 supposed CPU tasks that were GPU tasks ...What's your evidence for that? You can't tell from the download file names - all tasks are identical until they are assigned to a particular class of device, by the server on allocation. I understand that the filename indicates nothing. Not going from that. The evidence was: 1) requested CPU work, not GPU work, as indicated by log, because 2) venue flag excluded GPU work by not being selected, and 3) tasks were received, and their presence verified in project directory, yet 4) listing shows tasks as GPU jobs in a) boincmgr and b) boinctasks. Agreed that turning the debug on would have been useful, but I have to assume that the fact that CPU tasks were the only ones requested is correct in the log. Not a big thing, but my only point in noting this was that the work type flag was (apparently) overridden by other considerations in the scheduler, which is strange. ![]() ![]() |
![]() ![]() ![]() Send message Joined: 1 Apr 13 Posts: 1859 Credit: 268,616,081 RAC: 1,349 ![]() ![]() |
And I did note my awareness of that in the original message. But the point of the message was that it ignored my exclusion of GPU tasks via venue flag, and responded to a request for CPU work by sending tasks for GPU. FWIW, my cache settings are .33 days.It looks to me like perhaps regardless of the venue it was going to fill the cache of GPU before releasing any more CPU. Just seemed strange that it would send that way ignoring the venue. Strange interaction, but thanks for taking a peek.It will always fill the GPU cache before it starts on the CPU (unless you have a really fast CPU and a really, really slow GPU). You just need to make sure to set the cache settings small enough that it doesn't take long to fill the GPU, to get it started on the CPU. Once the settings are changed back, it will continue to fill the new GPU cache, then start filling the CPU cache again. ![]() ![]() |
![]() ![]() ![]() Send message Joined: 1 Apr 13 Posts: 1859 Credit: 268,616,081 RAC: 1,349 ![]() ![]() |
... I got 10 supposed CPU tasks that were GPU tasks ...What's your evidence for that? You can't tell from the download file names - all tasks are identical until they are assigned to a particular class of device, by the server on allocation. (client ID: 8881207) 977 SETI@home 01/09/20 3:24:20 AM Sending scheduler request: To report completed tasks. 978 SETI@home 01/09/20 3:24:20 AM Reporting 2 completed tasks 979 SETI@home 01/09/20 3:24:20 AM Requesting new tasks for CPU 980 SETI@home 01/09/20 3:24:21 AM Scheduler request completed: got 3 new tasks 981 SETI@home 01/09/20 3:24:24 AM Started download of blc75_2bit_guppi_58693_06659_HIP100511_0136.26291.0.22.45.111.vlarhttps://setiathome.berkeley.edu/workunit.php?wuid=3831002783 982 SETI@home 01/09/20 3:24:24 AM Started download of blc64_2bit_guppi_58693_08596_HIP98819_0142.23980.409.22.45.58.vlarhttps://setiathome.berkeley.edu/workunit.php?wuid=3831002683 983 SETI@home 01/09/20 3:24:24 AM Started download of blc64_2bit_guppi_58693_07938_HIP100511_0140.25367.409.22.45.71.vlarhttps://setiathome.berkeley.edu/workunit.php?wuid=3831002789 Seems to me this pretty well proves the point ... please note that the WU assignment clearly shows them assigned to GPU. ![]() ![]() |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Thanks - that's why I asked the question. Seems like there is something screwy with the scheduler, and we need hard evidence like this to track it down. I spent some time exchanging emails with Eric this evening, using Beta to confirm the 'Anonymous Platform' error that plagued us before Christmas. He started Beta, I said 'nada'. He said it had fallen over, and restarted it again. I crashed it. He said 'damn', but we got a log out of it. |
![]() ![]() ![]() Send message Joined: 1 Apr 13 Posts: 1859 Credit: 268,616,081 RAC: 1,349 ![]() ![]() |
Thanks - that's why I asked the question. Seems like there is something screwy with the scheduler, and we need hard evidence like this to track it down. Yup, it's all good. Took me a bit to realize all I had to do was search for the tasks on the site and correlate them. I learned that if you ask the site to display tasks by name it auto-sorts the results. Makes it pretty quick to find them. Hard to say if it's worth spending much effort to fix this, but, on the other hand it's often true that what causes this might be causing other things. At least it's another piece of info that might be useful in future. And each time I learn a bit more, which is the main reason I'm here, after all, besides the science of it. Have to keep the brain cells engaged, else they rust. If you ever need someone to help bang on stuff or help prove a case, just ask. If it's within my skill set I'm always willing. Appreciate your taking time to help dig it out! ![]() ![]() |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
These BLC 35s are bad news, most of them are Instant Overflows. The Results received in last hour is already up to 205,271 and I only have One machine running them so far. Once other machines start running them I believe we will be constantly Out of Work, https://setiathome.berkeley.edu/results.php?hostid=6796479&offset=1120 I seem to be getting a large number of stuck Uploads as well... |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
These BLC 35s are bad news, most of them are Instant Overflows. The Results received in last hour is already up to 205,271 and I only have One machine running them so far. Once other machines start running them I believe we will be constantly Out of Work, https://setiathome.berkeley.edu/results.php?hostid=6796479&offset=1120 . . Yep. 90% or more of the Blc35 tasks are noise bombs. This is gonna add to the havoc ... Stephen :( |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
I see they added some old Arecibo files to the splitters in an attempt to slow down the return rate from all the noise bombs. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
Now the Server isn't responding. One machine's cache is down 50% and can't contact the Server. Naturally the Results Received and RTS is showing the change, it appears no one can contact the Server.... |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
Website is very unresponsive as of this moment. Tom A proud member of the OFA (Old Farts Association). |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
Yep, every host in scheduler backoff due to "internal server error". Can't report work and caches falling. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.