Message boards :
Number crunching :
Panic Mode On (111) Server Problems?
Message board moderation
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 31 · Next
Author | Message |
---|---|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Probably someone collected thoseHmmm, are you saying those messages are bouncing around inside the server creating problems? I've never received any of those messages, and since I run Anonymous platform I really don't have any need to change those Preferences as I just list the Apps I'm using in the app_info. I'm running 2 CPU tasks and 3 nVidia GPUs on that machine, it asks for GPU tasks much more often than CPU tasks. It's also back to not being sent much work in the last hour or so. It has around 50 CPU tasks onboard, meaning right now it's down around 70 GPU tasks. It does receive some tasks ever so often, but not enough to replace the completed tasks; Thu Apr 5 14:37:54 2018 | SETI@home | [sched_op] Starting scheduler request89 tasks only lasts for so long. Oh look, the server woke again; Thu Apr 5 16:01:11 2018 | SETI@home | [sched_op] Starting scheduler requestSo....How does it figure 76 tasks will last 332 minutes? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
OK, had much better luck this time. This has something to do with defining g_wreq->max_jobs_exceeded() if (config.max_wus_to_send) { g_wreq->max_jobs_per_rpc = mult * config.max_wus_to_send; } else { g_wreq->max_jobs_per_rpc = 999999; g_reply->set_delay(DELAY_NO_WORK_CACHE); } if (g_wreq->max_jobs_exceeded()) { sprintf(buf, "This computer has reached a limit on tasks in progress"); Last indexed on Jul 30, 2017 whatever indexed means and bool max_jobs_exceeded() { if (max_jobs_on_host_exceeded) return true; for (int i=0; i<NPROC_TYPES; i++) { extern WORK_REQ* g_wreq; extern double capped_host_fpops(); static inline void add_no_work_message(const char* m) { g_wreq->add_no_work_message(m); Last indexed on Feb 7 max_jobs_per_rpc can only be as high 999999 per request. So now have to figure out what config.max_wus_to_send is defined as. And what does mult * function do to that variable? And extern double capped_host_fpops() looks interesting too. Good 'ole fpops comes into play again. [Edit] OK, it has to do with determining how many tasks you get based on how many gpus on the host. if (n > MAX_GPUS) n = MAX_GPUS; ninstances[proc_type] = n; effective_ngpus += n; } int mult = effective_ncpus + config.gpu_multiplier * effective_ngpus; if (config.max_wus_to_send) { g_wreq->max_jobs_per_rpc = mult * config.max_wus_to_send; } else { g_wreq->max_jobs_per_rpc = 999999; It would seem that the number of tasks per cpu is defined somewhere else. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
That's a VERY GOOD question. I have always thought the fpops_est was always screwed up and didn't calculate true computing power of gpus. Even less so for the special app. The APR for the gpu tasks done on a special app host don't seem to be that wrong. So how does the scheduler mess up the estimated gpu task completion time so badly? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Look on the tasks tab in BOINC Manager. Each task has a "Remaining (estimated)" runtime. I'm guessing most of them are around 00:04:22.estimated total NVIDIA GPU task duration: 19947 secondsSo....How does it figure 76 tasks will last 332 minutes? That's usually a pretty good estimate, if all your cards run at the same speed. The server keeps track of your performance, and tweaks the figures so the estimate is realistic. If you run stock apps, the server monitors and adjusts speed (APR). If you run Anonymous Platform, the server takes your word for the speed, and tweaks the size of the task instead. Both routes end up in the same place. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Look on the tasks tab in BOINC Manager. Each task has a "Remaining (estimated)" runtime. I'm guessing most of them are around 00:04:22.estimated total NVIDIA GPU task duration: 19947 secondsSo....How does it figure 76 tasks will last 332 minutes? I don't understand Richard. What do you mean the server "takes your word for the speed" I don't know how we alter or affect the calculated APR other than what the server calculates for us. I don't think any of us are messing with the fpops_est value in the client_state. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Bedtime approaches, and the board is slow - it may take me until tomorrow to re-locate that code. But: Speed - in the stock case - APR in GigaFlop/sec - in the anonymous platform case - CPU benchmark for CPU tasks, Peak flops * fiddle factor for GPUs. Fiddle factor might be 1/20th. Task size - in the stock case - workunit <rsc_fpops_est>, raw, from splitters - in the anonymous platform case - <rsc_fpops_est>,tweaked by the inverse of the ratio of speed (as above - you're following me?) to APR. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
If you look at the posted logs you can see it's reporting 5 to 9 completed tasks every 5 minutes. 5 x 12 = 60 tasks in an hour. I just received a load of shorties estimated to take 85 seconds a piece. 85 seconds. 76 won't last long. The longest estimate on the tasks page is 3:47, the shortest is 1:25, that's how you complete well over 1000 tasks a day. Then there are all those that finish in about 5 seconds. We need more tasks. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
That depends whether the project exists to provide kibble to you, or whether you exist to do science for the project. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I'm running Low end hardware. I'll let what you posted sink in to the people running High end hardware. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
So how does the scheduler mess up the estimated gpu task completion time so badly?Easy. Just take the longest estimate and consider ONE device. That would be a little closer to reality. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I just looked at the estimated time for completion for all my tasks on the Intel machine. 52 minutes for cpu tasks. 43 seconds for shorties. 1 minute 24 seconds - 1 minute 54 seconds for VLAR's. The Ryzen machines do cpu tasks in 28-45 minutes. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Now both Linux crunchers are back to being down 100 tasks from full again like last night. Only 1 in 5 task requests get any work and then only 1 or 2 tasks. The rest of the time I get the "you've reached the limit of tasks in progress" message. . . And those pesky Blc01 tapes seem to still be stuck in the splitters ... Stephen ?? |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Seems the messages have changed. Along with more of the Reached a Limit, I'm now just being told Nothing was sent; . . Getting that here as well. Stephen :( |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
All of these look new... I wonder with the recent long out . . Hey there Mr Kevvy, . . Long time no hear ... . . That sounds very plausible to me ... Stephen :( |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I'm reading, but I don't have an explanation. . . I have observed just that sort of behaviour a lot lately. When getting new work the CPU queue will be completely refilled but the GPU Q will be shortchanged. Even the other way around on a rare occasion. So maybe it is contextual and whichever Q gets work allocated first triggers the "enough" signal despite the status of the second Q. That to me would be an error in the procedure, OR a deliberate change to the code to prevent bunkering as was suggested earlier. Stephen :( |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I just looked at the estimated time for completion for all my tasks on the Intel machine.The question was; Thu Apr 5 16:01:20 2018 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 19947 secondsThe answer is it took the longest estimate of around 4 minutes, multiplied that by 76, and came up with a time close to 304 minutes. BUT, that's for just ONE GPU...the machine has 3 GPUs. Therefore, the estimate is immediately off by a factor of 3, and then there is the problem of tasks taking much less than the High estimate. By the time all is corrected, the 76 tasks will probably take about 76 minutes using 3 GPUs with some tasks taking only 80 seconds to complete. Apparently the estimate completely ignores the number of devices the machine has. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
That snippet of code I posted is supposed to calculate the number of seconds of work based on the number of gpus in the host. From your calculation, that part of the code is not working evidently as I agree with your estimate is only for ONE device, not for three gpus. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13848 Credit: 208,696,464 RAC: 304 |
So, apart from the random work allocation rearing it's ugly head again, the database re-organisation seems to be helping. Things were a bit messy after the outage, but the splitters (in-spite of some slowdowns after a very good start) have been able to fill the Ready-to-send buffer, and keep it filled for over half a day. Been a while since that has been the case. The Results & WUs Awaiting-purge have both reached & generally settled around their more normal levels. WU Awaiting-deletion while not back to (effectively) zero like they used to be, are at least close enough to it, and not heading for yet another record high. Now we just need another bunch of short WUs so we can hammer the servers with 145,000/hour again to see just how well they can hold up. If we can get the Scheduler to reliably allocate work when a host hasn't reached it's cache or server-side limits, we might actually be ready to cope with more crunchers. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13848 Credit: 208,696,464 RAC: 304 |
That snippet of code I posted is supposed to calculate the number of seconds of work based on the number of gpus in the host. Or it's working the way it's meant to; wasn't it a glitch in the code that allows each GPU to get 100 WU, instead of like the CPU where it's a limit of 100 regardless of the number of cores/threads? Grant Darwin NT |
rob smith Send message Joined: 7 Mar 03 Posts: 22508 Credit: 416,307,556 RAC: 380 |
Since the routine is working "correctly" on two of my four crunchers, and "incorrectly" on the other two I would suggest there is something amiss in the communication between the cruncher and the calculation. It is worth noting that the two that are "incorrect" are my top two.... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.