Panic Mode On (111) Server Problems?

Author	Message
Sirius B Volunteer tester Send message Joined: 26 Dec 00 Posts: 24920 Credit: 3,081,182 RAC: 7	Message 1928521 - Posted: 7 Apr 2018, 1:52:50 UTC - in response to Message 1928520. Good point, but I want to see if it works with more than 50 at a time :-) After that hit, I'll keep a close eye on the log for when the scheduler has issues. ID: 1928521 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 36949 Credit: 261,360,520 RAC: 489	Message 1928524 - Posted: 7 Apr 2018, 1:59:05 UTC - in response to Message 1928521. Good point, but I want to see if it works with more than 50 at a time :-) After that hit, I'll keep a close eye on the log for when the scheduler has issues. I doubt that you'll see much difference there Sirius as you're only doing CPU work and not requesting constant GPU work. ;-) But then again I don't suffer those same problems here, but when I do have a problem everyone has a problem. Cheers. ID: 1928524 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1928528 - Posted: 7 Apr 2018, 3:12:35 UTC - in response to Message 1928524. Good point, but I want to see if it works with more than 50 at a time :-) After that hit, I'll keep a close eye on the log for when the scheduler has issues. I doubt that you'll see much difference there Sirius as you're only doing CPU work and not requesting constant GPU work. ;-) But then again I don't suffer those same problems here, but when I do have a problem everyone has a problem. Cheers. Yes, if Wiggo is having issues . . . . he is the proverbial canary in the coalmine. He runs a version 6 client and if he is having issues . . . . EVERYONE is having issues. Ever since the last of the Arecibo tasks cleared out from the RTS buffer, I have kept all machines at full caches. So maybe the problem is that the scheduler is having issues differentiating cpu or gpu work requests when the mix in the buffer contains Arecibo VLAR's. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1928528 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1928532 - Posted: 7 Apr 2018, 3:28:14 UTC - in response to Message 1928528. But then again I don't suffer those same problems here, but when I do have a problem everyone has a problem. Cheers. Yes, if Wiggo is having issues . . . . he is the proverbial canary in the coalmine. He runs a version 6 client and if he is having issues . . . . EVERYONE is having issues. Ever since the last of the Arecibo tasks cleared out from the RTS buffer, I have kept all machines at full caches. So maybe the problem is that the scheduler is having issues differentiating cpu or gpu work requests when the mix in the buffer contains Arecibo VLAR's. . . LOL I love it ... I have a picture in my head of Wiggo in a canary suit sitting on a perch. Sadly though in that analogy the canary would be Grant or maybe even yourself, being among the first hosts to experience difficulties when there is a problem. . . But I still love that image :) . . Yes, whatever the key to problem turns out to be I think it is clearly related to the point in the software that distinctions are made in task selection, particularly when there are Arecibo VLARs in the mix. Stephen :) ID: 1928532 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1928534 - Posted: 7 Apr 2018, 4:29:31 UTC - in response to Message 1928532. The scheduler never used to be this sensitive. Something in the database reconfiguration brought this to the fore and made it more obvious and obnoxious. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1928534 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874	Message 1928555 - Posted: 7 Apr 2018, 6:57:37 UTC Did my morning fetch while you lot were all asleep and I had the server to myself... ;-) 07/04/2018 07:17:50 \| SETI@home \| Sending scheduler request: To fetch work. 07/04/2018 07:17:53 \| SETI@home \| Scheduler request completed: got 41 new tasks 07/04/2018 07:23:00 \| SETI@home \| Sending scheduler request: To fetch work. 07/04/2018 07:23:00 \| SETI@home \| Reporting 1 completed tasks 07/04/2018 07:23:04 \| SETI@home \| Scheduler request completed: got 1 new tasks 07/04/2018 07:19:01 \| SETI@home \| Sending scheduler request: To fetch work. 07/04/2018 07:19:04 \| SETI@home \| Scheduler request completed: got 39 new tasks 07/04/2018 07:20:11 \| SETI@home \| Sending scheduler request: To fetch work. 07/04/2018 07:20:14 \| SETI@home \| Scheduler request completed: got 37 new tasks 07/04/2018 07:25:22 \| SETI@home \| Sending scheduler request: To fetch work. 07/04/2018 07:25:22 \| SETI@home \| Reporting 1 completed tasks 07/04/2018 07:25:24 \| SETI@home \| Scheduler request completed: got 1 new tasks So a scheduler turnround of 3 seconds seems pretty consistent. If TBar is getting a 1 second turnround, that suggests that he's triggering some sort of early exit path in the scheduler code. OK, that's something to think about when the coffee kicks in. ID: 1928555 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1928558 - Posted: 7 Apr 2018, 7:26:09 UTC I just had the Ryzen Linux box get 8 tasks stuck in download with 5 minutes on the counter trying and clocking up. Project backoff was 3 1/2 hours. Meanwhile uploads were going through with no issues every 20 seconds. Set http_transfer and http_debug and watched the host communicate with the scheduler. It knew it had transfers pending but couldn't seem to locate the files. After watching several go through their countdown timers, get no response and start a new increased timeout I tried a manual retry on a task and it eventually came down and that cleared the logjam on the others shortly. A different situation than the stuck downloads we've seen in the past. Something I'd not seen before. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1928558 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13859 Credit: 208,696,464 RAC: 304	Message 1928560 - Posted: 7 Apr 2018, 7:48:53 UTC - in response to Message 1928558. Last modified: 7 Apr 2018, 7:50:39 UTC I just had the Ryzen Linux box get 8 tasks stuck in download with 5 minutes on the counter trying and clocking up. Project backoff was 3 1/2 hours. Meanwhile uploads were going through with no issues every 20 seconds. No network issues at your end? Just had a quick look through my Event log, and no signs of download problems over the last 7 hours. Edit- just checked my Hosts file & still had the usually good download server address still there. Commented it out & will see how things go now. Grant Darwin NT ID: 1928560 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22551 Credit: 416,307,556 RAC: 380	Message 1928564 - Posted: 7 Apr 2018, 9:02:17 UTC It's the servers way of windind Keith up and telling him "time for bed" Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1928564 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874	Message 1928580 - Posted: 7 Apr 2018, 11:17:21 UTC Last modified: 7 Apr 2018, 11:42:42 UTC g_wreq->max_jobs_exceeded() Turns out to be easier than I expected. The process starts with https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L735: // return true if additional work is needed, // and there's disk space left, // and we haven't exceeded result per RPC limit, // and we haven't exceeded results per day limit // bool work_needed(bool locality_sched) { Locality_sched applies to Einstein@Home only - we can ignore it (always false) for SETI. work_needed() starts with some very quick checks - enough disk, speed, memory, allowed tasks. Measure that in microseconds. Then (same file. line 779): // see if we've reached limits on in-progress jobs // bool some_type_allowed = false; for (int i=0; i<NPROC_TYPES; i++) { we run a loop over each processor type - CPU, NVidia, ATI, intel_gpu. That's the order they were introduced into BOINC's capability, so I'll bet anyone a beer that's the order in which they're tested. The bit that bites us is line 796: if (proj_pref_exceeded \|\| config.max_jobs_in_progress.exceeded(NULL, i)) { if (config.debug_quota) { log_messages.printf(MSG_NORMAL, "[quota] reached limit on %s jobs in progress\n", proc_type_name(i) ); config.max_jobs_in_progress.print_log(); } g_wreq->clear_req(i); g_wreq->max_jobs_on_host_proc_type_exceeded[i] = true; Note that proc_type_name is printed to the server logs, although we don't see it in user messages. So, that segment leaves some_type_allowed as false, and triggers line 811 (edited): if (!some_type_allowed) { g_wreq->max_jobs_on_host_exceeded = true; return false; which in turn triggers line 1635: if (!work_needed(false)) { goto done; } which sends messages back to the user and effectively drops us out of the scheduler session. ID: 1928580 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51481 Credit: 1,018,363,574 RAC: 1,004	Message 1928582 - Posted: 7 Apr 2018, 11:39:20 UTC - in response to Message 1928580. All I can say is you are quite the sleuth, Mr. Haselgrove. By now you are probably getting to the point where you understand some things about the Boinc code better than a certain author of it......LOL. "Time is simply the mechanism that keeps everything from happening all at once." ID: 1928582 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874	Message 1928583 - Posted: 7 Apr 2018, 11:42:09 UTC - in response to Message 1928241. Thu Apr 5 16:01:20 2018 \| SETI@home \| [sched_op] estimated total NVIDIA GPU task duration: 19947 seconds So....How does it figure 76 tasks will last 332 minutes? The answer is it took the longest estimate of around 4 minutes, multiplied that by 76, and came up with a time close to 304 minutes. BUT, that's for just ONE GPU...the machine has 3 GPUs. Therefore, the estimate is immediately off by a factor of 3, and then there is the problem of tasks taking much less than the High estimate. By the time all is corrected, the 76 tasks will probably take about 76 minutes using 3 GPUs with some tasks taking only 80 seconds to complete. Apparently the estimate completely ignores the number of devices the machine has. That's a misunderstanding of how it works. Here at SETI, every task occupies, by default, one CPU or one GPU. The estimates are always based on single devices. The exception (at other projects only) is for multi-threaded (MT) tasks, which run on multiple CPU cores at the same time. The correction for multiple CPUs or GPUs is done at work-fetch time. If you set a cache level of 1 day (86,400 seconds), BOINC will ask for 259,200 seconds, a day's work for each of the three GPUs. So, for those hitting unexpected 'reached a limit of tasks in progress' messages, my suggestion would be to calculate how long it would take your particular CPU to complete 100 tasks. Say a task takes 1 hour, and you use 4 CPU cores to process them, the hundred tasks would take 25 hours - just over a day. Set your cache level just below that figure - one day and no additional would be neat in this simplified example - and your CPU should cruise along just below the CPU limit. That should allow the scheduler to process the GPU iteration of the loop, and send those tasks too. If each GPU finishes it's own task in under 15 minutes (and for the people in this discussion, I expect that's true), then each GPU should get it's own 100 allocation from a 1-day cache, too. ID: 1928583 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874	Message 1928587 - Posted: 7 Apr 2018, 12:00:11 UTC One more, and then I'll shut up (go out for lunch). The code I've been examining, https://github.com/BOINC/boinc/commits/master/sched/sched_send.cpp was last edited in July last year, and that was for something completely different. The bits we're interested in probably date back to July 2016 or before. And I have some knowledge of the way David and Eric work: David likes to use SETI for his testing, and Eric has far more important things to worry about than scheduler code (he's got a NASA launch coming up), so he tries not to touch it. My guess is that we're still running the published code. I could request a behaviour change so that the loop executes unconditionally for every processor type, regardless of whether the CPU is stuffed, but I'm not going to. It'll be better for the project as a whole to get you to turn down the cache sizes, which should shrink the database - except, it'll have the contrary effect, if it allows you to stuff the GPU caches instead. And allowing the early drop out of scheduler requests to continue will save CPU time on Synergy, too... ID: 1928587 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51481 Credit: 1,018,363,574 RAC: 1,004	Message 1928599 - Posted: 7 Apr 2018, 13:32:34 UTC Creation rate is ramping up at the moment. We shall see if it keeps up. Meowsplitters! "Time is simply the mechanism that keeps everything from happening all at once." ID: 1928599 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1928603 - Posted: 7 Apr 2018, 13:44:10 UTC - in response to Message 1928555. Last modified: 7 Apr 2018, 14:43:12 UTC So a scheduler turnround of 3 seconds seems pretty consistent. If TBar is getting a 1 second turnround, that suggests that he's triggering some sort of early exit path in the scheduler code. OK, that's something to think about when the coffee kicks in. A One second turnaround is pretty consistent on this side of the pond if you're sitting on top of a major FiOS cable. Sat Apr 7 09:23:42 2018 \| SETI@home \| Sending scheduler request: To report completed tasks. Sat Apr 7 09:23:42 2018 \| SETI@home \| Reporting 5 completed tasks Sat Apr 7 09:23:42 2018 \| SETI@home \| Requesting new tasks for NVIDIA GPU Sat Apr 7 09:23:42 2018 \| SETI@home \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices Sat Apr 7 09:23:42 2018 \| SETI@home \| [sched_op] NVIDIA GPU work request: 265812.04 seconds; 0.00 devices Sat Apr 7 09:23:43 2018 \| SETI@home \| Scheduler request completed: got 5 new tasks Sat Apr 7 09:23:43 2018 \| SETI@home \| [sched_op] estimated total CPU task duration: 0 seconds Sat Apr 7 09:23:43 2018 \| SETI@home \| [sched_op] estimated total NVIDIA GPU task duration: 1193 seconds Sat Apr 7 09:23:43 2018 \| SETI@home \| [sched_op] Deferring communication for 00:05:03 Sat Apr 7 09:23:43 2018 \| SETI@home \| [sched_op] Reason: requested by project Right now it's working as expected. I think the last time I looked at the scheduler code I came to the conclusion that it really didn't think it had any tasks to send. Hasn't there been recent talk of problems with the DataBase? Something about not being able to assign new tasks? So it couldn't send any new work? There's also the Arecibo VLAR problem. Has anyone considered allowing those VLARs on Anonymous platform so the faster machines can run them with the Special App? It might work if Arecibo VLARs could be enabled on Anonymous platform, maybe even restrict them to CUDA 6.0 and above, if that were possible. Oh, look at this other machine. It has the ATI card set to run APs only, and is currently out of work. Yet the machine is still getting the reached a limit message. So, it would appear the reached a limit message only applies to one particular App. Sat Apr 7 10:24:22 2018 \| SETI@home \| [sched_op] Starting scheduler request Sat Apr 7 10:24:22 2018 \| SETI@home \| Sending scheduler request: To fetch work. Sat Apr 7 10:24:22 2018 \| SETI@home \| Requesting new tasks for NVIDIA GPU and AMD/ATI GPU Sat Apr 7 10:24:22 2018 \| SETI@home \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices Sat Apr 7 10:24:22 2018 \| SETI@home \| [sched_op] NVIDIA GPU work request: 444308.99 seconds; 0.00 devices Sat Apr 7 10:24:22 2018 \| SETI@home \| [sched_op] AMD/ATI GPU work request: 263520.00 seconds; 1.00 devices Sat Apr 7 10:24:24 2018 \| SETI@home \| Scheduler request completed: got 0 new tasks Sat Apr 7 10:24:24 2018 \| SETI@home \| [sched_op] Server version 709 Sat Apr 7 10:24:24 2018 \| SETI@home \| No tasks sent Sat Apr 7 10:24:24 2018 \| SETI@home \| No tasks are available for AstroPulse v7 Sat Apr 7 10:24:24 2018 \| SETI@home \| This computer has reached a limit on tasks in progress Sat Apr 7 10:24:24 2018 \| SETI@home \| Project requested delay of 303 seconds Sat Apr 7 10:24:24 2018 \| SETI@home \| [sched_op] Deferring communication for 00:05:03 Sat Apr 7 10:24:24 2018 \| SETI@home \| [sched_op] Reason: requested by project Why did it take 2 seconds? I dunno, maybe because the other machine is just running CUDA tasks and Not APs. ID: 1928603 ·

Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182	Message 1928631 - Posted: 7 Apr 2018, 15:45:33 UTC Am I misreading something or is the same tape loaded twice? Tape blc03_2bit_guppi_58185_76028_Dw1_off_0033 SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours ID: 1928631 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51481 Credit: 1,018,363,574 RAC: 1,004	Message 1928632 - Posted: 7 Apr 2018, 15:49:24 UTC - in response to Message 1928631. Last modified: 7 Apr 2018, 15:50:57 UTC Am I misreading something or is the same tape loaded twice? Tape blc03_2bit_guppi_58185_76028_Dw1_off_0033 blc01 and blc03, I think. Or maybe 28 and 38. "Time is simply the mechanism that keeps everything from happening all at once." ID: 1928632 ·

Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182	Message 1928636 - Posted: 7 Apr 2018, 15:55:34 UTC - in response to Message 1928632. Am I misreading something or is the same tape loaded twice? Tape blc03_2bit_guppi_58185_76028_Dw1_off_0033 blc01 and blc03, I think. Or maybe 28 and 38. Thanks, I knew something was wrong, and it was me...â€¦.blc01 and 03 certainly makes them different. (hitting head with mouse) SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours ID: 1928636 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1928645 - Posted: 7 Apr 2018, 16:24:20 UTC - in response to Message 1928560. I just had the Ryzen Linux box get 8 tasks stuck in download with 5 minutes on the counter trying and clocking up. Project backoff was 3 1/2 hours. Meanwhile uploads were going through with no issues every 20 seconds. No network issues at your end? Just had a quick look through my Event log, and no signs of download problems over the last 7 hours. Edit- just checked my Hosts file & still had the usually good download server address still there. Commented it out & will see how things go now. No, no network issues. All other machines were downloading and uploading fine. Even that machine was popping out uploads constantly while I was watching the stuck downloads. It's like the servers lost track of just those 8 tasks when it tried to get them using the normal mechanism. They would have just sat there and prevented any other downloads until I manually intervened. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1928645 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1928647 - Posted: 7 Apr 2018, 16:34:51 UTC // and we haven't exceeded result per RPC limit What is the definition of result per RPC limit? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1928647 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.