Message boards :
Number crunching :
Panic Mode On (111) Server Problems?
Message board moderation
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 31 · Next
Author | Message |
---|---|
Sirius B Send message Joined: 26 Dec 00 Posts: 24901 Credit: 3,081,182 RAC: 7 |
Good point, but I want to see if it works with more than 50 at a time :-) After that hit, I'll keep a close eye on the log for when the scheduler has issues. |
Wiggo Send message Joined: 24 Jan 00 Posts: 36075 Credit: 261,360,520 RAC: 489 |
Good point, but I want to see if it works with more than 50 at a time :-) I doubt that you'll see much difference there Sirius as you're only doing CPU work and not requesting constant GPU work. ;-) But then again I don't suffer those same problems here, but when I do have a problem everyone has a problem. Cheers. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Good point, but I want to see if it works with more than 50 at a time :-) Yes, if Wiggo is having issues . . . . he is the proverbial canary in the coalmine. He runs a version 6 client and if he is having issues . . . . EVERYONE is having issues. Ever since the last of the Arecibo tasks cleared out from the RTS buffer, I have kept all machines at full caches. So maybe the problem is that the scheduler is having issues differentiating cpu or gpu work requests when the mix in the buffer contains Arecibo VLAR's. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . LOL I love it ... I have a picture in my head of Wiggo in a canary suit sitting on a perch. Sadly though in that analogy the canary would be Grant or maybe even yourself, being among the first hosts to experience difficulties when there is a problem. . . But I still love that image :) . . Yes, whatever the key to problem turns out to be I think it is clearly related to the point in the software that distinctions are made in task selection, particularly when there are Arecibo VLARs in the mix. Stephen :) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
The scheduler never used to be this sensitive. Something in the database reconfiguration brought this to the fore and made it more obvious and obnoxious. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14666 Credit: 200,643,578 RAC: 874 |
Did my morning fetch while you lot were all asleep and I had the server to myself... ;-) 07/04/2018 07:17:50 | SETI@home | Sending scheduler request: To fetch work. 07/04/2018 07:19:01 | SETI@home | Sending scheduler request: To fetch work. 07/04/2018 07:20:11 | SETI@home | Sending scheduler request: To fetch work.So a scheduler turnround of 3 seconds seems pretty consistent. If TBar is getting a 1 second turnround, that suggests that he's triggering some sort of early exit path in the scheduler code. OK, that's something to think about when the coffee kicks in. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I just had the Ryzen Linux box get 8 tasks stuck in download with 5 minutes on the counter trying and clocking up. Project backoff was 3 1/2 hours. Meanwhile uploads were going through with no issues every 20 seconds. Set http_transfer and http_debug and watched the host communicate with the scheduler. It knew it had transfers pending but couldn't seem to locate the files. After watching several go through their countdown timers, get no response and start a new increased timeout I tried a manual retry on a task and it eventually came down and that cleared the logjam on the others shortly. A different situation than the stuck downloads we've seen in the past. Something I'd not seen before. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13822 Credit: 208,696,464 RAC: 304 |
I just had the Ryzen Linux box get 8 tasks stuck in download with 5 minutes on the counter trying and clocking up. Project backoff was 3 1/2 hours. Meanwhile uploads were going through with no issues every 20 seconds. No network issues at your end? Just had a quick look through my Event log, and no signs of download problems over the last 7 hours. Edit- just checked my Hosts file & still had the usually good download server address still there. Commented it out & will see how things go now. Grant Darwin NT |
rob smith Send message Joined: 7 Mar 03 Posts: 22394 Credit: 416,307,556 RAC: 380 |
It's the servers way of windind Keith up and telling him "time for bed" Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14666 Credit: 200,643,578 RAC: 874 |
g_wreq->max_jobs_exceeded()Turns out to be easier than I expected. The process starts with https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L735: // return true if additional work is needed, // and there's disk space left, // and we haven't exceeded result per RPC limit, // and we haven't exceeded results per day limit // bool work_needed(bool locality_sched) {Locality_sched applies to Einstein@Home only - we can ignore it (always false) for SETI. work_needed() starts with some very quick checks - enough disk, speed, memory, allowed tasks. Measure that in microseconds. Then (same file. line 779): // see if we've reached limits on in-progress jobs // bool some_type_allowed = false; for (int i=0; i<NPROC_TYPES; i++) {we run a loop over each processor type - CPU, NVidia, ATI, intel_gpu. That's the order they were introduced into BOINC's capability, so I'll bet anyone a beer that's the order in which they're tested. The bit that bites us is line 796: if (proj_pref_exceeded || config.max_jobs_in_progress.exceeded(NULL, i)) { if (config.debug_quota) { log_messages.printf(MSG_NORMAL, "[quota] reached limit on %s jobs in progress\n", proc_type_name(i) ); config.max_jobs_in_progress.print_log(); } g_wreq->clear_req(i); g_wreq->max_jobs_on_host_proc_type_exceeded[i] = true;Note that proc_type_name is printed to the server logs, although we don't see it in user messages. So, that segment leaves some_type_allowed as false, and triggers line 811 (edited): if (!some_type_allowed) { g_wreq->max_jobs_on_host_exceeded = true; return false;which in turn triggers line 1635: if (!work_needed(false)) { goto done; }which sends messages back to the user and effectively drops us out of the scheduler session. |
kittyman Send message Joined: 9 Jul 00 Posts: 51470 Credit: 1,018,363,574 RAC: 1,004 |
All I can say is you are quite the sleuth, Mr. Haselgrove. By now you are probably getting to the point where you understand some things about the Boinc code better than a certain author of it......LOL. "Time is simply the mechanism that keeps everything from happening all at once." |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14666 Credit: 200,643,578 RAC: 874 |
That's a misunderstanding of how it works. Here at SETI, every task occupies, by default, one CPU or one GPU. The estimates are always based on single devices. The exception (at other projects only) is for multi-threaded (MT) tasks, which run on multiple CPU cores at the same time.Thu Apr 5 16:01:20 2018 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 19947 secondsThe answer is it took the longest estimate of around 4 minutes, multiplied that by 76, and came up with a time close to 304 minutes. BUT, that's for just ONE GPU...the machine has 3 GPUs. Therefore, the estimate is immediately off by a factor of 3, and then there is the problem of tasks taking much less than the High estimate. By the time all is corrected, the 76 tasks will probably take about 76 minutes using 3 GPUs with some tasks taking only 80 seconds to complete. The correction for multiple CPUs or GPUs is done at work-fetch time. If you set a cache level of 1 day (86,400 seconds), BOINC will ask for 259,200 seconds, a day's work for each of the three GPUs. So, for those hitting unexpected 'reached a limit of tasks in progress' messages, my suggestion would be to calculate how long it would take your particular CPU to complete 100 tasks. Say a task takes 1 hour, and you use 4 CPU cores to process them, the hundred tasks would take 25 hours - just over a day. Set your cache level just below that figure - one day and no additional would be neat in this simplified example - and your CPU should cruise along just below the CPU limit. That should allow the scheduler to process the GPU iteration of the loop, and send those tasks too. If each GPU finishes it's own task in under 15 minutes (and for the people in this discussion, I expect that's true), then each GPU should get it's own 100 allocation from a 1-day cache, too. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14666 Credit: 200,643,578 RAC: 874 |
One more, and then I'll shut up (go out for lunch). The code I've been examining, https://github.com/BOINC/boinc/commits/master/sched/sched_send.cpp was last edited in July last year, and that was for something completely different. The bits we're interested in probably date back to July 2016 or before. And I have some knowledge of the way David and Eric work: David likes to use SETI for his testing, and Eric has far more important things to worry about than scheduler code (he's got a NASA launch coming up), so he tries not to touch it. My guess is that we're still running the published code. I could request a behaviour change so that the loop executes unconditionally for every processor type, regardless of whether the CPU is stuffed, but I'm not going to. It'll be better for the project as a whole to get you to turn down the cache sizes, which should shrink the database - except, it'll have the contrary effect, if it allows you to stuff the GPU caches instead. And allowing the early drop out of scheduler requests to continue will save CPU time on Synergy, too... |
kittyman Send message Joined: 9 Jul 00 Posts: 51470 Credit: 1,018,363,574 RAC: 1,004 |
Creation rate is ramping up at the moment. We shall see if it keeps up. Meowsplitters! "Time is simply the mechanism that keeps everything from happening all at once." |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
So a scheduler turnround of 3 seconds seems pretty consistent. If TBar is getting a 1 second turnround, that suggests that he's triggering some sort of early exit path in the scheduler code. OK, that's something to think about when the coffee kicks in.A One second turnaround is pretty consistent on this side of the pond if you're sitting on top of a major FiOS cable. Sat Apr 7 09:23:42 2018 | SETI@home | Sending scheduler request: To report completed tasks.Right now it's working as expected. I think the last time I looked at the scheduler code I came to the conclusion that it really didn't think it had any tasks to send. Hasn't there been recent talk of problems with the DataBase? Something about not being able to assign new tasks? So it couldn't send any new work? There's also the Arecibo VLAR problem. Has anyone considered allowing those VLARs on Anonymous platform so the faster machines can run them with the Special App? It might work if Arecibo VLARs could be enabled on Anonymous platform, maybe even restrict them to CUDA 6.0 and above, if that were possible. Oh, look at this other machine. It has the ATI card set to run APs only, and is currently out of work. Yet the machine is still getting the reached a limit message. So, it would appear the reached a limit message only applies to one particular App. Sat Apr 7 10:24:22 2018 | SETI@home | [sched_op] Starting scheduler requestWhy did it take 2 seconds? I dunno, maybe because the other machine is just running CUDA tasks and Not APs. |
Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182 |
Am I misreading something or is the same tape loaded twice? Tape blc03_2bit_guppi_58185_76028_Dw1_off_0033 SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours |
kittyman Send message Joined: 9 Jul 00 Posts: 51470 Credit: 1,018,363,574 RAC: 1,004 |
Am I misreading something or is the same tape loaded twice? blc01 and blc03, I think. Or maybe 28 and 38. "Time is simply the mechanism that keeps everything from happening all at once." |
Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182 |
Am I misreading something or is the same tape loaded twice? Thanks, I knew something was wrong, and it was me...….blc01 and 03 certainly makes them different. (hitting head with mouse) SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I just had the Ryzen Linux box get 8 tasks stuck in download with 5 minutes on the counter trying and clocking up. Project backoff was 3 1/2 hours. Meanwhile uploads were going through with no issues every 20 seconds. No, no network issues. All other machines were downloading and uploading fine. Even that machine was popping out uploads constantly while I was watching the stuck downloads. It's like the servers lost track of just those 8 tasks when it tried to get them using the normal mechanism. They would have just sat there and prevented any other downloads until I manually intervened. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
// and we haven't exceeded result per RPC limit What is the definition of result per RPC limit? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.