Panic Mode On (111) Server Problems?

Message boards : Number crunching : Panic Mode On (111) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 31 · Next

AuthorMessage
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 24909
Credit: 3,081,182
RAC: 7
Ireland
Message 1928521 - Posted: 7 Apr 2018, 1:52:50 UTC - in response to Message 1928520.  

Good point, but I want to see if it works with more than 50 at a time :-)

After that hit, I'll keep a close eye on the log for when the scheduler has issues.
ID: 1928521 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36592
Credit: 261,360,520
RAC: 489
Australia
Message 1928524 - Posted: 7 Apr 2018, 1:59:05 UTC - in response to Message 1928521.  

Good point, but I want to see if it works with more than 50 at a time :-)

After that hit, I'll keep a close eye on the log for when the scheduler has issues.

I doubt that you'll see much difference there Sirius as you're only doing CPU work and not requesting constant GPU work. ;-)

But then again I don't suffer those same problems here, but when I do have a problem everyone has a problem.

Cheers.
ID: 1928524 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928528 - Posted: 7 Apr 2018, 3:12:35 UTC - in response to Message 1928524.  

Good point, but I want to see if it works with more than 50 at a time :-)

After that hit, I'll keep a close eye on the log for when the scheduler has issues.

I doubt that you'll see much difference there Sirius as you're only doing CPU work and not requesting constant GPU work. ;-)

But then again I don't suffer those same problems here, but when I do have a problem everyone has a problem.

Cheers.

Yes, if Wiggo is having issues . . . . he is the proverbial canary in the coalmine. He runs a version 6 client and if he is having issues . . . . EVERYONE is having issues.

Ever since the last of the Arecibo tasks cleared out from the RTS buffer, I have kept all machines at full caches. So maybe the problem is that the scheduler is having issues differentiating cpu or gpu work requests when the mix in the buffer contains Arecibo VLAR's.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928528 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1928532 - Posted: 7 Apr 2018, 3:28:14 UTC - in response to Message 1928528.  


But then again I don't suffer those same problems here, but when I do have a problem everyone has a problem.
Cheers.

Yes, if Wiggo is having issues . . . . he is the proverbial canary in the coalmine. He runs a version 6 client and if he is having issues . . . . EVERYONE is having issues.

Ever since the last of the Arecibo tasks cleared out from the RTS buffer, I have kept all machines at full caches. So maybe the problem is that the scheduler is having issues differentiating cpu or gpu work requests when the mix in the buffer contains Arecibo VLAR's.


. . LOL I love it ... I have a picture in my head of Wiggo in a canary suit sitting on a perch. Sadly though in that analogy the canary would be Grant or maybe even yourself, being among the first hosts to experience difficulties when there is a problem.

. . But I still love that image :)

. . Yes, whatever the key to problem turns out to be I think it is clearly related to the point in the software that distinctions are made in task selection, particularly when there are Arecibo VLARs in the mix.

Stephen

:)
ID: 1928532 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928534 - Posted: 7 Apr 2018, 4:29:31 UTC - in response to Message 1928532.  

The scheduler never used to be this sensitive. Something in the database reconfiguration brought this to the fore and made it more obvious and obnoxious.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928534 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14677
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928555 - Posted: 7 Apr 2018, 6:57:37 UTC

Did my morning fetch while you lot were all asleep and I had the server to myself... ;-)

07/04/2018 07:17:50 | SETI@home | Sending scheduler request: To fetch work.
07/04/2018 07:17:53 | SETI@home | Scheduler request completed: got 41 new tasks
07/04/2018 07:23:00 | SETI@home | Sending scheduler request: To fetch work.
07/04/2018 07:23:00 | SETI@home | Reporting 1 completed tasks
07/04/2018 07:23:04 | SETI@home | Scheduler request completed: got 1 new tasks
07/04/2018 07:19:01 | SETI@home | Sending scheduler request: To fetch work.
07/04/2018 07:19:04 | SETI@home | Scheduler request completed: got 39 new tasks
07/04/2018 07:20:11 | SETI@home | Sending scheduler request: To fetch work.
07/04/2018 07:20:14 | SETI@home | Scheduler request completed: got 37 new tasks
07/04/2018 07:25:22 | SETI@home | Sending scheduler request: To fetch work.
07/04/2018 07:25:22 | SETI@home | Reporting 1 completed tasks
07/04/2018 07:25:24 | SETI@home | Scheduler request completed: got 1 new tasks
So a scheduler turnround of 3 seconds seems pretty consistent. If TBar is getting a 1 second turnround, that suggests that he's triggering some sort of early exit path in the scheduler code. OK, that's something to think about when the coffee kicks in.
ID: 1928555 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928558 - Posted: 7 Apr 2018, 7:26:09 UTC

I just had the Ryzen Linux box get 8 tasks stuck in download with 5 minutes on the counter trying and clocking up. Project backoff was 3 1/2 hours. Meanwhile uploads were going through with no issues every 20 seconds.

Set http_transfer and http_debug and watched the host communicate with the scheduler. It knew it had transfers pending but couldn't seem to locate the files. After watching several go through their countdown timers, get no response and start a new increased timeout I tried a manual retry on a task and it eventually came down and that cleared the logjam on the others shortly.

A different situation than the stuck downloads we've seen in the past. Something I'd not seen before.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928558 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13842
Credit: 208,696,464
RAC: 304
Australia
Message 1928560 - Posted: 7 Apr 2018, 7:48:53 UTC - in response to Message 1928558.  
Last modified: 7 Apr 2018, 7:50:39 UTC

I just had the Ryzen Linux box get 8 tasks stuck in download with 5 minutes on the counter trying and clocking up. Project backoff was 3 1/2 hours. Meanwhile uploads were going through with no issues every 20 seconds.

No network issues at your end?
Just had a quick look through my Event log, and no signs of download problems over the last 7 hours.

Edit- just checked my Hosts file & still had the usually good download server address still there. Commented it out & will see how things go now.
Grant
Darwin NT
ID: 1928560 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22495
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1928564 - Posted: 7 Apr 2018, 9:02:17 UTC

It's the servers way of windind Keith up and telling him "time for bed"
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1928564 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14677
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928580 - Posted: 7 Apr 2018, 11:17:21 UTC
Last modified: 7 Apr 2018, 11:42:42 UTC

g_wreq->max_jobs_exceeded()
Turns out to be easier than I expected.

The process starts with https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L735:

// return true if additional work is needed,
// and there's disk space left,
// and we haven't exceeded result per RPC limit,
// and we haven't exceeded results per day limit
//
bool work_needed(bool locality_sched) {
Locality_sched applies to Einstein@Home only - we can ignore it (always false) for SETI.

work_needed() starts with some very quick checks - enough disk, speed, memory, allowed tasks. Measure that in microseconds. Then (same file. line 779):

    // see if we've reached limits on in-progress jobs
    //
    bool some_type_allowed = false;

    for (int i=0; i<NPROC_TYPES; i++) {
we run a loop over each processor type - CPU, NVidia, ATI, intel_gpu. That's the order they were introduced into BOINC's capability, so I'll bet anyone a beer that's the order in which they're tested. The bit that bites us is line 796:

        if (proj_pref_exceeded || config.max_jobs_in_progress.exceeded(NULL, i)) {
            if (config.debug_quota) {
                log_messages.printf(MSG_NORMAL,
                    "[quota] reached limit on %s jobs in progress\n",
                    proc_type_name(i)
                );
                config.max_jobs_in_progress.print_log();
            }
            g_wreq->clear_req(i);
            g_wreq->max_jobs_on_host_proc_type_exceeded[i] = true;
Note that proc_type_name is printed to the server logs, although we don't see it in user messages.

So, that segment leaves some_type_allowed as false, and triggers line 811 (edited):

    if (!some_type_allowed) {
        g_wreq->max_jobs_on_host_exceeded = true;
        return false;
which in turn triggers line 1635:

    if (!work_needed(false)) {
        goto done;
    }
which sends messages back to the user and effectively drops us out of the scheduler session.
ID: 1928580 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1928582 - Posted: 7 Apr 2018, 11:39:20 UTC - in response to Message 1928580.  

All I can say is you are quite the sleuth, Mr. Haselgrove.
By now you are probably getting to the point where you understand some things about the Boinc code better than a certain author of it......LOL.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1928582 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14677
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928583 - Posted: 7 Apr 2018, 11:42:09 UTC - in response to Message 1928241.  

Thu Apr 5 16:01:20 2018 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 19947 seconds
So....How does it figure 76 tasks will last 332 minutes?
The answer is it took the longest estimate of around 4 minutes, multiplied that by 76, and came up with a time close to 304 minutes. BUT, that's for just ONE GPU...the machine has 3 GPUs. Therefore, the estimate is immediately off by a factor of 3, and then there is the problem of tasks taking much less than the High estimate. By the time all is corrected, the 76 tasks will probably take about 76 minutes using 3 GPUs with some tasks taking only 80 seconds to complete.
Apparently the estimate completely ignores the number of devices the machine has.
That's a misunderstanding of how it works. Here at SETI, every task occupies, by default, one CPU or one GPU. The estimates are always based on single devices. The exception (at other projects only) is for multi-threaded (MT) tasks, which run on multiple CPU cores at the same time.

The correction for multiple CPUs or GPUs is done at work-fetch time. If you set a cache level of 1 day (86,400 seconds), BOINC will ask for 259,200 seconds, a day's work for each of the three GPUs.

So, for those hitting unexpected 'reached a limit of tasks in progress' messages, my suggestion would be to calculate how long it would take your particular CPU to complete 100 tasks. Say a task takes 1 hour, and you use 4 CPU cores to process them, the hundred tasks would take 25 hours - just over a day. Set your cache level just below that figure - one day and no additional would be neat in this simplified example - and your CPU should cruise along just below the CPU limit. That should allow the scheduler to process the GPU iteration of the loop, and send those tasks too. If each GPU finishes it's own task in under 15 minutes (and for the people in this discussion, I expect that's true), then each GPU should get it's own 100 allocation from a 1-day cache, too.
ID: 1928583 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14677
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928587 - Posted: 7 Apr 2018, 12:00:11 UTC

One more, and then I'll shut up (go out for lunch).

The code I've been examining, https://github.com/BOINC/boinc/commits/master/sched/sched_send.cpp was last edited in July last year, and that was for something completely different. The bits we're interested in probably date back to July 2016 or before. And I have some knowledge of the way David and Eric work: David likes to use SETI for his testing, and Eric has far more important things to worry about than scheduler code (he's got a NASA launch coming up), so he tries not to touch it. My guess is that we're still running the published code.

I could request a behaviour change so that the loop executes unconditionally for every processor type, regardless of whether the CPU is stuffed, but I'm not going to. It'll be better for the project as a whole to get you to turn down the cache sizes, which should shrink the database - except, it'll have the contrary effect, if it allows you to stuff the GPU caches instead. And allowing the early drop out of scheduler requests to continue will save CPU time on Synergy, too...
ID: 1928587 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1928599 - Posted: 7 Apr 2018, 13:32:34 UTC

Creation rate is ramping up at the moment.
We shall see if it keeps up.
Meowsplitters!
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1928599 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1928603 - Posted: 7 Apr 2018, 13:44:10 UTC - in response to Message 1928555.  
Last modified: 7 Apr 2018, 14:43:12 UTC

So a scheduler turnround of 3 seconds seems pretty consistent. If TBar is getting a 1 second turnround, that suggests that he's triggering some sort of early exit path in the scheduler code. OK, that's something to think about when the coffee kicks in.
A One second turnaround is pretty consistent on this side of the pond if you're sitting on top of a major FiOS cable.
Sat Apr 7 09:23:42 2018 | SETI@home | Sending scheduler request: To report completed tasks.
Sat Apr 7 09:23:42 2018 | SETI@home | Reporting 5 completed tasks
Sat Apr 7 09:23:42 2018 | SETI@home | Requesting new tasks for NVIDIA GPU
Sat Apr 7 09:23:42 2018 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Sat Apr 7 09:23:42 2018 | SETI@home | [sched_op] NVIDIA GPU work request: 265812.04 seconds; 0.00 devices
Sat Apr 7 09:23:43 2018 | SETI@home | Scheduler request completed: got 5 new tasks
Sat Apr 7 09:23:43 2018 | SETI@home | [sched_op] estimated total CPU task duration: 0 seconds
Sat Apr 7 09:23:43 2018 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 1193 seconds
Sat Apr 7 09:23:43 2018 | SETI@home | [sched_op] Deferring communication for 00:05:03
Sat Apr 7 09:23:43 2018 | SETI@home | [sched_op] Reason: requested by project
Right now it's working as expected.
I think the last time I looked at the scheduler code I came to the conclusion that it really didn't think it had any tasks to send. Hasn't there been recent talk of problems with the DataBase? Something about not being able to assign new tasks? So it couldn't send any new work? There's also the Arecibo VLAR problem. Has anyone considered allowing those VLARs on Anonymous platform so the faster machines can run them with the Special App? It might work if Arecibo VLARs could be enabled on Anonymous platform, maybe even restrict them to CUDA 6.0 and above, if that were possible.

Oh, look at this other machine. It has the ATI card set to run APs only, and is currently out of work. Yet the machine is still getting the reached a limit message. So, it would appear the reached a limit message only applies to one particular App.

Sat Apr 7 10:24:22 2018 | SETI@home | [sched_op] Starting scheduler request
Sat Apr 7 10:24:22 2018 | SETI@home | Sending scheduler request: To fetch work.
Sat Apr 7 10:24:22 2018 | SETI@home | Requesting new tasks for NVIDIA GPU and AMD/ATI GPU
Sat Apr 7 10:24:22 2018 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Sat Apr 7 10:24:22 2018 | SETI@home | [sched_op] NVIDIA GPU work request: 444308.99 seconds; 0.00 devices
Sat Apr 7 10:24:22 2018 | SETI@home | [sched_op] AMD/ATI GPU work request: 263520.00 seconds; 1.00 devices
Sat Apr 7 10:24:24 2018 | SETI@home | Scheduler request completed: got 0 new tasks
Sat Apr 7 10:24:24 2018 | SETI@home | [sched_op] Server version 709
Sat Apr 7 10:24:24 2018 | SETI@home | No tasks sent
Sat Apr 7 10:24:24 2018 | SETI@home | No tasks are available for AstroPulse v7
Sat Apr 7 10:24:24 2018 | SETI@home | This computer has reached a limit on tasks in progress
Sat Apr 7 10:24:24 2018 | SETI@home | Project requested delay of 303 seconds
Sat Apr 7 10:24:24 2018 | SETI@home | [sched_op] Deferring communication for 00:05:03
Sat Apr 7 10:24:24 2018 | SETI@home | [sched_op] Reason: requested by project
Why did it take 2 seconds? I dunno, maybe because the other machine is just running CUDA tasks and Not APs.
ID: 1928603 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1928631 - Posted: 7 Apr 2018, 15:45:33 UTC

Am I misreading something or is the same tape loaded twice?
Tape blc03_2bit_guppi_58185_76028_Dw1_off_0033

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1928631 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1928632 - Posted: 7 Apr 2018, 15:49:24 UTC - in response to Message 1928631.  
Last modified: 7 Apr 2018, 15:50:57 UTC

Am I misreading something or is the same tape loaded twice?
Tape blc03_2bit_guppi_58185_76028_Dw1_off_0033

blc01 and blc03, I think.
Or maybe 28 and 38.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1928632 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1928636 - Posted: 7 Apr 2018, 15:55:34 UTC - in response to Message 1928632.  

Am I misreading something or is the same tape loaded twice?
Tape blc03_2bit_guppi_58185_76028_Dw1_off_0033

blc01 and blc03, I think.
Or maybe 28 and 38.

Thanks, I knew something was wrong, and it was me...….blc01 and 03 certainly makes them different. (hitting head with mouse)

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1928636 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928645 - Posted: 7 Apr 2018, 16:24:20 UTC - in response to Message 1928560.  

I just had the Ryzen Linux box get 8 tasks stuck in download with 5 minutes on the counter trying and clocking up. Project backoff was 3 1/2 hours. Meanwhile uploads were going through with no issues every 20 seconds.

No network issues at your end?
Just had a quick look through my Event log, and no signs of download problems over the last 7 hours.

Edit- just checked my Hosts file & still had the usually good download server address still there. Commented it out & will see how things go now.

No, no network issues. All other machines were downloading and uploading fine. Even that machine was popping out uploads constantly while I was watching the stuck downloads. It's like the servers lost track of just those 8 tasks when it tried to get them using the normal mechanism. They would have just sat there and prevented any other downloads until I manually intervened.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928645 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928647 - Posted: 7 Apr 2018, 16:34:51 UTC

// and we haven't exceeded result per RPC limit


What is the definition of result per RPC limit?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928647 · Report as offensive
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 31 · Next

Message boards : Number crunching : Panic Mode On (111) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.