Panic Mode On (108) Server Problems?

Message boards : Number crunching : Panic Mode On (108) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · Next

AuthorMessage
Stephen "Heretic"Project Donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 2628
Credit: 47,907,900
RAC: 129,646
Australia
Message 1900472 - Posted: 11 Nov 2017, 5:11:44 UTC - in response to Message 1900449.  

And just as I typed that ... I see 120 tasks comming my way, of course on my slowest computer...
EDIT: Logjam clearing??? 80 on another computer ..

Still nothing on either of mine.


. . I guess we need to wait for the trickle down ....

Stephen

<sigh>
ID: 1900472 · Report as offensive     Reply Quote
Stephen "Heretic"Project Donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 2628
Credit: 47,907,900
RAC: 129,646
Australia
Message 1900473 - Posted: 11 Nov 2017, 5:17:27 UTC - in response to Message 1900455.  

One thing noticeably missing from manual updates is ... last request too recent ... so the database is either not recording task request, or not reading it ...


. . That is a concern for me as well, surely a sign something is not doing it's job when premature d/l requests are not even acknowledged as such. I will feel better when I begin to see that response again :)

Stephen

:(
ID: 1900473 · Report as offensive     Reply Quote
Stephen "Heretic"Project Donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 2628
Credit: 47,907,900
RAC: 129,646
Australia
Message 1900476 - Posted: 11 Nov 2017, 5:20:25 UTC - in response to Message 1900453.  

Ditto that for my 2 biggest computers - nothing but popcorn farts ...

Typical. I should have posted sooner- 5 min after posting, I scored some work.
All shorties, one of them noisy to boot.
Should be out of GPU work in 30min, another 45 for the CPU. The C2D is that slow it'll probably have GPU work till tomorrow & the CPU for a day or 2 after that.


. . I wonder what the trigger word is that kicks off that eerie achievement ... still nothing coming my way ...

. . I just fired this rig back up and it is about to upload it's last task ... If that gets nothing then back to sleep it goes ....

Stephen

:(
ID: 1900476 · Report as offensive     Reply Quote
Profile Wiggo "Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 12597
Credit: 169,098,386
RAC: 85,487
Australia
Message 1900478 - Posted: 11 Nov 2017, 5:27:19 UTC

. . I wonder what the trigger word is that kicks off that eerie achievement ...

ATM it's hitting the servers while something is in the feeder. ;-)

Cheers.
ID: 1900478 · Report as offensive     Reply Quote
Stephen "Heretic"Project Donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 2628
Credit: 47,907,900
RAC: 129,646
Australia
Message 1900479 - Posted: 11 Nov 2017, 5:31:44 UTC - in response to Message 1900478.  

. . I wonder what the trigger word is that kicks off that eerie achievement ...

ATM it's hitting the servers while something is in the feeder. ;-)

Cheers.


. . Well clearly I cannot manage that because I am completely out of work, back to sleep for my rigs until tonight ... :(

Stephen

:(
ID: 1900479 · Report as offensive     Reply Quote
Profile Brent Norman
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 1820
Credit: 105,441,312
RAC: 450,112
Canada
Message 1900481 - Posted: 11 Nov 2017, 5:50:16 UTC - in response to Message 1900479.  

Well you can't get work if they aren't running, and they use a LOT let power when empty, LOL.
ID: 1900481 · Report as offensive     Reply Quote
Profile Keith MyersProject Donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 2421
Credit: 183,376,445
RAC: 349,883
United States
Message 1900483 - Posted: 11 Nov 2017, 6:08:21 UTC

Well I've had some success getting work on my slowest crunchers, but the Linux box and the Ryzen 1700X are still unable to get work. The Win10 box is doing lots of MW and Einstein in the meantime.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1900483 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 8876
Credit: 114,923,761
RAC: 69,617
Australia
Message 1900484 - Posted: 11 Nov 2017, 6:13:54 UTC

CPU is cold, but GPU somehow keeps picking up dribs & drabs every now & then.
And better still the last bunch weren't all shorties.
Grant
Darwin NT
ID: 1900484 · Report as offensive     Reply Quote
Profile Brent Norman
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 1820
Credit: 105,441,312
RAC: 450,112
Canada
Message 1900486 - Posted: 11 Nov 2017, 6:21:11 UTC
Last modified: 11 Nov 2017, 6:22:48 UTC

Ditto there Keith, my 1080 have been dry since this afternoon other that a few dribbles that rarely last until the next request - and lately nothing at all. And looks like my next biggest computer will be joining it again.
EDIT just when I hit enter ... 8 tasks, 10.5m estimate, divided by 3 = 3.5 LOL
ID: 1900486 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 8876
Credit: 114,923,761
RAC: 69,617
Australia
Message 1900507 - Posted: 11 Nov 2017, 10:09:12 UTC

Well, things are still broken, but they're not as broken as they were.
Every now and then my systems pick up a bit of work.

And the work-in-progress has stopped falling, and even moved back up, a bit.
But now it looks like there are issues with the splitters as well. Before, as the ready-to-send buffer emptied out, they would fire up & top it off. Now the ready-to-send buffer is emptying, and the splitters just aren't cranking up to meet the demand.
So as well as difficulty in getting work when it's available, another 6-8 hours and there won't even be much work available to send anyway.
Grant
Darwin NT
ID: 1900507 · Report as offensive     Reply Quote
Stephen "Heretic"Project Donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 2628
Credit: 47,907,900
RAC: 129,646
Australia
Message 1900511 - Posted: 11 Nov 2017, 12:06:12 UTC - in response to Message 1900481.  

Well you can't get work if they aren't running, and they use a LOT let power when empty, LOL.


. . and even less again when off .. :)

. . I am willing to live with the pain when they are processing work but not when they are doing nothing ...

. . It seems there is no joy in Smurfville tonight, so I guess they will have the entire weekend off :(

Stephen

:(
ID: 1900511 · Report as offensive     Reply Quote
Iona
Avatar

Send message
Joined: 12 Jul 07
Posts: 736
Credit: 8,176,312
RAC: 22,269
United Kingdom
Message 1900512 - Posted: 11 Nov 2017, 12:33:21 UTC

Don't worry about it, Stephen, you are not missing much! I may still be running and have a reasonable amount of work, but now I can't report the tasks! Uploads, are fine, but that's as far as it goes.
Don't take life too seriously, as you'll never come out of it alive!
ID: 1900512 · Report as offensive     Reply Quote
Profile MikeProject Donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 30589
Credit: 57,537,239
RAC: 30,500
Germany
Message 1900516 - Posted: 11 Nov 2017, 13:31:42 UTC

Uploads are working fine here, just download once in a while.

11.11.2017 13:18:45 | SETI@home | [sched_op] CPU work request: 3878897.73 seconds; 0.00 devices
11.11.2017 13:18:45 | SETI@home | [sched_op] AMD/ATI GPU work request: 651031.34 seconds; 0.00 devices
11.11.2017 13:18:47 | SETI@home | Scheduler request completed: got 7 new tasks
11.11.2017 13:18:47 | SETI@home | [sched_op] Server version 707
11.11.2017 13:18:47 | SETI@home | Project requested delay of 303 seconds
11.11.2017 13:18:47 | SETI@home | [sched_op] estimated total CPU task duration: 18009 seconds
11.11.2017 13:18:47 | SETI@home | [sched_op] estimated total AMD/ATI GPU task duration: 1309 seconds
11.11.2017 13:18:47 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 14fe07aa.4256.10706.5.32.128_0
11.11.2017 13:18:47 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 14fe07aa.24680.890.9.36.104_0
11.11.2017 13:18:47 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 11mr07ac.30053.72.6.33.107_0
11.11.2017 13:18:47 | SETI@home | [sched_op] Deferring communication for 00:05:03
11.11.2017 13:18:47 | SETI@home | [sched_op] Reason: requested by project
11.11.2017 13:18:49 | SETI@home | Started download of 14fe07aa.16643.11115.12.39.126
11.11.2017 13:18:49 | SETI@home | Started download of 11ja07ab.2450.8252.3.30.249
11.11.2017 13:18:53 | SETI@home | Finished download of 14fe07aa.16643.11115.12.39.126
11.11.2017 13:18:53 | SETI@home | Finished download of 11ja07ab.2450.8252.3.30.249
11.11.2017 13:18:53 | SETI@home | Started download of 04ja07ab.16675.10706.9.36.114.vlar
11.11.2017 13:18:53 | SETI@home | Started download of 04ja07ab.16675.10706.9.36.107.vlar
11.11.2017 13:18:57 | SETI@home | Finished download of 04ja07ab.16675.10706.9.36.114.vlar
11.11.2017 13:18:57 | SETI@home | Finished download of 04ja07ab.16675.10706.9.36.107.vlar
11.11.2017 13:18:57 | SETI@home | Started download of 04ja07ab.16675.10706.9.36.104.vlar
11.11.2017 13:18:57 | SETI@home | Started download of 04ja07ab.16675.10706.9.36.103.vlar
11.11.2017 13:19:00 | SETI@home | Finished download of 04ja07ab.16675.10706.9.36.103.vlar
11.11.2017 13:19:00 | SETI@home | Started download of 04ja07ab.16675.10706.9.36.109.vlar
11.11.2017 13:19:02 | SETI@home | Finished download of 04ja07ab.16675.10706.9.36.104.vlar
11.11.2017 13:19:03 | SETI@home | Finished download of 04ja07ab.16675.10706.9.36.109.vlar
11.11.2017 13:20:27 | SETI@home | Computation for task 08ja07ad.4244.10706.10.37.135_0 finished
11.11.2017 13:20:27 | SETI@home | Starting task 08ja07ad.4244.10706.10.37.26_1
11.11.2017 13:20:28 | SETI@home | Started upload of 08ja07ad.4244.10706.10.37.135_0_r1032251672_0
11.11.2017 13:20:32 | SETI@home | Finished upload of 08ja07ad.4244.10706.10.37.135_0_r1032251672_0
11.11.2017 13:23:53 | SETI@home | [sched_op] Starting scheduler request
11.11.2017 13:23:53 | SETI@home | Sending scheduler request: To fetch work.
11.11.2017 13:23:53 | SETI@home | Reporting 1 completed tasks
11.11.2017 13:23:53 | SETI@home | Requesting new tasks for CPU and AMD/ATI GPU
11.11.2017 13:23:53 | SETI@home | [sched_op] CPU work request: 3862504.63 seconds; 0.00 devices
11.11.2017 13:23:53 | SETI@home | [sched_op] AMD/ATI GPU work request: 649975.19 seconds; 0.00 devices
11.11.2017 13:23:55 | SETI@home | Scheduler request completed: got 0 new tasks
11.11.2017 13:23:55 | SETI@home | [sched_op] Server version 707
11.11.2017 13:23:55 | SETI@home | Project has no tasks available
11.11.2017 13:23:55 | SETI@home | Project requested delay of 303 seconds
11.11.2017 13:23:55 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 08ja07ad.4244.10706.10.37.135_0
11.11.2017 13:23:55 | SETI@home | [sched_op] Deferring communication for 00:05:03
11.11.2017 13:23:55 | SETI@home | [sched_op] Reason: requested by project
11.11.2017 13:25:12 | SETI@home | Computation for task 08ja07ad.4244.10706.10.37.26_1 finished
11.11.2017 13:25:12 | SETI@home | Starting task 08ja07ad.4244.10706.10.37.22_1
11.11.2017 13:25:14 | SETI@home | Started upload of 08ja07ad.4244.10706.10.37.26_1_r1682938500_0
11.11.2017 13:25:17 | SETI@home | Finished upload of 08ja07ad.4244.10706.10.37.26_1_r1682938500_0
11.11.2017 13:25:26 | SETI@home | Computation for task 08ja07ad.4244.10706.10.37.22_1 finished
11.11.2017 13:25:26 | SETI@home | Starting task 14fe07aa.4256.10706.5.32.119_1
11.11.2017 13:25:29 | SETI@home | Started upload of 08ja07ad.4244.10706.10.37.22_1_r1149134663_0
11.11.2017 13:25:32 | SETI@home | Finished upload of 08ja07ad.4244.10706.10.37.22_1_r1149134663_0
11.11.2017 13:29:01 | SETI@home | [sched_op] Starting scheduler request
11.11.2017 13:29:01 | SETI@home | Sending scheduler request: To fetch work.
11.11.2017 13:29:01 | SETI@home | Reporting 2 completed tasks
11.11.2017 13:29:01 | SETI@home | Requesting new tasks for CPU and AMD/ATI GPU
11.11.2017 13:29:01 | SETI@home | [sched_op] CPU work request: 3864762.18 seconds; 0.00 devices
11.11.2017 13:29:01 | SETI@home | [sched_op] AMD/ATI GPU work request: 650496.87 seconds; 0.00 devices
11.11.2017 13:29:03 | SETI@home | Scheduler request completed: got 0 new tasks
11.11.2017 13:29:03 | SETI@home | [sched_op] Server version 707
11.11.2017 13:29:03 | SETI@home | Project has no tasks available
11.11.2017 13:29:03 | SETI@home | Project requested delay of 303 seconds
11.11.2017 13:29:03 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 08ja07ad.4244.10706.10.37.26_1
11.11.2017 13:29:03 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 08ja07ad.4244.10706.10.37.22_1
11.11.2017 13:29:03 | SETI@home | [sched_op] Deferring communication for 00:05:03
11.11.2017 13:29:03 | SETI@home | [sched_op] Reason: requested by project
With each crime and every kindness we birth our future.
ID: 1900516 · Report as offensive     Reply Quote
kittymanProject Donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 49045
Credit: 877,687,685
RAC: 200,280
United States
Message 1900518 - Posted: 11 Nov 2017, 13:40:07 UTC

Something just broke loose.
Got a boatload of work on all crunchers in the last 20 minutes or so.

Meow!!!
A kitty keeps loneliness away.
More meowing, less hissing. I speak meow, do you?

Have made friends in this life.
Most were cats.
ID: 1900518 · Report as offensive     Reply Quote
Profile Keith MyersProject Donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 2421
Credit: 183,376,445
RAC: 349,883
United States
Message 1900532 - Posted: 11 Nov 2017, 15:54:35 UTC

My Windows machines seem to have received work during the night and are close to full. But the Linux machine is still way down in work with about a quarter of what it is supposed to have. I think that part of the issue is that at each no tasks are available message, it just keeps incrementing the Nvidia gpu backoff interval.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1900532 · Report as offensive     Reply Quote
Profile Bernie VineProject Donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9079
Credit: 51,137,382
RAC: 21,360
United Kingdom
Message 1900537 - Posted: 11 Nov 2017, 16:33:26 UTC

Well I got this

13755 SETI@home 11/11/2017 3:52:48 PM Scheduler request completed: got 71 new tasks

On my main machine and caches are now full

2nd machine has been downloading in 1's amd 2's and also now has a full cache.
ID: 1900537 · Report as offensive     Reply Quote
Richard HaselgroveProject Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 11516
Credit: 106,054,701
RAC: 70,705
United Kingdom
Message 1900538 - Posted: 11 Nov 2017, 16:34:17 UTC - in response to Message 1900532.  

My Windows machines seem to have received work during the night and are close to full. But the Linux machine is still way down in work with about a quarter of what it is supposed to have. I think that part of the issue is that at each no tasks are available message, it just keeps incrementing the Nvidia gpu backoff interval.
Interesting point. But I think we worked very hard on tweaking those backoffs, and as far as I know, the current one is still the compromise we reached at v6.11.8:

client: fix bug that cause wasted scheduler RPC.

Old: when a job finished, we cleared the backoffs for the resources it used. The idea was to get more jobs immediately in the case where the client was at a jobs-in-progress limit.
Problem: this resulted in an RPC immediately, typically before the output files were uploaded. So the client is still at the limit, and doesn't get jobs.

New: clear the backoffs at the point when output files have been uploaded and the job is ready to report.
So, iff you have any NVidia tasks left, every time one of them completes (successfully), you should upload the result file and then do a scheduler contact. And every scheduler contact should combine reporting results (if any) with requesting new work (if needed).

That's the way I've always observed my Windows machines to work, and from the sound of it your Windows machines do the same. So, why should your Linux machine behave differently? It describes itself as v7.8.3, the same as the Windows machines, so it should be the same codebase and the same behaviour.

Could there be anything odd about the version numbering for your Linux build, or are we looking for a misplaced 'Windows only' wrapper round that 2010 tweak?
ID: 1900538 · Report as offensive     Reply Quote
Profile Keith MyersProject Donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 2421
Credit: 183,376,445
RAC: 349,883
United States
Message 1900541 - Posted: 11 Nov 2017, 16:48:50 UTC - in response to Message 1900538.  

So Richard, is there anything I can set in cc_config or logging options that can pinpoint why I keep getting larger and larger backoff intervals? What about the report_tasks_immediately flag in cc_config? Would that prevent the backoff?
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1900541 · Report as offensive     Reply Quote
Profile Jeff Buck
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1271
Credit: 133,346,744
RAC: 239,028
United States
Message 1900543 - Posted: 11 Nov 2017, 17:06:29 UTC
Last modified: 11 Nov 2017, 17:08:29 UTC

I had my #1 and #2 machines shut down all night, but my #3 machine was left running (w/ a backup project) and by this morning had gradually filled the queue. So, about 45 minutes ago, I fired up my #1 machine. It got 3 tasks, all Arecibo VLARs, adding to the 16 of same that it already had. Nothing for the GPUs. After half a dozen non-productive scheduler requests, I finally decided to reschedule all non-running Arecibo VLARs to the GPUs, just to give them something to do. When BOINC restarted, the first scheduler request immediately snagged 111 new tasks. Go figure!

EDIT: And it looks like the second request got 135 more.
ID: 1900543 · Report as offensive     Reply Quote
Richard HaselgroveProject Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 11516
Credit: 106,054,701
RAC: 70,705
United Kingdom
Message 1900544 - Posted: 11 Nov 2017, 17:15:27 UTC - in response to Message 1900541.  
Last modified: 11 Nov 2017, 18:04:35 UTC

So Richard, is there anything I can set in cc_config or logging options that can pinpoint why I keep getting larger and larger backoff intervals? What about the report_tasks_immediately flag in cc_config? Would that prevent the backoff?
You can see the backoffs using the work_fetch_debug Event Log flag, although you need your thinking head on - it's very dense and technical. I'd be more interested in doing that first, to find where the problem lies, rather than guess at potential fixes without fully understanding what's going on.

I'll try and force a WFD log with backoffs, and annotate it.

Edit - here's a simple one, with all the other projects removed.

11/11/2017 17:43:10 |  | [work_fetch] ------- start work fetch state -------
11/11/2017 17:43:10 |  | [work_fetch] target work buffer: 108000.00 + 864.00 sec
11/11/2017 17:43:10 |  | [work_fetch] --- project states ---
11/11/2017 17:43:10 | SETI@home | [work_fetch] REC 392100.423 prio -0.019 can't request work: scheduler RPC backoff (297.81 sec)
11/11/2017 17:43:10 |  | [work_fetch] --- state for CPU ---
11/11/2017 17:43:10 |  | [work_fetch] shortfall 257739.36 nidle 0.00 saturated 41144.36 busy 0.00
11/11/2017 17:43:10 | SETI@home | [work_fetch] share 0.000 blocked by project preferences
11/11/2017 17:43:10 |  | [work_fetch] --- state for NVIDIA GPU ---
11/11/2017 17:43:10 |  | [work_fetch] shortfall 60371.40 nidle 0.00 saturated 78256.81 busy 0.00
11/11/2017 17:43:10 | SETI@home | [work_fetch] share 0.000 project is backed off  (resource backoff: 552.68, inc 600.00)
11/11/2017 17:43:10 |  | [work_fetch] --- state for Intel GPU ---
11/11/2017 17:43:10 |  | [work_fetch] shortfall 87875.71 nidle 0.00 saturated 20988.29 busy 0.00
11/11/2017 17:43:10 | SETI@home | [work_fetch] share 0.000 blocked by project preferences
11/11/2017 17:43:10 |  | [work_fetch] ------- end work fetch state -------
Took a while to force it, because every request got work, and I was in the middle of a batch of shorties which reset the backoffs as quickly as I could fetch work.

So, the data lines on the order they appear.

target work buffer - what you ask for. 1.25 days plus 0.01 days, in this case. No work request unless you're below the sum of these two.

Project state - still early in the 5:03 server backoff. Won't ask by itself (and no point in pressing 'update') until this is zero.

No CPU (or iGPU) requests for SETI on this machine - my preference.

state for NVIDIA GPU - the one we're interested in. Showing a shortfall, so it would fetch work if it could. But it's in resource backoff, because I've reached a quota limit in this case - the same would show for 'no tasks available'.

The two figures showing for backoff are:

First - the current 'how long to wait' - will count down by 60 seconds every minute.
Second (inc) - the current baseline for the backoff. Will double at each consecutive failure to get work until it reaches (I think) 4 hours / 14,400 seconds. The actual backoff will be set to a random number of roughly the same magnitude as 'inc', so the machines don't get into lockstep.

The theory is that resource backoff should be set to zero after successful task completion, and 'inc' should be set to zero after every task allocation. You're saying that the first half of that statement doesn't apply under Linux?
ID: 1900544 · Report as offensive     Reply Quote
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · Next

Message boards : Number crunching : Panic Mode On (108) Server Problems?


 
©2017 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.