Panic Mode On (108) Server Problems?

Message boards : Number crunching : Panic Mode On (108) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 29 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1900596 - Posted: 11 Nov 2017, 21:18:38 UTC - in response to Message 1900572.  

I would like to study a contiguous segment of message log from that machine, with WFD active, showing resource backoff at the beginning, a task completion and upload, and the next WFD afterwards. What we do next depends on what we see there - if I see anything suspicious, I'll have a dig through the source code before writing anything on github.

If this is a bug, it's existed for 7 years without anyone noticing. Another couple of days is preferable to going off half-cocked and making fools of both of us.

I can do that. But I think I need to cause the machine to go into a backoff. Correct? Then enable the WFD and capture the entire log sequence from WFD log start to task completion and then to task request? Correct?

Back to a earlier part of one of your posts I missed comprehending.
The theory is that resource backoff should be set to zero after successful task completion, and 'inc' should be set to zero after every task allocation. You're saying that the first half of that statement doesn't apply under Linux?

Yes, that seems to be the case on the Linux machine. I report successful task completions every 303 seconds because there is at least one gpu task completed every 303 seconds because it uses the special app. Normally I report at least half a dozen tasks every machine report interval. If I understand your statement, that should reset the backoff every time. It doesn't.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1900596 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14686
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1900599 - Posted: 11 Nov 2017, 21:31:55 UTC - in response to Message 1900596.  

If I understand your statement, that should reset the resource backoff every time. It doesn't.
Yes, that's the key point - provided we're talking about the right backoff, as noted.

All I need to see is WFD before (showing backoff) - task completed - WFD after (showing whatever it shows).
ID: 1900599 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1900608 - Posted: 11 Nov 2017, 22:44:54 UTC

Well, the only way I can think of to cause a backoff is to reduce cache levels. But that would affect all machines. Right now I don't want to upset the apple cart since I am getting tasks regularly across all machines and the caches are staying topped off. I'll wait until the Linux cruncher gets into a backoff situation on its own and then capture the WFD log entries.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1900608 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14686
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1900610 - Posted: 11 Nov 2017, 22:47:00 UTC - in response to Message 1900608.  

Fair enough. Heading towards bedtime on this side of the pond, anyway - I wouldn't look at anything until tomorrow now, whatever turns up.
ID: 1900610 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1900707 - Posted: 12 Nov 2017, 13:53:35 UTC - in response to Message 1900544.  

So Richard, is there anything I can set in cc_config or logging options that can pinpoint why I keep getting larger and larger backoff intervals? What about the report_tasks_immediately flag in cc_config? Would that prevent the backoff?
You can see the backoffs using the work_fetch_debug Event Log flag, although you need your thinking head on - it's very dense and technical. I'd be more interested in doing that first, to find where the problem lies, rather than guess at potential fixes without fully understanding what's going on.

I'll try and force a WFD log with backoffs, and annotate it.

Edit - here's a simple one, with all the other projects removed.

11/11/2017 17:43:10 |  | [work_fetch] ------- start work fetch state -------
11/11/2017 17:43:10 |  | [work_fetch] target work buffer: 108000.00 + 864.00 sec
11/11/2017 17:43:10 |  | [work_fetch] --- project states ---
11/11/2017 17:43:10 | SETI@home | [work_fetch] REC 392100.423 prio -0.019 can't request work: scheduler RPC backoff (297.81 sec)
11/11/2017 17:43:10 |  | [work_fetch] --- state for CPU ---
11/11/2017 17:43:10 |  | [work_fetch] shortfall 257739.36 nidle 0.00 saturated 41144.36 busy 0.00
11/11/2017 17:43:10 | SETI@home | [work_fetch] share 0.000 blocked by project preferences
11/11/2017 17:43:10 |  | [work_fetch] --- state for NVIDIA GPU ---
11/11/2017 17:43:10 |  | [work_fetch] shortfall 60371.40 nidle 0.00 saturated 78256.81 busy 0.00
11/11/2017 17:43:10 | SETI@home | [work_fetch] share 0.000 project is backed off  (resource backoff: 552.68, inc 600.00)
11/11/2017 17:43:10 |  | [work_fetch] --- state for Intel GPU ---
11/11/2017 17:43:10 |  | [work_fetch] shortfall 87875.71 nidle 0.00 saturated 20988.29 busy 0.00
11/11/2017 17:43:10 | SETI@home | [work_fetch] share 0.000 blocked by project preferences
11/11/2017 17:43:10 |  | [work_fetch] ------- end work fetch state -------
Took a while to force it, because every request got work, and I was in the middle of a batch of shorties which reset the backoffs as quickly as I could fetch work.

So, the data lines on the order they appear.

target work buffer - what you ask for. 1.25 days plus 0.01 days, in this case. No work request unless you're below the sum of these two.

Project state - still early in the 5:03 server backoff. Won't ask by itself (and no point in pressing 'update') until this is zero.

No CPU (or iGPU) requests for SETI on this machine - my preference.

state for NVIDIA GPU - the one we're interested in. Showing a shortfall, so it would fetch work if it could. But it's in resource backoff, because I've reached a quota limit in this case - the same would show for 'no tasks available'.

The two figures showing for backoff are:

First - the current 'how long to wait' - will count down by 60 seconds every minute.
Second (inc) - the current baseline for the backoff. Will double at each consecutive failure to get work until it reaches (I think) 4 hours / 14,400 seconds. The actual backoff will be set to a random number of roughly the same magnitude as 'inc', so the machines don't get into lockstep.

The theory is that resource backoff should be set to zero after successful task completion, and 'inc' should be set to zero after every task allocation. You're saying that the first half of that statement doesn't apply under Linux?
This is interesting. I tried it with my 3 Ubuntu machines all running BOINC 7.8.3. I found it very difficult to have the machines go into backoff as they usually finished a task within 5 minutes and always reported a completed task which reset the counter. The one instance it didn't have a completed task at the end of the 5 minutes it waited until there was a completed task before reporting, which was only about a minute. During that minute it just sat there without a counter in the Projects tab, and reported after the next task completed which rest the counter. My three machines have never been touched by BOINC Tasks, and have never had the problem that Keith's had with only downloading about 20 tasks at a time. I believe that problem Keith had with only downloading around 20 tasks at a time was found to be caused by settings he made in BOINC Tasks. If he is having another problem with Work Fetch with the same machine, my first suspect would be BOINC Tasks as it was already found to cause problems with Work Fetch previously.
ID: 1900707 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14686
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1900725 - Posted: 12 Nov 2017, 16:14:24 UTC - in response to Message 1900707.  

This is interesting. I tried it with my 3 Ubuntu machines all running BOINC 7.8.3. I found it very difficult to have the machines go into backoff as they usually finished a task within 5 minutes and always reported a completed task which reset the counter. The one instance it didn't have a completed task at the end of the 5 minutes it waited until there was a completed task before reporting, which was only about a minute. During that minute it just sat there without a counter in the Projects tab, and reported after the next task completed which rest the counter. My three machines have never been touched by BOINC Tasks, and have never had the problem that Keith's had with only downloading about 20 tasks at a time. I believe that problem Keith had with only downloading around 20 tasks at a time was found to be caused by settings he made in BOINC Tasks. If he is having another problem with Work Fetch with the same machine, my first suspect would be BOINC Tasks as it was already found to cause problems with Work Fetch previously.
Well, what you see in the Event Log is written by the BOINC Client, and nothing else. What's written into the Event Log is what BOINC is using, what it's acting on.

There are many and various ways of giving operational settings to the BOINC Client - project web sites, via BOINC Manager, via Boinc Tasks, via an account Manager like BAM!, by directly editing the appropriate XML file. And probably more that I haven't thought of.

How they get there doesn't matter in the slightest (although it's helpful to the user's sanity if they pick one way that's convenient for them, and stick with it. Otherwise trouble-shooting tends to frazzle the brain. He says, with feeling and bitter experience.)

But however you set them, they end up in the same place, and they get written to the event log. That's why the log is so useful, and why - if you like tinkering - it repays the effort of learning to decipher what it's trying to tell you.
ID: 1900725 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1900736 - Posted: 12 Nov 2017, 17:36:21 UTC - in response to Message 1900707.  


This is interesting. I tried it with my 3 Ubuntu machines all running BOINC 7.8.3. I found it very difficult to have the machines go into backoff as they usually finished a task within 5 minutes and always reported a completed task which reset the counter. The one instance it didn't have a completed task at the end of the 5 minutes it waited until there was a completed task before reporting, which was only about a minute. During that minute it just sat there without a counter in the Projects tab, and reported after the next task completed which rest the counter. My three machines have never been touched by BOINC Tasks, and have never had the problem that Keith's had with only downloading about 20 tasks at a time. I believe that problem Keith had with only downloading around 20 tasks at a time was found to be caused by settings he made in BOINC Tasks. If he is having another problem with Work Fetch with the same machine, my first suspect would be BOINC Tasks as it was already found to cause problems with Work Fetch previously.

Well that is an interesting comment. I wouldn't be at all surprised that BoincTasks is involved now that you mention it. I wonder if BT has something to do with the machines all getting synched up with respect their reporting timing also. After all, only BT touches all machines at all times other than the SETI servers.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1900736 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1900789 - Posted: 12 Nov 2017, 22:07:20 UTC - in response to Message 1900707.  

This is interesting. I tried it with my 3 Ubuntu machines all running BOINC 7.8.3. I found it very difficult to have the machines go into backoff as they usually finished a task within 5 minutes and always reported a completed task which reset the counter. The one instance it didn't have a completed task at the end of the 5 minutes it waited until there was a completed task before reporting, which was only about a minute. During that minute it just sat there without a counter in the Projects tab, and reported after the next task completed which rest the counter. My three machines have never been touched by BOINC Tasks, and have never had the problem that Keith's had with only downloading about 20 tasks at a time. I believe that problem Keith had with only downloading around 20 tasks at a time was found to be caused by settings he made in BOINC Tasks. If he is having another problem with Work Fetch with the same machine, my first suspect would be BOINC Tasks as it was already found to cause problems with Work Fetch previously.


. . FWIIW, my Linux machines behave mostly as you describe except for the limited downloads issue. The fast rig seems to be limited to about 100 tasks with each request (give or take a dozen) when they are needed and available. The C2D unit had a problem with the max downloads getting less and less but I discovered that was due to shrinking free space on the Flashdrive it is running on. However since getting that space problem under control it is still limited to about 30 tasks per download no matter how empty it is. But, having only one GPU the cache is limited to 100 tasks so perhaps there is a relationship there 30/100 compare to 100/300 ??

Stephen

??
ID: 1900789 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1900797 - Posted: 13 Nov 2017, 0:02:24 UTC - in response to Message 1900789.  

This is interesting. I tried it with my 3 Ubuntu machines all running BOINC 7.8.3. I found it very difficult to have the machines go into backoff as they usually finished a task within 5 minutes and always reported a completed task which reset the counter. The one instance it didn't have a completed task at the end of the 5 minutes it waited until there was a completed task before reporting, which was only about a minute. During that minute it just sat there without a counter in the Projects tab, and reported after the next task completed which rest the counter. My three machines have never been touched by BOINC Tasks, and have never had the problem that Keith's had with only downloading about 20 tasks at a time. I believe that problem Keith had with only downloading around 20 tasks at a time was found to be caused by settings he made in BOINC Tasks. If he is having another problem with Work Fetch with the same machine, my first suspect would be BOINC Tasks as it was already found to cause problems with Work Fetch previously.


. . FWIIW, my Linux machines behave mostly as you describe except for the limited downloads issue. The fast rig seems to be limited to about 100 tasks with each request (give or take a dozen) when they are needed and available. The C2D unit had a problem with the max downloads getting less and less but I discovered that was due to shrinking free space on the Flashdrive it is running on. However since getting that space problem under control it is still limited to about 30 tasks per download no matter how empty it is. But, having only one GPU the cache is limited to 100 tasks so perhaps there is a relationship there 30/100 compare to 100/300 ??

Stephen

??

I solved the limited downloads problem back during the summer sometime if I remember. I have seen as many as 198 tasks downloaded when a machine is empty which I think I remember someone saying that 200 was the buffer limit, so no restrictions in downloads anymore. I need to get into the backoff situation on the Linux machine so I can set work_fetch_debug in the logfile for Richard. So far, this weekend, all machines are keeping topped off. I have had the need to kick the server on Friday night when I was way down on work on the Linux machine after I think that Eric fixed the servers problem. But no other issue so far on any machine. Keeping my fingers and toes crossed that I haven't now jinxed myself.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1900797 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13882
Credit: 208,696,464
RAC: 304
Australia
Message 1900799 - Posted: 13 Nov 2017, 0:27:01 UTC - in response to Message 1900797.  

So far, this weekend, all machines are keeping topped off. I have had the need to kick the server on Friday night when I was way down on work on the Linux machine after I think that Eric fixed the servers problem.

Likewise.
The application issue reared it's head a couple of times for me was well, but Tbar's triple update got things going again.
Grant
Darwin NT
ID: 1900799 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13882
Credit: 208,696,464
RAC: 304
Australia
Message 1901016 - Posted: 14 Nov 2017, 6:28:24 UTC - in response to Message 1899489.  

I often see a drop out somewhere around 06:00 UTC - I guess that might be when the data dump for the third party sites is done.

I'll have to keep an eye on the time in future.
Looking at the Haveland graphs, it was around 0:700 on them- the database Master-queries-per-second took a dive around the time the Website & Scheduler went AWOL. Probably lasted about 10min .

Both Web site & Scheduler MIA again for a while there, and around the same time as previously.
Grant
Darwin NT
ID: 1901016 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13882
Credit: 208,696,464
RAC: 304
Australia
Message 1901053 - Posted: 14 Nov 2017, 22:50:56 UTC

Well that would have to be one of the best after outage recoveries yet. First report after the outage & also got work. And have picked up new work with each scheduler contact since (so far). Already hit the serverside limits.
Whatever caused that last work shortage issue, Eric's efforts appear to have well & truly sorted it out.
Grant
Darwin NT
ID: 1901053 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1901065 - Posted: 14 Nov 2017, 23:23:21 UTC - in response to Message 1901053.  

Yea that went smooth for me too, but I was watching for it to come up and got in early. Reported an loaded in 3 requests.
ID: 1901065 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1901072 - Posted: 15 Nov 2017, 0:24:04 UTC - in response to Message 1901065.  

Yea that went smooth for me too, but I was watching for it to come up and got in early. Reported an loaded in 3 requests.


. . The outage was a little shorter than the norm and got work right away here too. Seems the cobwebs have been removed :)

Stephen

:)
ID: 1901072 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22658
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1901132 - Posted: 15 Nov 2017, 7:12:03 UTC

Eric obviously oiled the servers' hinges, things are a lot smoother than they have been for weeks :-)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1901132 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1901162 - Posted: 15 Nov 2017, 11:36:32 UTC - in response to Message 1901132.  

Me thinks Eric might have to look again, it appears the servers fell on their face an hour ago ... :(
ID: 1901162 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1901177 - Posted: 15 Nov 2017, 14:07:49 UTC

I just fired off a mail to Eric Re: empty cache ...
ID: 1901177 · Report as offensive
Profile Sid
Volunteer tester

Send message
Joined: 12 Jun 07
Posts: 16
Credit: 10,968,872
RAC: 0
United States
Message 1901185 - Posted: 15 Nov 2017, 15:07:43 UTC

Glad that I went to 1 day cache. . . .
ID: 1901185 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51505
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1901191 - Posted: 15 Nov 2017, 16:11:34 UTC
Last modified: 15 Nov 2017, 16:20:16 UTC

Oh meow. Last work I got was almost an hour ago.
Meowsigh.

EDIT.....
And, as usual, just 'bout the time I posted that, my best rig got a burst of 37 tasks.
Dunno if that means it is fixed, or just the luck of the draw.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1901191 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1901197 - Posted: 15 Nov 2017, 16:26:47 UTC

"Results ready to send: 28"
"Current result creation rate: 1.2/sec"

...someone needs to give one or more of the splitters a kick, they should be up in the 20 or 30 results created per second, when the queue is that low...
.

Hello, from Albany, CA!...
ID: 1901197 · Report as offensive
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 29 · Next

Message boards : Number crunching : Panic Mode On (108) Server Problems?


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.