Message boards :
Number crunching :
Panic Mode On (108) Server Problems?
Message board moderation
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 29 · Next
Author | Message |
---|---|
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
I would like to study a contiguous segment of message log from that machine, with WFD active, showing resource backoff at the beginning, a task completion and upload, and the next WFD afterwards. What we do next depends on what we see there - if I see anything suspicious, I'll have a dig through the source code before writing anything on github. I can do that. But I think I need to cause the machine to go into a backoff. Correct? Then enable the WFD and capture the entire log sequence from WFD log start to task completion and then to task request? Correct? Back to a earlier part of one of your posts I missed comprehending. The theory is that resource backoff should be set to zero after successful task completion, and 'inc' should be set to zero after every task allocation. You're saying that the first half of that statement doesn't apply under Linux? Yes, that seems to be the case on the Linux machine. I report successful task completions every 303 seconds because there is at least one gpu task completed every 303 seconds because it uses the special app. Normally I report at least half a dozen tasks every machine report interval. If I understand your statement, that should reset the backoff every time. It doesn't. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14686 Credit: 200,643,578 RAC: 874 ![]() ![]() |
If I understand your statement, that should reset the resource backoff every time. It doesn't.Yes, that's the key point - provided we're talking about the right backoff, as noted. All I need to see is WFD before (showing backoff) - task completed - WFD after (showing whatever it shows). |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
Well, the only way I can think of to cause a backoff is to reduce cache levels. But that would affect all machines. Right now I don't want to upset the apple cart since I am getting tasks regularly across all machines and the caches are staying topped off. I'll wait until the Linux cruncher gets into a backoff situation on its own and then capture the WFD log entries. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14686 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Fair enough. Heading towards bedtime on this side of the pond, anyway - I wouldn't look at anything until tomorrow now, whatever turns up. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
This is interesting. I tried it with my 3 Ubuntu machines all running BOINC 7.8.3. I found it very difficult to have the machines go into backoff as they usually finished a task within 5 minutes and always reported a completed task which reset the counter. The one instance it didn't have a completed task at the end of the 5 minutes it waited until there was a completed task before reporting, which was only about a minute. During that minute it just sat there without a counter in the Projects tab, and reported after the next task completed which rest the counter. My three machines have never been touched by BOINC Tasks, and have never had the problem that Keith's had with only downloading about 20 tasks at a time. I believe that problem Keith had with only downloading around 20 tasks at a time was found to be caused by settings he made in BOINC Tasks. If he is having another problem with Work Fetch with the same machine, my first suspect would be BOINC Tasks as it was already found to cause problems with Work Fetch previously.So Richard, is there anything I can set in cc_config or logging options that can pinpoint why I keep getting larger and larger backoff intervals? What about the report_tasks_immediately flag in cc_config? Would that prevent the backoff?You can see the backoffs using the work_fetch_debug Event Log flag, although you need your thinking head on - it's very dense and technical. I'd be more interested in doing that first, to find where the problem lies, rather than guess at potential fixes without fully understanding what's going on. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14686 Credit: 200,643,578 RAC: 874 ![]() ![]() |
This is interesting. I tried it with my 3 Ubuntu machines all running BOINC 7.8.3. I found it very difficult to have the machines go into backoff as they usually finished a task within 5 minutes and always reported a completed task which reset the counter. The one instance it didn't have a completed task at the end of the 5 minutes it waited until there was a completed task before reporting, which was only about a minute. During that minute it just sat there without a counter in the Projects tab, and reported after the next task completed which rest the counter. My three machines have never been touched by BOINC Tasks, and have never had the problem that Keith's had with only downloading about 20 tasks at a time. I believe that problem Keith had with only downloading around 20 tasks at a time was found to be caused by settings he made in BOINC Tasks. If he is having another problem with Work Fetch with the same machine, my first suspect would be BOINC Tasks as it was already found to cause problems with Work Fetch previously.Well, what you see in the Event Log is written by the BOINC Client, and nothing else. What's written into the Event Log is what BOINC is using, what it's acting on. There are many and various ways of giving operational settings to the BOINC Client - project web sites, via BOINC Manager, via Boinc Tasks, via an account Manager like BAM!, by directly editing the appropriate XML file. And probably more that I haven't thought of. How they get there doesn't matter in the slightest (although it's helpful to the user's sanity if they pick one way that's convenient for them, and stick with it. Otherwise trouble-shooting tends to frazzle the brain. He says, with feeling and bitter experience.) But however you set them, they end up in the same place, and they get written to the event log. That's why the log is so useful, and why - if you like tinkering - it repays the effort of learning to decipher what it's trying to tell you. |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
Well that is an interesting comment. I wouldn't be at all surprised that BoincTasks is involved now that you mention it. I wonder if BT has something to do with the machines all getting synched up with respect their reporting timing also. After all, only BT touches all machines at all times other than the SETI servers. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
This is interesting. I tried it with my 3 Ubuntu machines all running BOINC 7.8.3. I found it very difficult to have the machines go into backoff as they usually finished a task within 5 minutes and always reported a completed task which reset the counter. The one instance it didn't have a completed task at the end of the 5 minutes it waited until there was a completed task before reporting, which was only about a minute. During that minute it just sat there without a counter in the Projects tab, and reported after the next task completed which rest the counter. My three machines have never been touched by BOINC Tasks, and have never had the problem that Keith's had with only downloading about 20 tasks at a time. I believe that problem Keith had with only downloading around 20 tasks at a time was found to be caused by settings he made in BOINC Tasks. If he is having another problem with Work Fetch with the same machine, my first suspect would be BOINC Tasks as it was already found to cause problems with Work Fetch previously. . . FWIIW, my Linux machines behave mostly as you describe except for the limited downloads issue. The fast rig seems to be limited to about 100 tasks with each request (give or take a dozen) when they are needed and available. The C2D unit had a problem with the max downloads getting less and less but I discovered that was due to shrinking free space on the Flashdrive it is running on. However since getting that space problem under control it is still limited to about 30 tasks per download no matter how empty it is. But, having only one GPU the cache is limited to 100 tasks so perhaps there is a relationship there 30/100 compare to 100/300 ?? Stephen ?? |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
This is interesting. I tried it with my 3 Ubuntu machines all running BOINC 7.8.3. I found it very difficult to have the machines go into backoff as they usually finished a task within 5 minutes and always reported a completed task which reset the counter. The one instance it didn't have a completed task at the end of the 5 minutes it waited until there was a completed task before reporting, which was only about a minute. During that minute it just sat there without a counter in the Projects tab, and reported after the next task completed which rest the counter. My three machines have never been touched by BOINC Tasks, and have never had the problem that Keith's had with only downloading about 20 tasks at a time. I believe that problem Keith had with only downloading around 20 tasks at a time was found to be caused by settings he made in BOINC Tasks. If he is having another problem with Work Fetch with the same machine, my first suspect would be BOINC Tasks as it was already found to cause problems with Work Fetch previously. I solved the limited downloads problem back during the summer sometime if I remember. I have seen as many as 198 tasks downloaded when a machine is empty which I think I remember someone saying that 200 was the buffer limit, so no restrictions in downloads anymore. I need to get into the backoff situation on the Linux machine so I can set work_fetch_debug in the logfile for Richard. So far, this weekend, all machines are keeping topped off. I have had the need to kick the server on Friday night when I was way down on work on the Linux machine after I think that Eric fixed the servers problem. But no other issue so far on any machine. Keeping my fingers and toes crossed that I haven't now jinxed myself. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13882 Credit: 208,696,464 RAC: 304 ![]() ![]() |
So far, this weekend, all machines are keeping topped off. I have had the need to kick the server on Friday night when I was way down on work on the Linux machine after I think that Eric fixed the servers problem. Likewise. The application issue reared it's head a couple of times for me was well, but Tbar's triple update got things going again. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13882 Credit: 208,696,464 RAC: 304 ![]() ![]() |
I often see a drop out somewhere around 06:00 UTC - I guess that might be when the data dump for the third party sites is done. Both Web site & Scheduler MIA again for a while there, and around the same time as previously. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13882 Credit: 208,696,464 RAC: 304 ![]() ![]() |
Well that would have to be one of the best after outage recoveries yet. First report after the outage & also got work. And have picked up new work with each scheduler contact since (so far). Already hit the serverside limits. Whatever caused that last work shortage issue, Eric's efforts appear to have well & truly sorted it out. Grant Darwin NT |
![]() ![]() ![]() ![]() Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 ![]() ![]() |
Yea that went smooth for me too, but I was watching for it to come up and got in early. Reported an loaded in 3 requests. |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
Yea that went smooth for me too, but I was watching for it to come up and got in early. Reported an loaded in 3 requests. . . The outage was a little shorter than the norm and got work right away here too. Seems the cobwebs have been removed :) Stephen :) |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22658 Credit: 416,307,556 RAC: 380 ![]() ![]() |
Eric obviously oiled the servers' hinges, things are a lot smoother than they have been for weeks :-) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
![]() ![]() ![]() ![]() Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 ![]() ![]() |
Me thinks Eric might have to look again, it appears the servers fell on their face an hour ago ... :( |
![]() ![]() ![]() ![]() Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 ![]() ![]() |
I just fired off a mail to Eric Re: empty cache ... |
![]() Send message Joined: 12 Jun 07 Posts: 16 Credit: 10,968,872 RAC: 0 ![]() |
Glad that I went to 1 day cache. . . . ![]() |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51505 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
Oh meow. Last work I got was almost an hour ago. Meowsigh. EDIT..... And, as usual, just 'bout the time I posted that, my best rig got a burst of 37 tasks. Dunno if that means it is fixed, or just the luck of the draw. "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
![]() ![]() Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 ![]() |
"Results ready to send: 28" "Current result creation rate: 1.2/sec" ...someone needs to give one or more of the splitters a kick, they should be up in the 20 or 30 results created per second, when the queue is that low... . ![]() Hello, from Albany, CA!... |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.