The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 81 · 82 · 83 · 84 · 85 · 86 · 87 . . . 94 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13893
Credit: 208,696,464
RAC: 304
Australia
Message 2032448 - Posted: 14 Feb 2020, 21:55:09 UTC

A large volume of shorties, with many noise bombs thrown in, going through the system at the moment. In progress climbing, Ready to send falling, and all the past couple of weeks of the Validation & Assimilation backlogs clearing have been undone. Backlogs growing & growing fast.
Grant
Darwin NT
ID: 2032448 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1644
Credit: 12,921,799
RAC: 89
New Zealand
Message 2032488 - Posted: 15 Feb 2020, 5:09:18 UTC - in response to Message 2032448.  
Last modified: 15 Feb 2020, 5:10:21 UTC

A large volume of shorties, with many noise bombs thrown in, going through the system at the moment. In progress climbing, Ready to send falling, and all the past couple of weeks of the Validation & Assimilation backlogs clearing have been undone. Backlogs growing & growing fast.

As I type the splitters seem to be coping quite well splitting at over 80 a second. RTS is over 2500, could be higher but it could also be a lot lower
ID: 2032488 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2032497 - Posted: 15 Feb 2020, 10:01:09 UTC

The assimilation backlog had been going down at a steady rate for several days but couple of days ago it reversed direction. It was 2.2 mil at the lowest point but is now grown to almost 2.8 mil.

And each workunit in assimilation queue is keeping about 2.2 results stuck in the database. They seem to be shown in the validation queue on the SSP. This has pushed the total number of results to a level that forces their generation to be throttled again by stopping the splitters :(
ID: 2032497 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032515 - Posted: 15 Feb 2020, 13:51:57 UTC - in response to Message 2032500.  

And the replica again unable to keep up, as of now 5,478 seconds (1.5 hours) behind.


I’m starting to think this is a weekend problem. 3rd Saturday in a row that it’s happened, and usually catches back up by Monday morning.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032515 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19520
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2032531 - Posted: 15 Feb 2020, 16:08:00 UTC - in response to Message 2032515.  

And the replica again unable to keep up, as of now 5,478 seconds (1.5 hours) behind.


I’m starting to think this is a weekend problem. 3rd Saturday in a row that it’s happened, and usually catches back up by Monday morning.

Looks like someone has applied a touch of TLC, or a size 10 boot, to the problem. The replica has caught up.
ID: 2032531 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13893
Credit: 208,696,464
RAC: 304
Australia
Message 2032570 - Posted: 15 Feb 2020, 21:57:37 UTC

Validation / Assimilation backlog continues to grow.
Shouldn't be too long and all splitter output will cease again till the backlog starts going down again. The number of shorties & noise bombs has dropped off, so we might get lucky.
Grant
Darwin NT
ID: 2032570 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13893
Credit: 208,696,464
RAC: 304
Australia
Message 2032589 - Posted: 15 Feb 2020, 23:31:37 UTC - in response to Message 2032570.  

The number of shorties & noise bombs has dropped off, so we might get lucky.
Looks like i spoke too soon.
Another batch of shorties going through.
Grant
Darwin NT
ID: 2032589 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19520
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2032594 - Posted: 16 Feb 2020, 0:12:37 UTC

Panic Mode; ON

The replica is falling behind again.


As of 16 Feb 2020, 0:00:04 UTC
Replica seconds behind master	  1	0m


LOL

With that it's Good Night.
C U Domani
ID: 2032594 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13893
Credit: 208,696,464
RAC: 304
Australia
Message 2032650 - Posted: 16 Feb 2020, 10:19:41 UTC

Getting long delays in Scheduler response, and it's not supplying as many WUs as are being reported.
Hopefully just a brief passing glitch.
Grant
Darwin NT
ID: 2032650 · Report as offensive
Profile Siran d'Vel'nahr
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 7380
Credit: 44,181,323
RAC: 238
United States
Message 2032681 - Posted: 16 Feb 2020, 15:53:05 UTC

Greetings,

I do believe that we have finally made it beyond the major slow down we've been experiencing over the past 5 or 6 Sundays. :)

I do hope I didn't just jinx the whole operation! ;)

Have a great day! :)

Siran
CAPT Siran d'Vel'nahr - L L & P _\\//
Winders 11 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 2032681 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13893
Credit: 208,696,464
RAC: 304
Australia
Message 2032765 - Posted: 17 Feb 2020, 5:10:13 UTC

Got home form work to find my Linux system out of GPU work. Heaps of work to be downloaded- all in super extended extreme backoff mode.
Grant
Darwin NT
ID: 2032765 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1858
Credit: 268,616,081
RAC: 1,349
United States
Message 2032767 - Posted: 17 Feb 2020, 5:27:28 UTC - in response to Message 2032765.  
Last modified: 17 Feb 2020, 5:33:36 UTC

@Grant:
Got home form work to find my Linux system out of GPU work. Heaps of work to be downloaded- all in super extended extreme backoff mode.

Throw this in a cron job, executed from within the BOINC directory, every 15 minutes or so to solve that:
./boinccmd  --project http://setiathome.berkeley.edu update

Note that you can execute this from other than inside the BOINC directory, but will have to add password from gui_rpc_auth.cfg:
./BOINC/boinccmd  --passwd [password]  --project http://setiathome.berkeley.edu update

ID: 2032767 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13893
Credit: 208,696,464
RAC: 304
Australia
Message 2032769 - Posted: 17 Feb 2020, 5:38:28 UTC - in response to Message 2032767.  

@Grant:
Got home form work to find my Linux system out of GPU work. Heaps of work to be downloaded- all in super extended extreme backoff mode.

Throw this in a cron job, executed from within the BOINC directory, every 15 minutes or so to solve that:
./boinccmd  --project http://setiathome.berkeley.edu update

Note that you can execute this from other than inside the BOINC directory, but will have to add password from gui_rpc_auth.cfg:
./BOINC/boinccmd  --passwd [password]  --project http://setiathome.berkeley.edu update
Thanks, i'll put a copy of this aside for when i feel like wrestling with something new.
After today i need a shower, a bit of a read, then a good lie down.
:-)
Grant
Darwin NT
ID: 2032769 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2032781 - Posted: 17 Feb 2020, 8:49:39 UTC - in response to Message 2032767.  

@Grant:
Got home form work to find my Linux system out of GPU work. Heaps of work to be downloaded- all in super extended extreme backoff mode.

Throw this in a cron job, executed from within the BOINC directory, every 15 minutes or so to solve that:
./boinccmd  --project http://setiathome.berkeley.edu update
The solution doesn't match the problem. 'Update' will (amongst other things) try to get more work allocated by the scheduler: but this work is already allocated - it just needs to be downloaded. What's needed is some variation on

--file_transfer URL filename {retry | abort}
Do operation on a file transfer
but as it stands that needs a filename. You could use

--get_file_transfers
Show file transfers
and pipe the output to a file - then pick filenames from that.
ID: 2032781 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2032787 - Posted: 17 Feb 2020, 11:35:18 UTC

Also unconditional running of update will force a scheduler request to happen when you are still within the cooldown form the previous request. This makes the scheduler refuse to send you any work and reset the cooldown back to full five minutes. When it happens every 15 min, you may lose up to 33% of your opportunities to get more work, which puts you in a disadvantage in a situation where work generation is being throttled.
ID: 2032787 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032788 - Posted: 17 Feb 2020, 11:36:10 UTC - in response to Message 2032781.  

for i in `boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p'`;do boinccmd --file_transfer http://setiathome.berkeley.edu $i retry;done

Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032788 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032789 - Posted: 17 Feb 2020, 11:38:48 UTC - in response to Message 2032787.  

Also unconditional running of update will force a scheduler request to happen when you are still within the cooldown form the previous request. This makes the scheduler refuse to send you any work and reset the cooldown back to full five minutes. When it happens every 15 min, you may lose up to 33% of your opportunities to get more work, which puts you in a disadvantage in a situation where work generation is being throttled.


i run it on an interval of 930s. while you might miss a couple times, it doesn't seem to have any real adverse affects. my systems still get topped up all the way. it's a lot better than going into extended backoffs while you're sleeping and waking up to a cold system
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032789 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2032791 - Posted: 17 Feb 2020, 11:49:42 UTC - in response to Message 2032789.  

i run it on an interval of 930s. while you might miss a couple times, it doesn't seem to have any real adverse affects. my systems still get topped up all the way. it's a lot better than going into extended backoffs while you're sleeping and waking up to a cold system
Your cron can use smaller units than minutes?
ID: 2032791 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032792 - Posted: 17 Feb 2020, 11:51:07 UTC - in response to Message 2032791.  

I don’t run it in cron. I just open a terminal window and use the watch command and let it run.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032792 · Report as offensive
Harri Liljeroos
Avatar

Send message
Joined: 29 May 99
Posts: 5150
Credit: 85,281,665
RAC: 126
Finland
Message 2032802 - Posted: 17 Feb 2020, 13:55:02 UTC

BoincTasks can do the upload/download retries for you automatically. There is an example config.xml file under "Program Files\eFMer\BoincTasks\examples". You copy and edit this file to folder where BoincTasks exe-files are and restart BoincTasks. By default it checks upload/download queue every 4000 seconds and retries them. If file transfers still fail, it will shorten the interval automatically by 360 seconds (or something like that, I don't remember exactly). The interval is decreased after each retry if file transfers fail until it is 180/360 seconds (sorry, don't remember the exact value). If file transfers are successful, the interval is increased with same step until it is 4000 seconds. The config.xml also allows you to control the work requests as well but I haven't experimented with those parameters. The file has comments in it for different parameters and their purpose. You can see how it has operated under BoincTasks menu Show->Log.
ID: 2032802 · Report as offensive
Previous · 1 . . . 81 · 82 · 83 · 84 · 85 · 86 · 87 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.