The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 82 · 83 · 84 · 85 · 86 · 87 · 88 . . . 94 · Next

AuthorMessage
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1858
Credit: 268,616,081
RAC: 1,349
United States
Message 2032830 - Posted: 17 Feb 2020, 20:35:02 UTC - in response to Message 2032781.  
Last modified: 17 Feb 2020, 20:41:46 UTC

The solution doesn't match the problem. 'Update' will (amongst other things) try to get more work allocated by the scheduler: but this work is already allocated - it just needs to be downloaded.
That's what I get for posting so late (early) at night. Sorry, I was conflating two different issues. What I was addressing there was the issue of BOINC "forgetting' to even go asking for work for extended periods when there was none. NVM
Sounds like the BoincTasks option may be worth exploring ... appreciate it.
ID: 2032830 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032846 - Posted: 17 Feb 2020, 23:03:24 UTC - in response to Message 2032788.  

for i in `boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p'`;do boinccmd --file_transfer http://setiathome.berkeley.edu $i retry;done


This command will give the boot to stuck transfers if you don’t want to run additional software.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032846 · Report as offensive
Profile Buckeye4LF Project Donor
Avatar

Send message
Joined: 19 Jun 00
Posts: 173
Credit: 54,916,209
RAC: 833
United States
Message 2032854 - Posted: 18 Feb 2020, 0:10:40 UTC - in response to Message 2032846.  
Last modified: 18 Feb 2020, 0:16:25 UTC

for i in `boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p'`;do boinccmd --file_transfer http://setiathome.berkeley.edu $i retry;done


This command will give the boot to stuck transfers if you don’t want to run additional software.


Do you just run this every time you power up the machine or do you wait for an issue to come up? I am new to Linux as well, what options if any do you set for the watch command?

ID: 2032854 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032858 - Posted: 18 Feb 2020, 1:37:35 UTC

I would only run the watch command when you expect or getting issues and aren't going to be monitoring the machine while you are away or asleep. Because it can preempt the normal 305 second scheduler connection and force a "too soon" timeout backoff interval. Generally that doesn't cause an issue because it is infrequent or worst case synchronizes with the stock interval and keeps preempting it. But if you choose a much longer than 305 second interval and an odd number that won't happen very often.
But it is safe to use.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032858 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1646
Credit: 12,921,799
RAC: 89
New Zealand
Message 2032879 - Posted: 18 Feb 2020, 3:42:54 UTC

Do you need to panic? I'm not panicking in the slightest
Results ready to send	0	0	391,092	9m
Current result creation rate **	0/sec	0.0184/sec	3.9030/sec	5m
Results out in the field	0	32,248	6,382,603	9m

ID: 2032879 · Report as offensive
Niteryder
Volunteer tester

Send message
Joined: 1 Mar 99
Posts: 64
Credit: 22,663,988
RAC: 18
United States
Message 2032908 - Posted: 18 Feb 2020, 22:57:28 UTC

Are we back?
ID: 2032908 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2032909 - Posted: 18 Feb 2020, 23:00:46 UTC - in response to Message 2032908.  

Are we back?


yup, but expect some recovery lag as the system gets pounded by EVERYONE all returning the finished WUs and asking for fresh data. The system will be wonky while dealing with the volume for a while.
ID: 2032909 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 2032935 - Posted: 19 Feb 2020, 0:37:27 UTC

hello?
ID: 2032935 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2032936 - Posted: 19 Feb 2020, 0:50:50 UTC - in response to Message 2032909.  

Are we back?


yup, but expect some recovery lag as the system gets pounded by EVERYONE all returning the finished WUs and asking for fresh data. The system will be wonky while dealing with the volume for a while.


Yep. Much Wonkiness. That's the computer equivalent to Jerkiness in Calculus.
ID: 2032936 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1646
Credit: 12,921,799
RAC: 89
New Zealand
Message 2032951 - Posted: 19 Feb 2020, 3:06:55 UTC - in response to Message 2032935.  

hello?

Good evening cruncher America
ID: 2032951 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2032953 - Posted: 19 Feb 2020, 3:16:00 UTC - in response to Message 2032951.  

hello?

Good evening cruncher America


Good Afternoon in NZ! Not a lot of crunching going on right now. Seems things are a little stuck at the moment. Validations are slow, downloads aren't happening despite a growing number of ready to send.
ID: 2032953 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1646
Credit: 12,921,799
RAC: 89
New Zealand
Message 2032954 - Posted: 19 Feb 2020, 3:24:04 UTC - in response to Message 2032953.  

hello?

Good evening cruncher America


Good Afternoon in NZ! Not a lot of crunching going on right now. Seems things are a little stuck at the moment. Validations are slow, downloads aren't happening despite a growing number of ready to send.

All part of the recovery process
ID: 2032954 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2032956 - Posted: 19 Feb 2020, 3:26:00 UTC

Splitters stopped too late. There is now over 1.2 million tasks in RTS queue and the total number of tasks in the db is way past the 20 mil stability limit :(
ID: 2032956 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032957 - Posted: 19 Feb 2020, 3:29:15 UTC

Yes, typical lately of the recent long outages. They let the splitters run rampant with no throttling and don't allow any downloads till the RTS buffer gets filled to over a million ready to send. Then they allow downloads. You should have been able to report all finished work by now by setting NNT and can now unset NNT and just wait for the floodgates to open on tasks going out to all the hungry hosts.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032957 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2032959 - Posted: 19 Feb 2020, 3:39:56 UTC
Last modified: 19 Feb 2020, 3:53:59 UTC

NNT doesn't seem to make much difference. For the first couple of hours after the outage when when all my scheduler requests failed, they failed with and without NNT and when they started working, they also worked with and without NNT. I had one computer running with NNT and another without.

Some hosts somewhere seemed to be able to get new work already when I was receiving just timeouts and http errors. There was 852 Astropulse tasks in the RTS queue shortly after the server status page started updating after the outage. Those 852 tasks melted away fast - over half of them were gone by the next SSP update 20 minutes later. But I haven't received any new tasks for any of my computers yet.
ID: 2032959 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032961 - Posted: 19 Feb 2020, 3:49:09 UTC - in response to Message 2032959.  

It does in my case. Everyone's farm and internet connection is different. Yes, the scheduler connection always times out right after the outage finishes and then starts eventually connecting. But in my case if I don't set NNT it does not connect after connections are starting to go through. Set NNT and the connection can finally report finished work.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032961 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13904
Credit: 208,696,464
RAC: 304
Australia
Message 2032966 - Posted: 19 Feb 2020, 4:38:06 UTC
Last modified: 19 Feb 2020, 4:39:43 UTC

So, has anyone actually been able to get some work?
Not a one here, even after after all completed work returned.

In progress has fallen by 1.5 million, and is still falling. Looks like work is being reported, but nothing is going out.
Grant
Darwin NT
ID: 2032966 · Report as offensive
Alien Seeker
Avatar

Send message
Joined: 23 May 99
Posts: 57
Credit: 511,652
RAC: 32
France
Message 2032967 - Posted: 19 Feb 2020, 4:46:08 UTC - in response to Message 2032966.  

Since the recovery I got one task on one of my computers, and three on the other. So not zero, but it's really only a trickle.
Gazing at the skies, hoping for contact... Unlikely, but it would be such a fantastic opportunity to learn.

My alternative profile
ID: 2032967 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2032968 - Posted: 19 Feb 2020, 4:50:35 UTC
Last modified: 19 Feb 2020, 4:52:54 UTC

Replica is 30 minutes behind, so hard to see what is going on.
There were a bunch of APs that went out, I can see that there is an increased return rate for APs, but out in the field (for AP) should be higher, so it is still wonky.
I have a slow machine, so I just did a test ask for some work and it said it had no tasks available. Usually my machine will get tasks before all you fast machines. It is still in recovery mode, hard to tell if it just needs more time, or a swift kick. I'm guessing it will resolve in time when the total results falls below the magic 20 million number. Honestly that is just a wild guess on my part though.

edit: RTS is falling. the dam is broken. I think work is going out
ID: 2032968 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1858
Credit: 268,616,081
RAC: 1,349
United States
Message 2032970 - Posted: 19 Feb 2020, 4:59:24 UTC - in response to Message 2032966.  
Last modified: 19 Feb 2020, 5:04:04 UTC

So, has anyone actually been able to get some work?
Across 4 clients, I've gotten 10 total tasks so far :)
But at least with the boinccmd update kluge launched every 15 minutes, I do ask for work rather than sitting there stupid ... once the tasks start flowing, it will be interesting to see if putting stuck transfer retries under the control of boinctasks as previously mentioned benefits that issue.
ID: 2032970 · Report as offensive
Previous · 1 . . . 82 · 83 · 84 · 85 · 86 · 87 · 88 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.