The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 73 · 74 · 75 · 76 · 77 · 78 · 79 . . . 94 · Next

AuthorMessage
Profile Siran d'Vel'nahr
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 7381
Credit: 44,181,323
RAC: 238
United States
Message 2030715 - Posted: 3 Feb 2020, 23:01:17 UTC - in response to Message 2030282.  

How does one tell if the jobs are resends?
In this case, we're talking about new - extra - replications of existing tasks , not about resending lost tasks.

You tell from the task name, as shown in BOINC Manager. You need to be able to see the very end of the name - so use advanced view, and make the column as wide as you need. The last two characters are as follows:

_0 - always the first time a workunit has been sent to a cruncher. Every WU has a task _0
_1 - in normal times, usually created at the same as _0 and sent out straight away. At the moment, some are being created and distributed later.
_2 onwards - probably a new replication, because the first two failed to validate (either because they returned different answers, or one of them never returned at all). But again, just at the moment, some results are untrustworthy, so _2 may be created for a safety-check.

Hi Richard,

I would like to add to you list above. I just found a WU ending with _3. I counted about 30 _2 WUs. Any ideas? :)



Have a great day! :)

Siran
CAPT Siran d'Vel'nahr - L L & P _\\//
Winders 11 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 2030715 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2030749 - Posted: 4 Feb 2020, 3:13:56 UTC - in response to Message 2030702.  

Certainly something is going on. I went out at midday, UK time, and the replica was over 14 hours behind at that time.


Just looked and it is "down to" 5+ hours behind....

Tom
A proud member of the OFA (Old Farts Association).
ID: 2030749 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13986
Credit: 208,696,464
RAC: 304
Australia
Message 2030763 - Posted: 4 Feb 2020, 6:17:17 UTC

I see that all the usual backlogs- Validation, Assimilation & Deletion- continue to grow.

The Replica is down to around 3 hrs behind.
Grant
Darwin NT
ID: 2030763 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13986
Credit: 208,696,464
RAC: 304
Australia
Message 2030764 - Posted: 4 Feb 2020, 6:20:27 UTC - in response to Message 2030715.  

I would like to add to you list above. I just found a WU ending with _3.
I've been getting several each day, even the very occasional _4.
*shrug*
Grant
Darwin NT
ID: 2030764 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030768 - Posted: 4 Feb 2020, 7:46:03 UTC - in response to Message 2030715.  

Hi Richard,

I would like to add to you list above. I just found a WU ending with _3. I counted about 30 _2 WUs. Any ideas? :)
I did say "_2 onwards". Just keep counting until you find two which match. They get killed off after _10.
ID: 2030768 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030772 - Posted: 4 Feb 2020, 8:27:08 UTC - in response to Message 2030708.  

Are they purging overflows as soon as they get validated now?
When the replica db is catching up, we effectively see time running in fast-forward on the web site.
ID: 2030772 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1861
Credit: 268,616,081
RAC: 1,349
United States
Message 2030777 - Posted: 4 Feb 2020, 10:53:03 UTC

Nice to be heading into the outage with full caches and a database that's looking at least a bit healthier.
ID: 2030777 · Report as offensive
Niteryder
Volunteer tester

Send message
Joined: 1 Mar 99
Posts: 64
Credit: 22,663,988
RAC: 18
United States
Message 2030793 - Posted: 5 Feb 2020, 1:47:50 UTC

Are we back?
ID: 2030793 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11451
Credit: 29,581,041
RAC: 66
United States
Message 2030794 - Posted: 5 Feb 2020, 1:57:11 UTC - in response to Message 2030793.  

Are we back?

Yep
ID: 2030794 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13986
Credit: 208,696,464
RAC: 304
Australia
Message 2030801 - Posted: 5 Feb 2020, 2:16:50 UTC

Forums are extremely sluggish at the moment, and the Scheduler is slow to respond, but at least it is responding (with the usual after outage "Project has no tasks available").
Grant
Darwin NT
ID: 2030801 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1861
Credit: 268,616,081
RAC: 1,349
United States
Message 2030804 - Posted: 5 Feb 2020, 2:23:41 UTC

So far, a pretty great recovery. Got all work reported in just a few retries, gotr a modest download on just one cruncher, db reported pretty quick and forum responsive. Here's hoping it's the new normal!
ID: 2030804 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13986
Credit: 208,696,464
RAC: 304
Australia
Message 2030805 - Posted: 5 Feb 2020, 2:38:34 UTC - in response to Message 2030804.  
Last modified: 5 Feb 2020, 2:41:35 UTC

So far, a pretty great recovery. Got all work reported in just a few retries, gotr a modest download on just one cruncher, db reported pretty quick and forum responsive. Here's hoping it's the new normal!
Maybe where you are, but here the forums have gotten even slower, and Scheduler requests are all erroring out now, "HTTP internal server error" or "HTTP service unavailable" or "Failure when receiving data from the per", or it's just timing out.

It's very, very borked.
Hopefully it won't take more than an hour or three to settle down.
Grant
Darwin NT
ID: 2030805 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2030808 - Posted: 5 Feb 2020, 3:20:15 UTC - in response to Message 2030805.  

So far, a pretty great recovery. Got all work reported in just a few retries, gotr a modest download on just one cruncher, db reported pretty quick and forum responsive. Here's hoping it's the new normal!
Maybe where you are, but here the forums have gotten even slower, and Scheduler requests are all erroring out now, "HTTP internal server error" or "HTTP service unavailable" or "Failure when receiving data from the per", or it's just timing out.

It's very, very borked.
Hopefully it won't take more than an hour or three to settle down.


. . +1

. . My 2 smaller Linux rigs have reported their work OK, but the 2 main Linux rigs have not been able to report at all because of the errors above. And while I have had one or two "no tasks" mostly I cannot even talk to the servers.

Stephen

:(
ID: 2030808 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1861
Credit: 268,616,081
RAC: 1,349
United States
Message 2030809 - Posted: 5 Feb 2020, 3:32:35 UTC

Setting nnt until all work is reported has been very effective for me.
ID: 2030809 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13986
Credit: 208,696,464
RAC: 304
Australia
Message 2030810 - Posted: 5 Feb 2020, 3:50:57 UTC - in response to Message 2030809.  
Last modified: 5 Feb 2020, 3:51:23 UTC

Setting nnt until all work is reported has been very effective for me.
Yet another item on the list of things to be sorted out once the current mess has cleared.
Upload issues, download issues and Scheduler issues.
Grant
Darwin NT
ID: 2030810 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2030822 - Posted: 5 Feb 2020, 4:55:42 UTC

I just noticed we are back. And it wasn't a multi-day shutdown. Just a basic long Tuesday

Tom.
A proud member of the OFA (Old Farts Association).
ID: 2030822 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13986
Credit: 208,696,464
RAC: 304
Australia
Message 2030829 - Posted: 5 Feb 2020, 5:59:09 UTC
Last modified: 5 Feb 2020, 5:59:47 UTC

Ready to send is just short of 1 million, now if only i could get a few of them...
My Windows system has managed to pickup almost 200 WUs in the last 20min, my Linux system less than 6 since the resumption of services.

At least the Scheduler errors have cleared up & the forums are responsive again.
Grant
Darwin NT
ID: 2030829 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 2030832 - Posted: 5 Feb 2020, 6:16:10 UTC

Interesting that the "ready to send" for Astropulse keeps climbing and none appear to be going out.
Dave

ID: 2030832 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030834 - Posted: 5 Feb 2020, 6:33:54 UTC - in response to Message 2030829.  
Last modified: 5 Feb 2020, 6:44:39 UTC

My Windows system has managed to pickup almost 200 WUs in the last 20min, my Linux system less than 6 since the resumption of services....
The same thing happens on the Mac. Back when I had a Windows machine next to the Mac, connected to the same router, I'd watch the Windows machine receive work every five minutes while the Mac was told there wasn't any work available. After the Windows machine had a full cache, then there was magically work available for the Mac. I watched this dozens of times, to the point I was sure it wasn't a coincidence, and it hasn't changed one bit. Except, now I don't have a Windows machine, so, now I just sit and wait for hours after work flow has begun before I start receiving more than just a handful of tasks every 20 minutes or so. After the Windows machines are full, we Might start getting some serious downloads on our UNIX machines.

These two are still empty;
https://setiathome.berkeley.edu/results.php?hostid=6813106&offset=4500
https://setiathome.berkeley.edu/results.php?hostid=6796479&offset=1100
ID: 2030834 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13986
Credit: 208,696,464
RAC: 304
Australia
Message 2030835 - Posted: 5 Feb 2020, 6:49:15 UTC
Last modified: 5 Feb 2020, 6:51:29 UTC

Linux system now out of all work, Windows system out of CPU work.

Oh, and the Validation backlog reaches another record high.
Grant
Darwin NT
ID: 2030835 · Report as offensive
Previous · 1 . . . 73 · 74 · 75 · 76 · 77 · 78 · 79 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.