The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 79 · 80 · 81 · 82 · 83 · 84 · 85 . . . 94 · Next

AuthorMessage
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2031985 - Posted: 11 Feb 2020, 23:12:17 UTC

That didn't take too long today.
ID: 2031985 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031987 - Posted: 11 Feb 2020, 23:27:15 UTC - in response to Message 2031985.  

That didn't take too long today.

About double what it should be from what it was in the past.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031987 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13794
Credit: 208,696,464
RAC: 304
Australia
Message 2032038 - Posted: 12 Feb 2020, 4:05:25 UTC - in response to Message 2031987.  
Last modified: 12 Feb 2020, 4:08:49 UTC

That didn't take too long today.
About double what it should be from what it was in the past.
But still way better than it has been recently.
Got home to find one system with a full cache, and the other system with a bunch of downloads in extended backoff mode, but one Retry & everything came down OK And since then the Scheduler has been dishing out work and it hasn't taken any effort on my part to download it. First time for several weeks.


Edit- although it looks like we are about to run out of work; the splitters are still having issues getting going again after an outage.
And one of my systems has a tonne of Shorties in it's cache- that's not going to help the Validation & Assimilation backlogs clear.
Grant
Darwin NT
ID: 2032038 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032040 - Posted: 12 Feb 2020, 4:16:25 UTC - in response to Message 2032038.  

True, better than the past couple of weeks, but nowhere near the Tuesday outage boilerplate of a 4-5 hour outage for database backup.

I still have one system that stubbornly refuses to get any cpu work even though it requests it and only finally will get cpu tasks once the gpu cache is filled. Infuriating when the cpus provide a good portion of the house heating during Winter.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032040 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 2032048 - Posted: 12 Feb 2020, 6:08:21 UTC

Best recovery I've seen in quite a while.
Hopefully this means they've gotten a handle on the issues.
Perfect? No, but I can live with this ...
ID: 2032048 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13794
Credit: 208,696,464
RAC: 304
Australia
Message 2032051 - Posted: 12 Feb 2020, 6:47:12 UTC - in response to Message 2032048.  

Best recovery I've seen in quite a while.
Hopefully this means they've gotten a handle on the issues.
I'ts just a case of finally no more BLC35 files being split & putting out pretty much nothing but noise bombs. The fact is the backlogs from that (and the added replication for the RX5000 series issues) is still to be cleared, luckily they're presently low enough not to cause everything to fall over or come to a grinding halt.
Grant
Darwin NT
ID: 2032051 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2032052 - Posted: 12 Feb 2020, 6:54:05 UTC - in response to Message 2032048.  

Best recovery I've seen in quite a while.
Hopefully this means they've gotten a handle on the issues.
Perfect? No, but I can live with this ...


. . Hi Jimbo,

. . Less than 9 hours is a pleasant change from recent outages.

. . But the recovery has been very smooth with only a few niggling 'http internal error' messages.

Stephen

:)
ID: 2032052 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13794
Credit: 208,696,464
RAC: 304
Australia
Message 2032054 - Posted: 12 Feb 2020, 7:51:55 UTC
Last modified: 12 Feb 2020, 7:53:03 UTC

I'm wondering if they ran some sort of script during this outage to jiggle any missed WUs? I'm seeing a lot more groups of re-sends than usual. Getting batches of 20+ at a time.
Grant
Darwin NT
ID: 2032054 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2032061 - Posted: 12 Feb 2020, 10:08:10 UTC - in response to Message 2032052.  
Last modified: 12 Feb 2020, 10:09:28 UTC

. . Less than 9 hours is a pleasant change from recent outages.
You apparently count the outage length differently than I do. My system logged that as 11.93 hour outage. That's the time between the last reported or received task before the outage and the first received new task after it. So for me the outage ends when the servers have recovered to the point where they can actually hand out new work. That's what matters if I want to determine if my cache was sufficient to 'survive' the outage.

Actually the end should be at the point when the server can feed my computers new work faster than they can process so that the caches start filling. The first received task could be a random single task long before the faucets open for real. The first received task is just much easier to measure.

December 3rd was the last 'normal' tuesday outage (4.21 hours). All outages after that have been way longer - except some unplanned ones.
ID: 2032061 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19214
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2032096 - Posted: 12 Feb 2020, 17:40:04 UTC - in response to Message 2032040.  

True, better than the past couple of weeks, but nowhere near the Tuesday outage boilerplate of a 4-5 hour outage for database backup.

I still have one system that stubbornly refuses to get any cpu work even though it requests it and only finally will get cpu tasks once the gpu cache is filled. Infuriating when the cpus provide a good portion of the house heating during Winter.

During the outage did your CPU's run out of tasks?
ID: 2032096 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2032104 - Posted: 12 Feb 2020, 19:15:44 UTC - in response to Message 2032054.  

I'm wondering if they ran some sort of script during this outage to jiggle any missed WUs? I'm seeing a lot more groups of re-sends than usual. Getting batches of 20+ at a time.

. . A measure to try and clear the backlog perhaps ??
ID: 2032104 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2032109 - Posted: 12 Feb 2020, 19:31:20 UTC - in response to Message 2032061.  

. . Less than 9 hours is a pleasant change from recent outages.
You apparently count the outage length differently than I do. My system logged that as 11.93 hour outage. That's the time between the last reported or received task before the outage and the first received new task after it. So for me the outage ends when the servers have recovered to the point where they can actually hand out new work. That's what matters if I want to determine if my cache was sufficient to 'survive' the outage.

Actually the end should be at the point when the server can feed my computers new work faster than they can process so that the caches start filling. The first received task could be a random single task long before the faucets open for real. The first received task is just much easier to measure.

December 3rd was the last 'normal' tuesday outage (4.21 hours). All outages after that have been way longer - except some unplanned ones.


. . I think the general consensus is that it starts when the servers go into maintenance mode and ends when they come back online and we no longer receive a "Shut down for maintenance" message. After that the time till we receive "normal" feeds of WUs is the recovery period. I found the recovery pretty smooth and not long this time. This outage began at approx. 12:30 am local time and I was able to report work from about 9:05 am with only a small number (relatively) of "http internal error messages" so the outage was around 8.5 hours. I agree that the assessment of work required to get through an outage must include the recovery period but I was receiving 'significant' downloads (20 or more WUs) by about 10:15 am. So all up still less than 10 hours here. After 24 to 48 hour monsters I welcome the improvement ...

Stephen

< shrug >
ID: 2032109 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032120 - Posted: 12 Feb 2020, 21:35:56 UTC - in response to Message 2032096.  
Last modified: 12 Feb 2020, 21:36:12 UTC

True, better than the past couple of weeks, but nowhere near the Tuesday outage boilerplate of a 4-5 hour outage for database backup.

I still have one system that stubbornly refuses to get any cpu work even though it requests it and only finally will get cpu tasks once the gpu cache is filled. Infuriating when the cpus provide a good portion of the house heating during Winter.

During the outage did your CPU's run out of tasks?

The Ryzens, ALWAYS. The Intel Xeon usually always has cpu work during these long outages. It is just way slower than the Rzyens.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032120 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19214
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2032135 - Posted: 13 Feb 2020, 0:31:54 UTC - in response to Message 2032120.  
Last modified: 13 Feb 2020, 0:34:22 UTC

True, better than the past couple of weeks, but nowhere near the Tuesday outage boilerplate of a 4-5 hour outage for database backup.

I still have one system that stubbornly refuses to get any cpu work even though it requests it and only finally will get cpu tasks once the gpu cache is filled. Infuriating when the cpus provide a good portion of the house heating during Winter.

During the outage did your CPU's run out of tasks?

The Ryzens, ALWAYS. The Intel Xeon usually always has cpu work during these long outages. It is just way slower than the Rzyens.

I just took a quick look at one of your Ryzens (computer 5741129) and guestimate that about 60 tasks/day would keep each core busy, excluding noise bombs.

So the next question has to be why are there not enough tasks being cached for the CPU's
ID: 2032135 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032156 - Posted: 13 Feb 2020, 2:33:41 UTC - in response to Message 2032135.  
Last modified: 13 Feb 2020, 2:35:00 UTC

just took a quick look at one of your Ryzens (computer 5741129) and guestimate that about 60 tasks/day would keep each core busy, excluding noise bombs.

So the next question has to be why are there not enough tasks being cached for the CPU's


I don't know how you came up with 60 tasks a day on that host. BoincTasks says that host does 576 cpu tasks a day.

Generally I do a cpu task in 25-45 minutes for most work. Have 16 threads running cpu work.

I don't generally reschedule gpu work to the cpu. I just crunch through my allotted 150 tasks and then the cpu goes idle for the rest of the outage. I only do cpu work for Seti.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032156 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2032160 - Posted: 13 Feb 2020, 2:42:54 UTC - in response to Message 2032156.  

I am just guessing Keith, I am thinking possibly a slower CPU by 600 MHz and running Windows 10
ID: 2032160 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032161 - Posted: 13 Feb 2020, 2:51:08 UTC - in response to Message 2032160.  

Well that host has the new Ryzen 3950X cpu. So 32 threads available and I have the cores locked to 4.2Ghz and I run the memory at 3600Mhz CL14.

So it crunches through the cpu tasks the fastest of all my hosts. The second fastest is the 3900X with the same memory clock speed but only running 4.15Ghz and only 12 threads running cpu work.

That one is currently in pieces on the kitchen table getting a custom loop cooling upgrade so I can punch the clock up on it.

The Threadripper is next fastest and is mostly hamstrung by slower memory and slower clock that is autoboosted and variable. I'm waiting on a new hot-shirt cpu block for it so I can move to locked clocks on it which will improve the cpu processing time. Hoping to punch the memory clocks up too with a cooler cpu.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032161 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19214
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2032165 - Posted: 13 Feb 2020, 3:13:11 UTC - in response to Message 2032156.  

just took a quick look at one of your Ryzens (computer 5741129) and guestimate that about 60 tasks/day would keep each core busy, excluding noise bombs.

So the next question has to be why are there not enough tasks being cached for the CPU's


I don't know how you came up with 60 tasks a day on that host. BoincTasks says that host does 576 cpu tasks a day.

Generally I do a cpu task in 25-45 minutes for most work. Have 16 threads running cpu work.

I don't generally reschedule gpu work to the cpu. I just crunch through my allotted 150 tasks and then the cpu goes idle for the rest of the outage. I only do cpu work for Seti.

I did say each core.

Surely the limits set by Seti are per processor not per type of processor.

So I must ask again why is your cpu cache not storing enough to last a 12 hour outage when by your own numbers each core/thread is averaging 36 tasks/day.

At a max cache/processor of 100, a number I think is now out of date, you should be able to cache nearly three days of cpu work. Assuming no -9's.
ID: 2032165 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2032182 - Posted: 13 Feb 2020, 6:29:22 UTC - in response to Message 2032165.  
Last modified: 13 Feb 2020, 6:37:07 UTC

At a max cache/processor of 100, a number I think is now out of date, you should be able to cache nearly three days of cpu work. Assuming no -9's.


There we go using common sense again :)

By your maths, I should be able to store about 1500 GPU tasks to keep me purring along, although I wish I could get a cache of about 150 -200 opencl_ati_mac AP files a day as a decent replacement.

Edit:
Next you'll be saying we should no longer crunch AP on CPU because it is too inefficient, and has too many casual users who don't devote their machines to the extent full time crunchers do. But hey, it is a good hobby.
ID: 2032182 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2032183 - Posted: 13 Feb 2020, 6:35:10 UTC - in response to Message 2032165.  

I did say each core.
Surely the limits set by Seti are per processor not per type of processor..


. . The limit is per CPU not per core. So we only get 150 WUs whether you have 4 cores or 64 cores.

Stephen

< shrug >
ID: 2032183 · Report as offensive
Previous · 1 . . . 79 · 80 · 81 · 82 · 83 · 84 · 85 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.