The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 79 · 80 · 81 · 82 · 83 · 84 · 85 . . . 94 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13946
Credit: 208,696,464
RAC: 304
Australia
Message 2032054 - Posted: 12 Feb 2020, 7:51:55 UTC
Last modified: 12 Feb 2020, 7:53:03 UTC

I'm wondering if they ran some sort of script during this outage to jiggle any missed WUs? I'm seeing a lot more groups of re-sends than usual. Getting batches of 20+ at a time.
Grant
Darwin NT
ID: 2032054 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2032061 - Posted: 12 Feb 2020, 10:08:10 UTC - in response to Message 2032052.  
Last modified: 12 Feb 2020, 10:09:28 UTC

. . Less than 9 hours is a pleasant change from recent outages.
You apparently count the outage length differently than I do. My system logged that as 11.93 hour outage. That's the time between the last reported or received task before the outage and the first received new task after it. So for me the outage ends when the servers have recovered to the point where they can actually hand out new work. That's what matters if I want to determine if my cache was sufficient to 'survive' the outage.

Actually the end should be at the point when the server can feed my computers new work faster than they can process so that the caches start filling. The first received task could be a random single task long before the faucets open for real. The first received task is just much easier to measure.

December 3rd was the last 'normal' tuesday outage (4.21 hours). All outages after that have been way longer - except some unplanned ones.
ID: 2032061 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19691
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2032096 - Posted: 12 Feb 2020, 17:40:04 UTC - in response to Message 2032040.  

True, better than the past couple of weeks, but nowhere near the Tuesday outage boilerplate of a 4-5 hour outage for database backup.

I still have one system that stubbornly refuses to get any cpu work even though it requests it and only finally will get cpu tasks once the gpu cache is filled. Infuriating when the cpus provide a good portion of the house heating during Winter.

During the outage did your CPU's run out of tasks?
ID: 2032096 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2032104 - Posted: 12 Feb 2020, 19:15:44 UTC - in response to Message 2032054.  

I'm wondering if they ran some sort of script during this outage to jiggle any missed WUs? I'm seeing a lot more groups of re-sends than usual. Getting batches of 20+ at a time.

. . A measure to try and clear the backlog perhaps ??
ID: 2032104 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2032109 - Posted: 12 Feb 2020, 19:31:20 UTC - in response to Message 2032061.  

. . Less than 9 hours is a pleasant change from recent outages.
You apparently count the outage length differently than I do. My system logged that as 11.93 hour outage. That's the time between the last reported or received task before the outage and the first received new task after it. So for me the outage ends when the servers have recovered to the point where they can actually hand out new work. That's what matters if I want to determine if my cache was sufficient to 'survive' the outage.

Actually the end should be at the point when the server can feed my computers new work faster than they can process so that the caches start filling. The first received task could be a random single task long before the faucets open for real. The first received task is just much easier to measure.

December 3rd was the last 'normal' tuesday outage (4.21 hours). All outages after that have been way longer - except some unplanned ones.


. . I think the general consensus is that it starts when the servers go into maintenance mode and ends when they come back online and we no longer receive a "Shut down for maintenance" message. After that the time till we receive "normal" feeds of WUs is the recovery period. I found the recovery pretty smooth and not long this time. This outage began at approx. 12:30 am local time and I was able to report work from about 9:05 am with only a small number (relatively) of "http internal error messages" so the outage was around 8.5 hours. I agree that the assessment of work required to get through an outage must include the recovery period but I was receiving 'significant' downloads (20 or more WUs) by about 10:15 am. So all up still less than 10 hours here. After 24 to 48 hour monsters I welcome the improvement ...

Stephen

< shrug >
ID: 2032109 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032120 - Posted: 12 Feb 2020, 21:35:56 UTC - in response to Message 2032096.  
Last modified: 12 Feb 2020, 21:36:12 UTC

True, better than the past couple of weeks, but nowhere near the Tuesday outage boilerplate of a 4-5 hour outage for database backup.

I still have one system that stubbornly refuses to get any cpu work even though it requests it and only finally will get cpu tasks once the gpu cache is filled. Infuriating when the cpus provide a good portion of the house heating during Winter.

During the outage did your CPU's run out of tasks?

The Ryzens, ALWAYS. The Intel Xeon usually always has cpu work during these long outages. It is just way slower than the Rzyens.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032120 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19691
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2032135 - Posted: 13 Feb 2020, 0:31:54 UTC - in response to Message 2032120.  
Last modified: 13 Feb 2020, 0:34:22 UTC

True, better than the past couple of weeks, but nowhere near the Tuesday outage boilerplate of a 4-5 hour outage for database backup.

I still have one system that stubbornly refuses to get any cpu work even though it requests it and only finally will get cpu tasks once the gpu cache is filled. Infuriating when the cpus provide a good portion of the house heating during Winter.

During the outage did your CPU's run out of tasks?

The Ryzens, ALWAYS. The Intel Xeon usually always has cpu work during these long outages. It is just way slower than the Rzyens.

I just took a quick look at one of your Ryzens (computer 5741129) and guestimate that about 60 tasks/day would keep each core busy, excluding noise bombs.

So the next question has to be why are there not enough tasks being cached for the CPU's
ID: 2032135 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032156 - Posted: 13 Feb 2020, 2:33:41 UTC - in response to Message 2032135.  
Last modified: 13 Feb 2020, 2:35:00 UTC

just took a quick look at one of your Ryzens (computer 5741129) and guestimate that about 60 tasks/day would keep each core busy, excluding noise bombs.

So the next question has to be why are there not enough tasks being cached for the CPU's


I don't know how you came up with 60 tasks a day on that host. BoincTasks says that host does 576 cpu tasks a day.

Generally I do a cpu task in 25-45 minutes for most work. Have 16 threads running cpu work.

I don't generally reschedule gpu work to the cpu. I just crunch through my allotted 150 tasks and then the cpu goes idle for the rest of the outage. I only do cpu work for Seti.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032156 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1646
Credit: 12,921,799
RAC: 89
New Zealand
Message 2032160 - Posted: 13 Feb 2020, 2:42:54 UTC - in response to Message 2032156.  

I am just guessing Keith, I am thinking possibly a slower CPU by 600 MHz and running Windows 10
ID: 2032160 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032161 - Posted: 13 Feb 2020, 2:51:08 UTC - in response to Message 2032160.  

Well that host has the new Ryzen 3950X cpu. So 32 threads available and I have the cores locked to 4.2Ghz and I run the memory at 3600Mhz CL14.

So it crunches through the cpu tasks the fastest of all my hosts. The second fastest is the 3900X with the same memory clock speed but only running 4.15Ghz and only 12 threads running cpu work.

That one is currently in pieces on the kitchen table getting a custom loop cooling upgrade so I can punch the clock up on it.

The Threadripper is next fastest and is mostly hamstrung by slower memory and slower clock that is autoboosted and variable. I'm waiting on a new hot-shirt cpu block for it so I can move to locked clocks on it which will improve the cpu processing time. Hoping to punch the memory clocks up too with a cooler cpu.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032161 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19691
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2032165 - Posted: 13 Feb 2020, 3:13:11 UTC - in response to Message 2032156.  

just took a quick look at one of your Ryzens (computer 5741129) and guestimate that about 60 tasks/day would keep each core busy, excluding noise bombs.

So the next question has to be why are there not enough tasks being cached for the CPU's


I don't know how you came up with 60 tasks a day on that host. BoincTasks says that host does 576 cpu tasks a day.

Generally I do a cpu task in 25-45 minutes for most work. Have 16 threads running cpu work.

I don't generally reschedule gpu work to the cpu. I just crunch through my allotted 150 tasks and then the cpu goes idle for the rest of the outage. I only do cpu work for Seti.

I did say each core.

Surely the limits set by Seti are per processor not per type of processor.

So I must ask again why is your cpu cache not storing enough to last a 12 hour outage when by your own numbers each core/thread is averaging 36 tasks/day.

At a max cache/processor of 100, a number I think is now out of date, you should be able to cache nearly three days of cpu work. Assuming no -9's.
ID: 2032165 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2032182 - Posted: 13 Feb 2020, 6:29:22 UTC - in response to Message 2032165.  
Last modified: 13 Feb 2020, 6:37:07 UTC

At a max cache/processor of 100, a number I think is now out of date, you should be able to cache nearly three days of cpu work. Assuming no -9's.


There we go using common sense again :)

By your maths, I should be able to store about 1500 GPU tasks to keep me purring along, although I wish I could get a cache of about 150 -200 opencl_ati_mac AP files a day as a decent replacement.

Edit:
Next you'll be saying we should no longer crunch AP on CPU because it is too inefficient, and has too many casual users who don't devote their machines to the extent full time crunchers do. But hey, it is a good hobby.
ID: 2032182 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2032183 - Posted: 13 Feb 2020, 6:35:10 UTC - in response to Message 2032165.  

I did say each core.
Surely the limits set by Seti are per processor not per type of processor..


. . The limit is per CPU not per core. So we only get 150 WUs whether you have 4 cores or 64 cores.

Stephen

< shrug >
ID: 2032183 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2032186 - Posted: 13 Feb 2020, 6:49:15 UTC - in response to Message 2032183.  
Last modified: 13 Feb 2020, 6:52:24 UTC

The limit is per CPU not per core. So we only get 150 WUs whether you have 4 cores or 64 cores.


See...this is where we are messing up. We could create a SETI@home Pro, as a tiered level system.
Free just as it currently is.
$5/mo for 2 days advanced work
$10/mo for 3
up to $50 for 10.

Then an addition $5/mo for customizations. pick and choose which data types to work with for each machine.

Edit: Ok, I'll stop with the comedy.
ID: 2032186 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032187 - Posted: 13 Feb 2020, 6:49:59 UTC - in response to Message 2032165.  

As Stephen mentioned, each host no matter how many cpus or cores gets 150 cpu tasks. That's it. That is all. No more. So you can never carry more than the server allotment of 150 on any host.

So I start the outage with 150 tasks and retire 24 tasks per hour. So I am out of cpu tasks in 6 1/4 hours. And that is if I get no overflows or shorties with high angle range which can be finished in 20 minutes or less.

Since our outages lately have gone on for 12 hours or more before you actually start getting replacement work, the Ryzen cpus sit cold for several hours with no work.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032187 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19691
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2032189 - Posted: 13 Feb 2020, 7:03:31 UTC - in response to Message 2032187.  

As Stephen mentioned, each host no matter how many cpus or cores gets 150 cpu tasks. That's it. That is all. No more. So you can never carry more than the server allotment of 150 on any host.

So I start the outage with 150 tasks and retire 24 tasks per hour. So I am out of cpu tasks in 6 1/4 hours. And that is if I get no overflows or shorties with high angle range which can be finished in 20 minutes or less.

Since our outages lately have gone on for 12 hours or more before you actually start getting replacement work, the Ryzen cpus sit cold for several hours with no work.

I thought it was per core.

I can only conclude the system is illogical.
ID: 2032189 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2032192 - Posted: 13 Feb 2020, 7:35:06 UTC - in response to Message 2032189.  

I can only conclude the system is illogical.


I get all tingly inside when I see deductive humor.
ID: 2032192 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19691
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2032193 - Posted: 13 Feb 2020, 8:17:49 UTC - in response to Message 2032192.  
Last modified: 13 Feb 2020, 8:18:25 UTC

I can only conclude the system is illogical.


I get all tingly inside when I see deductive humor.

LMAO.

Next step in the conclusion.
The special sauce needs more ingredients, so that each core or thread is translated into a CPU when asking for work.
ID: 2032193 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032195 - Posted: 13 Feb 2020, 9:00:25 UTC - in response to Message 2032193.  

I can only conclude the system is illogical.


I get all tingly inside when I see deductive humor.

LMAO.

Next step in the conclusion.
The special sauce needs more ingredients, so that each core or thread is translated into a CPU when asking for work.

We already tried. The scheduler wants nothing to do with spoofed cpus. A host can only have one cpu so only gets 150 tasks.

The only way around the issue is to reschedule gpu tasks to the cpu.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032195 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19691
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2032196 - Posted: 13 Feb 2020, 9:12:07 UTC - in response to Message 2032195.  

I can only conclude the system is illogical.


I get all tingly inside when I see deductive humor.

LMAO.

Next step in the conclusion.
The special sauce needs more ingredients, so that each core or thread is translated into a CPU when asking for work.

We already tried. The scheduler wants nothing to do with spoofed cpus. A host can only have one cpu so only gets 150 tasks.

The only way around the issue is to reschedule gpu tasks to the cpu.

I had dual CPU computer on here at one time and I'm pretty sure the cache was per CPU, that's why I'm surprised the cache is not per core or thread.

That computer isn't listed here these days, but it is still in the list at Beta https://setiweb.ssl.berkeley.edu/beta/show_host_detail.php?hostid=6318
ID: 2032196 · Report as offensive
Previous · 1 . . . 79 · 80 · 81 · 82 · 83 · 84 · 85 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.