The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 85 · 86 · 87 · 88 · 89 · 90 · 91 . . . 94 · Next

AuthorMessage
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 2033304 - Posted: 21 Feb 2020, 13:44:55 UTC

I'm afraid this is going to recur over and over until they reach out to someone (google for example) who deals with very large and very busy databases, for help. No amount of shutting down and letting things catch up is going to solve this problem. As far as I know they haven't even been able to root cause the issue.
ID: 2033304 · Report as offensive
Profile Siran d'Vel'nahr
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 7379
Credit: 44,181,323
RAC: 238
United States
Message 2033307 - Posted: 21 Feb 2020, 14:54:25 UTC - in response to Message 2033303.  

I'd almost be at the point that they should think about shutting it down for a few days to clear that backlog problem.
The problem would then immediately reappear when all the starved hosts are reporting the millions of tasks they crunched during the long outage and asking millions of new tasks to fill their caches.

Hi Ville,

Something else that could help is to end the spoofing. If a host has 4 GPUs they should only get WUs for those 4 GPUs. No more editing software to spoof that a host has 15, 20, 30 or more GPUs when they have no more than 8.

My main is OUT of WUs now and probably stands little chance of getting any in the foreseeable future. My other hosts are getting dangerously low as well. :(

Have a great day! :)

Siran
CAPT Siran d'Vel'nahr - L L & P _\\//
Winders 11 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 2033307 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033309 - Posted: 21 Feb 2020, 15:17:15 UTC - in response to Message 2033307.  

Something else that could help is to end the spoofing. If a host has 4 GPUs they should only get WUs for those 4 GPUs. No more editing software to spoof that a host has 15, 20, 30 or more GPUs when they have no more than 8.
Spoofing has actually allowed me to be nice to the servers and other users. I reduce my fake gpu count when the Tuesday outage has started so that when the outage ends, I'm still above my new cap so I'm only reporting results but not competing with the other hosts for new tasks. When my host finally starts asking for new tasks, it is only asking a few at the time matching the number it reported. And when this happens, the post-outage congestion is over already.

Also I have configured my computers to report at most 100 results per scheduler request. So that they aren't flooding the server with a ridiculous bomb after the outage.
ID: 2033309 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2033313 - Posted: 21 Feb 2020, 15:48:50 UTC
Last modified: 21 Feb 2020, 15:49:30 UTC

the issue seems to be with the assimilators and validators. those have been steadily increasing, not able to work at the rate that work is being returned, causing the backlog. in the past they had issues with the deleters and db purgers backing up like this, but they seem to have gotten a handle on those, they haven't been backing up. they really just need better hardware.

oh well, as is usual now, when my systems run out of work, they will shift over to Einstein as a backup, no worries.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2033313 · Report as offensive
pututu

Send message
Joined: 21 Jul 16
Posts: 12
Credit: 10,108,801
RAC: 6
United States
Message 2033315 - Posted: 21 Feb 2020, 16:05:34 UTC - in response to Message 2033313.  

....they really just need better hardware...


These are really powerful system that they are currently running ¯\_(ツ)_/¯

Hosts
bruno: Intel Server (2 x 2.66GHz Xeon, 8 GB RAM)
carolyn: Intel Server (2 x quad-core 2.4GHz Xeon, 96 GB RAM)
centurion: Intel Server (2 x hexa-core 3.4GHz Xeon, 512 GB RAM)
georgem: Intel Server (2 x hexa-core 3.07GHz Xeon, 96 GB RAM)
khan: Intel Server (2 x 3.0GHz Xeon, 32 GB RAM)
lando: Intel Server (2 x quad-core 2.4GHz Xeon, 12 GB RAM)
marvin: Intel Server (2 x 2.66GHz Xeon, 16 GB RAM)
oscar: Intel Server (2 x quad-core 2.4GHz Xeon, 96 GB RAM)
paddym: Intel Server (2 x hexa-core 3.07GHz Xeon, 132 GB RAM)
synergy: Intel Server (2 x hexa-core 2.53GHz Xeon, 96 GB RAM)
muarae1: Intel Server (2 x hexa-core 3.07GHz Xeon, 76 GB RAM)
muarae4: Intel Server (2 x hexa-core 3.07GHz Xeon, 76 GB RAM)
thumper: Sun Fire X4500 (2 x dual-core 2.6GHz Opteron, 16 GB RAM)
vader: Intel Server (2 x dual-core 3GHz Xeon, 32 GB RAM)
ID: 2033315 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2033316 - Posted: 21 Feb 2020, 16:13:31 UTC - in response to Message 2033307.  

Hi Siran,
The project has far more work than can be processed. Even with current computing power, the database and servers cannot handle the load during this "steady state" period between outages. The spoofing just helps fast PCs keep processing during an outage. At times like today, I don't think it's a problem, other than results out in the field. That number is only about 1/3 of results returned and awaiting validation. I get no more priority to tasks downloads than you do on a Pi system. Every 5 min, we each get a shot.

You may recall when they increased the tasks per CPU and GPU from 100 to 300(?), the system jammed up. With moderate GPUs even that is not a lot. They need to find and address the root cause. Various solutions have been suggested and I'm sure they've considered all of them. I'm ready to contribute some $ if they'll just tell us what they need.

Roger
ID: 2033316 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2033320 - Posted: 21 Feb 2020, 16:40:11 UTC - in response to Message 2033315.  

....they really just need better hardware...


These are really powerful system that they are currently running ¯\_(ツ)_/¯

Hosts
bruno: Intel Server (2 x 2.66GHz Xeon, 8 GB RAM)
carolyn: Intel Server (2 x quad-core 2.4GHz Xeon, 96 GB RAM)
centurion: Intel Server (2 x hexa-core 3.4GHz Xeon, 512 GB RAM)
georgem: Intel Server (2 x hexa-core 3.07GHz Xeon, 96 GB RAM)
khan: Intel Server (2 x 3.0GHz Xeon, 32 GB RAM)
lando: Intel Server (2 x quad-core 2.4GHz Xeon, 12 GB RAM)
marvin: Intel Server (2 x 2.66GHz Xeon, 16 GB RAM)
oscar: Intel Server (2 x quad-core 2.4GHz Xeon, 96 GB RAM)
paddym: Intel Server (2 x hexa-core 3.07GHz Xeon, 132 GB RAM)
synergy: Intel Server (2 x hexa-core 2.53GHz Xeon, 96 GB RAM)
muarae1: Intel Server (2 x hexa-core 3.07GHz Xeon, 76 GB RAM)
muarae4: Intel Server (2 x hexa-core 3.07GHz Xeon, 76 GB RAM)
thumper: Sun Fire X4500 (2 x dual-core 2.6GHz Opteron, 16 GB RAM)
vader: Intel Server (2 x dual-core 3GHz Xeon, 32 GB RAM)


in terms of modern server hardware, this stuff is ancient, probably ~10 years old. i mean several users here are running more capable systems. dual quad and dual hex systems are not impressive anymore.

and many of them are maxed out without any ability to upgrade without a full platform upgrade. they need more cores, and in general more modern platforms to handle more I/O. Theoretically they could replace all of these systems with just a couple AMD Epyc based servers. a full overhaul is the "best" solution, but also the most costly, and time consuming. Time and money are in short supply over there from what it seems.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2033320 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033322 - Posted: 21 Feb 2020, 17:03:06 UTC

Spoofing has no impact on database size whatsoever as long as the hosts using it are fast enough to process their spoofed cache faster than their wingmen. The result wouldn't go any further in the pipe before the wingman has returned his result anyway.

I have observed the effect of varying my fake gpu count to the number of tasks the web site lists for it and it has no effect on the total unless I go way further than I usually do. Changing the gpu count just moves tasks between 'in progress' and 'validation pending' states.
ID: 2033322 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033325 - Posted: 21 Feb 2020, 17:12:19 UTC - in response to Message 2033320.  

Theoretically they could replace all of these systems with just a couple AMD Epyc based servers.
A single modern dual socket Epyc server can have more cores than all those listed servers combined! There are even many single core chips - those must be really ancient.
ID: 2033325 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2033326 - Posted: 21 Feb 2020, 17:25:26 UTC - in response to Message 2033325.  

Theoretically they could replace all of these systems with just a couple AMD Epyc based servers.
A single modern dual socket Epyc server can have more cores than all those listed servers combined! There are even many single core chips - those must be really ancient.


indeed. by my count its <60Cores and ~1TB RAM total. you can do that in a SINGLE socket Epyc board! 64 cores, 128 threads, MUCH better IPC, 1-2TB of faster DDR4 memory.

but it's probably best to at least spread it out over a couple systems to decrease sources of bottlenecks (network connectivity, disk I/O, etc) and to not have all your eggs in one basket so to speak in the case of hardware issues taking down the whole project lol.

this stuff isn't cheap, but we can dream. the point is, even if they upgrade to more modern setups, but not necessarily bleeding edge, they will be a lot better off. Intel Xeon E5-2600v2 chips can be had cheaply and available up to 12c/24t parts, Registered ECC DDR3 ram is cheap and plentiful. even a meager upgrade like that on some key systems would go a LONG way.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2033326 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22674
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033327 - Posted: 21 Feb 2020, 17:30:11 UTC - in response to Message 2033316.  

While it MAY have more work than can be processed (a claim for which there is NO evidence) then, if there is a problem delivering that work to the users then it make no sense to attempt to grab all one can, and so turn the average user away because they can't get work due to the greed of a very vocal minority.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033327 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033328 - Posted: 21 Feb 2020, 17:32:46 UTC - in response to Message 2033326.  

indeed. by my count its <60Cores and ~1TB RAM total. you can do that in a SINGLE socket Epyc board!
With my math those are 110 cores. Note that all the listed servers are dual socket ones.
ID: 2033328 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51510
Credit: 1,018,363,574
RAC: 1,004
United States
Message 2033329 - Posted: 21 Feb 2020, 17:41:12 UTC - in response to Message 2033327.  
Last modified: 21 Feb 2020, 17:43:02 UTC

While it MAY have more work than can be processed (a claim for which there is NO evidence) then, if there is a problem delivering that work to the users then it make no sense to attempt to grab all one can, and so turn the average user away because they can't get work due to the greed of a very vocal minority.

I think everybody has about the same odds of hitting the servers when it has work in the RTS queue to hand out.
I am far short of having a full cache, and most work requests are getting the 'project has no tasks available' response.
But, about 20 minutes ago I got a 36 task hit to keep my cruncher going.
This does not help those who have mega-crunchers very much.
So, work is going out and being returned.
Wish things were better, but it is what it is.

Meow.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 2033329 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033330 - Posted: 21 Feb 2020, 17:51:58 UTC - in response to Message 2033327.  
Last modified: 21 Feb 2020, 17:53:22 UTC

While it MAY have more work than can be processed (a claim for which there is NO evidence) then, if there is a problem delivering that work to the users then it make no sense to attempt to grab all one can, and so turn the average user away because they can't get work due to the greed of a very vocal minority.
Every host has equal chance to get tasks in a scheduler request. In the throttled situations like this it's the fast ones who suffer because they need more work to keep running but only get the same trickle that everyone gets.

If I wanted to grab an unfair share of the work, I wouldn't be spoofing my gpu count but running multiple instances of unmodified boinc instead. That would allow me to spam scheduler requests more frequently and grab the share of many computers to one computer. I actually considered that when I had bought my current gpu and started to have issues 'surviving' Tuesday downtimes as that trick would have also allowed me to multiply my cache size. But that would have been too dirty trick to my taste so I modified my client to report imaginary gpus instead.
ID: 2033330 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22674
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033332 - Posted: 21 Feb 2020, 18:01:27 UTC

NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after, and when the pot is empty there is a pause in delivery while it is refilled - which is why we see so many "project has no tasks" messages, even when there are thousands apparently available in the RTS.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033332 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2033333 - Posted: 21 Feb 2020, 18:11:13 UTC - in response to Message 2033327.  

While it MAY have more work than can be processed (a claim for which there is NO evidence) then, if there is a problem delivering that work to the users then it make no sense to attempt to grab all one can, and so turn the average user away because they can't get work due to the greed of a very vocal minority.

I seem to recall from another thread that SAH is only taking a few percent of the Breakthrough Listen data from Green Bank. That's my evidence. Plus, since I've been here we have never run out of tapes that I recall. Regardless, the servers cannot dish out the stack of tapes they have loaded since everyone's caches are dropping and I see plenty of tapes mounted and unprocessed.
ID: 2033333 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14687
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2033334 - Posted: 21 Feb 2020, 18:12:09 UTC - in response to Message 2033332.  

NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after, and when the pot is empty there is a pause in delivery while it is refilled - which is why we see so many "project has no tasks" messages, even when there are thousands apparently available in the RTS.
I think it's also worth ensuring that your work request is as 'quick to process' as possible. I happen to have seen that this afternoon.

Fast cruncher had run itself dry while I was out:
21/02/2020 17:02:49 | SETI@home | [sched_op] NVIDIA GPU work request: 88128.00 seconds; 0.00 devices
21/02/2020 17:02:52 | SETI@home | Scheduler request completed: got 0 new tasks
So I turned down the work cache from 0.5 days to 0.05 days:
21/02/2020 17:24:24 | SETI@home | [sched_op] NVIDIA GPU work request: 10368.00 seconds; 0.00 devices
21/02/2020 17:24:26 | SETI@home | Scheduler request completed: got 96 new tasks
21/02/2020 17:24:26 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 5039 seconds
If it takes too long to carry out all the checks (has any of your other computers acted as wingmate on this WU?), the available tasks are likely to have been grabbed by a more agile computer while your particular scheduler instance is still thinking about it.
ID: 2033334 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033336 - Posted: 21 Feb 2020, 18:23:15 UTC - in response to Message 2033250.  

more of a whine / observation / OCD thing but why are there 5 'chunks' of data waiting to be processed since late 2019? would flushing everything out of the repositories have a potential cleansing affect? i'm no DB nor systems design guy, just wondering..... it has come close a few times only to have another coupla days of data pushed in front - like today.


. . There are 4 'tapes' that have been sitting on the splitters since October 2018 but never split despite being the oldest tapes mounted. In the last couple of months there is another tape that has joined this group so now there are 2 x Blc22, 2 x Blc34 and 1 x Blc62 tapes that are very old but will not split. I have no idea if the reason these tapes will not split has anything to do with the general malaise that is affecting the splitters and other functions. But I would still like to see them either kicked off to split or just kicked off if the data is faulty.

Stephen

< shrug >
ID: 2033336 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2033338 - Posted: 21 Feb 2020, 18:29:57 UTC - in response to Message 2033328.  
Last modified: 21 Feb 2020, 18:32:57 UTC

indeed. by my count its <60Cores and ~1TB RAM total. you can do that in a SINGLE socket Epyc board!
With my math those are 110 cores. Note that all the listed servers are dual socket ones.

edit - whoops, i did forget to double it. yes indeed a dual socket 64-core epyc system would have more cores all in one box.

still think it's more wise to spread it across 2-3 systems for the previous reasons mentioned.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2033338 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14687
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2033340 - Posted: 21 Feb 2020, 18:30:59 UTC - in response to Message 2033336.  

I just think the 'what tape shall I run next?' algorithm is running LIFO instead of FIFO.
ID: 2033340 · Report as offensive
Previous · 1 . . . 85 · 86 · 87 · 88 · 89 · 90 · 91 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.