The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 86 · 87 · 88 · 89 · 90 · 91 · 92 . . . 94 · Next

AuthorMessage
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2033341 - Posted: 21 Feb 2020, 18:32:20 UTC - in response to Message 2033332.  

NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after, and when the pot is empty there is a pause in delivery while it is refilled - which is why we see so many "project has no tasks" messages, even when there are thousands apparently available in the RTS.

Why not simply increase the size of this pot? and make it intelligent with a constant refilling function? With out a fix size?

IMHO The real problem on SETI or the boinc itself is they are obsolete (software & hardware) for today's hardware and software's used by the volunteers. Try to keep very slow devices (who takes days to crunch a single WU) and superfast devices (few of them spoofed and who crunch a WU in a couple of seconds) is the real source of all our constant problems. The non test release the servers WU limits just make all worst.

With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778
Why this number is so high? Sure not because the superfast or spoofed hosts. This comes from the slow hosts (the vast majority of the hosts) and the big WU deadline.

Besides the impossible solution of buy new hardware (lots of $ and time) the only real way that could be solved is by reducing even more the WU limits on the slow devices and drastically reduce the dead time of the WU's. Makes little or no sense at all, allow a host who produce a couple of WU per day have a 150 WU buffer. The number of WU available to all the hosts must be limited to the capacity of it to crunch & produce the valid WU. Maybe a 1 day will be enough. Please remember... i'm talking about 1000's of hosts who DL large caches and will never return them on time.

So please stop blaming the, maybe the 30-40 hosts who run with spoofed clients, who DL and return his WU in a day or 2. They actually help to clear the DB.

my 0.02
ID: 2033341 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2033342 - Posted: 21 Feb 2020, 18:38:47 UTC - in response to Message 2033336.  

more of a whine / observation / OCD thing but why are there 5 'chunks' of data waiting to be processed since late 2019? would flushing everything out of the repositories have a potential cleansing affect? i'm no DB nor systems design guy, just wondering..... it has come close a few times only to have another coupla days of data pushed in front - like today.


. . There are 4 'tapes' that have been sitting on the splitters since October 2018 but never split despite being the oldest tapes mounted. In the last couple of months there is another tape that has joined this group so now there are 2 x Blc22, 2 x Blc34 and 1 x Blc62 tapes that are very old but will not split. I have no idea if the reason these tapes will not split has anything to do with the general malaise that is affecting the splitters and other functions. But I would still like to see them either kicked off to split or just kicked off if the data is faulty.

Stephen

< shrug >


Before we got the new blc73s I had hope that we would split the old files in the queue, but alas that was not to be. I don't think files in the to-be-split queue would affect the process much. I don't think we are near the "limit" on things in that queue as I've seen it much longer in the past. There is the possibility that the files that are in the process of being split, but left hanging as the splitters move on to new files might be an issue but I have no idea. They increased the number of splitting processes a couple of months ago. There are now 16 processes working on 11 files versus the previous 14 processes working on 9 files.

I have no idea how the splitters choose which file to work on next. It doesn't have a pattern anymore.

My opinion, based on nothing, is that they don't watch the splitting queue that much. We usually have to let them know when files have gotten stuck splitting. They just don't monitor it that closely. My guess is that they have no idea that there are 5 "stale" files hanging out in the queue. I truly think that there is no problem caused by the stale files, but I will admit it still bothers me.
ID: 2033342 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033344 - Posted: 21 Feb 2020, 18:40:49 UTC - in response to Message 2033309.  

Something else that could help is to end the spoofing. If a host has 4 GPUs they should only get WUs for those 4 GPUs. No more editing software to spoof that a host has 15, 20, 30 or more GPUs when they have no more than 8.
Spoofing has actually allowed me to be nice to the servers and other users. I reduce my fake gpu count when the Tuesday outage has started so that when the outage ends, I'm still above my new cap so I'm only reporting results but not competing with the other hosts for new tasks. When my host finally starts asking for new tasks, it is only asking a few at the time matching the number it reported. And when this happens, the post-outage congestion is over already.
Also I have configured my computers to report at most 100 results per scheduler request. So that they aren't flooding the server with a ridiculous bomb after the outage.


. . Or do as many others do and set No New Tasks until the reporting crush has finished and the servers are getting back on their feet. These days that takes several hours. But either way the benefit is minimal because the majority of volunteers do none of the above and so the problem persists. As far as spoofing goes, while telling the system that 2 GPUs are really 64 GPUs is ridiculously extreme, most of the those spoofing such high number do have huge numbers of GPUs like 6 to 8 physical units. And the totally impost from this practice is also quite minimal because those spoofing are a tiny, tiny fraction of overall volunteers. But that is just my 2c worth.

Stephen

< shrug >
ID: 2033344 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2033346 - Posted: 21 Feb 2020, 18:46:42 UTC - in response to Message 2033341.  
Last modified: 21 Feb 2020, 18:47:01 UTC



So please stop blaming the, maybe the 30-40 hosts who run with spoofed clients, who DL and return his WU in a day or 2. They actually help to clear the DB.

my 0.02

+1

If there is a limit of 10 days on the amount of WUs one can have (if you don't hit the #/cpu & #/gpu limit already), why can't they shorten the due date to 3 weeks?? I think this would get the WUs that are ghosted or on machines that have left the project back into circulation quicker. Some of my WUs have a due date 8 weeks out.

Would this be an issue for those running multiple projects?? Is there some unintended consequence I'm overlooking??
ID: 2033346 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2033348 - Posted: 21 Feb 2020, 19:09:13 UTC - in response to Message 2033346.  

Would this be an issue for those running multiple projects?? Is there some unintended consequence I'm overlooking??

No, none. Project deadlines are only applicable for each individual project. The backlash if Seti shortened their deadline would be hosts would do more Seti work more quickly and would not get preempted as much by other projects shorter deadline tasks. For the majority of projects, Seti has the longest deadlines except for some outliers like WCG and Climate Prediction which have almost year long deadlines.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2033348 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51580
Credit: 1,018,363,574
RAC: 1,004
United States
Message 2033349 - Posted: 21 Feb 2020, 19:11:43 UTC

The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project.
As it should be.

Meow.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 2033349 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033351 - Posted: 21 Feb 2020, 19:23:32 UTC - in response to Message 2033327.  

While it MAY have more work than can be processed (a claim for which there is NO evidence) then, if there is a problem delivering that work to the users then it make no sense to attempt to grab all one can, and so turn the average user away because they can't get work due to the greed of a very vocal minority.


. . Sorry Rob but I have to disagree with you there. We have seen on a couple of occasions that if we get all the data that Greenbank produces in one day our whole setup is busy crunching for several months. The daily output from one GBT recorder takes several days to get through and there are 64 potential recorders which we can process. If/when we finally start processing data from Parkes (to which I am looking forward) that output will double. While our processing capacity has grown manifold over the decades this program has been operating so has the data which we need to process. If our daily capacity could grow 10 fold then we would still be lagging behind the data available. One thing in our favour is we do not see all the data from every day. We only some of the output from each observatory, and so far none from Parkes (had to throw that in). Eric and David have both stressed the need for more volunteers and they aint lying.

. . As far as being greedy, since when it is greedy to want to do work for somebody else at no charge despite the cost to yourself? And these spoofing hosts you seem to feel are denying you the work you want are simply trying to keep the supply of work up to the level of productivity of which their machines are capable. I would call that efficiency not greed, and working towards having the processing power that the project requires. The issue is that the crunching power of the volunteer force has grown beyond the capacity of the server farm. While there is at least one very powerful new server under test it appears there are problems which are delaying its deployment in main. And there is a new data storage unit under construction that will be many times greater, and several times faster than that currently available, but this could still be a long time away from being deployed. The whole issue has been seriously aggravated by recent large runs of noisy data such as the Blc35 and Blc41 series which have at least in a large part been about 90% noise bombs. The servers are choking on the output of a much greater crunching force than ever before that is working on the project. There are probably some other issues biting at the ankles of the system since the unfortunate BOINC upgrade that kicked off most of these problems prior to Christmas and required a 'roll back', which I have no doubt has left several time bombs in the machinery that could also be part of the problem. The cure is a matter of time and money, and I cannot see the boffins at Berkeley actually asking all the volunteers to do less work.

. . Sorry but you tripped a trigger with that message ...

Stephen

< shrug >
ID: 2033351 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2033352 - Posted: 21 Feb 2020, 19:25:09 UTC - in response to Message 2033349.  
Last modified: 21 Feb 2020, 19:27:39 UTC

The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project.
As it should be.

Meow.

Nice to see you around.

Back to topic. Even the slowest computer or devices like cell phones, PI, etc. could crunch a WU in less than a month....
Then why keep a deadline of 2-3 months?
Remember the DB needs to keep track the WU for all this time and if at the end of the time it was not returned, rinse & repeat. More 2-3 months?

My point is simple: stop the fix limit of WU per GPU/CPU and allow the host DL only up to the number of Wu it crunch and return valid for an specific period of time (1 day for example). And reduce the death lines. That will squeeze the DB to a more manageable size for sure.

Desperate times needs desperate measures......
ID: 2033352 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51580
Credit: 1,018,363,574
RAC: 1,004
United States
Message 2033353 - Posted: 21 Feb 2020, 19:31:13 UTC

It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work.
But that would require more programming and a lot more work required on the project servers to figure out what to send or not send on every single work request.

Methinks the overhead would be too high to do be worth it.

Meow.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 2033353 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033354 - Posted: 21 Feb 2020, 19:32:30 UTC - in response to Message 2033329.  

While it MAY have more work than can be processed (a claim for which there is NO evidence) then, if there is a problem delivering that work to the users then it make no sense to attempt to grab all one can, and so turn the average user away because they can't get work due to the greed of a very vocal minority.

I think everybody has about the same odds of hitting the servers when it has work in the RTS queue to hand out.
I am far short of having a full cache, and most work requests are getting the 'project has no tasks available' response.
But, about 20 minutes ago I got a 36 task hit to keep my cruncher going.
This does not help those who have mega-crunchers very much.
So, work is going out and being returned.
Wish things were better, but it is what it is.


Meow.


. . Exactly, we are all in the same boat ...

Stephen

< shrug >
ID: 2033354 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2033355 - Posted: 21 Feb 2020, 19:32:40 UTC - in response to Message 2033349.  

The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project.
As it should be.

Meow.

but when the number of computers that need 4-6 weeks to complete 1 WU are almost non-existent (even a RPi can complete a WU in under a day, and the impact this kind of setting is having on the project... probably best to do what's in the best interest of the project.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2033355 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033357 - Posted: 21 Feb 2020, 19:35:30 UTC - in response to Message 2033341.  
Last modified: 21 Feb 2020, 19:35:50 UTC

NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after,
Why not simply increase the size of this pot? and make it intelligent with a constant refilling function? With out a fix size?
The easiest solution would be to have a limit for number of task to be given for one host at a time. This limit could be the same as the size of the pot at first but whenever there has been a request that had to be offered 0 tasks because the pot was empty, this limit would be reduced a bit. And whenever there was still tasks left in the pot when it was refilled, the limit would be increased a bit.

So it would scale dynamically and when everything is fine, every request would get what it asks but in a throttled situation like today, everyone would get a little bit in each request whenever there are tasks in the RTS queue. Instead of the current situation where most requests get noting and an occasional lucky one gets a lot.
ID: 2033357 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2033358 - Posted: 21 Feb 2020, 19:36:47 UTC - in response to Message 2033349.  

The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project.
As it should be.

Meow.


I totally agree on the including as many people as possible! I guess I just figured that a device could return a WU in 21 days. There is a limit of 10 days worth of WUs (I'm assuming as this was a big deal when I was on my slower machine as I would hit this limit before my 100 WU limit. If this is no longer the case, let me know). If a machine takes longer than 21 days (my suggested time out limit) to do 1 WU, then do they only get 1 WU at a time?

This brings up so many questions... what is the slowest system? how long does it take to do 1 WU??
ID: 2033358 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2033360 - Posted: 21 Feb 2020, 19:37:32 UTC - in response to Message 2033353.  
Last modified: 21 Feb 2020, 19:40:36 UTC

It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work.
But that would require more programming and a lot more work required on the project servers to figure out what to send or not send on every single work request.

Methinks the overhead would be too high to do be worth it.

Meow.

They actually done each day, for all hosts (from the slower to the fastest one), just look at the stats.....

So no more load to the servers by doing this.
ID: 2033360 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033362 - Posted: 21 Feb 2020, 19:46:06 UTC - in response to Message 2033332.  

NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after, and when the pot is empty there is a pause in delivery while it is refilled - which is why we see so many "project has no tasks" messages, even when there are thousands apparently available in the RTS.


. . True, but that is because EVERY host has the potential to empty that buffer with a limit of 150 WUs per device. It is in no way because of spoofing. What is needed has been previously suggested, generally by the guys spoofing GPUs who you hold accountable for the problem, that in times of crisis such as now, there is a work fetch limit imposed such as 10 or 20 WUs per request. This would reduce impact on average and would be 'fairer' even by your definition. Slower hosts would refill their caches in a relatively short time and even the faster hosts would not be completely devoid of work, but the overall effectiveness would still be limited by the behaviour of the splitters and SETI servers.

Stephen

< shrug >
ID: 2033362 · Report as offensive
Alien Seeker
Avatar

Send message
Joined: 23 May 99
Posts: 57
Credit: 511,652
RAC: 32
France
Message 2033363 - Posted: 21 Feb 2020, 19:46:11 UTC - in response to Message 2033341.  

With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778
Why this number is so high? Sure not because the superfast or spoofed hosts. This comes from the slow hosts (the vast majority of the hosts) and the big WU deadline.


According to the site, your computer has a average turnaround time of 1.17/1.24 days (CPU/GPU). Which isn't even twice as fast as my CPU-only, seriously-throttled, switched-off-at-night computers (1.50 days for one, 1.91 days for the other). So in the end, your superfast spoofed host keeps validation pending nearly as long as my slow computers, it just crunches many more tasks in the same duration.

What weights heavily on the number of tasks/workunits around are ghosts, and the more in-progress tasks you have at a given time, the more likely you are to not realise some of them never actually reached your computer. Shortening the deadline to say, 2 or 3 weeks would help a lot without affecting even slower systems.
Gazing at the skies, hoping for contact... Unlikely, but it would be such a fantastic opportunity to learn.

My alternative profile
ID: 2033363 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033365 - Posted: 21 Feb 2020, 19:50:15 UTC - in response to Message 2033349.  
Last modified: 21 Feb 2020, 19:59:38 UTC

The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project.
As it should be.
How old is an old computer? My older cruncher is 11 years old and its ancient Core 2 Duo CPU can crunch a slow AstroPulse task in 8 hours and other tasks in 1 to 2 hours.

Single thread power of CPUs hasn't grown a lot in the last decade. They have just gained a lot more cores.

That chip has the same wattage as the new 8 core Zen 2 chip in my other computer that has about 10 times its crunching power. So using very old hardware for number crunching is bad for climate (and for wallet too).
ID: 2033365 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033366 - Posted: 21 Feb 2020, 19:51:11 UTC - in response to Message 2033340.  

I just think the 'what tape shall I run next?' algorithm is running LIFO instead of FIFO.


. . It would certainly seem so ...

Stephen

:)
ID: 2033366 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033367 - Posted: 21 Feb 2020, 20:02:21 UTC - in response to Message 2033349.  

The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project.
As it should be.

Meow.


. . So how slow a computer would you need to take 12 weeks to process one WU?????

Stephen

. . Just curious ....

:)
ID: 2033367 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2033370 - Posted: 21 Feb 2020, 20:11:19 UTC

As an example of the timing problems that a volunteer-based project like SETI has to navigate, I've just cleared a _5 task that has been hanging around for 9 days:

8538194490	6551171	12 Feb 2020, 17:20:04 UTC	13 Feb 2020, 11:18:10 UTC	Error while computing	    40.65	38.64	---	SETI@home v8 v8.22 (opencl_nvidia_SoG)
windows_intelx86
8538194491	8889086	12 Feb 2020, 17:19:54 UTC	13 Feb 2020, 2:56:07 UTC	Aborted			     0.00	0.00	---	SETI@home v8 v8.22 (opencl_nvidia_SoG)
windows_intelx86
8539994504	8740693	13 Feb 2020, 3:29:55 UTC	14 Feb 2020, 2:43:40 UTC	Completed and validated	12,326.90	10,470.39	41.14	SETI@home v8 v8.08 (alt)
windows_x86_64
8541276844	8637291	13 Feb 2020, 11:53:47 UTC	15 Feb 2020, 21:05:49 UTC	Error while computing	 2,728.58	13.72	---	SETI@home v8
Anonymous platform (NVIDIA GPU)
8551093378	8687393	15 Feb 2020, 21:30:46 UTC	21 Feb 2020, 19:36:02 UTC	Error while computing	 4,098.76	12.27	---	SETI@home v8 v8.22 (opencl_nvidia_SoG)
windows_intelx86
8572354368	6910484	21 Feb 2020, 19:36:04 UTC	21 Feb 2020, 19:46:26 UTC	Completed and validated	    36.09	33.22	41.14	SETI@home v8
Anonymous platform (NVIDIA GPU)
The three 'error' tasks were from

NVIDIA GeForce GTX 960 (2048MB) driver: 436.48 OpenCL: 1.2
NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 441.66 OpenCL: 1.2
NVIDIA GeForce GTX 1060 6GB (4095MB) driver: 441.66 OpenCL: 1.2

- so we're still suffering (and suffering badly) from NVidia's mistake. _3 and _4 between them held up the WU for eight of the nine days it's spent in the database. And just look at the runtime differential between the two valid instances.
ID: 2033370 · Report as offensive
Previous · 1 . . . 86 · 87 · 88 · 89 · 90 · 91 · 92 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.