The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 87 · 88 · 89 · 90 · 91 · 92 · 93 . . . 94 · Next

AuthorMessage
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2033360 - Posted: 21 Feb 2020, 19:37:32 UTC - in response to Message 2033353.  
Last modified: 21 Feb 2020, 19:40:36 UTC

It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work.
But that would require more programming and a lot more work required on the project servers to figure out what to send or not send on every single work request.

Methinks the overhead would be too high to do be worth it.

Meow.

They actually done each day, for all hosts (from the slower to the fastest one), just look at the stats.....

So no more load to the servers by doing this.
ID: 2033360 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033362 - Posted: 21 Feb 2020, 19:46:06 UTC - in response to Message 2033332.  

NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after, and when the pot is empty there is a pause in delivery while it is refilled - which is why we see so many "project has no tasks" messages, even when there are thousands apparently available in the RTS.


. . True, but that is because EVERY host has the potential to empty that buffer with a limit of 150 WUs per device. It is in no way because of spoofing. What is needed has been previously suggested, generally by the guys spoofing GPUs who you hold accountable for the problem, that in times of crisis such as now, there is a work fetch limit imposed such as 10 or 20 WUs per request. This would reduce impact on average and would be 'fairer' even by your definition. Slower hosts would refill their caches in a relatively short time and even the faster hosts would not be completely devoid of work, but the overall effectiveness would still be limited by the behaviour of the splitters and SETI servers.

Stephen

< shrug >
ID: 2033362 · Report as offensive
Alien Seeker
Avatar

Send message
Joined: 23 May 99
Posts: 57
Credit: 511,652
RAC: 32
France
Message 2033363 - Posted: 21 Feb 2020, 19:46:11 UTC - in response to Message 2033341.  

With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778
Why this number is so high? Sure not because the superfast or spoofed hosts. This comes from the slow hosts (the vast majority of the hosts) and the big WU deadline.


According to the site, your computer has a average turnaround time of 1.17/1.24 days (CPU/GPU). Which isn't even twice as fast as my CPU-only, seriously-throttled, switched-off-at-night computers (1.50 days for one, 1.91 days for the other). So in the end, your superfast spoofed host keeps validation pending nearly as long as my slow computers, it just crunches many more tasks in the same duration.

What weights heavily on the number of tasks/workunits around are ghosts, and the more in-progress tasks you have at a given time, the more likely you are to not realise some of them never actually reached your computer. Shortening the deadline to say, 2 or 3 weeks would help a lot without affecting even slower systems.
Gazing at the skies, hoping for contact... Unlikely, but it would be such a fantastic opportunity to learn.

My alternative profile
ID: 2033363 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033365 - Posted: 21 Feb 2020, 19:50:15 UTC - in response to Message 2033349.  
Last modified: 21 Feb 2020, 19:59:38 UTC

The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project.
As it should be.
How old is an old computer? My older cruncher is 11 years old and its ancient Core 2 Duo CPU can crunch a slow AstroPulse task in 8 hours and other tasks in 1 to 2 hours.

Single thread power of CPUs hasn't grown a lot in the last decade. They have just gained a lot more cores.

That chip has the same wattage as the new 8 core Zen 2 chip in my other computer that has about 10 times its crunching power. So using very old hardware for number crunching is bad for climate (and for wallet too).
ID: 2033365 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033366 - Posted: 21 Feb 2020, 19:51:11 UTC - in response to Message 2033340.  

I just think the 'what tape shall I run next?' algorithm is running LIFO instead of FIFO.


. . It would certainly seem so ...

Stephen

:)
ID: 2033366 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033367 - Posted: 21 Feb 2020, 20:02:21 UTC - in response to Message 2033349.  

The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project.
As it should be.

Meow.


. . So how slow a computer would you need to take 12 weeks to process one WU?????

Stephen

. . Just curious ....

:)
ID: 2033367 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2033370 - Posted: 21 Feb 2020, 20:11:19 UTC

As an example of the timing problems that a volunteer-based project like SETI has to navigate, I've just cleared a _5 task that has been hanging around for 9 days:

8538194490	6551171	12 Feb 2020, 17:20:04 UTC	13 Feb 2020, 11:18:10 UTC	Error while computing	    40.65	38.64	---	SETI@home v8 v8.22 (opencl_nvidia_SoG)
windows_intelx86
8538194491	8889086	12 Feb 2020, 17:19:54 UTC	13 Feb 2020, 2:56:07 UTC	Aborted			     0.00	0.00	---	SETI@home v8 v8.22 (opencl_nvidia_SoG)
windows_intelx86
8539994504	8740693	13 Feb 2020, 3:29:55 UTC	14 Feb 2020, 2:43:40 UTC	Completed and validated	12,326.90	10,470.39	41.14	SETI@home v8 v8.08 (alt)
windows_x86_64
8541276844	8637291	13 Feb 2020, 11:53:47 UTC	15 Feb 2020, 21:05:49 UTC	Error while computing	 2,728.58	13.72	---	SETI@home v8
Anonymous platform (NVIDIA GPU)
8551093378	8687393	15 Feb 2020, 21:30:46 UTC	21 Feb 2020, 19:36:02 UTC	Error while computing	 4,098.76	12.27	---	SETI@home v8 v8.22 (opencl_nvidia_SoG)
windows_intelx86
8572354368	6910484	21 Feb 2020, 19:36:04 UTC	21 Feb 2020, 19:46:26 UTC	Completed and validated	    36.09	33.22	41.14	SETI@home v8
Anonymous platform (NVIDIA GPU)
The three 'error' tasks were from

NVIDIA GeForce GTX 960 (2048MB) driver: 436.48 OpenCL: 1.2
NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 441.66 OpenCL: 1.2
NVIDIA GeForce GTX 1060 6GB (4095MB) driver: 441.66 OpenCL: 1.2

- so we're still suffering (and suffering badly) from NVidia's mistake. _3 and _4 between them held up the WU for eight of the nine days it's spent in the database. And just look at the runtime differential between the two valid instances.
ID: 2033370 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2033371 - Posted: 21 Feb 2020, 20:11:54 UTC - in response to Message 2033369.  
Last modified: 21 Feb 2020, 20:12:13 UTC

at some point, the project just needs to move on. at this point in time with computational power available, it's unreasonable to wait 6+ weeks for someone to return a WU. if they haven't returned it by two weeks, then it should be abandoned and let someone who's actually willing to do the work process it.

many other projects have much shorter deadlines, and I don't see anyone (much less the hordes of what is being called "most" users) complaining that they can't participate because of it.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2033371 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033372 - Posted: 21 Feb 2020, 20:14:33 UTC - in response to Message 2033353.  

It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work.
But that would require more programming and a lot more work required on the project servers to figure out what to send or not send on every single work request.
Methinks the overhead would be too high to do be worth it.
Meow.


. . Sadly that is a problem. But if an index were created for each host based on the daily return rate of that host this could be applied to work assignment. That would take time to construct and probably be very difficult to incorporate into the current systems. So is very unlikely. :(

Stephen

< shrug >
ID: 2033372 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2033374 - Posted: 21 Feb 2020, 20:17:38 UTC - in response to Message 2033364.  

Should they be denied to participate in something they find interesting, just because the 24/7 club don't like it when they can't get thousands of tasks every day?

No, everyone should be able to participate as much as they wish to. I just wish the servers and database could accommodate all the interest. Perhaps, setting amount of tasks based on average turn around time covers machine speed and on time. For example, if one runs 1 hr/day CPU only, they should need fewer tasks to reach avg turnaround of say 10 days than someone with 8 x 2080 Tis.

If I run out of tasks, at least my 24/7 club dues will go down for the month. :)
ID: 2033374 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033375 - Posted: 21 Feb 2020, 20:21:08 UTC - in response to Message 2033370.  

And just look at the runtime differential between the two valid instances.
You are comparing stock windows CPU app to Linux Special Sauce GPU app running in a Turing card.
ID: 2033375 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2033376 - Posted: 21 Feb 2020, 20:26:53 UTC

I've been rooting around in the scheduler code trying to find places where turnaround time and APR is generated. That is known for every host. So if you know those parameters for every host, you should be able to generate a priority list of which hosts should get the majority of work and clear the database the fastest.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2033376 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2033377 - Posted: 21 Feb 2020, 20:26:55 UTC - in response to Message 2033370.  

Richard on a bright note thanks for helping remove 5 results from the system
ID: 2033377 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033378 - Posted: 21 Feb 2020, 20:28:32 UTC

One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get.

Having a two week cache would be lot less cool if it hurts your RAC ;)
ID: 2033378 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2033379 - Posted: 21 Feb 2020, 20:32:12 UTC - in response to Message 2033363.  
Last modified: 21 Feb 2020, 20:41:18 UTC

With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778
Why this number is so high? Sure not because the superfast or spoofed hosts. This comes from the slow hosts (the vast majority of the hosts) and the big WU deadline.


According to the site, your computer has a average turnaround time of 1.17/1.24 days (CPU/GPU). Which isn't even twice as fast as my CPU-only, seriously-throttled, switched-off-at-night computers (1.50 days for one, 1.91 days for the other). So in the end, your superfast spoofed host keeps validation pending nearly as long as my slow computers, it just crunches many more tasks in the same duration.

What weights heavily on the number of tasks/workunits around are ghosts, and the more in-progress tasks you have at a given time, the more likely you are to not realise some of them never actually reached your computer. Shortening the deadline to say, 2 or 3 weeks would help a lot without affecting even slower systems.

I never said my host is a super fast one, i use an old CPU and a relatively slow GPU for today`s standards.
But following your example, my host has a close to 10K WU buffer and all it buffer is crunched in less than 1 1/2 day. Fastest hosts does the same in less than 1/2 a day. That's is why we use such large WU cache buffer.
Your host has a buffer of about 15-20 WU and it's crunch that buffer in about the same 1 1/2 day.
So your & mine buffer are in the range i suggest 1-2 days max
When i say fast/slow host is a host with low/high APR, not actually related to the CPU or GPU speed.
Then why a host who crunch, let`s say 3 WU/day, return only invalids or has a low APR needs a 10 days or a 150 WU buffer?
Now imagine a host who crunch less than 1 WU/day and has an APR of 10 or more days (there are 1000's) and has an up to 150 WU cache?
Sure a large impact at the DB than your/mine host.
That is why i try to explain.
ID: 2033379 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2033381 - Posted: 21 Feb 2020, 20:40:58 UTC - in response to Message 2033378.  

One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get.

Having a two week cache would be lot less cool if it hurts your RAC ;)

I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately"
ID: 2033381 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033382 - Posted: 21 Feb 2020, 20:42:51 UTC - in response to Message 2033371.  

at some point, the project just needs to move on. at this point in time with computational power available, it's unreasonable to wait 6+ weeks for someone to return a WU. if they haven't returned it by two weeks, then it should be abandoned and let someone who's actually willing to do the work process it.

many other projects have much shorter deadlines, and I don't see anyone (much less the hordes of what is being called "most" users) complaining that they can't participate because of it.


. . Lets consider a very old computer in contemporary terms, something like a Core 2 Duo or Core 2 Quad ( I have and am using one of each). Even without a SETI usable GPU the machines could process from 1 to 4 WUs at a time on their CPUs. So taking the worst case (the C2D) doing one WU at a time, it would take between 2 and 3 hours to process a WU, allowing it to get through about 8WUs per day. Lets assume the owner is on a dial up connection (is there actually anyone who is?) and only calls in once a week. They have the current task limit of 150 WUs (10 days + 10 days, now that might actually meet the definition of greedy) and each week they call in and return their yield of say 55 WUs. A 3 week deadline would still allow them to 'participate' without any other restrictions compared to ALL other users. So why 8 or 12? In reality of course to actually participate they only need to set their work fetch to cover their return period of 7 days but lets allow some margin and say the full primary fetch of 10 days without the additional. Say about 80 WUS, then only a 2 week deadline would really be required. Are there any hosts out there actually as slow as that, much less slower than that. I can find no logic or reason in the claim that such long deadlines are required to allow people to participate. Even with this hypothetical dial up scenario if they called in every other day they could 'participate' even with a 1 week deadline.

. . Just how low does the bar have to be set?

Stephen

? ? ?
ID: 2033382 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2033383 - Posted: 21 Feb 2020, 20:45:33 UTC - in response to Message 2033375.  

And just look at the runtime differential between the two valid instances.
You are comparing stock windows CPU app to Linux Special Sauce GPU app running in a Turing card.
That's what I was drawing attention to! The project finds itself where differentials like that exist (and the CPU in question is a AMD A10-9700, no dinosaur). It probably no longer has enough tools to manage every contingency.
ID: 2033383 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033384 - Posted: 21 Feb 2020, 20:55:23 UTC - in response to Message 2033381.  

One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get.

Having a two week cache would be lot less cool if it hurts your RAC ;)

I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately"


. . Immediately can only ever be a relative term, since even if your cache is empty and you only received one WU on an RTX2080ti which completes in 30 secs, your return time would be nearly one minute. But in context lets assume that a few minutes to a few (2 -3) hours would satisfy the idea of immediately. I'll restate that my personal target is 12 to 24 hours. And I still see no need for more than that.

Stephen

< shrug >
ID: 2033384 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033385 - Posted: 21 Feb 2020, 21:02:48 UTC - in response to Message 2033381.  

One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get.
I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately"
When the time scale is the 7 week deadline setiathome is using, then anything within the first couple of hours is pretty much 'immediately'. The shortest time in which you can return anything without manual micromanagement is the 5 minute cooldown between scheduler requests. Most non ancient GPUs can process at least one setiathome task in that time even when it isn't a noise bomb.

The average turnaround of all setiathome users is about 1.5 days. Make results returned in 1.5 days give the current credit, results returned exactly at the deadline give zero credit and interpolate/extrapolate linearly using those two fixed points to get the multiplier for other times. So if you return faster than 1.5 days, you get a few % more credit than you now get.

Or an alternative - make it a race: Return the task before your wingman for a bit of extra credit ;)
ID: 2033385 · Report as offensive
Previous · 1 . . . 87 · 88 · 89 · 90 · 91 · 92 · 93 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.