The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2033360 - Posted: 21 Feb 2020, 19:37:32 UTC - in response to Message 2033353. Last modified: 21 Feb 2020, 19:40:36 UTC It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work. But that would require more programming and a lot more work required on the project servers to figure out what to send or not send on every single work request. Methinks the overhead would be too high to do be worth it. Meow. They actually done each day, for all hosts (from the slower to the fastest one), just look at the stats..... So no more load to the servers by doing this. ID: 2033360 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033362 - Posted: 21 Feb 2020, 19:46:06 UTC - in response to Message 2033332. NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after, and when the pot is empty there is a pause in delivery while it is refilled - which is why we see so many "project has no tasks" messages, even when there are thousands apparently available in the RTS. . . True, but that is because EVERY host has the potential to empty that buffer with a limit of 150 WUs per device. It is in no way because of spoofing. What is needed has been previously suggested, generally by the guys spoofing GPUs who you hold accountable for the problem, that in times of crisis such as now, there is a work fetch limit imposed such as 10 or 20 WUs per request. This would reduce impact on average and would be 'fairer' even by your definition. Slower hosts would refill their caches in a relatively short time and even the faster hosts would not be completely devoid of work, but the overall effectiveness would still be limited by the behaviour of the splitters and SETI servers. Stephen < shrug > ID: 2033362 ·

Alien Seeker Send message Joined: 23 May 99 Posts: 57 Credit: 511,652 RAC: 32	Message 2033363 - Posted: 21 Feb 2020, 19:46:11 UTC - in response to Message 2033341. With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778 Why this number is so high? Sure not because the superfast or spoofed hosts. This comes from the slow hosts (the vast majority of the hosts) and the big WU deadline. According to the site, your computer has a average turnaround time of 1.17/1.24 days (CPU/GPU). Which isn't even twice as fast as my CPU-only, seriously-throttled, switched-off-at-night computers (1.50 days for one, 1.91 days for the other). So in the end, your superfast spoofed host keeps validation pending nearly as long as my slow computers, it just crunches many more tasks in the same duration. What weights heavily on the number of tasks/workunits around are ghosts, and the more in-progress tasks you have at a given time, the more likely you are to not realise some of them never actually reached your computer. Shortening the deadline to say, 2 or 3 weeks would help a lot without affecting even slower systems. Gazing at the skies, hoping for contact... Unlikely, but it would be such a fantastic opportunity to learn. My alternative profile ID: 2033363 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2033365 - Posted: 21 Feb 2020, 19:50:15 UTC - in response to Message 2033349. Last modified: 21 Feb 2020, 19:59:38 UTC The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project. As it should be. How old is an old computer? My older cruncher is 11 years old and its ancient Core 2 Duo CPU can crunch a slow AstroPulse task in 8 hours and other tasks in 1 to 2 hours. Single thread power of CPUs hasn't grown a lot in the last decade. They have just gained a lot more cores. That chip has the same wattage as the new 8 core Zen 2 chip in my other computer that has about 10 times its crunching power. So using very old hardware for number crunching is bad for climate (and for wallet too). ID: 2033365 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033366 - Posted: 21 Feb 2020, 19:51:11 UTC - in response to Message 2033340. I just think the 'what tape shall I run next?' algorithm is running LIFO instead of FIFO. . . It would certainly seem so ... Stephen :) ID: 2033366 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033367 - Posted: 21 Feb 2020, 20:02:21 UTC - in response to Message 2033349. The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project. As it should be. Meow. . . So how slow a computer would you need to take 12 weeks to process one WU????? Stephen . . Just curious .... :) ID: 2033367 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14668 Credit: 200,643,578 RAC: 874	Message 2033370 - Posted: 21 Feb 2020, 20:11:19 UTC As an example of the timing problems that a volunteer-based project like SETI has to navigate, I've just cleared a _5 task that has been hanging around for 9 days: 8538194490 6551171 12 Feb 2020, 17:20:04 UTC 13 Feb 2020, 11:18:10 UTC Error while computing 40.65 38.64 --- SETI@home v8 v8.22 (opencl_nvidia_SoG) windows_intelx86 8538194491 8889086 12 Feb 2020, 17:19:54 UTC 13 Feb 2020, 2:56:07 UTC Aborted 0.00 0.00 --- SETI@home v8 v8.22 (opencl_nvidia_SoG) windows_intelx86 8539994504 8740693 13 Feb 2020, 3:29:55 UTC 14 Feb 2020, 2:43:40 UTC Completed and validated 12,326.90 10,470.39 41.14 SETI@home v8 v8.08 (alt) windows_x86_64 8541276844 8637291 13 Feb 2020, 11:53:47 UTC 15 Feb 2020, 21:05:49 UTC Error while computing 2,728.58 13.72 --- SETI@home v8 Anonymous platform (NVIDIA GPU) 8551093378 8687393 15 Feb 2020, 21:30:46 UTC 21 Feb 2020, 19:36:02 UTC Error while computing 4,098.76 12.27 --- SETI@home v8 v8.22 (opencl_nvidia_SoG) windows_intelx86 8572354368 6910484 21 Feb 2020, 19:36:04 UTC 21 Feb 2020, 19:46:26 UTC Completed and validated 36.09 33.22 41.14 SETI@home v8 Anonymous platform (NVIDIA GPU) The three 'error' tasks were from NVIDIA GeForce GTX 960 (2048MB) driver: 436.48 OpenCL: 1.2 NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 441.66 OpenCL: 1.2 NVIDIA GeForce GTX 1060 6GB (4095MB) driver: 441.66 OpenCL: 1.2 - so we're still suffering (and suffering badly) from NVidia's mistake. _3 and _4 between them held up the WU for eight of the nine days it's spent in the database. And just look at the runtime differential between the two valid instances. ID: 2033370 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2033371 - Posted: 21 Feb 2020, 20:11:54 UTC - in response to Message 2033369. Last modified: 21 Feb 2020, 20:12:13 UTC at some point, the project just needs to move on. at this point in time with computational power available, it's unreasonable to wait 6+ weeks for someone to return a WU. if they haven't returned it by two weeks, then it should be abandoned and let someone who's actually willing to do the work process it. many other projects have much shorter deadlines, and I don't see anyone (much less the hordes of what is being called "most" users) complaining that they can't participate because of it. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2033371 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033372 - Posted: 21 Feb 2020, 20:14:33 UTC - in response to Message 2033353. It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work. But that would require more programming and a lot more work required on the project servers to figure out what to send or not send on every single work request. Methinks the overhead would be too high to do be worth it. Meow. . . Sadly that is a problem. But if an index were created for each host based on the daily return rate of that host this could be applied to work assignment. That would take time to construct and probably be very difficult to incorporate into the current systems. So is very unlikely. :( Stephen < shrug > ID: 2033372 ·

Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693	Message 2033374 - Posted: 21 Feb 2020, 20:17:38 UTC - in response to Message 2033364. Should they be denied to participate in something they find interesting, just because the 24/7 club don't like it when they can't get thousands of tasks every day? No, everyone should be able to participate as much as they wish to. I just wish the servers and database could accommodate all the interest. Perhaps, setting amount of tasks based on average turn around time covers machine speed and on time. For example, if one runs 1 hr/day CPU only, they should need fewer tasks to reach avg turnaround of say 10 days than someone with 8 x 2080 Tis. If I run out of tasks, at least my 24/7 club dues will go down for the month. :) ID: 2033374 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2033375 - Posted: 21 Feb 2020, 20:21:08 UTC - in response to Message 2033370. And just look at the runtime differential between the two valid instances. You are comparing stock windows CPU app to Linux Special Sauce GPU app running in a Turing card. ID: 2033375 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2033376 - Posted: 21 Feb 2020, 20:26:53 UTC I've been rooting around in the scheduler code trying to find places where turnaround time and APR is generated. That is known for every host. So if you know those parameters for every host, you should be able to generate a priority list of which hosts should get the majority of work and clear the database the fastest. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2033376 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 2033377 - Posted: 21 Feb 2020, 20:26:55 UTC - in response to Message 2033370. Richard on a bright note thanks for helping remove 5 results from the system ID: 2033377 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2033378 - Posted: 21 Feb 2020, 20:28:32 UTC One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get. Having a two week cache would be lot less cool if it hurts your RAC ;) ID: 2033378 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2033379 - Posted: 21 Feb 2020, 20:32:12 UTC - in response to Message 2033363. Last modified: 21 Feb 2020, 20:41:18 UTC With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778 Why this number is so high? Sure not because the superfast or spoofed hosts. This comes from the slow hosts (the vast majority of the hosts) and the big WU deadline. According to the site, your computer has a average turnaround time of 1.17/1.24 days (CPU/GPU). Which isn't even twice as fast as my CPU-only, seriously-throttled, switched-off-at-night computers (1.50 days for one, 1.91 days for the other). So in the end, your superfast spoofed host keeps validation pending nearly as long as my slow computers, it just crunches many more tasks in the same duration. What weights heavily on the number of tasks/workunits around are ghosts, and the more in-progress tasks you have at a given time, the more likely you are to not realise some of them never actually reached your computer. Shortening the deadline to say, 2 or 3 weeks would help a lot without affecting even slower systems. I never said my host is a super fast one, i use an old CPU and a relatively slow GPU for today`s standards. But following your example, my host has a close to 10K WU buffer and all it buffer is crunched in less than 1 1/2 day. Fastest hosts does the same in less than 1/2 a day. That's is why we use such large WU cache buffer. Your host has a buffer of about 15-20 WU and it's crunch that buffer in about the same 1 1/2 day. So your & mine buffer are in the range i suggest 1-2 days max When i say fast/slow host is a host with low/high APR, not actually related to the CPU or GPU speed. Then why a host who crunch, let`s say 3 WU/day, return only invalids or has a low APR needs a 10 days or a 150 WU buffer? Now imagine a host who crunch less than 1 WU/day and has an APR of 10 or more days (there are 1000's) and has an up to 150 WU cache? Sure a large impact at the DB than your/mine host. That is why i try to explain. ID: 2033379 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 2033381 - Posted: 21 Feb 2020, 20:40:58 UTC - in response to Message 2033378. One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get. Having a two week cache would be lot less cool if it hurts your RAC ;) I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately" ID: 2033381 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033382 - Posted: 21 Feb 2020, 20:42:51 UTC - in response to Message 2033371. at some point, the project just needs to move on. at this point in time with computational power available, it's unreasonable to wait 6+ weeks for someone to return a WU. if they haven't returned it by two weeks, then it should be abandoned and let someone who's actually willing to do the work process it. many other projects have much shorter deadlines, and I don't see anyone (much less the hordes of what is being called "most" users) complaining that they can't participate because of it. . . Lets consider a very old computer in contemporary terms, something like a Core 2 Duo or Core 2 Quad ( I have and am using one of each). Even without a SETI usable GPU the machines could process from 1 to 4 WUs at a time on their CPUs. So taking the worst case (the C2D) doing one WU at a time, it would take between 2 and 3 hours to process a WU, allowing it to get through about 8WUs per day. Lets assume the owner is on a dial up connection (is there actually anyone who is?) and only calls in once a week. They have the current task limit of 150 WUs (10 days + 10 days, now that might actually meet the definition of greedy) and each week they call in and return their yield of say 55 WUs. A 3 week deadline would still allow them to 'participate' without any other restrictions compared to ALL other users. So why 8 or 12? In reality of course to actually participate they only need to set their work fetch to cover their return period of 7 days but lets allow some margin and say the full primary fetch of 10 days without the additional. Say about 80 WUS, then only a 2 week deadline would really be required. Are there any hosts out there actually as slow as that, much less slower than that. I can find no logic or reason in the claim that such long deadlines are required to allow people to participate. Even with this hypothetical dial up scenario if they called in every other day they could 'participate' even with a 1 week deadline. . . Just how low does the bar have to be set? Stephen ? ? ? ID: 2033382 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14668 Credit: 200,643,578 RAC: 874	Message 2033383 - Posted: 21 Feb 2020, 20:45:33 UTC - in response to Message 2033375. And just look at the runtime differential between the two valid instances. You are comparing stock windows CPU app to Linux Special Sauce GPU app running in a Turing card. That's what I was drawing attention to! The project finds itself where differentials like that exist (and the CPU in question is a AMD A10-9700, no dinosaur). It probably no longer has enough tools to manage every contingency. ID: 2033383 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033384 - Posted: 21 Feb 2020, 20:55:23 UTC - in response to Message 2033381. One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get. Having a two week cache would be lot less cool if it hurts your RAC ;) I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately" . . Immediately can only ever be a relative term, since even if your cache is empty and you only received one WU on an RTX2080ti which completes in 30 secs, your return time would be nearly one minute. But in context lets assume that a few minutes to a few (2 -3) hours would satisfy the idea of immediately. I'll restate that my personal target is 12 to 24 hours. And I still see no need for more than that. Stephen < shrug > ID: 2033384 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2033385 - Posted: 21 Feb 2020, 21:02:48 UTC - in response to Message 2033381. One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get. I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately" When the time scale is the 7 week deadline setiathome is using, then anything within the first couple of hours is pretty much 'immediately'. The shortest time in which you can return anything without manual micromanagement is the 5 minute cooldown between scheduler requests. Most non ancient GPUs can process at least one setiathome task in that time even when it isn't a noise bomb. The average turnaround of all setiathome users is about 1.5 days. Make results returned in 1.5 days give the current credit, results returned exactly at the deadline give zero credit and interpolate/extrapolate linearly using those two fixed points to get the multiplier for other times. So if you return faster than 1.5 days, you get a few % more credit than you now get. Or an alternative -Â make it a race: Return the task before your wingman for a bit of extra credit ;) ID: 2033385 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.