The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 87 · 88 · 89 · 90 · 91 · 92 · 93 . . . 94 · Next

AuthorMessage
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2033371 - Posted: 21 Feb 2020, 20:11:54 UTC - in response to Message 2033369.  
Last modified: 21 Feb 2020, 20:12:13 UTC

at some point, the project just needs to move on. at this point in time with computational power available, it's unreasonable to wait 6+ weeks for someone to return a WU. if they haven't returned it by two weeks, then it should be abandoned and let someone who's actually willing to do the work process it.

many other projects have much shorter deadlines, and I don't see anyone (much less the hordes of what is being called "most" users) complaining that they can't participate because of it.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2033371 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033372 - Posted: 21 Feb 2020, 20:14:33 UTC - in response to Message 2033353.  

It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work.
But that would require more programming and a lot more work required on the project servers to figure out what to send or not send on every single work request.
Methinks the overhead would be too high to do be worth it.
Meow.


. . Sadly that is a problem. But if an index were created for each host based on the daily return rate of that host this could be applied to work assignment. That would take time to construct and probably be very difficult to incorporate into the current systems. So is very unlikely. :(

Stephen

< shrug >
ID: 2033372 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2033374 - Posted: 21 Feb 2020, 20:17:38 UTC - in response to Message 2033364.  

Should they be denied to participate in something they find interesting, just because the 24/7 club don't like it when they can't get thousands of tasks every day?

No, everyone should be able to participate as much as they wish to. I just wish the servers and database could accommodate all the interest. Perhaps, setting amount of tasks based on average turn around time covers machine speed and on time. For example, if one runs 1 hr/day CPU only, they should need fewer tasks to reach avg turnaround of say 10 days than someone with 8 x 2080 Tis.

If I run out of tasks, at least my 24/7 club dues will go down for the month. :)
ID: 2033374 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033375 - Posted: 21 Feb 2020, 20:21:08 UTC - in response to Message 2033370.  

And just look at the runtime differential between the two valid instances.
You are comparing stock windows CPU app to Linux Special Sauce GPU app running in a Turing card.
ID: 2033375 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2033376 - Posted: 21 Feb 2020, 20:26:53 UTC

I've been rooting around in the scheduler code trying to find places where turnaround time and APR is generated. That is known for every host. So if you know those parameters for every host, you should be able to generate a priority list of which hosts should get the majority of work and clear the database the fastest.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2033376 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1646
Credit: 12,921,799
RAC: 89
New Zealand
Message 2033377 - Posted: 21 Feb 2020, 20:26:55 UTC - in response to Message 2033370.  

Richard on a bright note thanks for helping remove 5 results from the system
ID: 2033377 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033378 - Posted: 21 Feb 2020, 20:28:32 UTC

One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get.

Having a two week cache would be lot less cool if it hurts your RAC ;)
ID: 2033378 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2033379 - Posted: 21 Feb 2020, 20:32:12 UTC - in response to Message 2033363.  
Last modified: 21 Feb 2020, 20:41:18 UTC

With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778
Why this number is so high? Sure not because the superfast or spoofed hosts. This comes from the slow hosts (the vast majority of the hosts) and the big WU deadline.


According to the site, your computer has a average turnaround time of 1.17/1.24 days (CPU/GPU). Which isn't even twice as fast as my CPU-only, seriously-throttled, switched-off-at-night computers (1.50 days for one, 1.91 days for the other). So in the end, your superfast spoofed host keeps validation pending nearly as long as my slow computers, it just crunches many more tasks in the same duration.

What weights heavily on the number of tasks/workunits around are ghosts, and the more in-progress tasks you have at a given time, the more likely you are to not realise some of them never actually reached your computer. Shortening the deadline to say, 2 or 3 weeks would help a lot without affecting even slower systems.

I never said my host is a super fast one, i use an old CPU and a relatively slow GPU for today`s standards.
But following your example, my host has a close to 10K WU buffer and all it buffer is crunched in less than 1 1/2 day. Fastest hosts does the same in less than 1/2 a day. That's is why we use such large WU cache buffer.
Your host has a buffer of about 15-20 WU and it's crunch that buffer in about the same 1 1/2 day.
So your & mine buffer are in the range i suggest 1-2 days max
When i say fast/slow host is a host with low/high APR, not actually related to the CPU or GPU speed.
Then why a host who crunch, let`s say 3 WU/day, return only invalids or has a low APR needs a 10 days or a 150 WU buffer?
Now imagine a host who crunch less than 1 WU/day and has an APR of 10 or more days (there are 1000's) and has an up to 150 WU cache?
Sure a large impact at the DB than your/mine host.
That is why i try to explain.
ID: 2033379 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1646
Credit: 12,921,799
RAC: 89
New Zealand
Message 2033381 - Posted: 21 Feb 2020, 20:40:58 UTC - in response to Message 2033378.  

One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get.

Having a two week cache would be lot less cool if it hurts your RAC ;)

I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately"
ID: 2033381 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033382 - Posted: 21 Feb 2020, 20:42:51 UTC - in response to Message 2033371.  

at some point, the project just needs to move on. at this point in time with computational power available, it's unreasonable to wait 6+ weeks for someone to return a WU. if they haven't returned it by two weeks, then it should be abandoned and let someone who's actually willing to do the work process it.

many other projects have much shorter deadlines, and I don't see anyone (much less the hordes of what is being called "most" users) complaining that they can't participate because of it.


. . Lets consider a very old computer in contemporary terms, something like a Core 2 Duo or Core 2 Quad ( I have and am using one of each). Even without a SETI usable GPU the machines could process from 1 to 4 WUs at a time on their CPUs. So taking the worst case (the C2D) doing one WU at a time, it would take between 2 and 3 hours to process a WU, allowing it to get through about 8WUs per day. Lets assume the owner is on a dial up connection (is there actually anyone who is?) and only calls in once a week. They have the current task limit of 150 WUs (10 days + 10 days, now that might actually meet the definition of greedy) and each week they call in and return their yield of say 55 WUs. A 3 week deadline would still allow them to 'participate' without any other restrictions compared to ALL other users. So why 8 or 12? In reality of course to actually participate they only need to set their work fetch to cover their return period of 7 days but lets allow some margin and say the full primary fetch of 10 days without the additional. Say about 80 WUS, then only a 2 week deadline would really be required. Are there any hosts out there actually as slow as that, much less slower than that. I can find no logic or reason in the claim that such long deadlines are required to allow people to participate. Even with this hypothetical dial up scenario if they called in every other day they could 'participate' even with a 1 week deadline.

. . Just how low does the bar have to be set?

Stephen

? ? ?
ID: 2033382 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2033383 - Posted: 21 Feb 2020, 20:45:33 UTC - in response to Message 2033375.  

And just look at the runtime differential between the two valid instances.
You are comparing stock windows CPU app to Linux Special Sauce GPU app running in a Turing card.
That's what I was drawing attention to! The project finds itself where differentials like that exist (and the CPU in question is a AMD A10-9700, no dinosaur). It probably no longer has enough tools to manage every contingency.
ID: 2033383 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033384 - Posted: 21 Feb 2020, 20:55:23 UTC - in response to Message 2033381.  

One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get.

Having a two week cache would be lot less cool if it hurts your RAC ;)

I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately"


. . Immediately can only ever be a relative term, since even if your cache is empty and you only received one WU on an RTX2080ti which completes in 30 secs, your return time would be nearly one minute. But in context lets assume that a few minutes to a few (2 -3) hours would satisfy the idea of immediately. I'll restate that my personal target is 12 to 24 hours. And I still see no need for more than that.

Stephen

< shrug >
ID: 2033384 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033385 - Posted: 21 Feb 2020, 21:02:48 UTC - in response to Message 2033381.  

One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get.
I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately"
When the time scale is the 7 week deadline setiathome is using, then anything within the first couple of hours is pretty much 'immediately'. The shortest time in which you can return anything without manual micromanagement is the 5 minute cooldown between scheduler requests. Most non ancient GPUs can process at least one setiathome task in that time even when it isn't a noise bomb.

The average turnaround of all setiathome users is about 1.5 days. Make results returned in 1.5 days give the current credit, results returned exactly at the deadline give zero credit and interpolate/extrapolate linearly using those two fixed points to get the multiplier for other times. So if you return faster than 1.5 days, you get a few % more credit than you now get.

Or an alternative - make it a race: Return the task before your wingman for a bit of extra credit ;)
ID: 2033385 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2033387 - Posted: 21 Feb 2020, 21:03:33 UTC - in response to Message 2033381.  

One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get.

Having a two week cache would be lot less cool if it hurts your RAC ;)

I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately"

GPUGrid rewards fast turnaround hosts with 50% more credit if returned within 24 hours. 25% more credit if work is returned within 48 hours. The same could be implemented here.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2033387 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2033388 - Posted: 21 Feb 2020, 21:08:06 UTC - in response to Message 2033381.  

I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately"

Actually you can. If you set:

<report_results_immediately>1</report_results_immediately>

in the cc_config.xml file. From the client configuration wiki.

<report_results_immediately>0|1</report_results_immediately>
If 1, each job will be reported to the project server as soon as it's finished, with an inbuilt 60 second delay from completion of result upload. (normally it's deferred for up to one hour, so that several jobs can be reported in one request). Using this option increases the load on project servers, and should generally be avoided. This is intended to be used only on computers whose disks are reformatted daily.


But early overflows that run for only 15 seconds still would get reported at each 305 second scheduler connect interval.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2033388 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 717
Credit: 8,032,827
RAC: 62
France
Message 2033389 - Posted: 21 Feb 2020, 21:08:57 UTC

it's not really a problem of the speedest ones against the slowest ones but ...

1_ slowest ones are a lot, the cross validated pending is not a problem .. ( slow CPU and core count , slow GPU ) ( like mine but run 24h a day , 7 day a week , 52 weeks a year and a whole life i hope ...

2_ the intermediates ones (multi core CPU and a fast GPU ) are not the problems too, they are in sufficient number to have cross validated pending with another one intermediate ,

3_ the problem is that are only a few fastest ones ( multi core CPU multi fast GPU ) that can't in fact cross validated pending with another fastest computer ... they have to wait after the 2_ and 1_


some possible solutions . . .

increase the number of fastest hosts 3_.... ( not even possible , money cost, electricity cost, maintenance to do , hard configuration etc ) and server side problems to feed all these 3_

split the fastests 3_ to 2_ ... more intermediates and higher intermediates to increase the cross validated pending between them , easyer to configure, to supply etc ..

eliminate all the slowest ones .... against the values of the scientific project that is SETI

or all the proposals you've mades before me :D

(sorry for the language mistakes ) :p
ID: 2033389 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033394 - Posted: 21 Feb 2020, 21:21:11 UTC
Last modified: 21 Feb 2020, 21:22:18 UTC

Looks like the current panic is ending. My hosts are only a few tasks short of having full caches now.
ID: 2033394 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033397 - Posted: 21 Feb 2020, 21:29:43 UTC - in response to Message 2033341.  

With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778
Why this number is so high? Sure not because the superfast or spoofed hosts. This comes from the slow hosts (the vast majority of the hosts) and the big WU deadline.
Actually about 9 million of those 14 million results are in there not because someone is still crunching them but because the corresponding workunit is stuck in assimilation queue.
ID: 2033397 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2033400 - Posted: 21 Feb 2020, 21:38:50 UTC - in response to Message 2033369.  

The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project.
As it should be.

Meow.


. . So how slow a computer would you need to take 12 weeks to process one WU?????

Stephen

. . Just curious ....

:)

Most people do NOT run their computers 24/7, and therefore a slow computer that runs maybe only a few hours/week can take a very long time
to crunch one task. Should they be kicked out from their long time interest in SETI, just because the 24/7 club wants all WU's they can get?

This is getting into elitistic territory now, and that is not what SETI is (was ?) about.



I think this is a valid point. I didn't think of the people who have slow machines and are only on for a few hours every day. I would love to see the stats of machines in the category and see IF the return time allowance could be shortened. I want to be as inclusive as possible, but I'll be honest that if the number of machines in this category is small then it might be better to sacrifice a few participates to maybe get even more. How many people join the project, but quit because it isn't stable and they can't reliably get WUs every week?? How much shorter would the Tuesday outage be if the db was a better size? I want seti to be run by as many people as possible. This project is not only about finding the alien, but also PR and making people feel a part of something, but what if we are losing more people by not changing the system to run better??
ID: 2033400 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 2033403 - Posted: 21 Feb 2020, 21:44:57 UTC - in response to Message 2033369.  

Most people do NOT run their computers 24/7, and therefore a slow computer that runs maybe only a few hours/week can take a very long time
to crunch one task. Should they be kicked out from their long time interest in SETI, just because the 24/7 club wants all WU's they can get?

This is getting into elitistic territory now, and that is not what SETI is (was ?) about.
And for extremely slow rarely on systems, 1 month is plenty of time for them to return a WU. It's actually plenty of time for them to return many WUs.
While having deadlines as short as one week wouldn't affect such systems, it would affect those that are having problems- be it hardware, internet, power supply (fires, floods, storms etc). A 1 month deadline reduces the time it takes to clear a WU from the database, but still allows people time to recover from problems and not lose any of the work they have processed.
Grant
Darwin NT
ID: 2033403 · Report as offensive
Previous · 1 . . . 87 · 88 · 89 · 90 · 91 · 92 · 93 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.