About Deadlines or Database reduction proposals

Message boards : Number crunching : About Deadlines or Database reduction proposals
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 16 · Next

AuthorMessage
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19310
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034496 - Posted: 29 Feb 2020, 0:32:08 UTC - in response to Message 2034493.  

How long do we have to wait or what "floor" percentage of incorrectly validated tasks caused by bad drivers/cards of AMD/Nvidia is needed to remove the extra replications?
Been a while now for both vendors fixes to have been implemented in the user/host/vendor population. So how long do we need to wait? Until every conceivable host has installed proper drivers or left the project? Or what percentage of "bad" data is acceptable to let slip into the database?

That's why I suggested earlier that more use of the "Computer Details" page, which gives details of OS, Hardware and drivers, and stop sending tasks to those who haven't updated. Plus send a Notice informing everybody what is happening.
ID: 2034496 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034500 - Posted: 29 Feb 2020, 0:38:46 UTC - in response to Message 2034494.  
Last modified: 29 Feb 2020, 0:42:48 UTC

But couldn't the "Results returned and awaiting Validation" increase be down to the fact that they cannot move on to "Assimilation" because there is no room as that number is now 4 million instead of close to zero.
As Richard has posted many times in this (and probably other threads), they can't move on to Assimilation because the WU is still waiting for systems to return their result so it can be Validated. Until a WU is Validated, or declared dead due to too many errors, it cannot move on to Assimilation.
You cannot Assimilate a WU until all results for it have been returned.


The reason the Assimilation backlog is so big, is because WUs are being moved to be Assimilated, but it is not occurring due to the database I/O problems. Which is due to the huge number of Results returned and awaiting Validation.
The Assimilation backlog is an effect of the cause- which is the Awaiting validation blowout in numbers.
Cause and effect.


But assuming you are correct, then the work cache needs to be reduced to the previous limit of 100 tasks and scrap the present day 150, immediately.
Or better yet limit the cache of systems, not the number of WUs.
Or just block all RX 5000 systems from doing any Seti work till the driver & application issue is fully resolved, then the extra replication will no longer be needed & the Results returned and awaiting Validation will eventually return to normal levels.
However due to the existing deadlines, that is going to take quite a while.
Grant
Darwin NT
ID: 2034500 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19310
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034502 - Posted: 29 Feb 2020, 0:47:23 UTC - in response to Message 2034500.  

But couldn't the "Results returned and awaiting Validation" increase be down to the fact that they cannot move on to "Assimilation" because there is no room as that number is now 4 million instead of close to zero.
As Richard has posted many times in this (and probably other threads), they can't move on to Assimilation because the WU is still waiting for systems to return their result so it can be Validated. Until a WU is Validated, or declared dead due to too many errors it cannot move on to Assimilation.
You cannot Assimilate a WU until all results for it have been returned.

But quite a few of them have actually been cleared in the last few day, or will be in the next couple of days, as they were issued early January with deadline from 24th February to 2nd March. The task in https://setiathome.berkeley.edu/forum_thread.php?id=85239&postid=2034464#2034464 was one of these my task was the _2.
The other block of bad tasks does have a way to go as the deadlines for those is towards the end of March.

But assuming you are correct, then the work cache needs to be reduced to the previous limit of 100 tasks and scrap the present day 150, immediately.
Or better yet limit the cache of systems, not the number of WUs.
Or just block all RX 5000 systems from doing any Seti work till the driver & application issue is fully resolved, then the extra replication will no longer be needed & the Results returned and awaiting Validation will eventually return to normal levels.
However due to the existing deadlines, that is going to take quite a while.
ID: 2034502 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034503 - Posted: 29 Feb 2020, 0:54:00 UTC - in response to Message 2034502.  
Last modified: 29 Feb 2020, 0:54:51 UTC

But quite a few of them have actually been cleared in the last few day, or will be in the next couple of days, as they were issued early January with deadline from 24th February to 2nd March. The task in https://setiathome.berkeley.edu/forum_thread.php?id=85239&postid=2034464#2034464 was one of these my task was the _2.
The other block of bad tasks does have a way to go as the deadlines for those is towards the end of March.
And every time a bunch of shorties get issued, they require a minimum of 3 systems to Validate them, not the usual 2 (if all goes well).
So the present problems are going to continue until we get a new database server than can handle the load, or until the extra replication of WUs is no longer required to protect the integrity of the Science database (or the number of Seti crunchers drops off so much that the load is reduced to the point the database can handle it).
It is that simple.
Grant
Darwin NT
ID: 2034503 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2034504 - Posted: 29 Feb 2020, 0:54:34 UTC - in response to Message 2034500.  

Or just block all RX 5000 systems from doing any Seti work till the driver & application issue is fully resolved

Win10/Nvidia driver <442.19 hosts aren't innocent either. Assuming those hosts produce a large quantity of "Exceeded task time limit" errors on VHAR tasks, those will go out for further replication before finally validating also.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2034504 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034506 - Posted: 29 Feb 2020, 1:01:00 UTC - in response to Message 2034504.  
Last modified: 29 Feb 2020, 1:03:41 UTC

Or just block all RX 5000 systems from doing any Seti work till the driver & application issue is fully resolved
Win10/Nvidia driver <442.19 hosts aren't innocent either. Assuming those hosts produce a large quantity of "Exceeded task time limit" errors on VHAR tasks, those will go out for further replication before finally validating also.
Yep. But fortunately (or unfortunately) they produce only a fraction of what the the RX 5000 series do.
A RX 5000 series card produces an error every few seconds (lets say 5sec). An Nvidia with a dodgy driver produce one every 50min.

So crappy WU production per affected card.
Nvidia 29 per day.
RX 5000 17,280 per day (or 28,800 at 3sec per WU).
A slight difference in effect on the database there.
Grant
Darwin NT
ID: 2034506 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19310
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034507 - Posted: 29 Feb 2020, 1:06:39 UTC - in response to Message 2034504.  
Last modified: 29 Feb 2020, 1:07:05 UTC

Or just block all RX 5000 systems from doing any Seti work till the driver & application issue is fully resolved

Win10/Nvidia driver <442.19 hosts aren't innocent either. Assuming those hosts produce a large quantity of "Exceeded task time limit" errors on VHAR tasks, those will go out for further replication before finally validating also.

It could be argued that problem reduced the load, as VHAR's normally only take 2 or 3 minutes to process on an Nvidia GPU, and they were taking hours and therefore not asking for more tasks.
ID: 2034507 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2034509 - Posted: 29 Feb 2020, 1:23:43 UTC - in response to Message 2034506.  

So crappy WU production per affected card.
Nvidia 29 per day.
RX 5000 17,280 per day (or 28,800 at 3sec per WU).
A slight difference in effect on the database there.

True. More of a molehill problem for the Nvidia base compared to the AMD issue.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2034509 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034515 - Posted: 29 Feb 2020, 1:44:18 UTC - in response to Message 2034509.  

So crappy WU production per affected card.
Nvidia 29 per day.
RX 5000 17,280 per day (or 28,800 at 3sec per WU).
A slight difference in effect on the database there.
True. More of a molehill problem for the Nvidia base compared to the AMD issue.
And i suspect that problem with resolve it self relatively quickly. Those that don't change drivers never would have gone for the problem ones, those that do change drivers would have got the problem ones, and will then (hopefully) get the fixed ones now they are available.
Grant
Darwin NT
ID: 2034515 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034522 - Posted: 29 Feb 2020, 2:22:20 UTC

What would help is if they can implement the shorter deadline for resends Richard mentioned that BOINC already supports. The sooner they are cleared, the sooner everything will work again.
At least we'll be back to just the usual upload & download issues. Hopefully it will also help with the after outage recoveries.
Grant
Darwin NT
ID: 2034522 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2034524 - Posted: 29 Feb 2020, 2:33:39 UTC - in response to Message 2033902.  

Did the screensaver make that much difference?
Can't say I was aware of that, I thought the added load was about 5%.
It did make a difference, but the point i was was trying to make is that many of the original ideas/assumptions are no longer relevant. People no longer use screen savers, there is no need. So an hour of screen saver time a day really isn't relevant any more. Not to mention it can now be run on everything from phones, tablets, laptops, desktops, servers etc when it was originally just for people's desktop or laptop computer.
I'm surprised someone hasn't got it running on their Smart TV yet. Watch TV, and look for aliens all at the same time.


I don't know about smart TV's but someone was trying to run it on some Android-based TV controllers and they kept burning out. Something about not enough cooling available...

I do still like my screen savers which is why I still run it under window sometimes...

Tom
A proud member of the OFA (Old Farts Association).
ID: 2034524 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2034525 - Posted: 29 Feb 2020, 2:36:28 UTC - in response to Message 2034522.  

What would help is if they can implement the shorter deadline for resends Richard mentioned that BOINC already supports. The sooner they are cleared, the sooner everything will work again.
At least we'll be back to just the usual upload & download issues. Hopefully it will also help with the after outage recoveries.


Maybe the shorter deadlines should be tried out in beta while everyone interested devotes a lot of threads there so we can see how the load behaves.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2034525 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36333
Credit: 261,360,520
RAC: 489
Australia
Message 2034527 - Posted: 29 Feb 2020, 2:53:41 UTC

I just went through the to oldest 200 pending tasks on my 2500K and there would be many thousands of tasks taking up space until deadlines are met in just that small sample.

Rigs like this 1 can't be helping. :-(

https://setiathome.berkeley.edu/show_host_detail.php?hostid=8629071

Or this 1.

https://setiathome.berkeley.edu/show_host_detail.php?hostid=8873378

Also I've a hell of a lot of wingman that havn't contacted the project since early January. Just a couple of examples.

https://setiathome.berkeley.edu/show_host_detail.php?hostid=8870423

https://setiathome.berkeley.edu/show_host_detail.php?hostid=8881964

But there are a great deal more of them and then there are those that miss immediate validation and have to wait for deadlines to pass for correction.

https://setiathome.berkeley.edu/workunit.php?wuid=3785491713

Thankfully the number of these has dropped since the last time I did this check.

And then there are the many "hit and run" who return a few completed tasks, fill up to the max, and then just stop (far too many to even bother linking them). This could be put down to the way that either this project or BOINC itself naturally overcommit any rig maxing out their CPU's and then that problem only gets worse with the addition of GPU's, and their number of them, making the now "standard" setup result in rigs being in a state that many users find unacceptable. Many of these people never come here to find out "why" so they just uninstall it leaving those tasks laying around and this is problem that also needs to be looked into.

Another problem (very likely the cause of my 1st link's problem) are over zealous antivirus programs that don't seem to know about the BOINC project at all and many "normal" users don't know about excluding those folders from being scanned bu them. This would likely have to be done between the project's managers and the antivirus developers themselves to become aware of each other.

Cheers.
ID: 2034527 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22440
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2034556 - Posted: 29 Feb 2020, 8:38:23 UTC - in response to Message 2034494.  

Yet again - they do not move to assimilation from validation, they are FLAGGED for assimilation, they sit exactly where they were when they were validated. We get the number waiting for assimilation by running a query on the "day-file" (I use that name to distinguish it from all the other databases and tables that SETI uses).
The assimiliator process uses a query to find data that has been validated and then copies that data into the science database. This process is very rapid on the "get data" side, but the "place data" side is a bit slower.

There is no 150 day limit (that went a long time ago), today the limit is 150 for the CPUs and 150 per GPU both are at any time.
Certainly dropping these figures to say, 100, would help, provided EVERYBODY lived with and didn't artificially inflate their GPU counts. But itr would not be an instant fix as it will take time for all the "excess" tasks to work through the system.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2034556 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14673
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034570 - Posted: 29 Feb 2020, 9:27:46 UTC

I think it's probably time that we devoted a bit of thought to the source - the causes - of all these extra tasks in the daily database table.

Apart from the big jump in the "in progress" limits around 6 December (which should have largely worked its way through the system by now), the biggest problem I see are the two different GPU driver problems. They are different, and we haven't perhaps thought about the differences enough.

NVidia released - as they often do - a new driver version for their existing cards. Some people who worry about that sort of thing (mainly gamers) leapt onto the bandwagon: other people (for a while at least) had the new drivers foisted on them by Windows 10. In either case, it was an upgrade process, and we can reasonably expect that the same people will upgrade their drivers again, for the same reasons and through the same process. This one will cure itself over time - until NVidia write the next bug.

AMD released new hardware, and (if I understand it right), their bug was present in their day 0 drivers released to support the new cards. We can't assume that the purchasers of the new cards (or new machines fitted with the new cards) will be the same 'constant upgraders': many people will simply use their machine 'as is' useless there is a significant problem visibly affecting their foreground use of the machine. We had an example in the sticky thread a few days ago. We also had a developer report yesterday suggesting that the existing applications will never return to working as before, because of an apparent change in the underlying OpenCL compiler behaviour. We need to pay attention to that.
ID: 2034570 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2034574 - Posted: 29 Feb 2020, 9:53:20 UTC - in response to Message 2034556.  

@ rob smith
Certainly dropping these figures to say, 100, would help, provided EVERYBODY lived with and didn't
artificially inflate their GPU counts.
I mostly agree with what you wrote. Where we differ is that EVEN IF we have some spoofers and others
who game the system, overall it would help reduce the numbers in the queues. As always, there will be
some with very big and fast systems who will complain that they run out of work during the maintenance
and unplanned outages. Again, I would suggest that before at least the scheduled outage that a small
window (5 minutes?) be opened with a large (400? 500?) limit followed by an hour of time to allow the
transmission of the tasks to those big and fast systems. Yes, a few slow systems will also get in, but overall,
the number of tasks in the queues will decrease.

I think an even smaller limit would be appropriate. Most of the big systems have no more than 32 threads
running SETI work. If they have GPUs, each gets a thread of its own. Therefore, a limit of 50 would also work
quite nicely. I don't know how fast the big boys can process on their CPU threads, but on my Threadripper 1950X
using the stock client they take about 90 minutes or a bit more. So on the NEW beasts, maybe 60 minutes
between tasks for the CPU.

GPU processes are much faster. Someone posted they process a task on their GPU in about 60 seconds. Depending
on how busy "Synergy" is, the added contacts might be too much, and a larger limit (60? 80?) would be needed.
Unless, of course, the big boys agree to forgo just a little bit of processing for the good of the majority processors.
Note also, as the numbers in the queues decrease, the amount of work "Synergy" must do to process those tasks
also decreases.
ID: 2034574 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2034576 - Posted: 29 Feb 2020, 10:00:58 UTC - in response to Message 2034570.  

@ Richard Haselgrove
I think it's probably time that we devoted a bit of thought to the source - the causes - of all these extra tasks in the daily database table.
+1 on that!
Does anyone here know if the (time limit/deadline/whatever it might be called) to move tasks out of the waiting validation queue into the
assimilation queue has been tweaked to allow work on recovering from the driver problems? Or perhaps a special status code inserted
into the suspect tasks to "freeze" them while the work is ongoing. to recover?
ID: 2034576 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2034577 - Posted: 29 Feb 2020, 10:25:11 UTC - in response to Message 2034443.  

@ Keith Myers
Notice the very low count for both the Results returned and awaiting validation and Workunits waiting for assimilation.
What? Wait! Didn't we have slow computers back then? With a tasks in process limit of 100 or was it 150?

Does this mean ... it is NOT the slow return of tasks that is causing our problem now?
ID: 2034577 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2034578 - Posted: 29 Feb 2020, 10:42:52 UTC - in response to Message 2034494.  

@ W-K 666

But couldn't the "Results returned and awaiting Validation" increase be down to the fact that they
cannot move on to "Assimilation" because there is no room as that number is now 4 million instead of
close to zero.
No. The various queues from "ready to send" to "DB purging" are "logical queues" and a task is
"moved" from one queue to another by changing a status value within the row of the DB.

The limit we are bumping up against is the number of rows within the DB, not the values within the rows.
ID: 2034578 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2034579 - Posted: 29 Feb 2020, 10:44:13 UTC

A quick note: we are processing about 10,000 more tasks/hour now than last November.
ID: 2034579 · Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 16 · Next

Message boards : Number crunching : About Deadlines or Database reduction proposals


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.