Message boards :
Number crunching :
About Deadlines or Database reduction proposals
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 16 · Next
Author | Message |
---|---|
W-K 666 Send message Joined: 18 May 99 Posts: 19317 Credit: 40,757,560 RAC: 67 |
How long do we have to wait or what "floor" percentage of incorrectly validated tasks caused by bad drivers/cards of AMD/Nvidia is needed to remove the extra replications? That's why I suggested earlier that more use of the "Computer Details" page, which gives details of OS, Hardware and drivers, and stop sending tasks to those who haven't updated. Plus send a Notice informing everybody what is happening. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
But couldn't the "Results returned and awaiting Validation" increase be down to the fact that they cannot move on to "Assimilation" because there is no room as that number is now 4 million instead of close to zero.As Richard has posted many times in this (and probably other threads), they can't move on to Assimilation because the WU is still waiting for systems to return their result so it can be Validated. Until a WU is Validated, or declared dead due to too many errors, it cannot move on to Assimilation. You cannot Assimilate a WU until all results for it have been returned. The reason the Assimilation backlog is so big, is because WUs are being moved to be Assimilated, but it is not occurring due to the database I/O problems. Which is due to the huge number of Results returned and awaiting Validation. The Assimilation backlog is an effect of the cause- which is the Awaiting validation blowout in numbers. Cause and effect. But assuming you are correct, then the work cache needs to be reduced to the previous limit of 100 tasks and scrap the present day 150, immediately.Or better yet limit the cache of systems, not the number of WUs. Or just block all RX 5000 systems from doing any Seti work till the driver & application issue is fully resolved, then the extra replication will no longer be needed & the Results returned and awaiting Validation will eventually return to normal levels. However due to the existing deadlines, that is going to take quite a while. Grant Darwin NT |
W-K 666 Send message Joined: 18 May 99 Posts: 19317 Credit: 40,757,560 RAC: 67 |
But couldn't the "Results returned and awaiting Validation" increase be down to the fact that they cannot move on to "Assimilation" because there is no room as that number is now 4 million instead of close to zero.As Richard has posted many times in this (and probably other threads), they can't move on to Assimilation because the WU is still waiting for systems to return their result so it can be Validated. Until a WU is Validated, or declared dead due to too many errors it cannot move on to Assimilation. But quite a few of them have actually been cleared in the last few day, or will be in the next couple of days, as they were issued early January with deadline from 24th February to 2nd March. The task in https://setiathome.berkeley.edu/forum_thread.php?id=85239&postid=2034464#2034464 was one of these my task was the _2. The other block of bad tasks does have a way to go as the deadlines for those is towards the end of March. But assuming you are correct, then the work cache needs to be reduced to the previous limit of 100 tasks and scrap the present day 150, immediately.Or better yet limit the cache of systems, not the number of WUs. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
But quite a few of them have actually been cleared in the last few day, or will be in the next couple of days, as they were issued early January with deadline from 24th February to 2nd March. The task in https://setiathome.berkeley.edu/forum_thread.php?id=85239&postid=2034464#2034464 was one of these my task was the _2.And every time a bunch of shorties get issued, they require a minimum of 3 systems to Validate them, not the usual 2 (if all goes well). So the present problems are going to continue until we get a new database server than can handle the load, or until the extra replication of WUs is no longer required to protect the integrity of the Science database (or the number of Seti crunchers drops off so much that the load is reduced to the point the database can handle it). It is that simple. Grant Darwin NT |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Or just block all RX 5000 systems from doing any Seti work till the driver & application issue is fully resolved Win10/Nvidia driver <442.19 hosts aren't innocent either. Assuming those hosts produce a large quantity of "Exceeded task time limit" errors on VHAR tasks, those will go out for further replication before finally validating also. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
Yep. But fortunately (or unfortunately) they produce only a fraction of what the the RX 5000 series do.Or just block all RX 5000 systems from doing any Seti work till the driver & application issue is fully resolvedWin10/Nvidia driver <442.19 hosts aren't innocent either. Assuming those hosts produce a large quantity of "Exceeded task time limit" errors on VHAR tasks, those will go out for further replication before finally validating also. A RX 5000 series card produces an error every few seconds (lets say 5sec). An Nvidia with a dodgy driver produce one every 50min. So crappy WU production per affected card. Nvidia 29 per day. RX 5000 17,280 per day (or 28,800 at 3sec per WU). A slight difference in effect on the database there. Grant Darwin NT |
W-K 666 Send message Joined: 18 May 99 Posts: 19317 Credit: 40,757,560 RAC: 67 |
Or just block all RX 5000 systems from doing any Seti work till the driver & application issue is fully resolved It could be argued that problem reduced the load, as VHAR's normally only take 2 or 3 minutes to process on an Nvidia GPU, and they were taking hours and therefore not asking for more tasks. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
So crappy WU production per affected card. True. More of a molehill problem for the Nvidia base compared to the AMD issue. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
And i suspect that problem with resolve it self relatively quickly. Those that don't change drivers never would have gone for the problem ones, those that do change drivers would have got the problem ones, and will then (hopefully) get the fixed ones now they are available.So crappy WU production per affected card.True. More of a molehill problem for the Nvidia base compared to the AMD issue. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
What would help is if they can implement the shorter deadline for resends Richard mentioned that BOINC already supports. The sooner they are cleared, the sooner everything will work again. At least we'll be back to just the usual upload & download issues. Hopefully it will also help with the after outage recoveries. Grant Darwin NT |
Tom M Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 |
Did the screensaver make that much difference?It did make a difference, but the point i was was trying to make is that many of the original ideas/assumptions are no longer relevant. People no longer use screen savers, there is no need. So an hour of screen saver time a day really isn't relevant any more. Not to mention it can now be run on everything from phones, tablets, laptops, desktops, servers etc when it was originally just for people's desktop or laptop computer. I don't know about smart TV's but someone was trying to run it on some Android-based TV controllers and they kept burning out. Something about not enough cooling available... I do still like my screen savers which is why I still run it under window sometimes... Tom A proud member of the OFA (Old Farts Association). |
Tom M Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 |
What would help is if they can implement the shorter deadline for resends Richard mentioned that BOINC already supports. The sooner they are cleared, the sooner everything will work again. Maybe the shorter deadlines should be tried out in beta while everyone interested devotes a lot of threads there so we can see how the load behaves. Tom A proud member of the OFA (Old Farts Association). |
Wiggo Send message Joined: 24 Jan 00 Posts: 36387 Credit: 261,360,520 RAC: 489 |
I just went through the to oldest 200 pending tasks on my 2500K and there would be many thousands of tasks taking up space until deadlines are met in just that small sample. Rigs like this 1 can't be helping. :-( https://setiathome.berkeley.edu/show_host_detail.php?hostid=8629071 Or this 1. https://setiathome.berkeley.edu/show_host_detail.php?hostid=8873378 Also I've a hell of a lot of wingman that havn't contacted the project since early January. Just a couple of examples. https://setiathome.berkeley.edu/show_host_detail.php?hostid=8870423 https://setiathome.berkeley.edu/show_host_detail.php?hostid=8881964 But there are a great deal more of them and then there are those that miss immediate validation and have to wait for deadlines to pass for correction. https://setiathome.berkeley.edu/workunit.php?wuid=3785491713 Thankfully the number of these has dropped since the last time I did this check. And then there are the many "hit and run" who return a few completed tasks, fill up to the max, and then just stop (far too many to even bother linking them). This could be put down to the way that either this project or BOINC itself naturally overcommit any rig maxing out their CPU's and then that problem only gets worse with the addition of GPU's, and their number of them, making the now "standard" setup result in rigs being in a state that many users find unacceptable. Many of these people never come here to find out "why" so they just uninstall it leaving those tasks laying around and this is problem that also needs to be looked into. Another problem (very likely the cause of my 1st link's problem) are over zealous antivirus programs that don't seem to know about the BOINC project at all and many "normal" users don't know about excluding those folders from being scanned bu them. This would likely have to be done between the project's managers and the antivirus developers themselves to become aware of each other. Cheers. |
rob smith Send message Joined: 7 Mar 03 Posts: 22455 Credit: 416,307,556 RAC: 380 |
Yet again - they do not move to assimilation from validation, they are FLAGGED for assimilation, they sit exactly where they were when they were validated. We get the number waiting for assimilation by running a query on the "day-file" (I use that name to distinguish it from all the other databases and tables that SETI uses). The assimiliator process uses a query to find data that has been validated and then copies that data into the science database. This process is very rapid on the "get data" side, but the "place data" side is a bit slower. There is no 150 day limit (that went a long time ago), today the limit is 150 for the CPUs and 150 per GPU both are at any time. Certainly dropping these figures to say, 100, would help, provided EVERYBODY lived with and didn't artificially inflate their GPU counts. But itr would not be an instant fix as it will take time for all the "excess" tasks to work through the system. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14674 Credit: 200,643,578 RAC: 874 |
I think it's probably time that we devoted a bit of thought to the source - the causes - of all these extra tasks in the daily database table. Apart from the big jump in the "in progress" limits around 6 December (which should have largely worked its way through the system by now), the biggest problem I see are the two different GPU driver problems. They are different, and we haven't perhaps thought about the differences enough. NVidia released - as they often do - a new driver version for their existing cards. Some people who worry about that sort of thing (mainly gamers) leapt onto the bandwagon: other people (for a while at least) had the new drivers foisted on them by Windows 10. In either case, it was an upgrade process, and we can reasonably expect that the same people will upgrade their drivers again, for the same reasons and through the same process. This one will cure itself over time - until NVidia write the next bug. AMD released new hardware, and (if I understand it right), their bug was present in their day 0 drivers released to support the new cards. We can't assume that the purchasers of the new cards (or new machines fitted with the new cards) will be the same 'constant upgraders': many people will simply use their machine 'as is' useless there is a significant problem visibly affecting their foreground use of the machine. We had an example in the sticky thread a few days ago. We also had a developer report yesterday suggesting that the existing applications will never return to working as before, because of an apparent change in the underlying OpenCL compiler behaviour. We need to pay attention to that. |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ rob smith Certainly dropping these figures to say, 100, would help, provided EVERYBODY lived with and didn'tI mostly agree with what you wrote. Where we differ is that EVEN IF we have some spoofers and others who game the system, overall it would help reduce the numbers in the queues. As always, there will be some with very big and fast systems who will complain that they run out of work during the maintenance and unplanned outages. Again, I would suggest that before at least the scheduled outage that a small window (5 minutes?) be opened with a large (400? 500?) limit followed by an hour of time to allow the transmission of the tasks to those big and fast systems. Yes, a few slow systems will also get in, but overall, the number of tasks in the queues will decrease. I think an even smaller limit would be appropriate. Most of the big systems have no more than 32 threads running SETI work. If they have GPUs, each gets a thread of its own. Therefore, a limit of 50 would also work quite nicely. I don't know how fast the big boys can process on their CPU threads, but on my Threadripper 1950X using the stock client they take about 90 minutes or a bit more. So on the NEW beasts, maybe 60 minutes between tasks for the CPU. GPU processes are much faster. Someone posted they process a task on their GPU in about 60 seconds. Depending on how busy "Synergy" is, the added contacts might be too much, and a larger limit (60? 80?) would be needed. Unless, of course, the big boys agree to forgo just a little bit of processing for the good of the majority processors. Note also, as the numbers in the queues decrease, the amount of work "Synergy" must do to process those tasks also decreases. |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Richard Haselgrove I think it's probably time that we devoted a bit of thought to the source - the causes - of all these extra tasks in the daily database table.+1 on that! Does anyone here know if the (time limit/deadline/whatever it might be called) to move tasks out of the waiting validation queue into the assimilation queue has been tweaked to allow work on recovering from the driver problems? Or perhaps a special status code inserted into the suspect tasks to "freeze" them while the work is ongoing. to recover? |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Keith Myers Notice the very low count for both the Results returned and awaiting validation and Workunits waiting for assimilation.What? Wait! Didn't we have slow computers back then? With a tasks in process limit of 100 or was it 150? Does this mean ... it is NOT the slow return of tasks that is causing our problem now? |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ W-K 666 But couldn't the "Results returned and awaiting Validation" increase be down to the fact that theyNo. The various queues from "ready to send" to "DB purging" are "logical queues" and a task is "moved" from one queue to another by changing a status value within the row of the DB. The limit we are bumping up against is the number of rows within the DB, not the values within the rows. |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
A quick note: we are processing about 10,000 more tasks/hour now than last November. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.