Message boards :
Number crunching :
About Deadlines or Database reduction proposals
Message board moderation
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 16 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
So currently the assimilator backlog is the problem and the cramped database and related problems like throttled work generation and difficult recoveries after Tuesday downtimes are the symptoms.Many times in the past we have had the Deleter backlog increase, until such time as it stopped the splitters from splitting. Then the Deleter backlog would statrt to clear, then eventually the splitters would start splitting again. Then the Deleter backlog would rise again... This would keep on happening till the work returned last hour dropped off, and the splitter output didn't need to be as high. Then everything would eventually clear. Till the next surge in demand. I think it was Richard made a suggestion to to re-work the database, which Eric implemented, and that problem hasn't re-occured since. It also resulted in much shorter weekly maintenance outages. At present, the database is being (most likely) (mostly) cached. But the database performance is still much poorer than usual, because of the bloated Results returned and awaiting validation. That impacts on other process, and at present the one most affected is the Assimilators. It is a symptom, not the cause (of course it in itself could cause further problems). The Assimilator backlog has come along since the Results returned and awaiting validation numbers blew out, not the other way round. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
And implementing the shorter deadline for Resends to get them out of the system as quickly as possible will make things better the next time a surge in noise bombs/corrupt WUs occurs, but it will also help clean up the present mess (although it'll take over a month to have much effect. The existing ones need to start timing out to finally get them cleared out).There are (at least) two current significant reasons for the creation of _2 and later replications:I've just done a bit of a clearout on one of my fast hosts (0.5 day cache, ~1,000 tasks), pending some possible maintenance later in the week. I cleared Grant Darwin NT |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 19691 Credit: 40,757,560 RAC: 67 ![]() ![]() |
I'm going to say, I believe the terminology used on the Server Staus page is correct, and therefore must reject your theory.What i posted isn't a theory, it's a statement of facts as they presently stand. No, you are changing the units of measurement to fit your theory. I know what caused the problem, was a mixture of three things, increase of tasks/device, a series of bad drivers for Nvidia gpu's, the ATI/AMD GPU problems and the subsequent increase to three tasks for validation of noise bombs. The last of which has problem because if the mix includes two AT/AMD devices the bad boys gang up on the good guy and "win". All of which caused the seti database to increase in size so it can no longer fit in RAM and therefore shuffle data back and forwards to the HDD's, a relatively very slow process. This has caused a least one blockage it how results are processed through the system, specifically after they have been crunched and reported. At the moment the problem is being managed by throttling the splitters but this is not curing the problem. The problem doesn't seem to be in Validation, because you can observe the report and if your computer is the 2nd to report almost immediate granting of credit etc. What you can see is that validated MB tasks are now staying in view longer than the 'normal' 24 hours, so they are not being purged as quickly as before. (the longest of my tasks still in view was validated 27 Feb 2020, 21:53:44 UTC. Note. The opposite is happening to AP tasks, they are being removed from view in 6 hours or less. Therefore my prognosis is that there is a blockage for MB tasks between 'validation' and 'purging'. edit] And the large numbers in the Validation row are not caused by the GPU problems, mentioned above, as they have mostly been cleared and are definitely less than 2% of all tasks outstanding. |
Darrell Wilcox ![]() Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 ![]() ![]() |
@ Grant (SSSF) I think you are a knowledgeable person, but also a bit stubborn and unwilling to consider changes that, while they don't FIX the problem, they at least start to lessen the impact the problem is having on the system. I agree with you that in the longer term, we will need a bigger and faster server. It won't happen fast enough to fix this problem. Setting a short(er) deadline will result in a smaller DB, but not soon enough to FIX this problem. My suggestions won't fix this problem, either. At best, over a few days, they will improve the situation, and they can begin doing so IMMEDIATELY and easily. Take the easy (low hanging fruit) suggestions now, and work toward a better long-term solution that will take much longer to implement. The system these days is processing about 20,000 MORE tasks per HOUR now than it was last November. The system isn't broken, just a bit impaired. Let's remove some of that impairment. And will be of no benefit in reducing the database size in any significant or meaning full way, soJust how much is "significant" or ""meaningful"? Any reduction might just be the little bit that allows ALL the DB to reside in RAM. Why do it? Because it is easy and fast to do, even if the reduction is relatively small and slowly made. EDIT: Everyone should occasionally read the very bottom line on these pages. The one that says SETI is funded by our government, so everyone should be allowed to participate. Donations by volunteers, as I have also done, are great for special needs, but don't entitle us to actually "drive the train", just to ride on it. Oh, I mean the U.S. government, not the Vietnamese one. |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
Yes, and if everyone lowered their cache to a Day, the Turnaround Time would probably drop significantly. I'd guess most people out there have their cache set a 10 days. Imagine if they All dropped it to a Day... . . Considering the inaccuracy in calculation run times it would probably be best if the work fetch were set to the current average of 1.2 days to achieve an actual single day of WUs. But if everyone did that then yes the average would go lower and some of the load would reduce. I think everyone who reads these forums know my attitude to keeping caches under one day. Stephen . . |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
Yes. If you spot any host in the database with a 10 day average turnround, please let us know. . . Looks like the classic newbie not liking the real world process and leaving ... Stephen :( |
Darrell Wilcox ![]() Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 ![]() ![]() |
@ Stephen "Heretic" Got a new toy to talk with E.T., and when it didn't work instantly, gave up. |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
I got one like that . . The worst I've seen was a 3 time rollover where the 3rd wingmate finally crunched the task right the end of their deadline so it sat in limbo for over 6 months ... I'm sure others have seen some even worse. So the combination of delinquent hosts and overlong deadlines is very damaging to the project. At least we can tackle the deadlines. Stephen . . |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
Try 0,01, and 0,01. Works prefectly, and makes you return the tasks very fast. . . Sadly it seems that the final part of the process is not suitable for distributed computing, so terminating the first part (us) will not speed it up one little bit. If they could deal with the final phase via BOINC then I would happily devote at least half of my computers to that purpose. Stephen :( |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
@ Grant (SSSF)I'm all for changes i can see having a significant positive impact- reduced deadlines & a further reduced deadline for resends being the main example. The reduction in the number of WUs per device you propose will leave many without full work, let alone get them through the many not so infrequent periods where the Scheduler takes 10 min to an hour off. But the resulting reduction of Work in progress won't necessarily improve things enough to help the situation. A change like that which will have a significant negative impact on a large number of users, but very likely won't result in a significant improvement in what it is being changed for doesn't make sense to me. The system these days is processing about 20,000 MORE tasks per HOUR now than it was last November. The system isn't broken, just a bit impaired. Let's remove some of that impairment.It's not so much a case of being broken, as a case of reaching it's limits. Reducing deadlines looks to offer a good reduction in database size, with minimal if any impact on crunchers (if a system can't return 1 WU a month then it doesn't really have anything to offer IMHO). But any benefits we get from these changes if the project gets it's wish and gets more crunchers & computing resources to process work won't stop the problem from occurring again (even if Quorums return to 2 for all WUs) in the future. The fact is the database has reached the point that the present hardware can't support i, and the load is only going to grow (hopefully). It's time for new hardware. Grant Darwin NT |
Darrell Wilcox ![]() Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 ![]() ![]() |
@ Stephen "Heretic" . . The worst I've seen was a 3 time rollover where the 3rd wingmate finally crunched the task right the end of their I believe the server DB could have several million crunched and returned, validated or not, in the assimilation queue, the Purge queue, the Field queue, and NOT have a significant impact on processing. As several posters have stated or implied, if the DB fits in RAM, we don't have a significant processing issue. When it won't fit, then we do have an issue. |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
@ Stephen "Heretic" . . We see a lot of that ... the instant gratification generation ... Stephen < shrug > |
Darrell Wilcox ![]() Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 ![]() ![]() |
@ Grant (SSSF) But the resulting reduction of Work in progress won't necessarily improve things enough to help the situation. Again you dismiss this suggestion out of hand without actually knowing if it will or will not have a "significant positive impact". At least it will have a positive impact of some amount, and it requires, perhaps, as much as 5 minutes to add or modify two parameters in the cc_config.xml file for the server. Cheap! Easy! Positive impact on the DB! And it doesn't cut out anyone. Reducing deadlines would have a positive impact on the DB, but would cut out the processing of the slow or infrequent system. If the user base gets too small, or too angry at being cut out, their votes could backfire on SETI and have the government funding reduced or eliminated. I don't want that to happen. It is far better to put up with some "noise" in the DB by the ones who over cache, or sample and run away. Let's do these changes, and then let's get a fund raiser started for new/added server hardware. |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 19691 Credit: 40,757,560 RAC: 67 ![]() ![]() |
Reducing deadlines looks to offer a good reduction in database size, Are you sure. Eric's graph https://setiathome.berkeley.edu/forum_thread.php?id=83848&postid=1978208#1978208 indicates 98% of all tasks get returned in 3 days. edit] I think a better suggestion is to reduce the cache to a true 24 hour limit. My 0.6 day (14.4 hrs) cache only last ~10 hrs at best. (0.6 days = ~150 tasks for GPU) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
Again you dismiss this suggestion out of hand without actually knowing if it will or will not have a "significant positive impact". At least it will have a positive impact of some amount, and itI am not dismissing it out of hand- i looked at the effect of reducing the limits from their highest level down to the present one. It had a significant effect on the work in progress, but it din't have much of an effect on the overall database size- because a large impact in a small proportion of a large value, is a small impact. And it does cut out people by not using all that they offer- look at the size of the limit you are proposing. Many CPU systems now have more than 10 cores, now they are no longer to able contribute fully to the project. 100 per GPU, fair enough. But 10 per CPU? You are concerned about alienating people that contribute next to nothing to the project, but not those that contribute significantly? If we are going to limit the size of the In progress tasks, the best way would be to limit the size of the cache, not an absolute limit on tasks. Hence my lack of support for this suggestion. Reducing deadlines would have a positive impact on the DB, but would cut out the processing of the slow or infrequent system. If the user base gets too small, or too angry at being cut out,And here you are dismissing my arguments without serious consideration. Do you honestly consider the return of 1 WU per month to be a worthy contribution to the project? I don't. One a week, sure. One a fortnight, maybe. But one WU per month? No, not in a million years. But the fact is that a 28 day deadline will still make it possible for such a system to contribute to the project. If people get upset that they can no longer return 3 or 6 or 8 WUs a year, then so be it. Let the ability to do 12 a year be the new minimum required to process work for Seti (even if they only choose to do 1, or even 0- the reduced deadline along with a 7 day deadline for resends would help keep the database smaller than it presently is). Let's do these changes, and then let's get a fund raiser started for new/added server hardware.Changes that have the minimum impact on the majority of crunchers, and the greatest impact on the database issue. Grant Darwin NT |
Darrell Wilcox ![]() Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 ![]() ![]() |
@ W-K 666 Are you sure. Eric's graph https://setiathome.berkeley.edu/forum_thread.php?id=83848&postid=1978208#1978208 indicates 98% of all tasks get returned in 3 days.This is my calculated %'s Day......workunit storage.............%.................Total...% 1........................1,450,000......26.73%.............26.73% 2............................790,000......14.56%.............41.29% 3............................380,000.........7.00%.............48.29% 4............................110,000.........2.03%.............50.32% 5............................190,000.........3.50%.............53.82% 6............................150,000.........2.76%.............56.59% 7............................100,000.........1.84%.............58.43% 8...............................80,000.........1.47%.............59.91% 9...............................90,000.........1.66%.............61.57% 10............................85,000.........1.57%.............63.13% 11-60...............2,000,000......36.87%..........100.00% (about 0.75% per day after 10 days) "tasks returned" estimated by eyeball. |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22799 Credit: 416,307,556 RAC: 380 ![]() ![]() |
I believe the server DB could have several million crunched and returned, validated or not, in the assimilation queue, the Purge queue, the Field queue, and NOT have a significant impact on processing. Well, the working database has millions of tasks at various stages along the path from "newly created" to "ready for deletion", and there is an issue - this temporary table is far too big to sit in memory, so has to be "paged" in and out to allow work to be processed by the servers. All the data you describe is actually held in single table, and queries are run by each of the processes to update each task until it is completed and deleted - the processes are "crunching", "validation", "assimilation", "purge" & "delete". A set of flags identify where in the process a task is sitting - a process cannot work on a task that is marked as being used in an earlier step, or indeed go back and look at one it's already dealt with. There is the odd issue in that the validators do not always correctly flag tasks they've dealt with, and this appears to coincide with either a major server "bump", or being processed as they are stopped. Richard(?) did do some digging on this a long time ago, but since this was such a rare occurrence nothing appears to have been done about it. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Darrell Wilcox ![]() Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 ![]() ![]() |
@ Grant (SSSF) Ahh, as Winston Churchill supposedly said, "Now we are negotiating the price." But 10 per CPU?I concede to you this point, so let's make it 50 or even 60. Do you honestly consider the return of 1 WU per month to be a worthy contribution to the project? I don't.Yes, I do because the U.S. already has enough people who don't believe in science. If they came, let them contribute even a small amount. Maybe they will inspire somebody who has a big fast computer to try it, too. One a week, sure. One a fortnight, maybe. But one WU per month? No, not in a million years. But the fact is that a 28 day deadline will still make it possible for such a system to contribute to the project.The "cost" to the project to carry those few is very small, so any contribution made is worth it. ... along with a 7 day deadline for resends ...This part is also very doable without software changes, and I support this. By changing cc_config.xml on the server, the Resends could be preferentially sent to "reliable" (BOINC scheduler code speak for "a system with a short turnaround and few errors") system for reprocessing. So will you contact Mr. Kevvy to start a fund raiser for more RAM, a new motherboard and RAM, or something else? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
So will you contact Mr. Kevvy to start a fund raiser for more RAM, a new motherboard and RAM, or something else?We need Eric to state his requirements, then the fun begins. I've already posted my suggested system somewhere around these forums (it can handle 3TB of RAM, sufficient improvement over the current system's 96GB to be good for a couple of years i should think). Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
Maybe so, but a quick look at one of my systems shows 5,461 Pending. Unfortunately the Task pages are barely responsive, but before things came to a halt i got to around 30 days out. There were just under 800 WUs over the 30 day line.Reducing deadlines looks to offer a good reduction in database size,Are you sure. Eric's graph https://setiathome.berkeley.edu/forum_thread.php?id=83848&postid=1978208#1978208 indicates 98% of all tasks get returned in 3 days. That's almost 15% of my Pendings. Grant Darwin NT |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.