About Deadlines or Database reduction proposals

Message boards : Number crunching : About Deadlines or Database reduction proposals
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 16 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13947
Credit: 208,696,464
RAC: 304
Australia
Message 2034965 - Posted: 2 Mar 2020, 6:55:52 UTC - in response to Message 2034805.  
Last modified: 2 Mar 2020, 7:04:41 UTC

So currently the assimilator backlog is the problem and the cramped database and related problems like throttled work generation and difficult recoveries after Tuesday downtimes are the symptoms.
Many times in the past we have had the Deleter backlog increase, until such time as it stopped the splitters from splitting. Then the Deleter backlog would statrt to clear, then eventually the splitters would start splitting again. Then the Deleter backlog would rise again...
This would keep on happening till the work returned last hour dropped off, and the splitter output didn't need to be as high. Then everything would eventually clear. Till the next surge in demand. I think it was Richard made a suggestion to to re-work the database, which Eric implemented, and that problem hasn't re-occured since. It also resulted in much shorter weekly maintenance outages.
At present, the database is being (most likely) (mostly) cached. But the database performance is still much poorer than usual, because of the bloated Results returned and awaiting validation. That impacts on other process, and at present the one most affected is the Assimilators.
It is a symptom, not the cause (of course it in itself could cause further problems). The Assimilator backlog has come along since the Results returned and awaiting validation numbers blew out, not the other way round.
Grant
Darwin NT
ID: 2034965 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13947
Credit: 208,696,464
RAC: 304
Australia
Message 2034968 - Posted: 2 Mar 2020, 7:02:41 UTC - in response to Message 2034809.  

There are (at least) two current significant reasons for the creation of _2 and later replications:

1) bad drivers, and the associated compulsory re-check for overflow tasks
2) tasks issued, but never returned by absent hosts - reissued at deadline

Any solution has to take account of both problems.
I've just done a bit of a clearout on one of my fast hosts (0.5 day cache, ~1,000 tasks), pending some possible maintenance later in the week. I cleared

a) shorties - removes many rows from the database with little work
b) _2 or later replications.

I found far fewer resends than shorties. That, coupled with the tracking table I re-posted a little while ago, reinforces my view that absent hosts, who will never return the work however long we give them, are the bigger contributor to the longevity of the current database problems, and reducing the deadlines would be an effective contributor to shrinking the database, with very little downside.
And implementing the shorter deadline for Resends to get them out of the system as quickly as possible will make things better the next time a surge in noise bombs/corrupt WUs occurs, but it will also help clean up the present mess (although it'll take over a month to have much effect. The existing ones need to start timing out to finally get them cleared out).
Grant
Darwin NT
ID: 2034968 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19691
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034971 - Posted: 2 Mar 2020, 7:29:02 UTC - in response to Message 2034964.  
Last modified: 2 Mar 2020, 7:38:45 UTC

I'm going to say, I believe the terminology used on the Server Staus page is correct, and therefore must reject your theory.
What i posted isn't a theory, it's a statement of facts as they presently stand.
If you wish to contribute anything of value to this discussion, you need to understand what the problem is, and how it came about. And that requires understanding what is being discussed, which means understanding what the terms mean & apply to.
Since you chose to ignore the facts, then any input you continue to provide on this subject will not be of any relevance or use.

No, you are changing the units of measurement to fit your theory.

I know what caused the problem, was a mixture of three things, increase of tasks/device, a series of bad drivers for Nvidia gpu's, the ATI/AMD GPU problems and the subsequent increase to three tasks for validation of noise bombs. The last of which has problem because if the mix includes two AT/AMD devices the bad boys gang up on the good guy and "win".

All of which caused the seti database to increase in size so it can no longer fit in RAM and therefore shuffle data back and forwards to the HDD's, a relatively very slow process.
This has caused a least one blockage it how results are processed through the system, specifically after they have been crunched and reported.

At the moment the problem is being managed by throttling the splitters but this is not curing the problem.

The problem doesn't seem to be in Validation, because you can observe the report and if your computer is the 2nd to report almost immediate granting of credit etc.

What you can see is that validated MB tasks are now staying in view longer than the 'normal' 24 hours, so they are not being purged as quickly as before. (the longest of my tasks still in view was validated 27 Feb 2020, 21:53:44 UTC.
Note. The opposite is happening to AP tasks, they are being removed from view in 6 hours or less.

Therefore my prognosis is that there is a blockage for MB tasks between 'validation' and 'purging'.

edit] And the large numbers in the Validation row are not caused by the GPU problems, mentioned above, as they have mostly been cleared and are definitely less than 2% of all tasks outstanding.
ID: 2034971 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2034977 - Posted: 2 Mar 2020, 8:07:30 UTC - in response to Message 2034963.  
Last modified: 2 Mar 2020, 8:14:25 UTC

@ Grant (SSSF)
I think you are a knowledgeable person, but also a bit stubborn and unwilling to consider changes
that, while they don't FIX the problem, they at least start to lessen the impact the problem is
having on the system.

I agree with you that in the longer term, we will need a bigger and faster server. It won't happen
fast enough to fix this problem.

Setting a short(er) deadline will result in a smaller DB, but not soon enough to FIX this problem.

My suggestions won't fix this problem, either. At best, over a few days, they will improve the
situation, and they can begin doing so IMMEDIATELY and easily.

Take the easy (low hanging fruit) suggestions now, and work toward a better long-term solution
that will take much longer to implement.

The system these days is processing about 20,000 MORE tasks per HOUR now than it was last
November. The system isn't broken, just a bit impaired. Let's remove some of that impairment.

And will be of no benefit in reducing the database size in any significant or meaning full way, so
why even do it?
Just how much is "significant" or ""meaningful"? Any reduction might just be the
little bit that allows ALL the DB to reside in RAM. Why do it? Because it is easy and fast to do, even
if the reduction is relatively small and slowly made.

EDIT: Everyone should occasionally read the very bottom line on these pages. The one that says
SETI is funded by our government, so everyone should be allowed to participate. Donations by
volunteers, as I have also done, are great for special needs, but don't entitle us to actually "drive
the train", just to ride on it. Oh, I mean the U.S. government, not the Vietnamese one.
ID: 2034977 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2034979 - Posted: 2 Mar 2020, 8:12:08 UTC - in response to Message 2034865.  

Yes, and if everyone lowered their cache to a Day, the Turnaround Time would probably drop significantly. I'd guess most people out there have their cache set a 10 days. Imagine if they All dropped it to a Day...


. . Considering the inaccuracy in calculation run times it would probably be best if the work fetch were set to the current average of 1.2 days to achieve an actual single day of WUs. But if everyone did that then yes the average would go lower and some of the load would reduce. I think everyone who reads these forums know my attitude to keeping caches under one day.

Stephen

. .
ID: 2034979 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2034980 - Posted: 2 Mar 2020, 8:14:13 UTC - in response to Message 2034867.  

Yes. If you spot any host in the database with a 10 day average turnround, please let us know.

I'll start the ball rolling with host 8873865 (average turnround 62.14 days) - that's gone up from 53.85 days when I first saw it.


. . Looks like the classic newbie not liking the real world process and leaving ...

Stephen

:(
ID: 2034980 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2034981 - Posted: 2 Mar 2020, 8:16:55 UTC - in response to Message 2034980.  

@ Stephen "Heretic"
Got a new toy to talk with E.T., and when it didn't work instantly, gave up.
ID: 2034981 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2034982 - Posted: 2 Mar 2020, 8:19:23 UTC - in response to Message 2034873.  

I got one like that

https://setiathome.berkeley.edu/workunit.php?wuid=3715153965

Crunched it 29th Oct it will roll over again on 17th Mar

Got 20+ from Dec


. . The worst I've seen was a 3 time rollover where the 3rd wingmate finally crunched the task right the end of their deadline so it sat in limbo for over 6 months ... I'm sure others have seen some even worse. So the combination of delinquent hosts and overlong deadlines is very damaging to the project. At least we can tackle the deadlines.

Stephen

. .
ID: 2034982 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2034983 - Posted: 2 Mar 2020, 8:24:36 UTC - in response to Message 2034880.  

Try 0,01, and 0,01. Works prefectly, and makes you return the tasks very fast.
And if your computer dies or you stop crunching for other reasons, there's not
going to be many tasks hanging around.
(Sure, you will run out of tasks during outages and glitches, but so what?)
But I guess that's totally out of the question for some people here.....
And by all means, NEVER EVER comment on the fact that nothing has been really analyzed of all the millions (billions)
of tasks, that's been crunched here over the years. All we do is crunch for the "warehouse" so to speak.
We may already have all the evidence we need of ET, but we will never know until the crunched data is analyzed.
No hurry really to crunch more for the "warehouse".
Shut down this part of the project for a couple of years (or permanently), and concentrate on the most important part now, which is ANALYZING.


. . Sadly it seems that the final part of the process is not suitable for distributed computing, so terminating the first part (us) will not speed it up one little bit. If they could deal with the final phase via BOINC then I would happily devote at least half of my computers to that purpose.

Stephen

:(
ID: 2034983 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13947
Credit: 208,696,464
RAC: 304
Australia
Message 2034985 - Posted: 2 Mar 2020, 8:27:35 UTC - in response to Message 2034977.  

@ Grant (SSSF)
I think you are a knowledgeable person, but also a bit stubborn and unwilling to consider changes that, while they don't FIX the problem, they at least start to lessen the impact the problem is having on the system.
I'm all for changes i can see having a significant positive impact- reduced deadlines & a further reduced deadline for resends being the main example. The reduction in the number of WUs per device you propose will leave many without full work, let alone get them through the many not so infrequent periods where the Scheduler takes 10 min to an hour off. But the resulting reduction of Work in progress won't necessarily improve things enough to help the situation.
A change like that which will have a significant negative impact on a large number of users, but very likely won't result in a significant improvement in what it is being changed for doesn't make sense to me.


The system these days is processing about 20,000 MORE tasks per HOUR now than it was last November. The system isn't broken, just a bit impaired. Let's remove some of that impairment.
It's not so much a case of being broken, as a case of reaching it's limits. Reducing deadlines looks to offer a good reduction in database size, with minimal if any impact on crunchers (if a system can't return 1 WU a month then it doesn't really have anything to offer IMHO). But any benefits we get from these changes if the project gets it's wish and gets more crunchers & computing resources to process work won't stop the problem from occurring again (even if Quorums return to 2 for all WUs) in the future.
The fact is the database has reached the point that the present hardware can't support i, and the load is only going to grow (hopefully). It's time for new hardware.
Grant
Darwin NT
ID: 2034985 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2034987 - Posted: 2 Mar 2020, 8:30:38 UTC - in response to Message 2034982.  

@ Stephen "Heretic"
. . The worst I've seen was a 3 time rollover where the 3rd wingmate finally crunched the task right the end of their
deadline so it sat in limbo for over 6 months ... I'm sure others have seen some even worse. So the combination of
delinquent hosts and overlong deadlines is very damaging to the project. At least we can tackle the deadlines.

I believe the server DB could have several million crunched and returned, validated or not, in the assimilation queue,
the Purge queue, the Field queue, and NOT have a significant impact on processing. As several posters have stated or implied,
if the DB fits in RAM, we don't have a significant processing issue. When it won't fit, then we do have an issue.
ID: 2034987 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2034989 - Posted: 2 Mar 2020, 8:32:24 UTC - in response to Message 2034981.  

@ Stephen "Heretic"
Got a new toy to talk with E.T., and when it didn't work instantly, gave up.


. . We see a lot of that ... the instant gratification generation ...

Stephen

< shrug >
ID: 2034989 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2034991 - Posted: 2 Mar 2020, 8:46:53 UTC - in response to Message 2034985.  

@ Grant (SSSF)
But the resulting reduction of Work in progress won't necessarily improve things enough to help the situation.

Again you dismiss this suggestion out of hand without actually knowing if it will or will not have
a "significant positive impact". At least it will have a positive impact of some amount, and it
requires, perhaps, as much as 5 minutes to add or modify two parameters in the cc_config.xml
file for the server. Cheap! Easy! Positive impact on the DB! And it doesn't cut out anyone.

Reducing deadlines would have a positive impact on the DB, but would cut out the processing
of the slow or infrequent system. If the user base gets too small, or too angry at being cut out,
their votes could backfire on SETI and have the government funding reduced or eliminated. I
don't want that to happen. It is far better to put up with some "noise" in the DB by the ones
who over cache, or sample and run away.

Let's do these changes, and then let's get a fund raiser started for new/added server hardware.
ID: 2034991 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19691
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034992 - Posted: 2 Mar 2020, 8:47:20 UTC - in response to Message 2034985.  
Last modified: 2 Mar 2020, 8:55:25 UTC

Reducing deadlines looks to offer a good reduction in database size,


Are you sure. Eric's graph https://setiathome.berkeley.edu/forum_thread.php?id=83848&postid=1978208#1978208 indicates 98% of all tasks get returned in 3 days.

edit] I think a better suggestion is to reduce the cache to a true 24 hour limit.
My 0.6 day (14.4 hrs) cache only last ~10 hrs at best. (0.6 days = ~150 tasks for GPU)
ID: 2034992 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13947
Credit: 208,696,464
RAC: 304
Australia
Message 2034997 - Posted: 2 Mar 2020, 9:13:03 UTC - in response to Message 2034991.  

Again you dismiss this suggestion out of hand without actually knowing if it will or will not have a "significant positive impact". At least it will have a positive impact of some amount, and it
requires, perhaps, as much as 5 minutes to add or modify two parameters in the cc_config.xml file for the server. Cheap! Easy! Positive impact on the DB! And it doesn't cut out anyone.
I am not dismissing it out of hand- i looked at the effect of reducing the limits from their highest level down to the present one. It had a significant effect on the work in progress, but it din't have much of an effect on the overall database size- because a large impact in a small proportion of a large value, is a small impact.
And it does cut out people by not using all that they offer- look at the size of the limit you are proposing. Many CPU systems now have more than 10 cores, now they are no longer to able contribute fully to the project. 100 per GPU, fair enough. But 10 per CPU? You are concerned about alienating people that contribute next to nothing to the project, but not those that contribute significantly?
If we are going to limit the size of the In progress tasks, the best way would be to limit the size of the cache, not an absolute limit on tasks.
Hence my lack of support for this suggestion.


Reducing deadlines would have a positive impact on the DB, but would cut out the processing of the slow or infrequent system. If the user base gets too small, or too angry at being cut out,
their votes could backfire on SETI and have the government funding reduced or eliminated. Idon't want that to happen. It is far better to put up with some "noise" in the DB by the ones
who over cache, or sample and run away.
And here you are dismissing my arguments without serious consideration.
Do you honestly consider the return of 1 WU per month to be a worthy contribution to the project? I don't. One a week, sure. One a fortnight, maybe. But one WU per month? No, not in a million years. But the fact is that a 28 day deadline will still make it possible for such a system to contribute to the project.
If people get upset that they can no longer return 3 or 6 or 8 WUs a year, then so be it. Let the ability to do 12 a year be the new minimum required to process work for Seti (even if they only choose to do 1, or even 0- the reduced deadline along with a 7 day deadline for resends would help keep the database smaller than it presently is).


Let's do these changes, and then let's get a fund raiser started for new/added server hardware.
Changes that have the minimum impact on the majority of crunchers, and the greatest impact on the database issue.
Grant
Darwin NT
ID: 2034997 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2035000 - Posted: 2 Mar 2020, 9:28:12 UTC - in response to Message 2034992.  

@ W-K 666

Are you sure. Eric's graph https://setiathome.berkeley.edu/forum_thread.php?id=83848&postid=1978208#1978208 indicates 98% of all tasks get returned in 3 days.
This is my calculated %'s

Day......workunit storage.............%.................Total...%
1........................1,450,000......26.73%.............26.73%
2............................790,000......14.56%.............41.29%
3............................380,000.........7.00%.............48.29%
4............................110,000.........2.03%.............50.32%
5............................190,000.........3.50%.............53.82%
6............................150,000.........2.76%.............56.59%
7............................100,000.........1.84%.............58.43%
8...............................80,000.........1.47%.............59.91%
9...............................90,000.........1.66%.............61.57%
10............................85,000.........1.57%.............63.13%
11-60...............2,000,000......36.87%..........100.00% (about 0.75% per day after 10 days)


"tasks returned" estimated by eyeball.
ID: 2035000 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22799
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2035001 - Posted: 2 Mar 2020, 9:28:42 UTC

I believe the server DB could have several million crunched and returned, validated or not, in the assimilation queue, the Purge queue, the Field queue, and NOT have a significant impact on processing.

Well, the working database has millions of tasks at various stages along the path from "newly created" to "ready for deletion", and there is an issue - this temporary table is far too big to sit in memory, so has to be "paged" in and out to allow work to be processed by the servers. All the data you describe is actually held in single table, and queries are run by each of the processes to update each task until it is completed and deleted - the processes are "crunching", "validation", "assimilation", "purge" & "delete". A set of flags identify where in the process a task is sitting - a process cannot work on a task that is marked as being used in an earlier step, or indeed go back and look at one it's already dealt with.

There is the odd issue in that the validators do not always correctly flag tasks they've dealt with, and this appears to coincide with either a major server "bump", or being processed as they are stopped. Richard(?) did do some digging on this a long time ago, but since this was such a rare occurrence nothing appears to have been done about it.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2035001 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2035008 - Posted: 2 Mar 2020, 9:49:08 UTC - in response to Message 2034997.  

@ Grant (SSSF)
Ahh, as Winston Churchill supposedly said, "Now we are negotiating the price."
But 10 per CPU?
I concede to you this point, so let's make it 50 or even 60.
Do you honestly consider the return of 1 WU per month to be a worthy contribution to the project? I don't.
Yes, I do because the U.S. already has enough people who don't believe in science. If they came, let them
contribute even a small amount. Maybe they will inspire somebody who has a big fast computer to try it, too.
One a week, sure. One a fortnight, maybe. But one WU per month? No, not in a million years. But the fact is that a 28 day deadline will still make it possible for such a system to contribute to the project.
The "cost" to the project to carry those few is very small, so any contribution made is worth it.
... along with a 7 day deadline for resends ...
This part is also very doable without software changes, and I support this. By changing cc_config.xml on the
server, the Resends could be preferentially sent to "reliable" (BOINC scheduler code speak for "a system with
a short turnaround and few errors") system for reprocessing.

So will you contact Mr. Kevvy to start a fund raiser for more RAM, a new motherboard and RAM, or something else?
ID: 2035008 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13947
Credit: 208,696,464
RAC: 304
Australia
Message 2035009 - Posted: 2 Mar 2020, 9:54:01 UTC - in response to Message 2035008.  

So will you contact Mr. Kevvy to start a fund raiser for more RAM, a new motherboard and RAM, or something else?
We need Eric to state his requirements, then the fun begins.

I've already posted my suggested system somewhere around these forums (it can handle 3TB of RAM, sufficient improvement over the current system's 96GB to be good for a couple of years i should think).
Grant
Darwin NT
ID: 2035009 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13947
Credit: 208,696,464
RAC: 304
Australia
Message 2035010 - Posted: 2 Mar 2020, 10:08:30 UTC - in response to Message 2034992.  

Reducing deadlines looks to offer a good reduction in database size,
Are you sure. Eric's graph https://setiathome.berkeley.edu/forum_thread.php?id=83848&postid=1978208#1978208 indicates 98% of all tasks get returned in 3 days.
edit] I think a better suggestion is to reduce the cache to a true 24 hour limit.
My 0.6 day (14.4 hrs) cache only last ~10 hrs at best. (0.6 days = ~150 tasks for GPU)
Maybe so, but a quick look at one of my systems shows 5,461 Pending. Unfortunately the Task pages are barely responsive, but before things came to a halt i got to around 30 days out. There were just under 800 WUs over the 30 day line.
That's almost 15% of my Pendings.
Grant
Darwin NT
ID: 2035010 · Report as offensive     Reply Quote
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 16 · Next

Message boards : Number crunching : About Deadlines or Database reduction proposals


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.