Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 64 · 65 · 66 · 67 · 68 · 69 · 70 . . . 94 · Next
| Author | Message |
|---|---|
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared.This would alienate all those users who are not following these forums making them quit or switch to other projects permanently. Loss of users would help the server congestion but hurt the science progress. I think letting the backlogs clear at the start of a Tuesday downtime would make a big difference. Especially if they also trigger the validation of all those results that have missed validation for various reasons over the last weeks and are now waiting for the deadlines. The resend cycle wouldn't clear but they are a small percentage of all the tasks. The huge 'Workunits waiting for assimilation' backlog that is now 3.5 million and still rising would clear. Those workunits waiting for assimilation must have corresponding result rows still in the database at least for the canonical result but probably for all the results because I have never seen any workunit in the website show part of the results deleted while the workunit still exists. The number or results waiting for assimilation is not shown on SSP, so I guess those results may be still counted in the validation queue. It this is the case, then those may explain over 7 million of the current 12 million result validation queue! Once we have the new NAS device up and running, bump up the limits &...When the problem is the database not fitting in RAM, the disk performance increase won't fix the problem. It only reduces the magnitude of the consequences a bit. |
Unixchick ![]() Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22
|
My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared. +1 I think this is a great idea. We will all still get work... just make it the resends, until db reaches a good size. p.s. The idea of processing data without a wingman, or having a bad result put in over my good result is BS and worthless. I love Seti, but I don't want feel good theater, I want SCIENCE! so I'm NNT until it gets better. |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
The Splitters have fallen off again, most requests are receiving 'Project has No tasks...' again. Caches are falling.... one is down by 50% already, and so it continues. Oh, the problem with failed Uploads has also returned. Probably has something to do with returning around 60 to 70 completed tasks every 5 minutes. |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
Now they have apparently switched to 'initial replication 1': 3861450832 So no more risk of bad results returned first making good results returned later fail, but also no chance whatsoever of catching the bad results. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14041 Credit: 208,696,464 RAC: 304
|
Depending on how much better they perform, the need for all of it to fit in RAM may not arise (although that is rather wishful thinking- i am expecting the new storage to be significantly faster than the exiting storage, however i don't expect it to be significant enough.).Once we have the new NAS device up and running, bump up the limits &...When the problem is the database not fitting in RAM, the disk performance increase won't fix the problem. It only reduces the magnitude of the consequences a bit. Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM. Or a new server with more RAM. Or better yet, both. Grant Darwin NT |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
Results received in last hour = 197,095 just a matter of time now, probably not long. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14041 Credit: 208,696,464 RAC: 304
|
Results received in last hour = 197,095Already getting "Project has no tasks available" messages, i think Tbar posted similarly in another thread. Caches running down. Not surprising considering the return rate & the increasing Validation & Assimilation backlogs- both have reached new record highs. Grant Darwin NT |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM.Setiathome Boinc database running from flash would burn out the flash in a short time! |
Unixchick ![]() Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22
|
If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status?? edit: Ville - I love your Pluto pic. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14041 Credit: 208,696,464 RAC: 304
|
After several decades.Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM.Setiathome Boinc database running from flash would burn out the flash in a short time! Yes, if you were to use consumer/client SSDs they would die rather quickly, however SSDs designed for enterprise use will last an extremely long time, under much heavier use than Seti provides. For example, DWPD (Drive Writes Per Day, where the entire capacity of the drive is written to in a 24hr period). Consumer drivers are rated at around .1 to .5, Enterprise drives are rated as high as 3 DWPD, some specialised write drives even higher. And of course with multiple drives in an array or pool, even udner the heaviest of loads, they will never come close to their rated maximum DWPD limit. Grant Darwin NT |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
Ville - I love your Pluto pic.I got the idea of using a Pluto pic from you ;) |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status??My theory is that the Transitioner isn't (hasn't) marked all those returns as 'ready to validate' - I think the bulk of them have been sitting there untouched since the December troubles. Eric replied - very late on Thursday night, his time - I will see if I can figure out a transitioner trick tomorrow, in which case I will revert to standard replication.(I suggested that Matt might have a script for that - I think we've done it before) |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
The first machine is now out of work, https://setiathome.berkeley.edu/results.php?hostid=6796479 The Next machine's cache is down by 60%, it will be out soon, https://setiathome.berkeley.edu/results.php?hostid=6813106 The load is still above 190k, and the Splitters can't keep up, https://setiathome.berkeley.edu/show_server_status.php |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status??If my theory about results belonging to workunits waiting for assimilation being shown as waiting for validation is correct, then we could have about 7.5 million of the 12.2 million results there being ones that have been validated but not assimilated yet. And that is growing fast. The 'Workunits waiting for assimilation' is a supposed to be close to zero in normal situation because workunits get assimilated immediately after they have been validated. But for more than a week now that number has been steadily growing. Recently by about 30000 per hour. The Astropulse number has also been growing the last couple of hours. There is some serious performance problem in assimilation. |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
The first machine is now out of workYou are crunching too fast. My caches are nearly full in both machines. |
rob smith ![]() Send message Joined: 7 Mar 03 Posts: 23016 Credit: 416,307,556 RAC: 380
|
Because I'm nowhere near the bulk of my computers I've had to resort to using the web options page to set don't do any SETI work - the first time I've had to resort to this sort of thing due to actions by SETI :-( Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
There is a noise bombing window in blc35 at around 58692_07 and _08. Those are probably causing the current high return rate. |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
Eric replied - very late on Thursday night, his time -Is he aware of the assimilation problem? |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
He didn't mention it, but I would expect so, yes: that's an unambiguous figure on the face of the SSP (and the more complete figures which, I presume, they have access to via internal monitoring).Eric replied - very late on Thursday night, his time -Is he aware of the assimilation problem? The 'results awaiting validation' and 'workunits awaiting validation' figures are also unambiguous, but they are unusual - why are they so different? The first usually hovers around 4 million, but recently it's been 12 million. Why? The rise started when the 'in progress' limit was raised - an obvious direct connection, no alarm bells. But why is it still so high? That needs explanation, and I've suggested a possible way of finding out the answer. Lets hope it works, else someone is going to have to come up with another suggestion. |
rob smith ![]() Send message Joined: 7 Mar 03 Posts: 23016 Credit: 416,307,556 RAC: 380
|
If this is an attempt to reduce the amount of work sitting around waiting to be validated it's not working as the number has increased from about 11,500,000 last night to about 12,250,000 this morning. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
©2026 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.