Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 64 · 65 · 66 · 67 · 68 · 69 · 70 . . . 94 · Next
Author | Message |
---|---|
Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572 |
Switched to AP only until this is sorted. NNT it is then:-( Kevin |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13766 Credit: 208,696,464 RAC: 304 |
And even with this self Validation, the Validation & Assimilation backlogs continue to grow. And my Inconclusives look to be heading for an all time record, and there would appear to the return of the BLC35 noise bombs. Grant Darwin NT |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
If we all set NNT no work will be processed and no resends will get processed. That is just my opinion |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13766 Credit: 208,696,464 RAC: 304 |
We could cut our caches to 0.0 + 0.0 - return every task after just 5 minutes and get the 'first back' reward? No, I didn't think so either.My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared. Not started to clear, but fully cleared. Pull all BLC35 files and then restart the splitters with 100 + 100 serverside limits again. Once we have the new NAS device up and running, bump up the limits & reintroduce the BLC35 files and then use them to stress test the system. If it fails again then it's fundraising time for new database servers that are capable of handling the load (that really needs to be done anyway in order to meet the projects goals of many more crunchers returning much more work). Grant Darwin NT |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
I agree Grant in regards to the pulling of the data tapes however I think you will find that they don't have the manpower to sift through the data and pool selected tapes |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13766 Credit: 208,696,464 RAC: 304 |
I agree Grant in regards to the pulling of the data tapes however I think you will find that they don't have the manpower to sift through the data and pool selected tapesHence just pull all files named BLC35 and hold them over till such time as the servers can handle the load they generate. Plenty of other files to be processed, so no need to do these ones now. Grant Darwin NT |
Gene Send message Joined: 26 Apr 99 Posts: 150 Credit: 48,393,279 RAC: 118 |
I got 3 invalids on the 30th. In all three cases the wingman, who got "valid" credit, returned a StdErr file that was empty - just one line <core_client_version>7.14.2</core_client_version> Not sensible that a result was "valid" that returned no info. I'm going NNT as others have. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared.This would alienate all those users who are not following these forums making them quit or switch to other projects permanently. Loss of users would help the server congestion but hurt the science progress. I think letting the backlogs clear at the start of a Tuesday downtime would make a big difference. Especially if they also trigger the validation of all those results that have missed validation for various reasons over the last weeks and are now waiting for the deadlines. The resend cycle wouldn't clear but they are a small percentage of all the tasks. The huge 'Workunits waiting for assimilation' backlog that is now 3.5 million and still rising would clear. Those workunits waiting for assimilation must have corresponding result rows still in the database at least for the canonical result but probably for all the results because I have never seen any workunit in the website show part of the results deleted while the workunit still exists. The number or results waiting for assimilation is not shown on SSP, so I guess those results may be still counted in the validation queue. It this is the case, then those may explain over 7 million of the current 12 million result validation queue! Once we have the new NAS device up and running, bump up the limits &...When the problem is the database not fitting in RAM, the disk performance increase won't fix the problem. It only reduces the magnitude of the consequences a bit. |
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared. +1 I think this is a great idea. We will all still get work... just make it the resends, until db reaches a good size. p.s. The idea of processing data without a wingman, or having a bad result put in over my good result is BS and worthless. I love Seti, but I don't want feel good theater, I want SCIENCE! so I'm NNT until it gets better. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The Splitters have fallen off again, most requests are receiving 'Project has No tasks...' again. Caches are falling.... one is down by 50% already, and so it continues. Oh, the problem with failed Uploads has also returned. Probably has something to do with returning around 60 to 70 completed tasks every 5 minutes. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Now they have apparently switched to 'initial replication 1': 3861450832 So no more risk of bad results returned first making good results returned later fail, but also no chance whatsoever of catching the bad results. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13766 Credit: 208,696,464 RAC: 304 |
Depending on how much better they perform, the need for all of it to fit in RAM may not arise (although that is rather wishful thinking- i am expecting the new storage to be significantly faster than the exiting storage, however i don't expect it to be significant enough.).Once we have the new NAS device up and running, bump up the limits &...When the problem is the database not fitting in RAM, the disk performance increase won't fix the problem. It only reduces the magnitude of the consequences a bit. Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM. Or a new server with more RAM. Or better yet, both. Grant Darwin NT |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Results received in last hour = 197,095 just a matter of time now, probably not long. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13766 Credit: 208,696,464 RAC: 304 |
Results received in last hour = 197,095Already getting "Project has no tasks available" messages, i think Tbar posted similarly in another thread. Caches running down. Not surprising considering the return rate & the increasing Validation & Assimilation backlogs- both have reached new record highs. Grant Darwin NT |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM.Setiathome Boinc database running from flash would burn out the flash in a short time! |
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status?? edit: Ville - I love your Pluto pic. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13766 Credit: 208,696,464 RAC: 304 |
After several decades.Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM.Setiathome Boinc database running from flash would burn out the flash in a short time! Yes, if you were to use consumer/client SSDs they would die rather quickly, however SSDs designed for enterprise use will last an extremely long time, under much heavier use than Seti provides. For example, DWPD (Drive Writes Per Day, where the entire capacity of the drive is written to in a 24hr period). Consumer drivers are rated at around .1 to .5, Enterprise drives are rated as high as 3 DWPD, some specialised write drives even higher. And of course with multiple drives in an array or pool, even udner the heaviest of loads, they will never come close to their rated maximum DWPD limit. Grant Darwin NT |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Ville - I love your Pluto pic.I got the idea of using a Pluto pic from you ;) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14656 Credit: 200,643,578 RAC: 874 |
If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status??My theory is that the Transitioner isn't (hasn't) marked all those returns as 'ready to validate' - I think the bulk of them have been sitting there untouched since the December troubles. Eric replied - very late on Thursday night, his time - I will see if I can figure out a transitioner trick tomorrow, in which case I will revert to standard replication.(I suggested that Matt might have a script for that - I think we've done it before) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The first machine is now out of work, https://setiathome.berkeley.edu/results.php?hostid=6796479 The Next machine's cache is down by 60%, it will be out soon, https://setiathome.berkeley.edu/results.php?hostid=6813106 The load is still above 190k, and the Splitters can't keep up, https://setiathome.berkeley.edu/show_server_status.php |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.