Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 71 · 72 · 73 · 74 · 75 · 76 · 77 . . . 94 · Next
Author | Message |
---|---|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Until we can get "Results returned and awaiting validation" down to around 3.5 million (given the present amount of Work in progress- so 7 million to go), and the "Workunits waiting for assimilation" back down to 0 (3.7 million to go), any new work just causes those numbers to climb.If the underlying problem is not fixed, the numbers will just start growing again no matter how low they were driven. Apparently the splitters are occasionally running in so short bursts that the SSP can't catch them. I got a small bunch of freshly split _0s and _1s. Mostly noise bombs. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304 |
If the underlying problem is not fixed, the numbers will just start growing again no matter how low they were driven.Yep. It appears we've just about finished all the BLC35 noise bombs***. And there is now a fix for the AMD RX 5000 card issues. While the increased serverside limits didn't help things, it was those 2 issues that really brought things undone- as the way to stop dodgy results getting in to the science database was require more than 1 wingman to verify a noisy WU result. Combined with files that were producing almost nothing but noise bombs, the size of the database exploded as the hardware just couldn't keep up with the load. And there may have been other performance related issues that have contributed to the initial database rapid expansion & the corresponding excruciatingly slow recovery. Having said that, it shows that we really do need new hardware in order to meet (not too distant) future workloads (let alone the continuing upload & download server issues). Edit- *** Having said that, there's still a big heap of them still to come (there were that many noisy files there). Grant Darwin NT |
Peter Send message Joined: 12 Feb 14 Posts: 19 Credit: 1,385,738 RAC: 6 |
Yeaaaaah, a lot of tasks for for CPU and CPU+GPU are now waiting :) |
Kiska Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0 |
Edit: Except for the replica, which is now 5,91 hours behind, and it's getting worse for each update of the SSP. :-( Fun time, I just config'd graphs for replica: https://munin.kiska.pw/munin/Munin-Node/Munin-Node/replica_setiathome.html This should make Grant happy :D |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304 |
Yeaaaaah, a lot of tasks for for CPU and CPU+GPU are now waiting :)It's nice to get work, but it would have been nicer (given how things are at present) for the backlogs to be a few more million down before that happened. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304 |
This should make Grant happy :DVery nice. Now, if the "Results returned and awaiting validation" were on the same graph as the "Results out in the field" for both for MB & AP it'd be perfect (they're the same order of magnitude as each other- millions for MB and hundreds of thousands for AP, whereas the Assimilation & Deletion numbers are (when things aren't broken) usually around 0 so with the values in their millions there it makes it harder to see what's been going on with the smaller values). Oh, and the "Workunits waiting for db purging" and "Results waiting for db purging" could also go on the "Results returned and awaiting validation" and "Results out in the field" graph (or have their own). Pretty please. Pretty please with a cherry on top. Grant Darwin NT |
Kiska Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0 |
This should make Grant happy :DVery nice. Once it starts populating :D https://munin.kiska.pw/munin/Munin-Node/Munin-Node/results_setiathomev8_in_progress_validation.html Remind me to do the other stuff later |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
And there is now a fix for the AMD RX 5000 card issues.They can force only 'vanilla' hosts to upgrade their apps. So they can't really revert the triple validation kludge for overflow results before enough of the anonymous platform hosts have updated their apps to make the risk of a task getting sent to two bad hosts tiny enough to be acceptable. Unless they can 'blacklist' amd gpus from receiving the _1 if the corresponding _0 was sent to one. But I don't think the system supports this because if it did, they would have already done it instead of using this triple validation kludge - which isn't even 100% watertight because there's still the risk of all three going to bad hosts. |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
I am waiting and waiting to have the website confirm that I have a full cache. Everything is running Seti@Home except for three weather forecast tasks from WCG. Eyeballing it looks like I have a full set of cpu tasks and a less than full set of gpu tasks. But all the gpus are engaged and I think I may have 150 gpu tasks so hopefully it will stay that way. Apparently the Replica DB is "just a bit behind". It just reported I have 6 tasks in progress. I know I have to take off my shoes to count past 10 but I am sure I have more than "6" :) Here it is Sunday morning and I/we? are finally get a steady flow of tasks? Tom A proud member of the OFA (Old Farts Association). |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Now, if the "Results returned and awaiting validation" were on the same graph as the "Results out in the field" for both for MB & AP it'd be perfectActually one of the more interesting graphs would be ts SUM of 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging' for both MB & AP. That is all eight fields in one sum. This would be the number of results in the database. The value that Eric said has to be kept under 20 milllion to avoid the result table spilling out of RAM. It is now 18.9 milllion. Those 71 ancient zombie S@Hv7 results appear to have finally been purged! |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
I am waiting and waiting to have the website confirm that I have a full cache.Do what I did: Write a program that reads the client_state.xml and reports the number of tasks for CPU and GPU. That way you can easily see how full your queues are and you don't need the website for that, so it works even during the out(r)ages. And the data will always be fresh no matter how behind the relica db is. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874 |
I think BoincTasks can do that, as well. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
I think BoincTasks can do that, as well. Quite well, in fact. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
And, at least for the moment, the floodgates appear to have opened. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Something has changed. The floodgates are wide open but the assimilation queue is still getting smaller. |
Chris904395093209d Send message Joined: 1 Jan 01 Posts: 112 Credit: 29,923,129 RAC: 6 |
I'm not seeing the '71' under the S@H V7 column on the server status page. Did those finally get cleaned up in the dbase? ~Chris |
Kissagogo27 Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62 |
UTC+1 ^^ |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
|
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I'm not seeing the '71' under the S@H V7 column on the server status page. Did those finally get cleaned up in the dbase? Maybe is time to start to cut the timeline of the WUs and some changes in the way the work is distributed like sending the resends to the fastest hosts to clear them ASAP. Or we will be trapped on an endless loop of no new work each time the total reaches 20 MM. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
Or we will be trapped on an endless loop of no new work each time the total reaches 20 MM. Possible explanation of why this has only been happening recently here.... Briefly: Quorum=3 for overflows coupled with BLC35 files which generate little except overflows. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.