Panic Mode On (92) Server Problems?

Author	Message
David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1602714 - Posted: 19 Nov 2014, 21:53:16 UTC - in response to Message 1602483. I think Cosmic sums it up quite nicely within my limits of comprehension. Matt said a couple of years ago that he couldn't foresee any Informix limitation that Seti might hit for the foreseeable future. That may have to be re-visited. Actually, I think we've got confused over the two different types of database. Firstly, we have the 'BOINC' database - master and replica - which handles all the transactional stuff for daily processing. That's the one which typically has ~3 million rows for tasks in progress, which means ~1.6 million for WUs in progress - and judging by the message number on Chris's post, 1.6 million (and growing) rows for the forums. More to the point, it has a huge rate of churn, with a turnover of ~1.5 million rows per day in normal operation. That fragments the database and index structure: as I understand it, compacting and re-indexing the BOINC database is the main reason for the duration of the weekly maintenance (and by implication, if the 'tasks in progress' limits were removed, the weekly outage would take much longer). This is the database which is re-loaded from disk into RAM by the initial queries after each outage: it's run by a MySQL (free, open-source) database engine, and I don't see any prospect of (or need for) changing that: all the BOINC server daemons (splitter, validator, etc.) have to interact directly with this database, and even the slightest change in query syntax would require a lot of work - and render our version of the code incompatible with all the other BOINC projects. Informix is used for the other databases - the SETI@home and Astropulse science databases. We don't see any data from those databases in our day-to-day interactions with BOINC. They hold data on all the signals found since the begiining of SETI@Home, 15 years ago: about 14 billion rows, according to the science status page. That's three orders of magnitude greater than the BOINC transactional stuff. If we are using 100 GB of RAM to cache the BOINC database, we might need 100 Terabytes of RAM to cache the science DB - which probably accounts for the difficulties they're having getting Ntpckr up to speed. I think the last suggestion I read was to leave it on disk, but to use SSD disks for speed: I don't know how far they've got with that. The numbers are still eye-watering. It occurs to me after reading this and subsequent posts that with the AP database and all its attendant processes being down, but (IIRC, and I think my point is still valid if I don't) the splitters having continued for a while after it went down, the Master database is probably holding a lot more than the usual number of rows devoted to AP. The validators were the first thing to go down, so there are a lot of AP WUs just sitting there waiting for validation. If I understand the above correctly, all of that is being held in the Master's RAM. Plus, we have the transition from AP6 to 7, which may be causing another larger than normal chunk of table to be held in RAM. Perhaps they should have started a new AP science database concurrent with that transition. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1602714 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1602721 - Posted: 19 Nov 2014, 22:14:27 UTC - in response to Message 1602714. I think Cosmic sums it up quite nicely within my limits of comprehension. Matt said a couple of years ago that he couldn't foresee any Informix limitation that Seti might hit for the foreseeable future. That may have to be re-visited. Actually, I think we've got confused over the two different types of database. Firstly, we have the 'BOINC' database - master and replica - which handles all the transactional stuff for daily processing. That's the one which typically has ~3 million rows for tasks in progress, which means ~1.6 million for WUs in progress - and judging by the message number on Chris's post, 1.6 million (and growing) rows for the forums. More to the point, it has a huge rate of churn, with a turnover of ~1.5 million rows per day in normal operation. That fragments the database and index structure: as I understand it, compacting and re-indexing the BOINC database is the main reason for the duration of the weekly maintenance (and by implication, if the 'tasks in progress' limits were removed, the weekly outage would take much longer). This is the database which is re-loaded from disk into RAM by the initial queries after each outage: it's run by a MySQL (free, open-source) database engine, and I don't see any prospect of (or need for) changing that: all the BOINC server daemons (splitter, validator, etc.) have to interact directly with this database, and even the slightest change in query syntax would require a lot of work - and render our version of the code incompatible with all the other BOINC projects. Informix is used for the other databases - the SETI@home and Astropulse science databases. We don't see any data from those databases in our day-to-day interactions with BOINC. They hold data on all the signals found since the begiining of SETI@Home, 15 years ago: about 14 billion rows, according to the science status page. That's three orders of magnitude greater than the BOINC transactional stuff. If we are using 100 GB of RAM to cache the BOINC database, we might need 100 Terabytes of RAM to cache the science DB - which probably accounts for the difficulties they're having getting Ntpckr up to speed. I think the last suggestion I read was to leave it on disk, but to use SSD disks for speed: I don't know how far they've got with that. The numbers are still eye-watering. It occurs to me after reading this and subsequent posts that with the AP database and all its attendant processes being down, but (IIRC, and I think my point is still valid if I don't) the splitters having continued for a while after it went down, the Master database is probably holding a lot more than the usual number of rows devoted to AP. The validators were the first thing to go down, so there are a lot of AP WUs just sitting there waiting for validation. If I understand the above correctly, all of that is being held in the Master's RAM. Plus, we have the transition from AP6 to 7, which may be causing another larger than normal chunk of table to be held in RAM. Perhaps they should have started a new AP science database concurrent with that transition. Yes and no. There is certainly a lot of stalled AP v7 work hanging around, waiting to be moved on down the line - but that's in the transactional (BOINC - MySQL) database. And yes, that would be held in RAM. But Astropulse has always generated far fewer tasks (and hence result rows in the database) than the MB figures I was quoting earlier, by about 30-fold. In roughly 3 months since AP v7 was launched, the BOINC database will have grown by the equivalent of less than 3 days of MB throughput. And - the point I was trying to make before - that has nothing whatsoever to do with the Informix (science) databases. ID: 1602721 ·

Julie Volunteer moderator Volunteer tester Send message Joined: 28 Oct 09 Posts: 34053 Credit: 18,883,157 RAC: 18	Message 1602734 - Posted: 19 Nov 2014, 22:41:22 UTC That's some good news:) rOZZ Music Pictures ID: 1602734 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1602742 - Posted: 19 Nov 2014, 22:48:36 UTC - in response to Message 1602734. Last modified: 19 Nov 2014, 22:53:37 UTC That's some good news:) Work is now getting Split, my T8100 received a good bucket full, all VLARs. All tasks for computer 7118863 Claggy ID: 1602742 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1602745 - Posted: 19 Nov 2014, 22:53:50 UTC - in response to Message 1602742. That's some good news:) Work is now getting Split, my T8100 received a good bucket full, all VLARs. Claggy Aye, The SETI@home bit buckets are refilling at the moment. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1602745 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34758 Credit: 261,360,520 RAC: 489	Message 1602748 - Posted: 19 Nov 2014, 22:58:22 UTC Yep, getting some tasks here 2, but why does Rosetta work always want to go into high priority mode whenever I get SETI work? :-O Cheers. ID: 1602748 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 30651 Credit: 53,134,872 RAC: 32	Message 1602758 - Posted: 19 Nov 2014, 23:29:40 UTC - in response to Message 1602748. Yep, getting some tasks here 2, but why does Rosetta work always want to go into high priority mode whenever I get SETI work? :-O Cheers. It is called a denial of crunch attack, a/k/a short deadlines. ID: 1602758 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34758 Credit: 261,360,520 RAC: 489	Message 1602759 - Posted: 19 Nov 2014, 23:36:34 UTC - in response to Message 1602758. Yep, getting some tasks here 2, but why does Rosetta work always want to go into high priority mode whenever I get SETI work? :-O Cheers. It is called a denial of crunch attack, a/k/a short deadlines. They maybe short on their deadlines, but there's no way that they won't be finished well before those deadlines arrive. Cheers. ID: 1602759 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34758 Credit: 261,360,520 RAC: 489	Message 1602862 - Posted: 20 Nov 2014, 2:56:27 UTC I wonder if someone will get around to doing something about file 07oc11af. Cheers. ID: 1602862 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1602864 - Posted: 20 Nov 2014, 3:00:58 UTC - in response to Message 1602862. Woo hoo...just got a bunch of MB for 1 rig...unfortunately half were shorties but at least the GPUs fired back up.. ;) ID: 1602864 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1602915 - Posted: 20 Nov 2014, 5:59:15 UTC - in response to Message 1602677. Eric's post mentioned it has taken 5 days to get to 25% & that was about 6 day ago. So if it still going it should be somewhere around 50% complete. In theory that would put the completion time around the 28/29th. Being a holiday weekend for the US maybe after maintenance on the 2nd AP shall return. Keep in mind those figures were before the Database became non responsive & the whole show went down for a while. It's possible the rebuild had to be restarted from scratch. Grant Darwin NT ID: 1602915 ·

Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20	Message 1602916 - Posted: 20 Nov 2014, 6:03:23 UTC - in response to Message 1602864. Woo hoo...just got a bunch of MB for 1 rig...unfortunately half were shorties but at least the GPUs fired back up.. ;) Just got 14 MBs for my Core2Duo - all seem to be mid-range/normal tasks. My other WinXP box has a couple tasks, and my iBook has 1, but I just added a Core2 Duo running Vista, and it has no tasks yet,,,, Donald Infernal Optimist / Submariner, retired ID: 1602916 ·

Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20	Message 1602918 - Posted: 20 Nov 2014, 6:06:18 UTC - in response to Message 1602915. Eric's post mentioned it has taken 5 days to get to 25% & that was about 6 day ago. So if it still going it should be somewhere around 50% complete. In theory that would put the completion time around the 28/29th. Being a holiday weekend for the US maybe after maintenance on the 2nd AP shall return. Keep in mind those figures were before the Database became non responsive & the whole show went down for a while. It's possible the rebuild had to be restarted from scratch. Eric has posted an Update in Tech News - looks like the AP rebuild failed and will have to be redone..... Donald Infernal Optimist / Submariner, retired ID: 1602918 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34758 Credit: 261,360,520 RAC: 489	Message 1602960 - Posted: 20 Nov 2014, 9:14:20 UTC My 2 rigs have picked up over 200 tasks between them today. Cheers. ID: 1602960 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1602977 - Posted: 20 Nov 2014, 10:00:30 UTC I got a (technical term here) bunch of tasks today, too, but only a few GPU ones. I may just pack it in for the duration (won't hurt to save even more on the electric bill) when these are gone... sigh ID: 1602977 ·

Phil Burden Send message Joined: 26 Oct 00 Posts: 264 Credit: 22,303,899 RAC: 0	Message 1603069 - Posted: 20 Nov 2014, 12:39:35 UTC - in response to Message 1602960. My 2 rigs have picked up over 200 tasks between them today. Cheers. You were lucky, I got 3 ;-( Now long since processed. P. ID: 1603069 ·

JohnDK Volunteer tester Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127	Message 1603078 - Posted: 20 Nov 2014, 13:02:31 UTC - in response to Message 1603069. My 2 rigs have picked up over 200 tasks between them today. Cheers. You were lucky, I got 3 ;-( Now long since processed. P. I think the difference between you 2 are the BOINC versions. I have recently updated all my 3 PCs to BOINC V7 from V6. V6 kept asking for work trying to fill the cache, V7 ask for works less frequently and my 24/7 PC haven't got one single task, last time it requested new work was about 3 hours ago... ID: 1603078 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1603087 - Posted: 20 Nov 2014, 13:26:09 UTC - in response to Message 1603078. My 2 rigs have picked up over 200 tasks between them today. Cheers. You were lucky, I got 3 ;-( Now long since processed. P. I think the difference between you 2 are the BOINC versions. I have recently updated all my 3 PCs to BOINC V7 from V6. V6 kept asking for work trying to fill the cache, V7 ask for works less frequently and my 24/7 PC haven't got one single task, last time it requested new work was about 3 hours ago... I`m on Boinc 6 and got not much so far. One single task here and there. So i do some more beta. With each crime and every kindness we birth our future. ID: 1603087 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1603104 - Posted: 20 Nov 2014, 14:08:04 UTC Last modified: 20 Nov 2014, 14:09:25 UTC Looks like the Server Page has been dead for 2 hours. EDIT: Whoops! It was my browser cache...Sorry! ID: 1603104 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1603156 - Posted: 20 Nov 2014, 16:18:59 UTC - in response to Message 1602721. It occurs to me after reading this and subsequent posts that with the AP database and all its attendant processes being down, but (IIRC, and I think my point is still valid if I don't) the splitters having continued for a while after it went down, the Master database is probably holding a lot more than the usual number of rows devoted to AP. The validators were the first thing to go down, so there are a lot of AP WUs just sitting there waiting for validation. If I understand the above correctly, all of that is being held in the Master's RAM. Plus, we have the transition from AP6 to 7, which may be causing another larger than normal chunk of table to be held in RAM. Perhaps they should have started a new AP science database concurrent with that transition. Yes and no. There is certainly a lot of stalled AP v7 work hanging around, waiting to be moved on down the line - but that's in the transactional (BOINC - MySQL) database. And yes, that would be held in RAM. But Astropulse has always generated far fewer tasks (and hence result rows in the database) than the MB figures I was quoting earlier, by about 30-fold. In roughly 3 months since AP v7 was launched, the BOINC database will have grown by the equivalent of less than 3 days of MB throughput. That size differential occurred to me after I posted. And - the point I was trying to make before - that has nothing whatsoever to do with the Informix (science) databases. I understood that point. I noticed this morning that I now also have a significant number of MB WUs where both hosts have completed, but validation is still pending. This process usually happens faster than you can reload the web page to look at it. Speaking of vlars, when I kicked my i7 to ask Beta for work yesterday, it got over 50 vlars (and immediately put Einsteins on hold to start working on them). David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1603156 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.