Panic Mode On (92) Server Problems?

Message boards : Number crunching : Panic Mode On (92) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 23 · Next

AuthorMessage
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1602714 - Posted: 19 Nov 2014, 21:53:16 UTC - in response to Message 1602483.  

I think Cosmic sums it up quite nicely within my limits of comprehension. Matt said a couple of years ago that he couldn't foresee any Informix limitation that Seti might hit for the foreseeable future. That may have to be re-visited.

Actually, I think we've got confused over the two different types of database.

Firstly, we have the 'BOINC' database - master and replica - which handles all the transactional stuff for daily processing. That's the one which typically has ~3 million rows for tasks in progress, which means ~1.6 million for WUs in progress - and judging by the message number on Chris's post, 1.6 million (and growing) rows for the forums.

More to the point, it has a huge rate of churn, with a turnover of ~1.5 million rows per day in normal operation. That fragments the database and index structure: as I understand it, compacting and re-indexing the BOINC database is the main reason for the duration of the weekly maintenance (and by implication, if the 'tasks in progress' limits were removed, the weekly outage would take much longer). This is the database which is re-loaded from disk into RAM by the initial queries after each outage: it's run by a MySQL (free, open-source) database engine, and I don't see any prospect of (or need for) changing that: all the BOINC server daemons (splitter, validator, etc.) have to interact directly with this database, and even the slightest change in query syntax would require a lot of work - and render our version of the code incompatible with all the other BOINC projects.

Informix is used for the other databases - the SETI@home and Astropulse science databases. We don't see any data from those databases in our day-to-day interactions with BOINC. They hold data on all the signals found since the begiining of SETI@Home, 15 years ago: about 14 billion rows, according to the science status page. That's three orders of magnitude greater than the BOINC transactional stuff. If we are using 100 GB of RAM to cache the BOINC database, we might need 100 Terabytes of RAM to cache the science DB - which probably accounts for the difficulties they're having getting Ntpckr up to speed. I think the last suggestion I read was to leave it on disk, but to use SSD disks for speed: I don't know how far they've got with that. The numbers are still eye-watering.

It occurs to me after reading this and subsequent posts that with the AP database and all its attendant processes being down, but (IIRC, and I think my point is still valid if I don't) the splitters having continued for a while after it went down, the Master database is probably holding a lot more than the usual number of rows devoted to AP. The validators were the first thing to go down, so there are a lot of AP WUs just sitting there waiting for validation. If I understand the above correctly, all of that is being held in the Master's RAM. Plus, we have the transition from AP6 to 7, which may be causing another larger than normal chunk of table to be held in RAM.

Perhaps they should have started a new AP science database concurrent with that transition.
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1602714 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1602721 - Posted: 19 Nov 2014, 22:14:27 UTC - in response to Message 1602714.  

I think Cosmic sums it up quite nicely within my limits of comprehension. Matt said a couple of years ago that he couldn't foresee any Informix limitation that Seti might hit for the foreseeable future. That may have to be re-visited.

Actually, I think we've got confused over the two different types of database.

Firstly, we have the 'BOINC' database - master and replica - which handles all the transactional stuff for daily processing. That's the one which typically has ~3 million rows for tasks in progress, which means ~1.6 million for WUs in progress - and judging by the message number on Chris's post, 1.6 million (and growing) rows for the forums.

More to the point, it has a huge rate of churn, with a turnover of ~1.5 million rows per day in normal operation. That fragments the database and index structure: as I understand it, compacting and re-indexing the BOINC database is the main reason for the duration of the weekly maintenance (and by implication, if the 'tasks in progress' limits were removed, the weekly outage would take much longer). This is the database which is re-loaded from disk into RAM by the initial queries after each outage: it's run by a MySQL (free, open-source) database engine, and I don't see any prospect of (or need for) changing that: all the BOINC server daemons (splitter, validator, etc.) have to interact directly with this database, and even the slightest change in query syntax would require a lot of work - and render our version of the code incompatible with all the other BOINC projects.

Informix is used for the other databases - the SETI@home and Astropulse science databases. We don't see any data from those databases in our day-to-day interactions with BOINC. They hold data on all the signals found since the begiining of SETI@Home, 15 years ago: about 14 billion rows, according to the science status page. That's three orders of magnitude greater than the BOINC transactional stuff. If we are using 100 GB of RAM to cache the BOINC database, we might need 100 Terabytes of RAM to cache the science DB - which probably accounts for the difficulties they're having getting Ntpckr up to speed. I think the last suggestion I read was to leave it on disk, but to use SSD disks for speed: I don't know how far they've got with that. The numbers are still eye-watering.

It occurs to me after reading this and subsequent posts that with the AP database and all its attendant processes being down, but (IIRC, and I think my point is still valid if I don't) the splitters having continued for a while after it went down, the Master database is probably holding a lot more than the usual number of rows devoted to AP. The validators were the first thing to go down, so there are a lot of AP WUs just sitting there waiting for validation. If I understand the above correctly, all of that is being held in the Master's RAM. Plus, we have the transition from AP6 to 7, which may be causing another larger than normal chunk of table to be held in RAM.

Perhaps they should have started a new AP science database concurrent with that transition.

Yes and no. There is certainly a lot of stalled AP v7 work hanging around, waiting to be moved on down the line - but that's in the transactional (BOINC - MySQL) database. And yes, that would be held in RAM. But Astropulse has always generated far fewer tasks (and hence result rows in the database) than the MB figures I was quoting earlier, by about 30-fold. In roughly 3 months since AP v7 was launched, the BOINC database will have grown by the equivalent of less than 3 days of MB throughput.

And - the point I was trying to make before - that has nothing whatsoever to do with the Informix (science) databases.
ID: 1602721 · Report as offensive
Profile Julie
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 28 Oct 09
Posts: 34053
Credit: 18,883,157
RAC: 18
Belgium
Message 1602734 - Posted: 19 Nov 2014, 22:41:22 UTC

That's some good news:)
rOZZ
Music
Pictures
ID: 1602734 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1602742 - Posted: 19 Nov 2014, 22:48:36 UTC - in response to Message 1602734.  
Last modified: 19 Nov 2014, 22:53:37 UTC

That's some good news:)

Work is now getting Split, my T8100 received a good bucket full, all VLARs.

All tasks for computer 7118863

Claggy
ID: 1602742 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1602745 - Posted: 19 Nov 2014, 22:53:50 UTC - in response to Message 1602742.  

That's some good news:)

Work is now getting Split, my T8100 received a good bucket full, all VLARs.

Claggy

Aye, The SETI@home bit buckets are refilling at the moment.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1602745 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34758
Credit: 261,360,520
RAC: 489
Australia
Message 1602748 - Posted: 19 Nov 2014, 22:58:22 UTC

Yep, getting some tasks here 2, but why does Rosetta work always want to go into high priority mode whenever I get SETI work? :-O

Cheers.
ID: 1602748 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30651
Credit: 53,134,872
RAC: 32
United States
Message 1602758 - Posted: 19 Nov 2014, 23:29:40 UTC - in response to Message 1602748.  

Yep, getting some tasks here 2, but why does Rosetta work always want to go into high priority mode whenever I get SETI work? :-O

Cheers.

It is called a denial of crunch attack, a/k/a short deadlines.
ID: 1602758 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34758
Credit: 261,360,520
RAC: 489
Australia
Message 1602759 - Posted: 19 Nov 2014, 23:36:34 UTC - in response to Message 1602758.  

Yep, getting some tasks here 2, but why does Rosetta work always want to go into high priority mode whenever I get SETI work? :-O

Cheers.

It is called a denial of crunch attack, a/k/a short deadlines.

They maybe short on their deadlines, but there's no way that they won't be finished well before those deadlines arrive.

Cheers.
ID: 1602759 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34758
Credit: 261,360,520
RAC: 489
Australia
Message 1602862 - Posted: 20 Nov 2014, 2:56:27 UTC

I wonder if someone will get around to doing something about file 07oc11af.

Cheers.
ID: 1602862 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1602864 - Posted: 20 Nov 2014, 3:00:58 UTC - in response to Message 1602862.  

Woo hoo...just got a bunch of MB for 1 rig...unfortunately half were shorties but at least the GPUs fired back up.. ;)
ID: 1602864 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1602915 - Posted: 20 Nov 2014, 5:59:15 UTC - in response to Message 1602677.  

Eric's post mentioned it has taken 5 days to get to 25% & that was about 6 day ago. So if it still going it should be somewhere around 50% complete. In theory that would put the completion time around the 28/29th. Being a holiday weekend for the US maybe after maintenance on the 2nd AP shall return.

Keep in mind those figures were before the Database became non responsive & the whole show went down for a while.
It's possible the rebuild had to be restarted from scratch.
Grant
Darwin NT
ID: 1602915 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1602916 - Posted: 20 Nov 2014, 6:03:23 UTC - in response to Message 1602864.  

Woo hoo...just got a bunch of MB for 1 rig...unfortunately half were shorties but at least the GPUs fired back up.. ;)

Just got 14 MBs for my Core2Duo - all seem to be mid-range/normal tasks. My other WinXP box has a couple tasks, and my iBook has 1, but I just added a Core2 Duo running Vista, and it has no tasks yet,,,,
Donald
Infernal Optimist / Submariner, retired
ID: 1602916 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1602918 - Posted: 20 Nov 2014, 6:06:18 UTC - in response to Message 1602915.  

Eric's post mentioned it has taken 5 days to get to 25% & that was about 6 day ago. So if it still going it should be somewhere around 50% complete. In theory that would put the completion time around the 28/29th. Being a holiday weekend for the US maybe after maintenance on the 2nd AP shall return.

Keep in mind those figures were before the Database became non responsive & the whole show went down for a while.
It's possible the rebuild had to be restarted from scratch.

Eric has posted an Update in Tech News - looks like the AP rebuild failed and will have to be redone.....
Donald
Infernal Optimist / Submariner, retired
ID: 1602918 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34758
Credit: 261,360,520
RAC: 489
Australia
Message 1602960 - Posted: 20 Nov 2014, 9:14:20 UTC

My 2 rigs have picked up over 200 tasks between them today.

Cheers.
ID: 1602960 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1602977 - Posted: 20 Nov 2014, 10:00:30 UTC

I got a (technical term here) bunch of tasks today, too, but only a few GPU ones. I may just pack it in for the duration (won't hurt to save even more on the electric bill) when these are gone...

*sigh*
ID: 1602977 · Report as offensive
Phil Burden

Send message
Joined: 26 Oct 00
Posts: 264
Credit: 22,303,899
RAC: 0
United Kingdom
Message 1603069 - Posted: 20 Nov 2014, 12:39:35 UTC - in response to Message 1602960.  

My 2 rigs have picked up over 200 tasks between them today.

Cheers.


You were lucky, I got 3 ;-( Now long since processed.

P.
ID: 1603069 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 1603078 - Posted: 20 Nov 2014, 13:02:31 UTC - in response to Message 1603069.  

My 2 rigs have picked up over 200 tasks between them today.

Cheers.


You were lucky, I got 3 ;-( Now long since processed.

P.

I think the difference between you 2 are the BOINC versions.

I have recently updated all my 3 PCs to BOINC V7 from V6. V6 kept asking for work trying to fill the cache, V7 ask for works less frequently and my 24/7 PC haven't got one single task, last time it requested new work was about 3 hours ago...
ID: 1603078 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1603087 - Posted: 20 Nov 2014, 13:26:09 UTC - in response to Message 1603078.  

My 2 rigs have picked up over 200 tasks between them today.

Cheers.


You were lucky, I got 3 ;-( Now long since processed.

P.

I think the difference between you 2 are the BOINC versions.

I have recently updated all my 3 PCs to BOINC V7 from V6. V6 kept asking for work trying to fill the cache, V7 ask for works less frequently and my 24/7 PC haven't got one single task, last time it requested new work was about 3 hours ago...


I`m on Boinc 6 and got not much so far.
One single task here and there.
So i do some more beta.


With each crime and every kindness we birth our future.
ID: 1603087 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1603104 - Posted: 20 Nov 2014, 14:08:04 UTC
Last modified: 20 Nov 2014, 14:09:25 UTC

Looks like the Server Page has been dead for 2 hours.

EDIT: Whoops! It was my browser cache...Sorry!
ID: 1603104 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1603156 - Posted: 20 Nov 2014, 16:18:59 UTC - in response to Message 1602721.  

It occurs to me after reading this and subsequent posts that with the AP database and all its attendant processes being down, but (IIRC, and I think my point is still valid if I don't) the splitters having continued for a while after it went down, the Master database is probably holding a lot more than the usual number of rows devoted to AP. The validators were the first thing to go down, so there are a lot of AP WUs just sitting there waiting for validation. If I understand the above correctly, all of that is being held in the Master's RAM. Plus, we have the transition from AP6 to 7, which may be causing another larger than normal chunk of table to be held in RAM.

Perhaps they should have started a new AP science database concurrent with that transition.

Yes and no. There is certainly a lot of stalled AP v7 work hanging around, waiting to be moved on down the line - but that's in the transactional (BOINC - MySQL) database. And yes, that would be held in RAM. But Astropulse has always generated far fewer tasks (and hence result rows in the database) than the MB figures I was quoting earlier, by about 30-fold. In roughly 3 months since AP v7 was launched, the BOINC database will have grown by the equivalent of less than 3 days of MB throughput.

That size differential occurred to me after I posted.

And - the point I was trying to make before - that has nothing whatsoever to do with the Informix (science) databases.

I understood that point.

I noticed this morning that I now also have a significant number of MB WUs where both hosts have completed, but validation is still pending. This process usually happens faster than you can reload the web page to look at it.

Speaking of vlars, when I kicked my i7 to ask Beta for work yesterday, it got over 50 vlars (and immediately put Einsteins on hold to start working on them).
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1603156 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 23 · Next

Message boards : Number crunching : Panic Mode On (92) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.