Panic Mode On (108) Server Problems?

Message boards : Number crunching : Panic Mode On (108) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 29 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1904269 - Posted: 2 Dec 2017, 0:08:20 UTC - in response to Message 1904260.  

Since the issue is with the SETI@home science database and not the master or replica databases I'm not sure how how the deadlines would even be relevant.
Given the data is stored in the science database after it has been validated.

The reason for the serverside limits was due to the load on master database and it falling over or just plain getting bogged down.
Grant
Darwin NT
ID: 1904269 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1904273 - Posted: 2 Dec 2017, 0:25:04 UTC - in response to Message 1904269.  

Since the issue is with the SETI@home science database and not the master or replica databases I'm not sure how how the deadlines would even be relevant.
Given the data is stored in the science database after it has been validated.

The reason for the serverside limits was due to the load on master database and it falling over or just plain getting bogged down.

That is correct. Either the MySQL database software or the hardware the BOINC master db system is using couldn't handle scanning a table of 11,000,000+ tasks several hundred, or thousand?, times a second.

Which, again, is in fact not the science database that runs informix .
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1904273 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1639
Credit: 12,921,799
RAC: 89
New Zealand
Message 1904275 - Posted: 2 Dec 2017, 0:33:28 UTC - in response to Message 1904218.  


Down until the new year...

Can somebody please give me a link to the official message saying it will be down until next year? I haven't been able to find anything.
ID: 1904275 · Report as offensive
Profile Dr.Diesel Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 14 May 99
Posts: 41
Credit: 123,695,755
RAC: 139
United States
Message 1904276 - Posted: 2 Dec 2017, 0:33:30 UTC - in response to Message 1904273.  

Which, again, is in fact not the science database that runs informix .


Is this along with each's systems OS and primary DB documented anywhere?
ID: 1904276 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1904277 - Posted: 2 Dec 2017, 0:41:29 UTC - in response to Message 1904276.  
Last modified: 2 Dec 2017, 0:43:47 UTC

Which, again, is in fact not the science database that runs informix .


Is this along with each's systems OS and primary DB documented anywhere?

You can check the Server Status for most of that information.
https://setiathome.berkeley.edu/show_server_status.php
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1904277 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1904279 - Posted: 2 Dec 2017, 0:46:18 UTC - in response to Message 1904275.  


Down until the new year...

Can somebody please give me a link to the official message saying it will be down until next year? I haven't been able to find anything.

There has been no message, so people are making it up as they feel like it.
I'm voting for next week.
Grant
Darwin NT
ID: 1904279 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1639
Credit: 12,921,799
RAC: 89
New Zealand
Message 1904281 - Posted: 2 Dec 2017, 0:54:43 UTC - in response to Message 1904279.  


Down until the new year...

Can somebody please give me a link to the official message saying it will be down until next year? I haven't been able to find anything.

There has been no message, so people are making it up as they feel like it.
I'm voting for next week.

Thank you Grant for the clarification.
Thanks to Eric & the team for working on this issue
ID: 1904281 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1904284 - Posted: 2 Dec 2017, 1:06:42 UTC - in response to Message 1904277.  

Which, again, is in fact not the science database that runs informix .


Is this along with each's systems OS and primary DB documented anywhere?

You can check the Server Status for most of that information.
https://setiathome.berkeley.edu/show_server_status.php

Uhh, where on the SSP is the information requested? I see no mention of the OS or database software each server is running?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1904284 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1904289 - Posted: 2 Dec 2017, 1:39:30 UTC - in response to Message 1904273.  

Since the issue is with the SETI@home science database and not the master or replica databases I'm not sure how how the deadlines would even be relevant.
Given the data is stored in the science database after it has been validated.

The reason for the serverside limits was due to the load on master database and it falling over or just plain getting bogged down.

That is correct. Either the MySQL database software or the hardware the BOINC master db system is using couldn't handle scanning a table of 11,000,000+ tasks several hundred, or thousand?, times a second.

Which, again, is in fact not the science database that runs informix .
My very first message on this subject stated:
...speeding up the turnaround would also lessen the storage requirements of the master and replica databases, an issue that seems to rear its head quite often.
...
Now, I don't know what percentage of the master database is occupied by task data, versus workunit data, account data, host data, and so on,...
And I tried in a subsequent message to reiterate that distinction:
If, by "main" database, you mean the science database, that's true, and it's the science database that crashed this week. But the master and replica databases are the ones impacted by the active task and WU volume, and also the ones most often running into space-related problems.
Why does the science database keep getting dragged in?
ID: 1904289 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1904292 - Posted: 2 Dec 2017, 2:04:55 UTC - in response to Message 1904268.  

Or maybe we set it so there is a minimum deadline of 1 week, to allow for Seti server issues as well as local issues, so for any system that returns work within 1 week or less- that will be the deadline. For systems that take over 1 week average turnaround time, their deadline is their Average turnaround time + 1 week.
So for a system that takes 8 days, their deadlines will be 8+7= 15 days. 23 days to return work their deadlines will be 30 days, etc.
I don't know that it even needs to be cut that close. Heck, even a three or four week cushion (for the longer-running tasks) would effect a significant improvement, I think. People do go on vacation, or off on business trips, or shut down their machines for awhile for other reasons. It would be nice if, before they did so, they drained their queues, but that tends not to happen and the system should allow sufficient latitude for that. Heck, I had a significant unplanned outage across the board last February, when a major storm knocked out the electricity in my area for five and a half days. So, the turnaround average on my main crunchers took a big hit. :^)

Another thing that I think would help, but perhaps just to a small degree, would be a way for conscientious users to abandon or abort tasks via the web site. I mention this because it seems we periodically see questions and/or apologies in the forum from users whose systems have irrevocably died, or met with some extreme reconfiguration, while still having lots of tasks in the queue. There's currently no alternative than to simply let those tasks time out.
ID: 1904292 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1904294 - Posted: 2 Dec 2017, 2:09:44 UTC - in response to Message 1904284.  

Uhh, where on the SSP is the information requested? I see no mention of the OS or database software each server is running?
I don't know about the OS, but the DB software is mentioned in those brief DB descriptions at the very beginning of the Glossary: "mysql" for the master and "informix" for the science DBs.
ID: 1904294 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1904298 - Posted: 2 Dec 2017, 2:25:07 UTC - in response to Message 1904294.  

Uhh, where on the SSP is the information requested? I see no mention of the OS or database software each server is running?
I don't know about the OS, but the DB software is mentioned in those brief DB descriptions at the very beginning of the Glossary: "mysql" for the master and "informix" for the science DBs.

Ahh, thanks for pointing that out to me Jeff. I was looking for that information in the Hosts table describing the hardware.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1904298 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1904301 - Posted: 2 Dec 2017, 2:43:52 UTC - in response to Message 1904292.  
Last modified: 2 Dec 2017, 2:45:05 UTC

Or maybe we set it so there is a minimum deadline of 1 week, to allow for Seti server issues as well as local issues, so for any system that returns work within 1 week or less- that will be the deadline. For systems that take over 1 week average turnaround time, their deadline is their Average turnaround time + 1 week.
So for a system that takes 8 days, their deadlines will be 8+7= 15 days. 23 days to return work their deadlines will be 30 days, etc.

I don't know that it even needs to be cut that close. Heck, even a three or four week cushion (for the longer-running tasks) would effect a significant improvement, I think.

Even the longest running tasks generally don't take that long to run as such, so more time for longer tasks isn't really necessary; it'd be better just to have a larger minimum return time to allow for problems.

My C2D is presently doing it's last CPU tasks, taking about 3hrs 15min for Arecibo VLARs (when it's also processing GPU work those can take up to 7 hours).
The main issue for long return times isn't so much how long it takes to actually process a WU, but how long it takes to process a WU for systems that aren't on 24/7, or have rather silly (IMHO) settings for BOINC (eg Suspend when non-BOINC CPU usage is above 25%), combined with running multiple projects.
So to allow for system outages (both cruncher and Seti) you could go with the present minimum of 2 weeks, or even bump it up to 3 weeks, but still base it on the application's Average turnaround time. For applications with 3 weeks or less turnaround time, then make the deadline 3 weeks. For those that take longer than 3 weeks, the deadline time should be their application turnaround time + 3 weeks.
Basically all work has a 3 week safety margin.

This should significantly reduce the time it takes to get a validated result, but still not impact on the slower systems ability to do Seti work, as well as reduce the load on the main database (which we know isn't the issue this time around but has been the case several times in the past).
Grant
Darwin NT
ID: 1904301 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1904305 - Posted: 2 Dec 2017, 3:00:01 UTC

All very sensible suggestions today about how to achieve slimmer databases and shorten the deadlines. I saw a post that as long as Eric was happy with the deadlines, all was good. My opinion is that the "grrrr -grumble" factor of being matched up with slow wingmen and a large pending count would dramatically go down if the deadlines were shortened.

But we aren't here to make setizens happy ... are we?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1904305 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11358
Credit: 29,581,041
RAC: 66
United States
Message 1904308 - Posted: 2 Dec 2017, 3:34:01 UTC - in response to Message 1904305.  
Last modified: 2 Dec 2017, 3:36:50 UTC

But we aren't here to make setizens happy ... are we?

NO! We are here to find ETI.
ID: 1904308 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1904313 - Posted: 2 Dec 2017, 4:19:30 UTC - in response to Message 1904240.  

One thing I think I've noticed about the Android devices, also, is that many of them are sent tasks that fail every time, simply because the devices don't support the assigned app. I've seen some that have never returned an error-free task, even after many months of crunching, yet their owners don't seem to notice.
Just to put a little something behind this statement I made earlier, here are three such machines that are/were wingmen on some of my currently assigned tasks:

http://setiathome.berkeley.edu/show_host_detail.php?hostid=8262256
http://setiathome.berkeley.edu/show_host_detail.php?hostid=8365042
http://setiathome.berkeley.edu/show_host_detail.php?hostid=8369269

They each have ZERO total credits, despite faithfully cycling through tasks pretty much on a daily basis. The first listed host has been at it for nearly 7 months. (I've seen them with over a year of such futility, but don't happen to have any on my wing at the moment.)

If Eric is truly so concerned about embracing low-volume crunchers such as these, he needs to see to it that the scheduler only sends them tasks they can actually process. Having a 14 Jan 2018 deadline is no help to a device that has no hope of ever returning a valid task. I'm sure these folks sincerely believe they're making a worthwhile contribution to the project but, through no fault of their own (as far as I know), they're not.
ID: 1904313 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1904316 - Posted: 2 Dec 2017, 4:37:06 UTC

If you really wish to help , while the project is down do another project or start mining crypto and then donate it to the project .

If enough people do it for the next 3 weeks there could be up to $400,000 made and with that sort of cash they will be able to fix the problems and maybe set up a separate server for all the very slow machines without causing a problem with the main project maybe
ID: 1904316 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1904333 - Posted: 2 Dec 2017, 7:39:44 UTC

Other than the AP splitters, Server Status has gone green, and 1 of my systems just picked up a dozen WUs.
Grant
Darwin NT
ID: 1904333 · Report as offensive
Ghia
Avatar

Send message
Joined: 7 Feb 17
Posts: 238
Credit: 28,911,438
RAC: 50
Norway
Message 1904345 - Posted: 2 Dec 2017, 8:24:11 UTC - in response to Message 1904333.  

Other than the AP splitters, Server Status has gone green, and 1 of my systems just picked up a dozen WUs.

Yes, the guys at Berkeley must have worked late..53 WUs have landed here. Not rejoicing quite yet, but things are looking up :)


...Ghia...
Humans may rule the world...but bacteria run it...
ID: 1904345 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1904347 - Posted: 2 Dec 2017, 8:41:27 UTC - in response to Message 1904333.  

Obviously, the guys in the lab were burning the midnight oil. I just restarted machines and one of them picked up 87 tasks right away.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1904347 · Report as offensive
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 29 · Next

Message boards : Number crunching : Panic Mode On (108) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.