Panic Mode On (108) Server Problems?

Message boards : Number crunching : Panic Mode On (108) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 . . . 32 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 9900
Credit: 128,487,763
RAC: 81,242
Australia
Message 1904237 - Posted: 1 Dec 2017, 22:19:27 UTC - in response to Message 1904235.  

Take those and pre-Core 2 Duo architectures out of the picture and you could reduce the deadlines by at least 2/3.

Well, you could suggest that to Eric, but don't expect him to do anything remotely like that.[/quote]
Since he went to the effort to develop applications that would run on phones I don't think it's very likely either.
But if he really wants to increase the amount of work being processed, and not have the servers collapse under the increased load, it's something he's going to have to give serious consideration to in the not too distant future.
Grant
Darwin NT
ID: 1904237 · Report as offensive
Profile Jeff Buck Special Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 2
United States
Message 1904240 - Posted: 1 Dec 2017, 22:25:51 UTC - in response to Message 1904229.  

Jeff, I agree that the deadlines are a way too long for tasks. It just makes no sense to me why with a 10 max cache setting, the deadlines are as much as 8 weeks. But your numbers are a way out. What is being flagged as a resend by timing out is part of the normal 4.5M tasks that are normally in the field, not the 2.5 that are left, and yes pendings should be included too. AP tasks usually have shorter deadline around 25 days, MB much longer. I think a reasonable deadline would be 20 days, 30 at most. Not 8 weeks.

But yes, it would help a small amount for the database size.
Ah, good point about the 4.5M. I had overlooked just how much that and, probably, the "awaiting validation" had shrunk even before the scheduler was shut down. But the 25K per day (actually now up to 27K and probably will add another K or two before we even hit the 24 hour mark) should still be a valid number, even if the percentage is smaller.

I don't know where the happy medium would be, but 8 weeks is definitely excessive, even for Android devices I would think. Even cutting it back by a week on the longer deadlines (and by a corresponding percentage on the shorties) should ease the database load. I also wonder why deadlines couldn't be based on a host's turnaround time, with a shorter deadline for the faster hardware. Of course, that would mean that any given WU would likely have different deadlines for each task but, so what?

One thing I think I've noticed about the Android devices, also, is that many of them are sent tasks that fail every time, simply because the devices don't support the assigned app. I've seen some that have never returned an error-free task, even after many months of crunching, yet their owners don't seem to notice.
ID: 1904240 · Report as offensive
Profile Jeff Buck Special Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 2
United States
Message 1904242 - Posted: 1 Dec 2017, 22:31:27 UTC - in response to Message 1904230.  

Further to Mark's comment.
SETI has a far lower timout ratio than most other projects because they have deadlines that are demonstrably too short. Why send out 800 hours worth of tasks to a computer that will only complete 600 hours of processing before the dealine is reached.

Actually long deadlines do not impact on the size of the main database but on the size of one of the intermediate databases.
What are those timeout ratios? I've never seen them anywhere, which is why this buildup of timeout resends seemed like a good time to bring up the topic.

If, by "main" database, you mean the science database, that's true, and it's the science database that crashed this week. But the master and replica databases are the ones impacted by the active task and WU volume, and also the ones most often running into space-related problems.
ID: 1904242 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 11991
Credit: 118,675,413
RAC: 40,489
United Kingdom
Message 1904244 - Posted: 1 Dec 2017, 22:39:16 UTC

Don't forget it isn't just the raw crunching time you need to consider for deadlines - it's also all the dead time when the computer is switched off or in use. And fir Android, when it's away from the charger.
ID: 1904244 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 4485
Credit: 279,197,494
RAC: 628,305
United States
Message 1904246 - Posted: 1 Dec 2017, 22:45:34 UTC - in response to Message 1904227.  

I too have wondered why SETI has stuck with the extremely long deadlines I assess were implemented for the original hardware used on the project. That kind of hardware is 18 years in the past and does not need to continue to be supported. I agree with you Jeff, I would expect the sizes of databases and the strain they put on the project would be greatly lessened if the deadlines were reduced by a month, lets say from the current 2 month deadline.

As I understand it, the reason there has been no adjustment is because Eric does not wish to disenfranchise anybody from participating in this project.

And that would include folks with very meager hardware resources. Not everybody can afford what some of us are able to.

That is why.

But if my 5 year old phone can easily finish tasks within a month, and Android devices must be the weakest hardware in use, is it reasonable to send it tasks that have a deadline 8 weeks out from when it was sent.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1904246 · Report as offensive
Profile Jeff Buck Special Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 2
United States
Message 1904248 - Posted: 1 Dec 2017, 22:53:21 UTC - in response to Message 1904244.  

Don't forget it isn't just the raw crunching time you need to consider for deadlines - it's also all the dead time when the computer is switched off or in use. And fir Android, when it's away from the charger.
Agreed. That's why I mentioned turnaround time as a possible yardstick (or meterstick, if you prefer). I'm sure there are some who just run S@h as it was originally designed, as a pretty screensaver, for short bursts here and there. But assigning deadlines to everyone based on those "lowest common denominators" seems like a horrible waste.
ID: 1904248 · Report as offensive
Profile Dr.Diesel Special Project $75 donor

Send message
Joined: 14 May 99
Posts: 35
Credit: 39,546,940
RAC: 41,129
United States
Message 1904255 - Posted: 1 Dec 2017, 23:29:31 UTC - in response to Message 1904101.  

Eric mentioned Informix, which I suspect is IBM's Informix DB, probably waiting on a new build from them. Or at minimum awaiting an undocumented config switch as a temp work-a-round. There was some backend working happening last night, perhaps they gave it a go and ran into further issues.

It is that and last I heard we are on an older version of it.
I would hope it is a well know issue they ran into and time, because of the large size of the db, is the only factor.


Having worked with IBM in the past, I further feel for Eric's sanity.
ID: 1904255 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 185,655,154
RAC: 44,478
United States
Message 1904260 - Posted: 1 Dec 2017, 23:46:02 UTC - in response to Message 1904227.  

I too have wondered why SETI has stuck with the extremely long deadlines I assess were implemented for the original hardware used on the project. That kind of hardware is 18 years in the past and does not need to continue to be supported. I agree with you Jeff, I would expect the sizes of databases and the strain they put on the project would be greatly lessened if the deadlines were reduced by a month, lets say from the current 2 month deadline.

As I understand it, the reason there has been no adjustment is because Eric does not wish to disenfranchise anybody from participating in this project.

And that would include folks with very meager hardware resources. Not everybody can afford what some of us are able to.

That is why.

Since the issue is with the SETI@home science database and not the master or replica databases I'm not sure how how the deadlines would even be relevant.
Given the data is stored in the science database after it has been validated.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1904260 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 11991
Credit: 118,675,413
RAC: 40,489
United Kingdom
Message 1904266 - Posted: 1 Dec 2017, 23:51:57 UTC - in response to Message 1904248.  

Don't forget it isn't just the raw crunching time you need to consider for deadlines - it's also all the dead time when the computer is switched off or in use. And fir Android, when it's away from the charger.
Agreed. That's why I mentioned turnaround time as a possible yardstick (or meterstick, if you prefer). I'm sure there are some who just run S@h as it was originally designed, as a pretty screensaver, for short bursts here and there. But assigning deadlines to everyone based on those "lowest common denominators" seems like a horrible waste.
If a computer has a much faster recorded average turnround already, it won't be pushing deadlines. 100 tasks per device, with 7 week deadlines, equates to about 12 hours per task.

The other thing that people in this conversation often forget is that many users share their available resources across multiple projects. That increases deadline pressure too - especially for the 'other' projects with shorter deadlines. And the whole deadline question only applies to the BOINC transaction database - which has been running pretty smoothly, apart from the disk failure recently (and that might have been in the file storage array, not the database). Either way, nothing to do with the Informix science database which is the source of today's malaise.

If Eric is happy with the current deadlines, then I'm satisfied by that.
ID: 1904266 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 9900
Credit: 128,487,763
RAC: 81,242
Australia
Message 1904268 - Posted: 2 Dec 2017, 0:06:03 UTC - in response to Message 1904248.  
Last modified: 2 Dec 2017, 0:09:32 UTC

But assigning deadlines to everyone based on those "lowest common denominators" seems like a horrible waste.

I guess it depends on how it is implemented.
Keep the present per WU deadlines (2-8 weeks depending on expected run time).
However when a WU is allocated to a system the actual deadline for that allocated WU is based on the hosts actual Average turnaround time for each application

That appears to vary from 1.5hrs to over 3 weeks (the longest I've seen).

So for a monster host (such as Petri's) that would be a 1.5hr deadline for GPU work (CPU deadlines based on CPU application turnaround times). After that, the WU would be re-issued to another host.
But what if something happens to one of those faster hosts? Here in Darwin the power going out for up to 6 hours at a time isn't that unusual. It would be a bit rough for all of those WUs to error out when they finally get power back just because of a short deadline that they can normally meet.
And what if the Seti servers go down in a screaming heap such has occurred this time? The servers declare all of their cache has missed the deadline before they are able to report the work they have completed?

Do they need to make an addition to the server code that adds the server down time + 1 to 12 hours (depending on how long he servers were down for) to all the WU deadlines and amend those deadlines before the Servers allow any Scheduler requests so people's work isn't marked as "No response by deadline date"? A lot of effort for little gain IMHO.

Or maybe we set it so there is a minimum deadline of 1 week, to allow for Seti server issues as well as local issues, so for any system that returns work within 1 week or less- that will be the deadline. For systems that take over 1 week average turnaround time, their deadline is their Average turnaround time + 1 week.
So for a system that takes 8 days, their deadlines will be 8+7= 15 days. 23 days to return work their deadlines will be 30 days, etc.
Grant
Darwin NT
ID: 1904268 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 9900
Credit: 128,487,763
RAC: 81,242
Australia
Message 1904269 - Posted: 2 Dec 2017, 0:08:20 UTC - in response to Message 1904260.  

Since the issue is with the SETI@home science database and not the master or replica databases I'm not sure how how the deadlines would even be relevant.
Given the data is stored in the science database after it has been validated.

The reason for the serverside limits was due to the load on master database and it falling over or just plain getting bogged down.
Grant
Darwin NT
ID: 1904269 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 185,655,154
RAC: 44,478
United States
Message 1904273 - Posted: 2 Dec 2017, 0:25:04 UTC - in response to Message 1904269.  

Since the issue is with the SETI@home science database and not the master or replica databases I'm not sure how how the deadlines would even be relevant.
Given the data is stored in the science database after it has been validated.

The reason for the serverside limits was due to the load on master database and it falling over or just plain getting bogged down.

That is correct. Either the MySQL database software or the hardware the BOINC master db system is using couldn't handle scanning a table of 11,000,000+ tasks several hundred, or thousand?, times a second.

Which, again, is in fact not the science database that runs informix .
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1904273 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1012
Credit: 8,992,251
RAC: 3,501
New Zealand
Message 1904275 - Posted: 2 Dec 2017, 0:33:28 UTC - in response to Message 1904218.  


Down until the new year...

Can somebody please give me a link to the official message saying it will be down until next year? I haven't been able to find anything.
ID: 1904275 · Report as offensive
Profile Dr.Diesel Special Project $75 donor

Send message
Joined: 14 May 99
Posts: 35
Credit: 39,546,940
RAC: 41,129
United States
Message 1904276 - Posted: 2 Dec 2017, 0:33:30 UTC - in response to Message 1904273.  

Which, again, is in fact not the science database that runs informix .


Is this along with each's systems OS and primary DB documented anywhere?
ID: 1904276 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 185,655,154
RAC: 44,478
United States
Message 1904277 - Posted: 2 Dec 2017, 0:41:29 UTC - in response to Message 1904276.  
Last modified: 2 Dec 2017, 0:43:47 UTC

Which, again, is in fact not the science database that runs informix .


Is this along with each's systems OS and primary DB documented anywhere?

You can check the Server Status for most of that information.
https://setiathome.berkeley.edu/show_server_status.php
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1904277 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 9900
Credit: 128,487,763
RAC: 81,242
Australia
Message 1904279 - Posted: 2 Dec 2017, 0:46:18 UTC - in response to Message 1904275.  


Down until the new year...

Can somebody please give me a link to the official message saying it will be down until next year? I haven't been able to find anything.

There has been no message, so people are making it up as they feel like it.
I'm voting for next week.
Grant
Darwin NT
ID: 1904279 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1012
Credit: 8,992,251
RAC: 3,501
New Zealand
Message 1904281 - Posted: 2 Dec 2017, 0:54:43 UTC - in response to Message 1904279.  


Down until the new year...

Can somebody please give me a link to the official message saying it will be down until next year? I haven't been able to find anything.

There has been no message, so people are making it up as they feel like it.
I'm voting for next week.

Thank you Grant for the clarification.
Thanks to Eric & the team for working on this issue
ID: 1904281 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 4485
Credit: 279,197,494
RAC: 628,305
United States
Message 1904284 - Posted: 2 Dec 2017, 1:06:42 UTC - in response to Message 1904277.  

Which, again, is in fact not the science database that runs informix .


Is this along with each's systems OS and primary DB documented anywhere?

You can check the Server Status for most of that information.
https://setiathome.berkeley.edu/show_server_status.php

Uhh, where on the SSP is the information requested? I see no mention of the OS or database software each server is running?
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1904284 · Report as offensive
Profile Jeff Buck Special Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 2
United States
Message 1904289 - Posted: 2 Dec 2017, 1:39:30 UTC - in response to Message 1904273.  

Since the issue is with the SETI@home science database and not the master or replica databases I'm not sure how how the deadlines would even be relevant.
Given the data is stored in the science database after it has been validated.

The reason for the serverside limits was due to the load on master database and it falling over or just plain getting bogged down.

That is correct. Either the MySQL database software or the hardware the BOINC master db system is using couldn't handle scanning a table of 11,000,000+ tasks several hundred, or thousand?, times a second.

Which, again, is in fact not the science database that runs informix .
My very first message on this subject stated:
...speeding up the turnaround would also lessen the storage requirements of the master and replica databases, an issue that seems to rear its head quite often.
...
Now, I don't know what percentage of the master database is occupied by task data, versus workunit data, account data, host data, and so on,...
And I tried in a subsequent message to reiterate that distinction:
If, by "main" database, you mean the science database, that's true, and it's the science database that crashed this week. But the master and replica databases are the ones impacted by the active task and WU volume, and also the ones most often running into space-related problems.
Why does the science database keep getting dragged in?
ID: 1904289 · Report as offensive
Profile Jeff Buck Special Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 2
United States
Message 1904292 - Posted: 2 Dec 2017, 2:04:55 UTC - in response to Message 1904268.  

Or maybe we set it so there is a minimum deadline of 1 week, to allow for Seti server issues as well as local issues, so for any system that returns work within 1 week or less- that will be the deadline. For systems that take over 1 week average turnaround time, their deadline is their Average turnaround time + 1 week.
So for a system that takes 8 days, their deadlines will be 8+7= 15 days. 23 days to return work their deadlines will be 30 days, etc.
I don't know that it even needs to be cut that close. Heck, even a three or four week cushion (for the longer-running tasks) would effect a significant improvement, I think. People do go on vacation, or off on business trips, or shut down their machines for awhile for other reasons. It would be nice if, before they did so, they drained their queues, but that tends not to happen and the system should allow sufficient latitude for that. Heck, I had a significant unplanned outage across the board last February, when a major storm knocked out the electricity in my area for five and a half days. So, the turnaround average on my main crunchers took a big hit. :^)

Another thing that I think would help, but perhaps just to a small degree, would be a way for conscientious users to abandon or abort tasks via the web site. I mention this because it seems we periodically see questions and/or apologies in the forum from users whose systems have irrevocably died, or met with some extreme reconfiguration, while still having lots of tasks in the queue. There's currently no alternative than to simply let those tasks time out.
ID: 1904292 · Report as offensive
Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 . . . 32 · Next

Message boards : Number crunching : Panic Mode On (108) Server Problems?


 
©2018 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.