Panic Mode On (47) Server problems?

Message boards : Number crunching : Panic Mode On (47) Server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

AuthorMessage
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15399
Credit: 7,423,413
RAC: 1
United Kingdom
Message 1110750 - Posted: 28 May 2011, 13:11:25 UTC
Last modified: 28 May 2011, 13:12:33 UTC

Two thousand five hundred gigabytes

I didn't realise it was about that high, crikey.

Seti staff are supposed to be project scientists first and foremost, and spend their time analysing the data we provide for them. But of neccessity, they have to also undertake a secondary role of being Server admins.

I think they do rather well all things considered.

It must be frustrating when all you want to do is the science but then a server conks out and you have to deal with that. They certainly do a magnificent job.

Anyhow I'm down to half an astropulse and one beta task to crunch with thirteen uploads pending. Still got a project backoff counting down, 45 mins to go!
ID: 1110750 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14677
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1110756 - Posted: 28 May 2011, 13:53:08 UTC - in response to Message 1110727.  

As I understand it, the reason for using RAID configurations is because there is redundancy built in, i.e. a backup. RAID disks are hot swappable, so if a hard drive fails you simply take it out and throw it away and plug a new one in. Then the RAID array will copy whatever it needs to the new disk.

The easiest RAID of all to manage for a small business is hot spare plus hot swap. You have, say, three drives in RAID 5 (that's the smallest configuration), plus a fourth, empty, drive powered up but taking no other part in the proceedings. When an active drive fails, something - probably a hardware watchdog program running in the controller card firmware - switches it out, wakes up the spare drive, and starts copying data from the surviving drives until full redundancy is restored. Then, on the next daily or weekly check, the server operator can pull the failed drive, and replace it with a new one which becomes the new hot spare. The trouble is, very few small business owners do daily or weekly checks.....

There are two flies in the ointment as far as SETI is concerned.

1) AFAIK, the scenario above relies on dedicated and compatible hardware RAID controllers - which have to be paid for. Historically, the SETI staff have preferred to rely on software RAID maintained through the operating system. That gives them finer control of the configuration, but possibly rules out some of the more sophisicated recovery options.

2) In a typical small business, most of the data on the server drives will be pretty stable, not changing much from one day/week/month/year to the next. The SETI workunit data is incredibly volatile - again, from the SSP figures, the turnover is typically 25 thousand MB workunit datafiles per hour (half the 'result per hour' rate), plus a thousand AP workunit files.

The same maths gives sixteen gigabytes per hour of new data to be written, and outdated data to be deleted. Not only does that give you (potentially) a huge fragmentation problem, it would also keep a disk controller card very busy - probably ruling out the 'hot rebuild' while the array was online. And remember that SETI's disks have that average throughput 24/7 - small business servers have quieter times overnight and at weekends where these sorts of processes have a chance to catch up, even if they run slowly during the business day.
ID: 1110756 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1110772 - Posted: 28 May 2011, 14:46:56 UTC

Jeff just posted an update here.

We have confirmed Downloads are GO (I just got a couple), but still no word on when Uploads will come back.
Donald
Infernal Optimist / Submariner, retired
ID: 1110772 · Report as offensive
Profile Jason Safoutin
Volunteer tester
Avatar

Send message
Joined: 8 Sep 05
Posts: 1386
Credit: 200,389
RAC: 0
United States
Message 1110779 - Posted: 28 May 2011, 15:15:46 UTC - in response to Message 1110774.  

The upload is back.. i'm uploading right now :)


Same here. Some took a few tries, but all that I had uploaded and reported. BETA still down though :(
"By faith we understand that the universe was formed at God's command, so that what is seen was not made out of what was visible". Hebrews 11.3

ID: 1110779 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1110803 - Posted: 28 May 2011, 16:53:53 UTC

I believe I read a while back in one of Matt's posts that the volume where uploads (and I think the active tasks) is 4TB. Now another volume on that array may have the raw data files, as well. I want to say that the total array size that they're working with is well over 10TB, but somewhere under 30 usable.

I also believe last year they switched to raid 10 since it was significantly faster than 5/6. Yeah that's a lot of wasted disks, but it comes with the benefit of much more I/O throughput and is easier to rebuild a missing/replaced disk.

That said, some would say "well if it's just one disk that gets swapped out, why does it take a day and a half to rebuild the data on it?" That is because a FULL consistency check is run, and it verifies the data on ALL of the disks in the array. I run one of those on my 4-disk raid-5 every couple of weeks just for good measure, and it takes 2h30m. Roughly 35 minutes per disk involved, and their storage server has 20+ (I think Thumper has 48). Takes time.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1110803 · Report as offensive
Ellis Hardin

Send message
Joined: 15 Mar 01
Posts: 33
Credit: 26,603,764
RAC: 0
United States
Message 1110825 - Posted: 28 May 2011, 19:05:10 UTC

Wow. What a hornet's nest I've stirred up with my 20 day cache claim. Let me explain. Whatever is used to determine how many workunits you download is based on the amount of time that BOINC estimates it takes for you to complete a workunit, and whatever method BOINC uses is never accurate. It never takes me as long as the estimated time to process a workunit. So, it never takes me 20 days to work through my cache. When I had a 10 day cache set, I noticed that every now and then I would still run out of work, so that's why I changed to a 20 day cache. FYI, the max amount of extra days of work you can set is 10, but there is no max on the number of days to connect to BOINC, so, theoretically, you can get a much higher amount of work cached.

One time something weird happened, and my estimated completion times were greatly reduced, like from hours to seconds. As a result my computers ended up downloading 30+ days worth of work. This is how I know the estimated workunit completion times are used to determine how much work you download. And this was when I still had a 10 day cache set. That was the closest I came to missing workunit deadlines, but I made them all. The only workunit deadlines I've missed were due to lost workunits.
ID: 1110825 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1110841 - Posted: 28 May 2011, 19:50:39 UTC - in response to Message 1110825.  

http://setiathome.berkeley.edu/forum_thread.php?id=64277

Don't forget to toss a donation at SETI if you can guys.


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1110841 · Report as offensive
Profile Khangollo
Avatar

Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1110869 - Posted: 28 May 2011, 21:22:43 UTC

Shortie storm!
This will clog up the servers just nicely for quite a while...
ID: 1110869 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1110874 - Posted: 28 May 2011, 21:34:44 UTC - in response to Message 1110756.  



The same maths gives sixteen gigabytes per hour of new data to be written, and outdated data to be deleted. Not only does that give you (potentially) a huge fragmentation problem, it would also keep a disk controller card very busy - probably ruling out the 'hot rebuild' while the array was online. And remember that SETI's disks have that average throughput 24/7 - small business servers have quieter times overnight and at weekends where these sorts of processes have a chance to catch up, even if they run slowly during the business day.


I'm replying to this because this is the most clear answer to what I did not know how to ask.

Just for general consumption -- There is no need to defend the project or the guys. I'm *sure* they know what they are doing and I'm equally sure that I don't.

I'm asking about the technology so that I have some understanding of the state of things.

I'm thinking of multiple gigabytes of read/writes per hour of small files (I had forgotten about all the deletions) and I'm thinking to myself, "Self; wouldn't RAM be a FAR preferable place to do all these read/writes (and deletions)?"

Way back in time, when the Earth first cooled, RAM was very expensive and "hard to address." Not-so-much now. I know you'd still have to have redundancy, so you'd need a simple RAID to read and write results from and completed work units to, but you wouldn't be working the snot out of a bank of hard drives in a RAID array on a flaky controller across the room connected by fiber or whatever.

But I have no clue how you would tell the system to only write-to-disk what you wanted to preserve. I'm imagining you'd say "take programs and data from Disk A and write to RAM Drive X, do all your stuff out on X, then write only results you want to keep to Disk B," instead of "read from a subdirectory on the array, write to a different subdirectory on the array, and use a program in a different subdirectory on the array to decide what to keep and discard, and..." Well, you get my point.

I don't even have the current terminology to communicate what I'm trying to ask, but I guess I'm generally asking if a huge RAM drive is possible and if it is, wouldn't it be faster and more reliable?
ID: 1110874 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13841
Credit: 208,696,464
RAC: 304
Australia
Message 1110880 - Posted: 28 May 2011, 22:09:00 UTC - in response to Message 1110874.  

I don't even have the current terminology to communicate what I'm trying to ask, but I guess I'm generally asking if a huge RAM drive is possible and if it is, wouldn't it be faster and more reliable?

RAM is much more expensive than magnetic storage, and much less reliable. Also all it takes is a power hiccup & all that data is lost. Not so once it's saved to magnetic storage.
There are now SSDs (Solid State Drives) that use flash memory for storage & they use the same interface as a normal HDD. They are blindingly fast compared to HDDs- HDDs aren't even in the race, however they are much smaller in capacity, and mind bogglingly expensive compared to conventional HDDs.

Without getting accurate numbers or knowing the currrent size of storage used, the fact that you would need more drives & more storage devices to give the same capacity as HDDs, off the top of my head i would expect the cost of replacing the present major HDD storage with all SLC (Single Level Cell) SSDs (i wouldn't use MLC SSDs for this sort of work without out *at least* a 75% over provisioning) would have to be at least US$750,000.
Money that could be better spent elsewhere.
Grant
Darwin NT
ID: 1110880 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13841
Credit: 208,696,464
RAC: 304
Australia
Message 1110886 - Posted: 28 May 2011, 22:27:53 UTC - in response to Message 1110880.  


There's definately some sort of problem with uploads- normally after an outage they go though first attempt, even if slowly. At the present rate, it'll take about 5 days for all of mine to go through.
Grant
Darwin NT
ID: 1110886 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9958
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1110889 - Posted: 28 May 2011, 22:33:56 UTC - in response to Message 1110886.  


There's definately some sort of problem with uploads- normally after an outage they go though first attempt, even if slowly. At the present rate, it'll take about 5 days for all of mine to go through.


My modest 4 machines started today with 311 uploads waiting, went down as low as 275 now up to 326, so they crunching faster than they can be uploaded! :-)

ID: 1110889 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6658
Credit: 121,090,076
RAC: 0
United States
Message 1110890 - Posted: 28 May 2011, 22:38:56 UTC

Don't worry. Traffic is very high at the moment with thousands of us trying to upload. It will all be sorted out before you know it.

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1110890 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14677
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1110897 - Posted: 28 May 2011, 23:01:05 UTC - in response to Message 1110886.  

There's definately some sort of problem with uploads- normally after an outage they go though first attempt, even if slowly. At the present rate, it'll take about 5 days for all of mine to go through.

Don't forget we're uploading to a temporary storage facility cobbled together with duct tape and chewing gum. Don't be surprised that it's slow, be surprised that it works at all!
ID: 1110897 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6658
Credit: 121,090,076
RAC: 0
United States
Message 1110898 - Posted: 28 May 2011, 23:13:40 UTC - in response to Message 1110897.  

Don't be surprised that it's slow, be surprised that it works at all!

I am surprised it is working. I didn't expect anything until Tuesday. This is an exceptional moment that demonstrates the dedication of the SETI staff! They aren't paid for weekends.

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1110898 · Report as offensive
Profile Jason Safoutin
Volunteer tester
Avatar

Send message
Joined: 8 Sep 05
Posts: 1386
Credit: 200,389
RAC: 0
United States
Message 1110915 - Posted: 29 May 2011, 0:12:40 UTC - in response to Message 1110898.  

Don't be surprised that it's slow, be surprised that it works at all!

I am surprised it is working. I didn't expect anything until Tuesday. This is an exceptional moment that demonstrates the dedication of the SETI staff! They aren't paid for weekends.

Steve


Indeed! They could have just left it until Tuesday and shut everything off. I am glad we have dedicated staff members. The slow uploads are also do to the fact that everyone will be trying to upload, often multiple computers at once. It is still taking me a few tries to upload WUs, but all of them end up being uploaded. Just try not to contact the server as much. I have mine set to update/upload/report on my command only for the time being.
"By faith we understand that the universe was formed at God's command, so that what is seen was not made out of what was visible". Hebrews 11.3

ID: 1110915 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1110923 - Posted: 29 May 2011, 0:24:48 UTC - in response to Message 1110915.  

Don't be surprised that it's slow, be surprised that it works at all!

I am surprised it is working. I didn't expect anything until Tuesday. This is an exceptional moment that demonstrates the dedication of the SETI staff! They aren't paid for weekends.

Steve


Indeed! They could have just left it until Tuesday and shut everything off. I am glad we have dedicated staff members. The slow uploads are also do to the fact that everyone will be trying to upload, often multiple computers at once. It is still taking me a few tries to upload WUs, but all of them end up being uploaded. Just try not to contact the server as much. I have mine set to update/upload/report on my command only for the time being.


Yep just got the last of my several hundred uploaded. Some of them had an elapsed time of over 30 mins.

But that doesn't matter at all, SETI has the data and all is well.


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1110923 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1110927 - Posted: 29 May 2011, 0:29:00 UTC

I have 600+ on one machine and 60 on my other, it will be a long slow process.

Just not as slow as Steve!!

ID: 1110927 · Report as offensive
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15399
Credit: 7,423,413
RAC: 1
United Kingdom
Message 1110939 - Posted: 29 May 2011, 1:59:24 UTC

I don't crunch much but I'm still getting a backoff of several hours. I'm not hitting the retry button every few minutes because I think it's pointless (not to mention the server stuff), and sooner or later I'll have success.
ID: 1110939 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 1111034 - Posted: 29 May 2011, 11:34:02 UTC - in response to Message 1110939.  

I don't crunch much but I'm still getting a backoff of several hours. I'm not hitting the retry button every few minutes because I think it's pointless (not to mention the server stuff), and sooner or later I'll have success.

Totally agree have alot to download will wait until I get them all probably sometime tomorrow
ID: 1111034 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

Message boards : Number crunching : Panic Mode On (47) Server problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.