Panic Mode On (47) Server problems?

Author	Message
Dimly Lit Lightbulb ðŸ˜€ Volunteer tester Send message Joined: 30 Aug 08 Posts: 15399 Credit: 7,423,413 RAC: 1	Message 1110750 - Posted: 28 May 2011, 13:11:25 UTC Last modified: 28 May 2011, 13:12:33 UTC Two thousand five hundred gigabytes I didn't realise it was about that high, crikey. Seti staff are supposed to be project scientists first and foremost, and spend their time analysing the data we provide for them. But of neccessity, they have to also undertake a secondary role of being Server admins. I think they do rather well all things considered. It must be frustrating when all you want to do is the science but then a server conks out and you have to deal with that. They certainly do a magnificent job. Anyhow I'm down to half an astropulse and one beta task to crunch with thirteen uploads pending. Still got a project backoff counting down, 45 mins to go! ID: 1110750 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14661 Credit: 200,643,578 RAC: 874	Message 1110756 - Posted: 28 May 2011, 13:53:08 UTC - in response to Message 1110727. As I understand it, the reason for using RAID configurations is because there is redundancy built in, i.e. a backup. RAID disks are hot swappable, so if a hard drive fails you simply take it out and throw it away and plug a new one in. Then the RAID array will copy whatever it needs to the new disk. The easiest RAID of all to manage for a small business is hot spare plus hot swap. You have, say, three drives in RAID 5 (that's the smallest configuration), plus a fourth, empty, drive powered up but taking no other part in the proceedings. When an active drive fails, something - probably a hardware watchdog program running in the controller card firmware - switches it out, wakes up the spare drive, and starts copying data from the surviving drives until full redundancy is restored. Then, on the next daily or weekly check, the server operator can pull the failed drive, and replace it with a new one which becomes the new hot spare. The trouble is, very few small business owners do daily or weekly checks..... There are two flies in the ointment as far as SETI is concerned. 1) AFAIK, the scenario above relies on dedicated and compatible hardware RAID controllers - which have to be paid for. Historically, the SETI staff have preferred to rely on software RAID maintained through the operating system. That gives them finer control of the configuration, but possibly rules out some of the more sophisicated recovery options. 2) In a typical small business, most of the data on the server drives will be pretty stable, not changing much from one day/week/month/year to the next. The SETI workunit data is incredibly volatile - again, from the SSP figures, the turnover is typically 25 thousand MB workunit datafiles per hour (half the 'result per hour' rate), plus a thousand AP workunit files. The same maths gives sixteen gigabytes per hour of new data to be written, and outdated data to be deleted. Not only does that give you (potentially) a huge fragmentation problem, it would also keep a disk controller card very busy - probably ruling out the 'hot rebuild' while the array was online. And remember that SETI's disks have that average throughput 24/7 - small business servers have quieter times overnight and at weekends where these sorts of processes have a chance to catch up, even if they run slowly during the business day. ID: 1110756 ·

Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20	Message 1110772 - Posted: 28 May 2011, 14:46:56 UTC Jeff just posted an update here. We have confirmed Downloads are GO (I just got a couple), but still no word on when Uploads will come back. Donald Infernal Optimist / Submariner, retired ID: 1110772 ·

Jason Safoutin Volunteer tester Send message Joined: 8 Sep 05 Posts: 1386 Credit: 200,389 RAC: 0	Message 1110779 - Posted: 28 May 2011, 15:15:46 UTC - in response to Message 1110774. The upload is back.. i'm uploading right now :) Same here. Some took a few tries, but all that I had uploaded and reported. BETA still down though :( "By faith we understand that the universe was formed at God's command, so that what is seen was not made out of what was visible". Hebrews 11.3 ID: 1110779 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1110803 - Posted: 28 May 2011, 16:53:53 UTC I believe I read a while back in one of Matt's posts that the volume where uploads (and I think the active tasks) is 4TB. Now another volume on that array may have the raw data files, as well. I want to say that the total array size that they're working with is well over 10TB, but somewhere under 30 usable. I also believe last year they switched to raid 10 since it was significantly faster than 5/6. Yeah that's a lot of wasted disks, but it comes with the benefit of much more I/O throughput and is easier to rebuild a missing/replaced disk. That said, some would say "well if it's just one disk that gets swapped out, why does it take a day and a half to rebuild the data on it?" That is because a FULL consistency check is run, and it verifies the data on ALL of the disks in the array. I run one of those on my 4-disk raid-5 every couple of weeks just for good measure, and it takes 2h30m. Roughly 35 minutes per disk involved, and their storage server has 20+ (I think Thumper has 48). Takes time. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1110803 ·

Ellis Hardin Send message Joined: 15 Mar 01 Posts: 33 Credit: 26,603,764 RAC: 0	Message 1110825 - Posted: 28 May 2011, 19:05:10 UTC Wow. What a hornet's nest I've stirred up with my 20 day cache claim. Let me explain. Whatever is used to determine how many workunits you download is based on the amount of time that BOINC estimates it takes for you to complete a workunit, and whatever method BOINC uses is never accurate. It never takes me as long as the estimated time to process a workunit. So, it never takes me 20 days to work through my cache. When I had a 10 day cache set, I noticed that every now and then I would still run out of work, so that's why I changed to a 20 day cache. FYI, the max amount of extra days of work you can set is 10, but there is no max on the number of days to connect to BOINC, so, theoretically, you can get a much higher amount of work cached. One time something weird happened, and my estimated completion times were greatly reduced, like from hours to seconds. As a result my computers ended up downloading 30+ days worth of work. This is how I know the estimated workunit completion times are used to determine how much work you download. And this was when I still had a 10 day cache set. That was the closest I came to missing workunit deadlines, but I made them all. The only workunit deadlines I've missed were due to lost workunits. ID: 1110825 ·

Slavac Volunteer tester Send message Joined: 27 Apr 11 Posts: 1932 Credit: 17,952,639 RAC: 0	Message 1110841 - Posted: 28 May 2011, 19:50:39 UTC - in response to Message 1110825. http://setiathome.berkeley.edu/forum_thread.php?id=64277 Don't forget to toss a donation at SETI if you can guys. Executive Director GPU Users Group Inc. - brad@gpuug.org ID: 1110841 ·

Khangollo Send message Joined: 1 Aug 00 Posts: 245 Credit: 36,410,524 RAC: 0	Message 1110869 - Posted: 28 May 2011, 21:22:43 UTC Shortie storm! This will clog up the servers just nicely for quite a while... ID: 1110869 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1110874 - Posted: 28 May 2011, 21:34:44 UTC - in response to Message 1110756. The same maths gives sixteen gigabytes per hour of new data to be written, and outdated data to be deleted. Not only does that give you (potentially) a huge fragmentation problem, it would also keep a disk controller card very busy - probably ruling out the 'hot rebuild' while the array was online. And remember that SETI's disks have that average throughput 24/7 - small business servers have quieter times overnight and at weekends where these sorts of processes have a chance to catch up, even if they run slowly during the business day. I'm replying to this because this is the most clear answer to what I did not know how to ask. Just for general consumption -- There is no need to defend the project or the guys. I'm sure they know what they are doing and I'm equally sure that I don't. I'm asking about the technology so that I have some understanding of the state of things. I'm thinking of multiple gigabytes of read/writes per hour of small files (I had forgotten about all the deletions) and I'm thinking to myself, "Self; wouldn't RAM be a FAR preferable place to do all these read/writes (and deletions)?" Way back in time, when the Earth first cooled, RAM was very expensive and "hard to address." Not-so-much now. I know you'd still have to have redundancy, so you'd need a simple RAID to read and write results from and completed work units to, but you wouldn't be working the snot out of a bank of hard drives in a RAID array on a flaky controller across the room connected by fiber or whatever. But I have no clue how you would tell the system to only write-to-disk what you wanted to preserve. I'm imagining you'd say "take programs and data from Disk A and write to RAM Drive X, do all your stuff out on X, then write only results you want to keep to Disk B," instead of "read from a subdirectory on the array, write to a different subdirectory on the array, and use a program in a different subdirectory on the array to decide what to keep and discard, and..." Well, you get my point. I don't even have the current terminology to communicate what I'm trying to ask, but I guess I'm generally asking if a huge RAM drive is possible and if it is, wouldn't it be faster and more reliable? ID: 1110874 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13797 Credit: 208,696,464 RAC: 304	Message 1110880 - Posted: 28 May 2011, 22:09:00 UTC - in response to Message 1110874. I don't even have the current terminology to communicate what I'm trying to ask, but I guess I'm generally asking if a huge RAM drive is possible and if it is, wouldn't it be faster and more reliable? RAM is much more expensive than magnetic storage, and much less reliable. Also all it takes is a power hiccup & all that data is lost. Not so once it's saved to magnetic storage. There are now SSDs (Solid State Drives) that use flash memory for storage & they use the same interface as a normal HDD. They are blindingly fast compared to HDDs- HDDs aren't even in the race, however they are much smaller in capacity, and mind bogglingly expensive compared to conventional HDDs. Without getting accurate numbers or knowing the currrent size of storage used, the fact that you would need more drives & more storage devices to give the same capacity as HDDs, off the top of my head i would expect the cost of replacing the present major HDD storage with all SLC (Single Level Cell) SSDs (i wouldn't use MLC SSDs for this sort of work without out at least a 75% over provisioning) would have to be at least US$750,000. Money that could be better spent elsewhere. Grant Darwin NT ID: 1110880 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13797 Credit: 208,696,464 RAC: 304	Message 1110886 - Posted: 28 May 2011, 22:27:53 UTC - in response to Message 1110880. There's definately some sort of problem with uploads- normally after an outage they go though first attempt, even if slowly. At the present rate, it'll take about 5 days for all of mine to go through. Grant Darwin NT ID: 1110886 ·

Bernie Vine Volunteer moderator Volunteer tester Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328	Message 1110889 - Posted: 28 May 2011, 22:33:56 UTC - in response to Message 1110886. There's definately some sort of problem with uploads- normally after an outage they go though first attempt, even if slowly. At the present rate, it'll take about 5 days for all of mine to go through. My modest 4 machines started today with 311 uploads waiting, went down as low as 275 now up to 326, so they crunching faster than they can be uploaded! :-) ID: 1110889 ·

SciManStev Volunteer tester Send message Joined: 20 Jun 99 Posts: 6656 Credit: 121,090,076 RAC: 0	Message 1110890 - Posted: 28 May 2011, 22:38:56 UTC Don't worry. Traffic is very high at the moment with thousands of us trying to upload. It will all be sorted out before you know it. Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website ID: 1110890 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14661 Credit: 200,643,578 RAC: 874	Message 1110897 - Posted: 28 May 2011, 23:01:05 UTC - in response to Message 1110886. There's definately some sort of problem with uploads- normally after an outage they go though first attempt, even if slowly. At the present rate, it'll take about 5 days for all of mine to go through. Don't forget we're uploading to a temporary storage facility cobbled together with duct tape and chewing gum. Don't be surprised that it's slow, be surprised that it works at all! ID: 1110897 ·

SciManStev Volunteer tester Send message Joined: 20 Jun 99 Posts: 6656 Credit: 121,090,076 RAC: 0	Message 1110898 - Posted: 28 May 2011, 23:13:40 UTC - in response to Message 1110897. Don't be surprised that it's slow, be surprised that it works at all! I am surprised it is working. I didn't expect anything until Tuesday. This is an exceptional moment that demonstrates the dedication of the SETI staff! They aren't paid for weekends. Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website ID: 1110898 ·

Jason Safoutin Volunteer tester Send message Joined: 8 Sep 05 Posts: 1386 Credit: 200,389 RAC: 0	Message 1110915 - Posted: 29 May 2011, 0:12:40 UTC - in response to Message 1110898. Don't be surprised that it's slow, be surprised that it works at all! I am surprised it is working. I didn't expect anything until Tuesday. This is an exceptional moment that demonstrates the dedication of the SETI staff! They aren't paid for weekends. Steve Indeed! They could have just left it until Tuesday and shut everything off. I am glad we have dedicated staff members. The slow uploads are also do to the fact that everyone will be trying to upload, often multiple computers at once. It is still taking me a few tries to upload WUs, but all of them end up being uploaded. Just try not to contact the server as much. I have mine set to update/upload/report on my command only for the time being. "By faith we understand that the universe was formed at God's command, so that what is seen was not made out of what was visible". Hebrews 11.3 ID: 1110915 ·

Slavac Volunteer tester Send message Joined: 27 Apr 11 Posts: 1932 Credit: 17,952,639 RAC: 0	Message 1110923 - Posted: 29 May 2011, 0:24:48 UTC - in response to Message 1110915. Don't be surprised that it's slow, be surprised that it works at all! I am surprised it is working. I didn't expect anything until Tuesday. This is an exceptional moment that demonstrates the dedication of the SETI staff! They aren't paid for weekends. Steve Indeed! They could have just left it until Tuesday and shut everything off. I am glad we have dedicated staff members. The slow uploads are also do to the fact that everyone will be trying to upload, often multiple computers at once. It is still taking me a few tries to upload WUs, but all of them end up being uploaded. Just try not to contact the server as much. I have mine set to update/upload/report on my command only for the time being. Yep just got the last of my several hundred uploaded. Some of them had an elapsed time of over 30 mins. But that doesn't matter at all, SETI has the data and all is well. Executive Director GPU Users Group Inc. - brad@gpuug.org ID: 1110923 ·

arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 1110927 - Posted: 29 May 2011, 0:29:00 UTC I have 600+ on one machine and 60 on my other, it will be a long slow process. Just not as slow as Steve!! ID: 1110927 ·

Dimly Lit Lightbulb ðŸ˜€ Volunteer tester Send message Joined: 30 Aug 08 Posts: 15399 Credit: 7,423,413 RAC: 1	Message 1110939 - Posted: 29 May 2011, 1:59:24 UTC I don't crunch much but I'm still getting a backoff of several hours. I'm not hitting the retry button every few minutes because I think it's pointless (not to mention the server stuff), and sooner or later I'll have success. ID: 1110939 ·

[B^S] madmac Volunteer tester Send message Joined: 9 Feb 04 Posts: 1175 Credit: 4,754,897 RAC: 0	Message 1111034 - Posted: 29 May 2011, 11:34:02 UTC - in response to Message 1110939. I don't crunch much but I'm still getting a backoff of several hours. I'm not hitting the retry button every few minutes because I think it's pointless (not to mention the server stuff), and sooner or later I'll have success. Totally agree have alot to download will wait until I get them all probably sometime tomorrow ID: 1111034 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.