No Work Issues (Jun 09 2010)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 1002267 - Posted: 9 Jun 2010, 22:34:27 UTC Let me address the "no work" issues as of late. We've been running low on work to send out (or had the schedulers turned off) for several reasons: 1. Each raw data file has to go through a local software based radar analysis - a suite of programs that takes over 3 hours to run per file. This should keep up with the incoming data flow, but some nagging NFS/mounting bugs cause this suite to lock up several times a week. Each time it does the whole systems getting new data on line is clogged until a human can figure out where it was in the process, clean it up, and start the broken file over again (resulting in many hours of lost processing time). For example this morning we found it all jammed last night, cleaned it up around 9am, and finally around 12:30pm new workunits were available again. We're working on adding some band-aid solutions to this particular problem. 2. Server crashes: mork and ptolemy are prone to crashing for no apparent reason. Either of them going down causes the project to halt until we recover. Sometimes it takes days to fully get back to a regular work-flow pace again. We're trying to shuffle services around to get ptolemy out of the picture. Why ptolemy instead of mork? Mork is a much bigger system and therefore much harder to replace - plus when it goes down the download servers are at least still able to work for a while. 3. Some data files error out pretty quickly due to noise or garbage data. 4. The CUDA clients sure burn through work fast. 5. Some CUDA clients were returning garbage. To combat this a fix to the scheduler was put on line this Monday, but was unable to start it without errors. It took Eric, Jeff, and I all day, and most of the next morning, to finally find the obscure problem - which was actually a misleading redirect in the apache config (that was put in many months ago). By the time we fixed it, we were already into the weekly outage. So lots of battles on this front. In any case we are collecting data at this point (on 2TB drives, which means we'll lose less data waiting for the Arecibo operators to swap out the older 500/750GB drives), and still have a backlog of stuff to process in our archives. The lab is also getting a Gbit link to the world in July so the slow transfers to/from these archives will no longer be a bottleneck. Note this link is for the whole lab and our SETI specific data link will remain at 100MBit. Still, it's an improvement. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 1002267 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1002272 - Posted: 9 Jun 2010, 22:39:33 UTC - in response to Message 1002267. Thanks Matt for the update and all the efforts, Claggy ID: 1002272 ·

ront Send message Joined: 25 Aug 01 Posts: 77 Credit: 386,336 RAC: 0	Message 1002308 - Posted: 10 Jun 2010, 0:02:51 UTC Thanks a bunch. Appreciate the update/information ron tillman ID: 1002308 ·

SciManStev Volunteer tester Send message Joined: 20 Jun 99 Posts: 6652 Credit: 121,090,076 RAC: 0	Message 1002331 - Posted: 10 Jun 2010, 0:53:57 UTC Thank you Matt! This news certainly will be appreciated by all. You are right, in that GPU crunchers can burn through a lot of work. Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website ID: 1002331 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1002342 - Posted: 10 Jun 2010, 1:31:45 UTC - in response to Message 1002308. Thanks a bunch. Appreciate the update/information The same here too. ID: 1002342 ·

FrostKing9 Send message Joined: 20 Oct 01 Posts: 39 Credit: 23,815,960 RAC: 0	Message 1002351 - Posted: 10 Jun 2010, 2:12:09 UTC Last modified: 10 Jun 2010, 2:14:02 UTC WOW... Matt.... thanks for the very specific update. Congrat's on getting the Gbit-link to the outside world. That will surely make things a lot better... in some ways. And... a hearty "thank you" to all of the people who work on SETI.... and keep it going. I DONATE money to SETI@home.... DO YOU? I'm just slowly BOINC'ing along. Hey... ET... you have a sister who likes earthlings? ID: 1002351 ·

soft^spirit Send message Joined: 18 May 99 Posts: 6497 Credit: 34,134,168 RAC: 0	Message 1002494 - Posted: 10 Jun 2010, 12:10:23 UTC Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745). A heads up when it is safe for us to return them to work would be appreciated. Until then, crunching away on 6.03 only. Thank you for the update Matt! Janice ID: 1002494 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1002501 - Posted: 10 Jun 2010, 12:43:51 UTC - in response to Message 1002494. Last modified: 10 Jun 2010, 12:44:28 UTC Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745). A heads up when it is safe for us to return them to work would be appreciated. Until then, crunching away on 6.03 only. Thank you for the update Matt! Janice I don't see any sign of trouble on your GT 220 - host 5385323. The main 'garbage' offenders were the new GTX 470 and GTX 480. Yours isn't affected - you're fine to go back to normal crunching (as and when work is available, of course). ID: 1002501 ·

Byron Leigh Hatch @ team Carl Sagan Volunteer tester Send message Joined: 5 Jul 99 Posts: 4548 Credit: 35,667,570 RAC: 4	Message 1002506 - Posted: 10 Jun 2010, 13:10:17 UTC - in response to Message 1002267. Thanks Matt for the update. Byron ID: 1002506 ·

ront Send message Joined: 25 Aug 01 Posts: 77 Credit: 386,336 RAC: 0	Message 1002519 - Posted: 10 Jun 2010, 14:09:24 UTC Again, Thanks & Kudos to the SETI Staff and volunteers who labor long and hard to keep,what appears to be, a rather cantankerous system up and running. One cannot help but be impressed with the dedication and determination dispalyed by all of you. In the mean...............when work becomes available, is there any chance of getting some "Astropulse" tasks? thank you, rt ID: 1002519 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 1002521 - Posted: 10 Jun 2010, 14:14:32 UTC - in response to Message 1002267. Matt, thanks for the info on work issues. 1. Each raw data file has to go through a local software based radar analysis - I don't suppose you have extensive file locking & error checking code? Some sort of time-out, retry, or checkpoint of input file would be really helpful. Things like signal handling and using chsize(), if you know the output file size ahead of time, can help. Help Wanted forum? There are a lot of nerd programmers out there :cough: who would love to use their 1TB drives, core i7, and Virtualbox with NFS mounts to optimize the software radar blanking code. With internet2 access, testfile size is not an issue. And, congratulations on the gigabit link next month. You'll be able to use Internet2 to its full potential. :) ID: 1002521 ·

Astro-AL Send message Joined: 31 Mar 00 Posts: 18 Credit: 95,868,034 RAC: 80	Message 1002531 - Posted: 10 Jun 2010, 14:46:22 UTC - in response to Message 1002267. Thamks Matt for all your hard work and all the SETI staff. ID: 1002531 ·

Richard Send message Joined: 10 Jul 99 Posts: 19 Credit: 17,341,684 RAC: 0	Message 1002545 - Posted: 10 Jun 2010, 15:18:46 UTC - in response to Message 1002267. Thanks a lot for maintain us informed... Ask: may be this post a return to the good old times of "several posts per week" to maintain live the contact with the users? (Apologizes for my English; is not my mother language) ID: 1002545 ·

soft^spirit Send message Joined: 18 May 99 Posts: 6497 Credit: 34,134,168 RAC: 0	Message 1002550 - Posted: 10 Jun 2010, 15:28:49 UTC - in response to Message 1002501. Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745). A heads up when it is safe for us to return them to work would be appreciated. Until then, crunching away on 6.03 only. Thank you for the update Matt! Janice I don't see any sign of trouble on your GT 220 - host 5385323. The main 'garbage' offenders were the new GTX 470 and GTX 480. Yours isn't affected - you're fine to go back to normal crunching (as and when work is available, of course). Most of the invalid/error have timed out, they were beginning of may when I gave up and went CPU only (no more cuda/6.09). I will try turning it back on.. hopefully it is fixed. I just figured no units was better for the project than mutilated units. ID: 1002550 ·

Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 1002666 - Posted: 10 Jun 2010, 21:14:34 UTC Oh - seems like I forgot another major reason for little work: 6. "Cannot Find Coincident Blanking Signal" - that's the error reported by half the beams in many recent raw data files. What does that mean? For some reason the data files are being written in a format which makes it impossible for some beams to find the blanking signal information, which they need to process (or else they will be cluttered with radar interference). So they error out, and we're effectively losing half our data at this point, which of course speeds up the burn rate quite a bit. Another thing we're looking into. We should be able to figure it out and reprocess the missing beams in the future. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 1002666 ·

Jeff Lane Send message Joined: 31 Mar 09 Posts: 1 Credit: 1,097,029 RAC: 0	Message 1002675 - Posted: 10 Jun 2010, 21:44:12 UTC Hello! server crashes for no reasons? Maybe you should rename 'em. Zatar and Apollo! Or something. OK I'm just bored. ID: 1002675 ·

Byron Leigh Hatch @ team Carl Sagan Volunteer tester Send message Joined: 5 Jul 99 Posts: 4548 Credit: 35,667,570 RAC: 4	Message 1002684 - Posted: 10 Jun 2010, 22:10:17 UTC - in response to Message 1002666. Thanks Matt, for the update. Byron ID: 1002684 ·

SciManStev Volunteer tester Send message Joined: 20 Jun 99 Posts: 6652 Credit: 121,090,076 RAC: 0	Message 1002726 - Posted: 10 Jun 2010, 23:14:34 UTC Thank you! That is valuable information indeed! Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website ID: 1002726 ·

Jeff Dahn Volunteer tester Send message Joined: 23 Nov 02 Posts: 2 Credit: 8,227,107 RAC: 0	Message 1002816 - Posted: 11 Jun 2010, 5:10:46 UTC - in response to Message 1002726. I certainly appreciate the update. It seems as though you all are fighting an uphill battle to keep things viable. Thanks muchly for all the effort. Jeff ID: 1002816 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65746 Credit: 55,293,173 RAC: 49	Message 1002833 - Posted: 11 Jun 2010, 6:06:47 UTC - in response to Message 1002550. Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745). A heads up when it is safe for us to return them to work would be appreciated. Until then, crunching away on 6.03 only. Thank you for the update Matt! Janice I don't see any sign of trouble on your GT 220 - host 5385323. The main 'garbage' offenders were the new GTX 470 and GTX 480. Yours isn't affected - you're fine to go back to normal crunching (as and when work is available, of course). Most of the invalid/error have timed out, they were beginning of may when I gave up and went CPU only (no more cuda/6.09). I will try turning it back on.. hopefully it is fixed. I just figured no units was better for the project than mutilated units. Lovely. I use 6.03 and I know from what I've read that 6.09 is dead, dead, dead, 6.10 is for Fermi. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 1002833 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.