Message boards :
Technical News :
No Work Issues (Jun 09 2010)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Let me address the "no work" issues as of late. We've been running low on work to send out (or had the schedulers turned off) for several reasons: 1. Each raw data file has to go through a local software based radar analysis - a suite of programs that takes over 3 hours to run per file. This should keep up with the incoming data flow, but some nagging NFS/mounting bugs cause this suite to lock up several times a week. Each time it does the whole systems getting new data on line is clogged until a human can figure out where it was in the process, clean it up, and start the broken file over again (resulting in many hours of lost processing time). For example this morning we found it all jammed last night, cleaned it up around 9am, and finally around 12:30pm new workunits were available again. We're working on adding some band-aid solutions to this particular problem. 2. Server crashes: mork and ptolemy are prone to crashing for no apparent reason. Either of them going down causes the project to halt until we recover. Sometimes it takes days to fully get back to a regular work-flow pace again. We're trying to shuffle services around to get ptolemy out of the picture. Why ptolemy instead of mork? Mork is a much bigger system and therefore much harder to replace - plus when it goes down the download servers are at least still able to work for a while. 3. Some data files error out pretty quickly due to noise or garbage data. 4. The CUDA clients sure burn through work fast. 5. Some CUDA clients were returning garbage. To combat this a fix to the scheduler was put on line this Monday, but was unable to start it without errors. It took Eric, Jeff, and I all day, and most of the next morning, to finally find the obscure problem - which was actually a misleading redirect in the apache config (that was put in many months ago). By the time we fixed it, we were already into the weekly outage. So lots of battles on this front. In any case we are collecting data at this point (on 2TB drives, which means we'll lose less data waiting for the Arecibo operators to swap out the older 500/750GB drives), and still have a backlog of stuff to process in our archives. The lab is also getting a Gbit link to the world in July so the slow transfers to/from these archives will no longer be a bottleneck. Note this link is for the whole lab and our SETI specific data link will remain at 100MBit. Still, it's an improvement. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Thanks Matt for the update and all the efforts, Claggy |
ront Send message Joined: 25 Aug 01 Posts: 77 Credit: 386,336 RAC: 0 |
Thanks a bunch. Appreciate the update/information ron tillman |
SciManStev Send message Joined: 20 Jun 99 Posts: 6657 Credit: 121,090,076 RAC: 0 |
Thank you Matt! This news certainly will be appreciated by all. You are right, in that GPU crunchers can burn through a lot of work. Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
Wiggo Send message Joined: 24 Jan 00 Posts: 36318 Credit: 261,360,520 RAC: 489 |
Thanks a bunch. The same here too. |
FrostKing9 Send message Joined: 20 Oct 01 Posts: 39 Credit: 23,815,960 RAC: 0 |
WOW... Matt.... thanks for the very specific update. Congrat's on getting the Gbit-link to the outside world. That will surely make things a lot better... in some ways. And... a hearty "thank you" to all of the people who work on SETI.... and keep it going. I DONATE money to SETI@home.... DO YOU? I'm just slowly BOINC'ing along. Hey... ET... you have a sister who likes earthlings? |
soft^spirit Send message Joined: 18 May 99 Posts: 6497 Credit: 34,134,168 RAC: 0 |
Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745). A heads up when it is safe for us to return them to work would be appreciated. Until then, crunching away on 6.03 only. Thank you for the update Matt! Janice |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874 |
Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745). I don't see any sign of trouble on your GT 220 - host 5385323. The main 'garbage' offenders were the new GTX 470 and GTX 480. Yours isn't affected - you're fine to go back to normal crunching (as and when work is available, of course). |
Byron Leigh Hatch @ team Carl Sagan Send message Joined: 5 Jul 99 Posts: 4548 Credit: 35,667,570 RAC: 4 |
Thanks Matt for the update. Byron |
ront Send message Joined: 25 Aug 01 Posts: 77 Credit: 386,336 RAC: 0 |
Again, Thanks & Kudos to the SETI Staff and volunteers who labor long and hard to keep,what appears to be, a rather cantankerous system up and running. One cannot help but be impressed with the dedication and determination dispalyed by all of you. In the mean...............when work becomes available, is there any chance of getting some "Astropulse" tasks? thank you, rt |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Matt, thanks for the info on work issues. 1. Each raw data file has to go through a local software based radar analysis - I don't suppose you have extensive file locking & error checking code? Some sort of time-out, retry, or checkpoint of input file would be really helpful. Things like signal handling and using chsize(), if you know the output file size ahead of time, can help. Help Wanted forum? There are a lot of nerd programmers out there :cough: who would love to use their 1TB drives, core i7, and Virtualbox with NFS mounts to optimize the software radar blanking code. With internet2 access, testfile size is not an issue. And, congratulations on the gigabit link next month. You'll be able to use Internet2 to its full potential. :) |
Astro-AL Send message Joined: 31 Mar 00 Posts: 18 Credit: 95,868,034 RAC: 80 |
Thamks Matt for all your hard work and all the SETI staff. |
Richard Send message Joined: 10 Jul 99 Posts: 19 Credit: 17,341,684 RAC: 0 |
Thanks a lot for maintain us informed... Ask: may be this post a return to the good old times of "several posts per week" to maintain live the contact with the users? (Apologizes for my English; is not my mother language) |
soft^spirit Send message Joined: 18 May 99 Posts: 6497 Credit: 34,134,168 RAC: 0 |
Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745). Most of the invalid/error have timed out, they were beginning of may when I gave up and went CPU only (no more cuda/6.09). I will try turning it back on.. hopefully it is fixed. I just figured no units was better for the project than mutilated units. |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Oh - seems like I forgot another major reason for little work: 6. "Cannot Find Coincident Blanking Signal" - that's the error reported by half the beams in many recent raw data files. What does that mean? For some reason the data files are being written in a format which makes it impossible for some beams to find the blanking signal information, which they need to process (or else they will be cluttered with radar interference). So they error out, and we're effectively losing half our data at this point, which of course speeds up the burn rate quite a bit. Another thing we're looking into. We should be able to figure it out and reprocess the missing beams in the future. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Jeff Lane Send message Joined: 31 Mar 09 Posts: 1 Credit: 1,097,029 RAC: 0 |
Hello! server crashes for no reasons? Maybe you should rename 'em. Zatar and Apollo! Or something. OK I'm just bored. |
Byron Leigh Hatch @ team Carl Sagan Send message Joined: 5 Jul 99 Posts: 4548 Credit: 35,667,570 RAC: 4 |
Thanks Matt, for the update. Byron |
SciManStev Send message Joined: 20 Jun 99 Posts: 6657 Credit: 121,090,076 RAC: 0 |
Thank you! That is valuable information indeed! Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
Jeff Dahn Send message Joined: 23 Nov 02 Posts: 2 Credit: 8,227,107 RAC: 0 |
I certainly appreciate the update. It seems as though you all are fighting an uphill battle to keep things viable. Thanks muchly for all the effort. Jeff |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66196 Credit: 55,293,173 RAC: 49 |
Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745). Lovely. I use 6.03 and I know from what I've read that 6.09 is dead, dead, dead, 6.10 is for Fermi. Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.