No Work Issues (Jun 09 2010)

Message boards : Technical News : No Work Issues (Jun 09 2010)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1002267 - Posted: 9 Jun 2010, 22:34:27 UTC

Let me address the "no work" issues as of late. We've been running low on work to send out (or had the schedulers turned off) for several reasons:

1. Each raw data file has to go through a local software based radar analysis - a suite of programs that takes over 3 hours to run per file. This should keep up with the incoming data flow, but some nagging NFS/mounting bugs cause this suite to lock up several times a week. Each time it does the whole systems getting new data on line is clogged until a human can figure out where it was in the process, clean it up, and start the broken file over again (resulting in many hours of lost processing time). For example this morning we found it all jammed last night, cleaned it up around 9am, and finally around 12:30pm new workunits were available again. We're working on adding some band-aid solutions to this particular problem.

2. Server crashes: mork and ptolemy are prone to crashing for no apparent reason. Either of them going down causes the project to halt until we recover. Sometimes it takes days to fully get back to a regular work-flow pace again. We're trying to shuffle services around to get ptolemy out of the picture. Why ptolemy instead of mork? Mork is a much bigger system and therefore much harder to replace - plus when it goes down the download servers are at least still able to work for a while.

3. Some data files error out pretty quickly due to noise or garbage data.

4. The CUDA clients sure burn through work fast.

5. Some CUDA clients were returning garbage. To combat this a fix to the scheduler was put on line this Monday, but was unable to start it without errors. It took Eric, Jeff, and I all day, and most of the next morning, to finally find the obscure problem - which was actually a misleading redirect in the apache config (that was put in many months ago). By the time we fixed it, we were already into the weekly outage.

So lots of battles on this front. In any case we are collecting data at this point (on 2TB drives, which means we'll lose less data waiting for the Arecibo operators to swap out the older 500/750GB drives), and still have a backlog of stuff to process in our archives. The lab is also getting a Gbit link to the world in July so the slow transfers to/from these archives will no longer be a bottleneck. Note this link is for the whole lab and our SETI specific data link will remain at 100MBit. Still, it's an improvement.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1002267 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1002272 - Posted: 9 Jun 2010, 22:39:33 UTC - in response to Message 1002267.  

Thanks Matt for the update and all the efforts,

Claggy
ID: 1002272 · Report as offensive
ront

Send message
Joined: 25 Aug 01
Posts: 77
Credit: 386,336
RAC: 0
United States
Message 1002308 - Posted: 10 Jun 2010, 0:02:51 UTC

Thanks a bunch.

Appreciate the update/information

ron tillman
ID: 1002308 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6651
Credit: 121,090,076
RAC: 0
United States
Message 1002331 - Posted: 10 Jun 2010, 0:53:57 UTC

Thank you Matt! This news certainly will be appreciated by all. You are right, in that GPU crunchers can burn through a lot of work.

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1002331 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1002342 - Posted: 10 Jun 2010, 1:31:45 UTC - in response to Message 1002308.  

Thanks a bunch.

Appreciate the update/information


The same here too.

ID: 1002342 · Report as offensive
Profile FrostKing9
Avatar

Send message
Joined: 20 Oct 01
Posts: 39
Credit: 23,815,960
RAC: 0
United States
Message 1002351 - Posted: 10 Jun 2010, 2:12:09 UTC
Last modified: 10 Jun 2010, 2:14:02 UTC

WOW... Matt.... thanks for the very specific update. Congrat's on getting the Gbit-link to the outside world. That will surely make things a lot better... in some ways.

And... a hearty "thank you" to all of the people who work on SETI.... and keep it going.


I DONATE money to SETI@home.... DO YOU?

I'm just slowly BOINC'ing along.

Hey... ET... you have a sister who likes earthlings?
ID: 1002351 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1002494 - Posted: 10 Jun 2010, 12:10:23 UTC

Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745).

A heads up when it is safe for us to return them to work would be appreciated.
Until then, crunching away on 6.03 only. Thank you for the update Matt!

Janice
ID: 1002494 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1002501 - Posted: 10 Jun 2010, 12:43:51 UTC - in response to Message 1002494.  
Last modified: 10 Jun 2010, 12:44:28 UTC

Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745).

A heads up when it is safe for us to return them to work would be appreciated.
Until then, crunching away on 6.03 only. Thank you for the update Matt!

Janice

I don't see any sign of trouble on your GT 220 - host 5385323.

The main 'garbage' offenders were the new GTX 470 and GTX 480. Yours isn't affected - you're fine to go back to normal crunching (as and when work is available, of course).
ID: 1002501 · Report as offensive
Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 4548
Credit: 35,667,570
RAC: 4
Canada
Message 1002506 - Posted: 10 Jun 2010, 13:10:17 UTC - in response to Message 1002267.  

Thanks Matt for the update.

Byron
ID: 1002506 · Report as offensive
ront

Send message
Joined: 25 Aug 01
Posts: 77
Credit: 386,336
RAC: 0
United States
Message 1002519 - Posted: 10 Jun 2010, 14:09:24 UTC

Again,

Thanks & Kudos to the SETI Staff and volunteers who labor long and hard to keep,what appears to be, a rather cantankerous system up and running.

One cannot help but be impressed with the dedication and determination dispalyed by all of you.

In the mean...............when work becomes available, is there any chance of getting some "Astropulse" tasks?

thank you,


rt
ID: 1002519 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 1002521 - Posted: 10 Jun 2010, 14:14:32 UTC - in response to Message 1002267.  

Matt, thanks for the info on work issues.

1. Each raw data file has to go through a local software based radar analysis -

I don't suppose you have extensive file locking & error checking code? Some sort of time-out, retry, or checkpoint of input file would be really helpful. Things like signal handling and using chsize(), if you know the output file size ahead of time, can help. Help Wanted forum? There are a lot of nerd programmers out there :cough: who would love to use their 1TB drives, core i7, and Virtualbox with NFS mounts to optimize the software radar blanking code. With internet2 access, testfile size is not an issue.

And, congratulations on the gigabit link next month. You'll be able to use Internet2 to its full potential. :)
ID: 1002521 · Report as offensive
Profile Astro-AL

Send message
Joined: 31 Mar 00
Posts: 18
Credit: 95,868,034
RAC: 80
United States
Message 1002531 - Posted: 10 Jun 2010, 14:46:22 UTC - in response to Message 1002267.  

Thamks Matt for all your hard work and all the SETI staff.
ID: 1002531 · Report as offensive
Richard
Avatar

Send message
Joined: 10 Jul 99
Posts: 19
Credit: 17,341,684
RAC: 0
Argentina
Message 1002545 - Posted: 10 Jun 2010, 15:18:46 UTC - in response to Message 1002267.  

Thanks a lot for maintain us informed...

Ask: may be this post a return to the good old times of "several posts per week" to maintain live the contact with the users?
(Apologizes for my English; is not my mother language)
ID: 1002545 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1002550 - Posted: 10 Jun 2010, 15:28:49 UTC - in response to Message 1002501.  

Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745).

A heads up when it is safe for us to return them to work would be appreciated.
Until then, crunching away on 6.03 only. Thank you for the update Matt!

Janice

I don't see any sign of trouble on your GT 220 - host 5385323.

The main 'garbage' offenders were the new GTX 470 and GTX 480. Yours isn't affected - you're fine to go back to normal crunching (as and when work is available, of course).


Most of the invalid/error have timed out, they were beginning of may when I gave up and went CPU only (no more cuda/6.09).

I will try turning it back on.. hopefully it is fixed. I just figured no units was better for the project than mutilated units.
ID: 1002550 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1002666 - Posted: 10 Jun 2010, 21:14:34 UTC

Oh - seems like I forgot another major reason for little work:

6. "Cannot Find Coincident Blanking Signal" - that's the error reported by half the beams in many recent raw data files. What does that mean? For some reason the data files are being written in a format which makes it impossible for some beams to find the blanking signal information, which they need to process (or else they will be cluttered with radar interference). So they error out, and we're effectively losing half our data at this point, which of course speeds up the burn rate quite a bit. Another thing we're looking into. We should be able to figure it out and reprocess the missing beams in the future.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1002666 · Report as offensive
Jeff Lane

Send message
Joined: 31 Mar 09
Posts: 1
Credit: 1,097,029
RAC: 0
United States
Message 1002675 - Posted: 10 Jun 2010, 21:44:12 UTC

Hello! server crashes for no reasons? Maybe you should rename 'em. Zatar and Apollo! Or something. OK I'm just bored.
ID: 1002675 · Report as offensive
Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 4548
Credit: 35,667,570
RAC: 4
Canada
Message 1002684 - Posted: 10 Jun 2010, 22:10:17 UTC - in response to Message 1002666.  

Thanks Matt, for the update.

Byron

ID: 1002684 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6651
Credit: 121,090,076
RAC: 0
United States
Message 1002726 - Posted: 10 Jun 2010, 23:14:34 UTC

Thank you! That is valuable information indeed!

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1002726 · Report as offensive
Jeff Dahn
Volunteer tester

Send message
Joined: 23 Nov 02
Posts: 2
Credit: 8,227,107
RAC: 0
United States
Message 1002816 - Posted: 11 Jun 2010, 5:10:46 UTC - in response to Message 1002726.  

I certainly appreciate the update. It seems as though you all are fighting an uphill battle to keep things viable.

Thanks muchly for all the effort.
Jeff
ID: 1002816 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65689
Credit: 55,293,173
RAC: 49
United States
Message 1002833 - Posted: 11 Jun 2010, 6:06:47 UTC - in response to Message 1002550.  

Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745).

A heads up when it is safe for us to return them to work would be appreciated.
Until then, crunching away on 6.03 only. Thank you for the update Matt!

Janice

I don't see any sign of trouble on your GT 220 - host 5385323.

The main 'garbage' offenders were the new GTX 470 and GTX 480. Yours isn't affected - you're fine to go back to normal crunching (as and when work is available, of course).


Most of the invalid/error have timed out, they were beginning of may when I gave up and went CPU only (no more cuda/6.09).

I will try turning it back on.. hopefully it is fixed. I just figured no units was better for the project than mutilated units.

Lovely. I use 6.03 and I know from what I've read that 6.09 is dead, dead, dead, 6.10 is for Fermi.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1002833 · Report as offensive
1 · 2 · 3 · Next

Message boards : Technical News : No Work Issues (Jun 09 2010)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.