No Work Issues (Jun 09 2010)


log in

Advanced search

Message boards : Technical News : No Work Issues (Jun 09 2010)

1 · 2 · 3 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1002267 - Posted: 9 Jun 2010, 22:34:27 UTC

Let me address the "no work" issues as of late. We've been running low on work to send out (or had the schedulers turned off) for several reasons:

1. Each raw data file has to go through a local software based radar analysis - a suite of programs that takes over 3 hours to run per file. This should keep up with the incoming data flow, but some nagging NFS/mounting bugs cause this suite to lock up several times a week. Each time it does the whole systems getting new data on line is clogged until a human can figure out where it was in the process, clean it up, and start the broken file over again (resulting in many hours of lost processing time). For example this morning we found it all jammed last night, cleaned it up around 9am, and finally around 12:30pm new workunits were available again. We're working on adding some band-aid solutions to this particular problem.

2. Server crashes: mork and ptolemy are prone to crashing for no apparent reason. Either of them going down causes the project to halt until we recover. Sometimes it takes days to fully get back to a regular work-flow pace again. We're trying to shuffle services around to get ptolemy out of the picture. Why ptolemy instead of mork? Mork is a much bigger system and therefore much harder to replace - plus when it goes down the download servers are at least still able to work for a while.

3. Some data files error out pretty quickly due to noise or garbage data.

4. The CUDA clients sure burn through work fast.

5. Some CUDA clients were returning garbage. To combat this a fix to the scheduler was put on line this Monday, but was unable to start it without errors. It took Eric, Jeff, and I all day, and most of the next morning, to finally find the obscure problem - which was actually a misleading redirect in the apache config (that was put in many months ago). By the time we fixed it, we were already into the weekly outage.

So lots of battles on this front. In any case we are collecting data at this point (on 2TB drives, which means we'll lose less data waiting for the Arecibo operators to swap out the older 500/750GB drives), and still have a backlog of stuff to process in our archives. The lab is also getting a Gbit link to the world in July so the slow transfers to/from these archives will no longer be a bottleneck. Note this link is for the whole lab and our SETI specific data link will remain at 100MBit. Still, it's an improvement.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4087
Credit: 32,996,037
RAC: 5,766
United Kingdom
Message 1002272 - Posted: 9 Jun 2010, 22:39:33 UTC - in response to Message 1002267.

Thanks Matt for the update and all the efforts,

Claggy

ront
Send message
Joined: 25 Aug 01
Posts: 77
Credit: 386,336
RAC: 0
United States
Message 1002308 - Posted: 10 Jun 2010, 0:02:51 UTC

Thanks a bunch.

Appreciate the update/information

ron tillman
____________

Profile SciManStevProject donor
Volunteer tester
Avatar
Send message
Joined: 20 Jun 99
Posts: 4843
Credit: 81,569,590
RAC: 38,928
United States
Message 1002331 - Posted: 10 Jun 2010, 0:53:57 UTC

Thank you Matt! This news certainly will be appreciated by all. You are right, in that GPU crunchers can burn through a lot of work.

Steve
____________
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6917
Credit: 94,252,042
RAC: 75,331
Australia
Message 1002342 - Posted: 10 Jun 2010, 1:31:45 UTC - in response to Message 1002308.

Thanks a bunch.

Appreciate the update/information


The same here too.

____________

Profile FrostKing9
Avatar
Send message
Joined: 20 Oct 01
Posts: 39
Credit: 23,815,960
RAC: 0
United States
Message 1002351 - Posted: 10 Jun 2010, 2:12:09 UTC
Last modified: 10 Jun 2010, 2:14:02 UTC

WOW... Matt.... thanks for the very specific update. Congrat's on getting the Gbit-link to the outside world. That will surely make things a lot better... in some ways.

And... a hearty "thank you" to all of the people who work on SETI.... and keep it going.
____________


I DONATE money to SETI@home.... DO YOU?

I'm just slowly BOINC'ing along.

Hey... ET... you have a sister who likes earthlings?

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,631,059
RAC: 12
United States
Message 1002494 - Posted: 10 Jun 2010, 12:10:23 UTC

Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745).

A heads up when it is safe for us to return them to work would be appreciated.
Until then, crunching away on 6.03 only. Thank you for the update Matt!

Janice

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8491
Credit: 49,745,549
RAC: 55,476
United Kingdom
Message 1002501 - Posted: 10 Jun 2010, 12:43:51 UTC - in response to Message 1002494.
Last modified: 10 Jun 2010, 12:44:28 UTC

Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745).

A heads up when it is safe for us to return them to work would be appreciated.
Until then, crunching away on 6.03 only. Thank you for the update Matt!

Janice

I don't see any sign of trouble on your GT 220 - host 5385323.

The main 'garbage' offenders were the new GTX 470 and GTX 480. Yours isn't affected - you're fine to go back to normal crunching (as and when work is available, of course).

Profile Byron Leigh Hatch @ team Carl SaganProject donor
Volunteer tester
Avatar
Send message
Joined: 5 Jul 99
Posts: 3617
Credit: 11,868,585
RAC: 1,105
Canada
Message 1002506 - Posted: 10 Jun 2010, 13:10:17 UTC - in response to Message 1002267.

Thanks Matt for the update.

Byron
____________

ront
Send message
Joined: 25 Aug 01
Posts: 77
Credit: 386,336
RAC: 0
United States
Message 1002519 - Posted: 10 Jun 2010, 14:09:24 UTC

Again,

Thanks & Kudos to the SETI Staff and volunteers who labor long and hard to keep,what appears to be, a rather cantankerous system up and running.

One cannot help but be impressed with the dedication and determination dispalyed by all of you.

In the mean...............when work becomes available, is there any chance of getting some "Astropulse" tasks?

thank you,


rt
____________

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 546,900
RAC: 244
United States
Message 1002521 - Posted: 10 Jun 2010, 14:14:32 UTC - in response to Message 1002267.

Matt, thanks for the info on work issues.

1. Each raw data file has to go through a local software based radar analysis -

I don't suppose you have extensive file locking & error checking code? Some sort of time-out, retry, or checkpoint of input file would be really helpful. Things like signal handling and using chsize(), if you know the output file size ahead of time, can help. Help Wanted forum? There are a lot of nerd programmers out there :cough: who would love to use their 1TB drives, core i7, and Virtualbox with NFS mounts to optimize the software radar blanking code. With internet2 access, testfile size is not an issue.

And, congratulations on the gigabit link next month. You'll be able to use Internet2 to its full potential. :)

Profile Astro-AL
Send message
Joined: 31 Mar 00
Posts: 17
Credit: 51,121,022
RAC: 64,944
United States
Message 1002531 - Posted: 10 Jun 2010, 14:46:22 UTC - in response to Message 1002267.

Thamks Matt for all your hard work and all the SETI staff.
____________

Richard
Avatar
Send message
Joined: 10 Jul 99
Posts: 19
Credit: 8,895,932
RAC: 8,783
Argentina
Message 1002545 - Posted: 10 Jun 2010, 15:18:46 UTC - in response to Message 1002267.

Thanks a lot for maintain us informed...

Ask: may be this post a return to the good old times of "several posts per week" to maintain live the contact with the users?
(Apologizes for my English; is not my mother language)
____________

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,631,059
RAC: 12
United States
Message 1002550 - Posted: 10 Jun 2010, 15:28:49 UTC - in response to Message 1002501.

Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745).

A heads up when it is safe for us to return them to work would be appreciated.
Until then, crunching away on 6.03 only. Thank you for the update Matt!

Janice

I don't see any sign of trouble on your GT 220 - host 5385323.

The main 'garbage' offenders were the new GTX 470 and GTX 480. Yours isn't affected - you're fine to go back to normal crunching (as and when work is available, of course).


Most of the invalid/error have timed out, they were beginning of may when I gave up and went CPU only (no more cuda/6.09).

I will try turning it back on.. hopefully it is fixed. I just figured no units was better for the project than mutilated units.

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1002666 - Posted: 10 Jun 2010, 21:14:34 UTC

Oh - seems like I forgot another major reason for little work:

6. "Cannot Find Coincident Blanking Signal" - that's the error reported by half the beams in many recent raw data files. What does that mean? For some reason the data files are being written in a format which makes it impossible for some beams to find the blanking signal information, which they need to process (or else they will be cluttered with radar interference). So they error out, and we're effectively losing half our data at this point, which of course speeds up the burn rate quite a bit. Another thing we're looking into. We should be able to figure it out and reprocess the missing beams in the future.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Jeff Lane
Send message
Joined: 31 Mar 09
Posts: 1
Credit: 820,758
RAC: 446
United States
Message 1002675 - Posted: 10 Jun 2010, 21:44:12 UTC

Hello! server crashes for no reasons? Maybe you should rename 'em. Zatar and Apollo! Or something. OK I'm just bored.

Profile Byron Leigh Hatch @ team Carl SaganProject donor
Volunteer tester
Avatar
Send message
Joined: 5 Jul 99
Posts: 3617
Credit: 11,868,585
RAC: 1,105
Canada
Message 1002684 - Posted: 10 Jun 2010, 22:10:17 UTC - in response to Message 1002666.

Thanks Matt, for the update.

Byron

____________

Profile SciManStevProject donor
Volunteer tester
Avatar
Send message
Joined: 20 Jun 99
Posts: 4843
Credit: 81,569,590
RAC: 38,928
United States
Message 1002726 - Posted: 10 Jun 2010, 23:14:34 UTC

Thank you! That is valuable information indeed!

Steve
____________
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website

Jeff Dahn
Volunteer tester
Send message
Joined: 23 Nov 02
Posts: 2
Credit: 4,752,901
RAC: 5,162
United States
Message 1002816 - Posted: 11 Jun 2010, 5:10:46 UTC - in response to Message 1002726.

I certainly appreciate the update. It seems as though you all are fighting an uphill battle to keep things viable.

Thanks muchly for all the effort.
Jeff
____________

zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46272
Credit: 36,676,404
RAC: 5,329
Message 1002833 - Posted: 11 Jun 2010, 6:06:47 UTC - in response to Message 1002550.

Afraid I was among the "returning garbage" Computers. I finally changed the GPU to "wait until idle 9999 minutes" (NVIDIA GeForce GT 220 (986MB) driver: 19745).

A heads up when it is safe for us to return them to work would be appreciated.
Until then, crunching away on 6.03 only. Thank you for the update Matt!

Janice

I don't see any sign of trouble on your GT 220 - host 5385323.

The main 'garbage' offenders were the new GTX 470 and GTX 480. Yours isn't affected - you're fine to go back to normal crunching (as and when work is available, of course).


Most of the invalid/error have timed out, they were beginning of may when I gave up and went CPU only (no more cuda/6.09).

I will try turning it back on.. hopefully it is fixed. I just figured no units was better for the project than mutilated units.

Lovely. I use 6.03 and I know from what I've read that 6.09 is dead, dead, dead, 6.10 is for Fermi.
____________
My Facebook, War Commander, 2015

1 · 2 · 3 · Next

Message boards : Technical News : No Work Issues (Jun 09 2010)

Copyright © 2014 University of California