Message boards :
Number crunching :
Astropulse Beam 3 Polarity 1 Errors
Message board moderation
Author | Message |
---|---|
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Richard Haselgrove noticed that the "generate_envelope: num_ffts_performed < 100." error noted earlier in this thread seems to be happening only on one channel of the data: B3_P1. That reminded me of the Curiously Compressible Data seen at Beta last year. In fact it seems to be the same problem recurring a year later, one bit of the two in each sample is stuck. That triggers the noise detection code, all data is blanked, and there's nothing to generate the envelope from. I've sent an email to Josh and Eric, though there's probably not much they can do. The other 13 channels aren't affected, so skipping the 'tapes' completely would be overreacting. Joe |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
I've checked my captured files, and yes, they are super-compressible. I noted in my post at Lunatics that my WUs came from different day's recordings from the first one mentioned here (12 and 13 March): I've got some from B3_P1 recorded on 19 March and they're super-compressible too. So although it's the same B/P as the Beta event, this one seems to be lasting much longer. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I've checked my captured files, and yes, they are super-compressible. The initial cases on Beta were seen in 27fe08ac data, and Josh's spot check indicated it lasted at least until March 5 2008, so far the durations seem about the same. We'll get a better look at how long it lasts this time, since there's data with 20, 21, and 22mr tags. Then it skips to 30mr, Arecibo was doing planetary radar in between. I did get a second one (from 14mr09aa), but I don't process nearly as much data as you so probably won't see any more for this episode. Joe |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
Three people have crashed Workunit 438261330 So far. In ap_gfx_main.cpp: in ap_graphics_init(): Starting client. AstroPulse v. 5.03 Non-graphics FFTW USE_CONVERSION_OPT USE_SSE3 Windows x86 rev 112, Don't Panic!, by Raistmer with support of Lunatics.kwsn.net team. SSE3 static fftw lib, built by Jason G. ffa threshold mod, by Joe Segur. SSE3 dechirping by JDWhale CPUID: Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz Cache: L1=64K L2=4096K Features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 Error in ap_remove_radar.cpp: generate_envelope: num_ffts_performed < 100. Blanking too much RFI? Looking at the stderr out text it looks like it's a application issue but I'm not 100% sure. This is the only task I've come across with this issue Thanks in advance |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Three people have crashed Workunit 438261330 So far. As Josef has identified, it's a stuck bit in the B3_P1 data channel - so a data problem, NOT an application problem. Quite where the bit is stuck awaits further investigation. Joe thinks it's seasonal (weather affecting the recorder at the observing site?), which would mess up one channel of the MB observations too (same data, different splitter). On the other hand, it might be an AP splitter issue - though I don't see why that would only affect one channel - but if so, MB should be undamaged. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
The seasonality could as well be an annual cleanup campaign at Arecibo because someone from the NSF does an inspection each year. But I don't know what the air conditioning arrangements are where the multibeam recorder is either. The one picture I've seen is just a rack sitting against a wall. I'd give very high odds that the problem is in the recorder rack, most likely the cabling from the output of the quadrature detector for that channel. It might be on the detector card itself or the digital input card in the computer. Before the detector there's just a single channel of analog IF, after the digital input there's the usual computer data handling and any stuck bit problems there would be affecting far more than we're seeing. But the recorder was designed and put together by Berkeley SSL, they know the details better than I. When an mb_splitter does some of that data, the obvious pattern will be lost in the FFTs to produce the 256 subbands. I think the effect will simply be a bunch of quiet WUs with less noise than usual. WU names will be like NNmr09ax.xxxx.xxxxx.10.x.xxx since beam 3 polarity 1 corresponds to s4_id 10. Joe |
daysteppr Send message Joined: 22 Mar 05 Posts: 80 Credit: 19,575,419 RAC: 53 |
http://setiathome.berkeley.edu/workunit.php?wuid=437881143 Its a Demon WU! Noones been able to get more than 9 seconds on it. WU details from my machine: Name ap_14mr09ac_B3_P1_00375_20090429_24593.wu_1 Workunit 437881143 Created 29 Apr 2009 22:43:10 UTC Sent 29 Apr 2009 22:43:14 UTC Received 10 May 2009 6:14:39 UTC Server state Over Outcome Client error Client state Compute error Exit status -1 (0xffffffffffffffff) Computer ID 4217559 Report deadline 29 May 2009 22:43:14 UTC CPU time 5.28125 stderr out <core_client_version>5.10.45</core_client_version> <![CDATA[ <message> - exit code -1 (0xffffffff) </message> <stderr_txt> In ap_gfx_main.cpp: in ap_graphics_init(): Starting client. AstroPulse v. 5.03 Non-graphics FFTW USE_CONVERSION_OPT USE_SSE3 Windows x86 rev 112, Don't Panic!, by Raistmer with support of Lunatics.kwsn.net team. SSE3 static fftw lib, built by Jason G. ffa threshold mod, by Joe Segur. SSE3 dechirping by JDWhale CPUID: AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ Cache: L1=64K L2=512K Features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 Error in ap_remove_radar.cpp: generate_envelope: num_ffts_performed < 100. Blanking too much RFI? </stderr_txt> ]]> Validate state Invalid Claimed credit 0.0183885662974026 Granted credit 0.0183885662974026 I thought it was my computer giving me issues, till I looked at the others.... Sincerely, Daysteppr |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
http://setiathome.berkeley.edu/workunit.php?wuid=437881143 It's one of a whole block of demon WUs, which we identified over a week ago - see message 890237 further down this thread. It looks as if every B3_P1 wu between 12mr09 and 07ap09 suffers from the same problem. And the AP assimilator has been down for over 24 hours, which has backed up the work generation process (filled the available storage space, I guess) - no new AP work has been available for download since about 23:00 UTC last night. So I would expect that almost all of any AP work that people receive this morning will be resends of these demon WUs: you might consider opting out of AP temporarily if you have a bandwidth cap or other limited internet connection, because an 8MB download every 9 seconds isn't really worth it. |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
Yeah, I've gone through about 10 of those across my 4 AP crunchers so far. Looks like I have 2-3 more to go. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
Well that was fun. I just burned through about 15 of those B3_P1 WUs on two systems. Was doing some "housekeeping" with my Excel spreadsheet of all the r112 tasks I've done, and noticed a few B3_P1's, but noticed that I was _4 or _5 on them, so I suspended all the other tasks and only left all of my B3_P1's to run. After 3 seconds, they errored out and I requested 600,000 seconds of work, got four more APs, 3 of which were B3_P1's, so I repeated the process until I got rid of them all. That's not being greedy, that's being generous. I'm doing my part to get rid of those WUs so we can have more WU storage and get some new tasks going. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
Thank you for helping clear these B3_P1 tasksOn the bright side it looks as if all of your tasks are resends because looking at the Server Status all of the tapes with this erroring data shows (Done) on your host 2889590. Is there a possibility of getting more data like this in the near future? Thanks in advance. |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
I think Matt said that we are done splitting the bad tapes, but now have thousands of 8MB WUs that will just have to run though the system and get 6 errors. Other than that, I don't have a clue when the next tape with this "stuck bit" will surface again, but as Ozz pointed out, this was noticed over in beta this time last year, so it's possible that it may be a yearly thing, or just a total coincidence. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
Thanks for info. Is there any reason why tapes with erroring data on can formatted & sent straight back to get fresh data on them, or is this data bundled with good data? Does anyone know what tapes this data is on? So we can watch the tapes move through the system. Thanks in advance. P.S If I receive any of these units I'll let them run as soon as I notice them in my job queue. |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
I believe we have narrowed it down to somewhere in the 20-28feb09 tapes, and it's only the B3_P1 channel of those tapes, but, I have one that is 22feb09ac and B3_P1, and is fine. It has not been determined if it is in the data recorder or in the splitters, but seems more likely to be the data recorder. As far as getting new data on those tapes, the hard drives get sent in batches and then filled up. The drives are either 500gb or 750gb, and get "split" up into ~50gb files (it's a more manageable size for both in-house and off-site data storage). Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
[quote] I believe we have narrowed it down to somewhere in the 20-28feb09 tapes, and it's only the B3_P1 channel of those tapes, but, I have one that is 22feb09ac and B3_P1, and is fine. It has not been determined if it is in the data recorder or in the splitters, but seems more likely to be the data recorder. [/quote Thanks this is interesting information. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Actually, the current best estimate for the problematic B3_P1 recording range is 12mr09 through 07ap09 - Josef Segur has been keeping track of them at Lunatics. If anyone has any hard evidence for an example in February, I'm sure Joe would be most interested - but we need a link to a result, please. |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
I suppose I mis-typed. I think I meant 22mr09ac instead of 22fe09ac. I might have missed making the PDF of the task ID for that particular task though. So far what I've got are: ap_02ap09ac_B3_P1_00013_20090505_16821.wu_2 ap_03ap09aa_B3_P1_00002_20090505_10103.wu_1 ap_05ap09aa_B3_P1_00390_20090506_23779.wu_0 ap_05ap09aa_B3_P1_00390_20090506_23779.wu_0 ap_06ap09ab_B3_P1_00108_20090507_32618.wu_5 ap_06ap09ac_B3_P1_00004_20090507_07720.wu_5 ap_06ap09ac_B3_P1_00067_20090507_07720.wu_4 ap_06ap09ac_B3_P1_00067_20090507_07720.wu_4 ap_06ap09ac_B3_P1_00292_20090507_07720.wu_5 ap_06ap09ae_B3_P1_00000_20090508_19810.wu_0 ap_14mr09aa_B3_P1_00399_20090429_18241.wu_6 ap_14mr09ae_B3_P1_00154_20090502_29770.wu_3 ap_14mr09ae_B3_P1_00154_20090502_29770.wu_3 ap_20mr09aa_B3_P1_00091_20090430_24201.wu_5 ap_20mr09aa_B3_P1_00091_20090430_24201.wu_5 ap_20mr09ad_B3_P1_00361_20090501_16027.wu_6 ap_21mr09ab_B3_P1_00067_20090501_27055.wu_2 ap_21mr09ab_B3_P1_00096_20090501_27055.wu_4 ap_21mr09ac_B3_P1_00198_20090509_02086.wu_4 ap_21mr09ac_B3_P1_00388_20090510_02086.wu_2 ap_30mr09ab_B3_P1_00347_20090512_28745.wu_4 ap_30mr09ab_B3_P1_00347_20090512_28745.wu_4 If more show up, I'll post those. So far, it looks like all of mine are within Josef's observations. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Well, I thought I had found this one http://setiathome.berkeley.edu/workunit.php?wuid=443174608 dated 22fe09 so I started it and let it run about 10 minutes but it didn't fail. I then looked a little closer at the other wingmen that had errored out and saw all four of them had different problems. If it messes up on me when I get back to it I will report it here then. PROUD MEMBER OF Team Starfire World BOINC |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Ok, running it now. 23.7% in and no problems. Another wingman has already completed it successfully and is waiting on me. PROUD MEMBER OF Team Starfire World BOINC |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
Ok, running it now. 23.7% in and no problems. Another wingman has already completed it successfully and is waiting on me. Yeah, that task dated 22fe09 was a full month before the problem showed up. That one should be fine. I figured out why I thought I had an extra one of those B3_P1 tasks. I had one that was B3_P1 and 22fe09ac and looked at the date on it too quickly. Somewhere in my mind that was the same as 22mr09ac. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.