Astropulse Beam 3 Polarity 1 Errors


log in

Advanced search

Message boards : Number crunching : Astropulse Beam 3 Polarity 1 Errors

1 · 2 · Next
Author Message
Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4348
Credit: 1,126,925
RAC: 861
United States
Message 890237 - Posted: 1 May 2009, 17:40:49 UTC

Richard Haselgrove noticed that the "generate_envelope: num_ffts_performed < 100." error noted earlier in this thread seems to be happening only on one channel of the data: B3_P1. That reminded me of the Curiously Compressible Data seen at Beta last year. In fact it seems to be the same problem recurring a year later, one bit of the two in each sample is stuck. That triggers the noise detection code, all data is blanked, and there's nothing to generate the envelope from. I've sent an email to Josh and Eric, though there's probably not much they can do. The other 13 channels aren't affected, so skipping the 'tapes' completely would be overreacting.

Joe

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8824
Credit: 53,579,388
RAC: 47,197
United Kingdom
Message 890246 - Posted: 1 May 2009, 18:42:12 UTC - in response to Message 890237.

I've checked my captured files, and yes, they are super-compressible.

I noted in my post at Lunatics that my WUs came from different day's recordings from the first one mentioned here (12 and 13 March): I've got some from B3_P1 recorded on 19 March and they're super-compressible too.

So although it's the same B/P as the Beta event, this one seems to be lasting much longer.

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4348
Credit: 1,126,925
RAC: 861
United States
Message 890313 - Posted: 1 May 2009, 22:49:27 UTC - in response to Message 890246.

I've checked my captured files, and yes, they are super-compressible.

I noted in my post at Lunatics that my WUs came from different day's recordings from the first one mentioned here (12 and 13 March): I've got some from B3_P1 recorded on 19 March and they're super-compressible too.

So although it's the same B/P as the Beta event, this one seems to be lasting much longer.

The initial cases on Beta were seen in 27fe08ac data, and Josh's spot check indicated it lasted at least until March 5 2008, so far the durations seem about the same.

We'll get a better look at how long it lasts this time, since there's data with 20, 21, and 22mr tags. Then it skips to 30mr, Arecibo was doing planetary radar in between.

I did get a second one (from 14mr09aa), but I don't process nearly as much data as you so probably won't see any more for this episode.
Joe

Speedy
Volunteer tester
Avatar
Send message
Joined: 26 Jun 04
Posts: 711
Credit: 6,045,611
RAC: 2,750
New Zealand
Message 890392 - Posted: 2 May 2009, 1:49:50 UTC
Last modified: 2 May 2009, 1:53:20 UTC

Three people have crashed Workunit 438261330 So far.
In ap_gfx_main.cpp: in ap_graphics_init(): Starting client.
AstroPulse v. 5.03
Non-graphics FFTW USE_CONVERSION_OPT USE_SSE3
Windows x86 rev 112, Don't Panic!, by Raistmer with support of Lunatics.kwsn.net team. SSE3
static fftw lib, built by Jason G.
ffa threshold mod, by Joe Segur.
SSE3 dechirping by JDWhale
CPUID: Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz

Cache: L1=64K L2=4096K
Features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3
Error in ap_remove_radar.cpp: generate_envelope: num_ffts_performed < 100. Blanking too much RFI?

Looking at the stderr out text it looks like it's a application issue but I'm not 100% sure. This is the only task I've come across with this issue Thanks in advance
____________

Live in NZ y not join Smile City?

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8824
Credit: 53,579,388
RAC: 47,197
United Kingdom
Message 890488 - Posted: 2 May 2009, 10:06:50 UTC - in response to Message 890392.

Three people have crashed Workunit 438261330 So far.
In ap_gfx_main.cpp: in ap_graphics_init(): Starting client.
AstroPulse v. 5.03
Non-graphics FFTW USE_CONVERSION_OPT USE_SSE3
Windows x86 rev 112, Don't Panic!, by Raistmer with support of Lunatics.kwsn.net team. SSE3
static fftw lib, built by Jason G.
ffa threshold mod, by Joe Segur.
SSE3 dechirping by JDWhale
CPUID: Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz

Cache: L1=64K L2=4096K
Features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3
Error in ap_remove_radar.cpp: generate_envelope: num_ffts_performed < 100. Blanking too much RFI?

Looking at the stderr out text it looks like it's a application issue but I'm not 100% sure. This is the only task I've come across with this issue Thanks in advance

As Josef has identified, it's a stuck bit in the B3_P1 data channel - so a data problem, NOT an application problem.

Quite where the bit is stuck awaits further investigation. Joe thinks it's seasonal (weather affecting the recorder at the observing site?), which would mess up one channel of the MB observations too (same data, different splitter). On the other hand, it might be an AP splitter issue - though I don't see why that would only affect one channel - but if so, MB should be undamaged.

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4348
Credit: 1,126,925
RAC: 861
United States
Message 890629 - Posted: 2 May 2009, 18:39:19 UTC - in response to Message 890488.


As Josef has identified, it's a stuck bit in the B3_P1 data channel - so a data problem, NOT an application problem.

Quite where the bit is stuck awaits further investigation. Joe thinks it's seasonal (weather affecting the recorder at the observing site?), which would mess up one channel of the MB observations too (same data, different splitter). On the other hand, it might be an AP splitter issue - though I don't see why that would only affect one channel - but if so, MB should be undamaged.

The seasonality could as well be an annual cleanup campaign at Arecibo because someone from the NSF does an inspection each year. But I don't know what the air conditioning arrangements are where the multibeam recorder is either. The one picture I've seen is just a rack sitting against a wall.

I'd give very high odds that the problem is in the recorder rack, most likely the cabling from the output of the quadrature detector for that channel. It might be on the detector card itself or the digital input card in the computer. Before the detector there's just a single channel of analog IF, after the digital input there's the usual computer data handling and any stuck bit problems there would be affecting far more than we're seeing. But the recorder was designed and put together by Berkeley SSL, they know the details better than I.

When an mb_splitter does some of that data, the obvious pattern will be lost in the FFTs to produce the 256 subbands. I think the effect will simply be a bunch of quiet WUs with less noise than usual. WU names will be like NNmr09ax.xxxx.xxxxx.10.x.xxx since beam 3 polarity 1 corresponds to s4_id 10.
Joe

daysteppr
Send message
Joined: 22 Mar 05
Posts: 69
Credit: 4,321,958
RAC: 6,045
United States
Message 893302 - Posted: 10 May 2009, 8:49:47 UTC

http://setiathome.berkeley.edu/workunit.php?wuid=437881143
Its a Demon WU! Noones been able to get more than 9 seconds on it.

WU details from my machine:

Name ap_14mr09ac_B3_P1_00375_20090429_24593.wu_1
Workunit 437881143
Created 29 Apr 2009 22:43:10 UTC
Sent 29 Apr 2009 22:43:14 UTC
Received 10 May 2009 6:14:39 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1 (0xffffffffffffffff)
Computer ID 4217559
Report deadline 29 May 2009 22:43:14 UTC
CPU time 5.28125
stderr out <core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1 (0xffffffff)
</message>
<stderr_txt>
In ap_gfx_main.cpp: in ap_graphics_init(): Starting client.
AstroPulse v. 5.03
Non-graphics FFTW USE_CONVERSION_OPT USE_SSE3
Windows x86 rev 112, Don't Panic!, by Raistmer with support of Lunatics.kwsn.net team. SSE3
static fftw lib, built by Jason G.
ffa threshold mod, by Joe Segur.
SSE3 dechirping by JDWhale
CPUID: AMD Athlon(tm) 64 X2 Dual Core Processor 4600+

Cache: L1=64K L2=512K
Features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3
Error in ap_remove_radar.cpp: generate_envelope: num_ffts_performed < 100. Blanking too much RFI?

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 0.0183885662974026
Granted credit 0.0183885662974026


I thought it was my computer giving me issues, till I looked at the others....

Sincerely,
Daysteppr
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8824
Credit: 53,579,388
RAC: 47,197
United Kingdom
Message 893304 - Posted: 10 May 2009, 9:38:16 UTC - in response to Message 893302.

http://setiathome.berkeley.edu/workunit.php?wuid=437881143
Its a Demon WU! Noones been able to get more than 9 seconds on it.

Name ap_14mr09ac_B3_P1_00375_20090429_24593.wu_1

It's one of a whole block of demon WUs, which we identified over a week ago - see message 890237 further down this thread.

It looks as if every B3_P1 wu between 12mr09 and 07ap09 suffers from the same problem.

And the AP assimilator has been down for over 24 hours, which has backed up the work generation process (filled the available storage space, I guess) - no new AP work has been available for download since about 23:00 UTC last night. So I would expect that almost all of any AP work that people receive this morning will be resends of these demon WUs: you might consider opting out of AP temporarily if you have a bandwidth cap or other limited internet connection, because an 8MB download every 9 seconds isn't really worth it.

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2357
Credit: 8,951,347
RAC: 3,889
United States
Message 893557 - Posted: 11 May 2009, 1:28:01 UTC

Yeah, I've gone through about 10 of those across my 4 AP crunchers so far. Looks like I have 2-3 more to go.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2357
Credit: 8,951,347
RAC: 3,889
United States
Message 894493 - Posted: 14 May 2009, 4:17:22 UTC

Well that was fun. I just burned through about 15 of those B3_P1 WUs on two systems. Was doing some "housekeeping" with my Excel spreadsheet of all the r112 tasks I've done, and noticed a few B3_P1's, but noticed that I was _4 or _5 on them, so I suspended all the other tasks and only left all of my B3_P1's to run. After 3 seconds, they errored out and I requested 600,000 seconds of work, got four more APs, 3 of which were B3_P1's, so I repeated the process until I got rid of them all.

That's not being greedy, that's being generous. I'm doing my part to get rid of those WUs so we can have more WU storage and get some new tasks going.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Speedy
Volunteer tester
Avatar
Send message
Joined: 26 Jun 04
Posts: 711
Credit: 6,045,611
RAC: 2,750
New Zealand
Message 894505 - Posted: 14 May 2009, 5:18:04 UTC - in response to Message 894493.


That's not being greedy, that's being generous. I'm doing my part to get rid of those WUs so we can have more WU storage and get some new tasks going.

Thank you for helping clear these B3_P1 tasksOn the bright side it looks as if all of your tasks are resends because looking at the Server Status all of the tapes with this erroring data shows (Done) on your host 2889590. Is there a possibility of getting more data like this in the near future? Thanks in advance.
____________

Live in NZ y not join Smile City?

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2357
Credit: 8,951,347
RAC: 3,889
United States
Message 894510 - Posted: 14 May 2009, 5:38:08 UTC - in response to Message 894505.


That's not being greedy, that's being generous. I'm doing my part to get rid of those WUs so we can have more WU storage and get some new tasks going.

Thank you for helping clear these B3_P1 tasksOn the bright side it looks as if all of your tasks are resends because looking at the Server Status all of the tapes with this erroring data shows (Done) on your host 2889590. Is there a possibility of getting more data like this in the near future? Thanks in advance.

I think Matt said that we are done splitting the bad tapes, but now have thousands of 8MB WUs that will just have to run though the system and get 6 errors.

Other than that, I don't have a clue when the next tape with this "stuck bit" will surface again, but as Ozz pointed out, this was noticed over in beta this time last year, so it's possible that it may be a yearly thing, or just a total coincidence.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Speedy
Volunteer tester
Avatar
Send message
Joined: 26 Jun 04
Posts: 711
Credit: 6,045,611
RAC: 2,750
New Zealand
Message 894513 - Posted: 14 May 2009, 6:18:31 UTC - in response to Message 894510.


I think Matt said that we are done splitting the bad tapes, but now have thousands of 8MB WUs that will just have to run though the system and get 6 errors.

Thanks for info. Is there any reason why tapes with erroring data on can formatted & sent straight back to get fresh data on them, or is this data bundled with good data? Does anyone know what tapes this data is on? So we can watch the tapes move through the system. Thanks in advance.
P.S If I receive any of these units I'll let them run as soon as I notice them in my job queue.
____________

Live in NZ y not join Smile City?

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2357
Credit: 8,951,347
RAC: 3,889
United States
Message 894523 - Posted: 14 May 2009, 7:16:36 UTC - in response to Message 894513.
Last modified: 14 May 2009, 7:19:25 UTC


I think Matt said that we are done splitting the bad tapes, but now have thousands of 8MB WUs that will just have to run though the system and get 6 errors.

Thanks for info. Is there any reason why tapes with erroring data on can formatted & sent straight back to get fresh data on them, or is this data bundled with good data? Does anyone know what tapes this data is on? So we can watch the tapes move through the system. Thanks in advance.
P.S If I receive any of these units I'll let them run as soon as I notice them in my job queue.

I believe we have narrowed it down to somewhere in the 20-28feb09 tapes, and it's only the B3_P1 channel of those tapes, but, I have one that is 22feb09ac and B3_P1, and is fine. It has not been determined if it is in the data recorder or in the splitters, but seems more likely to be the data recorder.

As far as getting new data on those tapes, the hard drives get sent in batches and then filled up. The drives are either 500gb or 750gb, and get "split" up into ~50gb files (it's a more manageable size for both in-house and off-site data storage).
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Speedy
Volunteer tester
Avatar
Send message
Joined: 26 Jun 04
Posts: 711
Credit: 6,045,611
RAC: 2,750
New Zealand
Message 894527 - Posted: 14 May 2009, 7:47:15 UTC - in response to Message 894523.

[quote]
I believe we have narrowed it down to somewhere in the 20-28feb09 tapes, and it's only the B3_P1 channel of those tapes, but, I have one that is 22feb09ac and B3_P1, and is fine. It has not been determined if it is in the data recorder or in the splitters, but seems more likely to be the data recorder.
[/quote

Thanks this is interesting information.
____________

Live in NZ y not join Smile City?

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8824
Credit: 53,579,388
RAC: 47,197
United Kingdom
Message 894530 - Posted: 14 May 2009, 8:00:37 UTC - in response to Message 894527.


I believe we have narrowed it down to somewhere in the 20-28feb09 tapes, and it's only the B3_P1 channel of those tapes, but, I have one that is 22feb09ac and B3_P1, and is fine. It has not been determined if it is in the data recorder or in the splitters, but seems more likely to be the data recorder.

Thanks this is interesting information.

Actually, the current best estimate for the problematic B3_P1 recording range is 12mr09 through 07ap09 - Josef Segur has been keeping track of them at Lunatics.

If anyone has any hard evidence for an example in February, I'm sure Joe would be most interested - but we need a link to a result, please.

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2357
Credit: 8,951,347
RAC: 3,889
United States
Message 894883 - Posted: 15 May 2009, 7:20:44 UTC

I suppose I mis-typed. I think I meant 22mr09ac instead of 22fe09ac. I might have missed making the PDF of the task ID for that particular task though.

So far what I've got are:

ap_02ap09ac_B3_P1_00013_20090505_16821.wu_2
ap_03ap09aa_B3_P1_00002_20090505_10103.wu_1
ap_05ap09aa_B3_P1_00390_20090506_23779.wu_0
ap_05ap09aa_B3_P1_00390_20090506_23779.wu_0
ap_06ap09ab_B3_P1_00108_20090507_32618.wu_5
ap_06ap09ac_B3_P1_00004_20090507_07720.wu_5
ap_06ap09ac_B3_P1_00067_20090507_07720.wu_4
ap_06ap09ac_B3_P1_00067_20090507_07720.wu_4
ap_06ap09ac_B3_P1_00292_20090507_07720.wu_5
ap_06ap09ae_B3_P1_00000_20090508_19810.wu_0
ap_14mr09aa_B3_P1_00399_20090429_18241.wu_6
ap_14mr09ae_B3_P1_00154_20090502_29770.wu_3
ap_14mr09ae_B3_P1_00154_20090502_29770.wu_3
ap_20mr09aa_B3_P1_00091_20090430_24201.wu_5
ap_20mr09aa_B3_P1_00091_20090430_24201.wu_5
ap_20mr09ad_B3_P1_00361_20090501_16027.wu_6
ap_21mr09ab_B3_P1_00067_20090501_27055.wu_2
ap_21mr09ab_B3_P1_00096_20090501_27055.wu_4
ap_21mr09ac_B3_P1_00198_20090509_02086.wu_4
ap_21mr09ac_B3_P1_00388_20090510_02086.wu_2
ap_30mr09ab_B3_P1_00347_20090512_28745.wu_4
ap_30mr09ab_B3_P1_00347_20090512_28745.wu_4

If more show up, I'll post those. So far, it looks like all of mine are within Josef's observations.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 16,389,227
RAC: 9,466
United States
Message 894940 - Posted: 15 May 2009, 14:29:15 UTC

Well, I thought I had found this one http://setiathome.berkeley.edu/workunit.php?wuid=443174608 dated 22fe09 so I started it and let it run about 10 minutes but it didn't fail. I then looked a little closer at the other wingmen that had errored out and saw all four of them had different problems. If it messes up on me when I get back to it I will report it here then.
____________


PROUD MEMBER OF Team Starfire World BOINC

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 16,389,227
RAC: 9,466
United States
Message 895944 - Posted: 17 May 2009, 15:56:06 UTC - in response to Message 894940.

Ok, running it now. 23.7% in and no problems. Another wingman has already completed it successfully and is waiting on me.
____________


PROUD MEMBER OF Team Starfire World BOINC

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2357
Credit: 8,951,347
RAC: 3,889
United States
Message 896021 - Posted: 17 May 2009, 18:35:21 UTC - in response to Message 895944.

Ok, running it now. 23.7% in and no problems. Another wingman has already completed it successfully and is waiting on me.

Yeah, that task dated 22fe09 was a full month before the problem showed up. That one should be fine.

I figured out why I thought I had an extra one of those B3_P1 tasks. I had one that was B3_P1 and 22fe09ac and looked at the date on it too quickly. Somewhere in my mind that was the same as 22mr09ac.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

1 · 2 · Next

Message boards : Number crunching : Astropulse Beam 3 Polarity 1 Errors

Copyright © 2014 University of California