Take Two (Sep 24 2009)


log in

Advanced search

Message boards : Technical News : Take Two (Sep 24 2009)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 935692 - Posted: 24 Sep 2009, 19:29:14 UTC

Hey gang. Sorry to say the first software radar blanker tests were kind of a bust - apparently some radar still leaked through. But we have strong theories as to why, and the fixes are trivial. I'll probably start another test this afternoon (a long process to reanalyze/reblank/resplit the whole test file - may be a day or two before workunits go out again).

To answer one question: these tests are happening in public. As far as crunchers are concerned this is all data driven, so none of the plumbing that usually required more rigorous testing has changed, thus obviating the need for beta. And since there are far more flops in the public project, I got enough results returned right away for a first diagnosis. I imagine if I did this in beta it would take about a month (literally) before I would have realized there was a problem.

To sort of answer another question: the software blanker actually finds two kinds of radar - FAA and Aerostat - the latter of which hits us less frequently but is equally bad when it's there. The hardware blanker only locks onto FAA, and as we find misses some echoes, goes out of phase occasionally, or just isn't there in the data. Once we trust the software blanker, we'll probably just stick with that.

On the upload front: Sorry I've been ignoring this problem for a while, if only because I really see no obvious signs of a problem outside of complaints here on the forums. Traffic graphs look stable, the upload server shows no errors/drops, the result directories are continually updated with good looking result files, and the database queues are normal/stable. Also Eric has been tweaking this himself so I didn't want to step on his work. Nevertheless, I just took his load balancing fixes out of the way on the upload server and put my own fixes in - one that sends every 4th result upload requests to the scheduling server (which has the headroom to handle it, I think). We'll see if that improves matters. I wonder if this problem is ISP specific or something like that...

I'll slowly start of the processes that hit the science database - the science status page generator, the NTPCkrs, etc. We'll see if Bob's recent database optimizations have helped.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Aurora Borealis
Volunteer tester
Avatar
Send message
Joined: 14 Jan 01
Posts: 2975
Credit: 5,030,152
RAC: 1,337
Canada
Message 935701 - Posted: 24 Sep 2009, 19:54:35 UTC

Thanks for the update. Too bad about the radar blanker test fizzing out on the first tryout, but then when has any software worked as expected when put into full blown use. We have complete confidence that you'll get the kinks out.
____________
Questions? Answers are in the "Unofficial" BOINC Wiki.

Boinc V7.0.27
Win7 i5 3.33G 4GB, GTX470

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8499
Credit: 49,933,448
RAC: 51,087
United Kingdom
Message 935718 - Posted: 24 Sep 2009, 21:19:26 UTC - in response to Message 935692.
Last modified: 24 Sep 2009, 21:31:12 UTC

On the upload front: Sorry I've been ignoring this problem for a while, if only because I really see no obvious signs of a problem outside of complaints here on the forums. Traffic graphs look stable, the upload server shows no errors/drops, the result directories are continually updated with good looking result files, and the database queues are normal/stable. Also Eric has been tweaking this himself so I didn't want to step on his work. Nevertheless, I just took his load balancing fixes out of the way on the upload server and put my own fixes in - one that sends every 4th result upload requests to the scheduling server (which has the headroom to handle it, I think). We'll see if that improves matters. I wonder if this problem is ISP specific or something like that...

I think I was the first to report this, so here are some observations to help you narrow it down and decide whether it's significant or not.

In reverse order:
ISP - I doubt it. I reported from UK 'uploads sticky - anyone else seeing this?'. Both times, first responder came from Australia "same here". I doubt we share an ISP!
Both events seemed to start (late) weekend, and be cured by the Tuesday maintenance outage (we got some standard congestion backoffs this week, but that's different).
All uploads completed eventually, so no reduction in total work reported. It was just a unusual number of retries before success.
There seemed to be time periods when all uploads passed successfully at the first attempt, and others when failures were particularly common - with a (subjective) frequency of ~1 hour. I have logs which I could possibly analyse to give more specific time information, if that would be helpful.
Regarding your comments yesterday about the science database stalling - I note that Bruno runs the majority of the validator processes, which need science DB access. Could a science DB freeze possibly cause another process, say an upload, to timeout and miss its appointment? Just brainstorming here.

Finally, did you see the specific link to a CISCO advisory suggesting an IOS upgrade to forestall a DoS attack on the router? The symptom quoted was dropped packets (not empty replies to packets, which is what I'm seeing), but it suggested that in extremis before patching, a router reboot might be needed to restore throughtput.

Edit - just at the moment, everything is running perfectly, with no retries at all: not a good time for troubleshooting. [ooooops, I invoked Murphy again :-( ]

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 935728 - Posted: 24 Sep 2009, 22:32:06 UTC - in response to Message 935718.

Good points. I'll digest, but to clarify:

Regarding your comments yesterday about the science database stalling - I note that Bruno runs the majority of the validator processes, which need science DB access. Could a science DB freeze possibly cause another process, say an upload, to timeout and miss its appointment? Just brainstorming here.


Good brainstorming, but the validators actually do not require science database access. It reads actual result files, does mathematical checking/comparison, and sets the validated failed/passed flag in the BOINC/mysql database. So, yeah, it's checking science, but this is done outside of any need for the science database. For what that's worth..

Edit - just at the moment, everything is running perfectly, with no retries at all: not a good time for troubleshooting. [ooooops, I invoked Murphy again :-( ]


I'm guessing things were okay, and I made them even better with my tweak, but whatever problem that usually happens these days will rear its head as we approach the weekend (i.e. the end of the database compression cycle).

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8418
Credit: 4,134,325
RAC: 1,462
United Kingdom
Message 935749 - Posted: 24 Sep 2009, 23:31:48 UTC - in response to Message 935692.
Last modified: 24 Sep 2009, 23:32:47 UTC

Hey gang. Sorry to say the first software radar blanker tests were kind of a bust - apparently some radar still leaked through. But we have strong theories as to why, and the fixes are trivial. I'll probably start another test this afternoon (a long process to reanalyze/reblank/resplit the whole test file - may be a day or two before workunits go out again).

All part of the science!

To answer one question: these tests are happening in public. As far as crunchers are concerned this is all data driven, so none of the plumbing that usually required more rigorous testing has changed, thus obviating the need for beta. And since there are far more flops in the public project, I got enough results returned right away for a first diagnosis. I imagine if I did this in beta it would take about a month (literally) before I would have realized there was a problem.

With real science in action... :-)

That reminds me of an episode from an old Dr Who series where he goes to a city called Logolopolis where all the people there are (religiously) dedicated to mathematical calculations. He takes momentary advantage of their combined mathematical diligence to perform a near impossible calculation to reprogram the chameleon circuits of his TARDIS. Ofcause, other things happen also...


Good luck on the next attempt!

Regards,
Martin
____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8499
Credit: 49,933,448
RAC: 51,087
United Kingdom
Message 935831 - Posted: 25 Sep 2009, 15:00:07 UTC

I've checked those logs. Across three CUDA machines (their rapid turnover makes problems more noticeable!), I had 6377 occurrences of "HTTP error: Server returned nothing (no headers, no data)" between 12:00:00 UTC on Friday 11 Sep (when we first started commenting in Problems Uploading) and 18:00:00 UTC Tuesday 15 Sep when I finally cleared everything out during maintenance. During the same interval, I had 1336 successful uploads on the same three machines, so each file took on average 5.77 attempts to get through - usually they go at the first attempt, unless the pipe is maxxed out.

I can't deduce much from the hourly pattern:


(note log scale)

except that I resorted to the retry button on Tuesday morning! Data available on request.

Rick Heil
Send message
Joined: 16 Aug 09
Posts: 1
Credit: 873,126
RAC: 0
United States
Message 935918 - Posted: 25 Sep 2009, 22:40:16 UTC - in response to Message 935831.

For those suspecting internet issue, it is somethimes interesting to view what is going on at the backbone in the US not just the local ISP. A tracert will tell you where in the network it is being routed. The critical networks are shown in red, and sometimes the packet losses are astounding. I am not suggesting anything on the the retry issue mentioned here, only offering another data point to compare with your data. see -> http://www.internetpulse.net Click on the boxes. You can drill down by location also.

Rick

Message boards : Technical News : Take Two (Sep 24 2009)

Copyright © 2014 University of California