Drive (Aug 11 2011)


log in

Advanced search

Message boards : Technical News : Drive (Aug 11 2011)

1 · 2 · 3 · 4 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1390
Credit: 74,079
RAC: 0
United States
Message 1139085 - Posted: 11 Aug 2011, 23:07:27 UTC

Okay, we didn't fix the HE connections problem, but are getting closer to understanding what's going on. Basically our router down at the PAIX keeps getting a corrupted routing table. We reboot it, which flushes the pipes, but this only "evolves" the issue: people who couldn't connect before now can, but people who could connect before now cannot, or people don't see any change in behavior. This is likely due to a mixture of: (a) low memory on this old router, (b) our ridiculously high, constant rate of traffic, and perhaps also (c) a broken default route.

We're looking into (c) at the moment, and solving (a) may be far too painful (we don't have easy access to this router, which is a donated box mounted in donated rack space 30 miles away). So I've been arguing that we need to deal with (b) first, i.e. reduce our rate of traffic.

Part of reducing our traffic means breaking open our splitter code. Basically, one of the seven beams down at Arecibo has been busted for a while, thus causing a much-higher-than-normal rate of noisy workunits. We've come up with a way to detect busted beam automatically in the splitter (so it won't bother creating workunits for said beam) but this means cracking open the splitter. This is a delicate procedure, as you can really screw things up if the splitter is broken - and usually needs oversight from Eric who is the only one qualified to bless any changes to it. Of course, Eric has been busy with a zillion other things, so this kept getting kicked down the street. But at this point we all feel this needs to happen, which should reduce general traffic loads, and maybe clear up other problems - like our seemingly overworked router facing HE unable to handle the load.

Of course, it doesn't help we're all bogged down in a wave of grant proposals and conferences, and I'm having to write a bunch of notes as part of a major brain dump since I'm leaving for two months (starting two weeks from now). I'll be on the road (all over the Eastern North America in September, all over Europe in October) playing keyboards/guitar with the band Secret Chiefs 3. It's been a crazy month thus far getting ready for that.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4216
Credit: 34,480,029
RAC: 12,783
United Kingdom
Message 1139093 - Posted: 11 Aug 2011, 23:24:26 UTC - in response to Message 1139085.
Last modified: 11 Aug 2011, 23:33:38 UTC

Thanks for the update Matt,

Another idea is to penalise hosts more heavily that are producing invalid work on an app version, Hosts with Fermi GPU's running non-Fermi apps springs to mind,

Claggy

Profile Byron Leigh Hatch @ team Carl SaganProject donor
Volunteer tester
Avatar
Send message
Joined: 5 Jul 99
Posts: 3622
Credit: 11,947,405
RAC: 1,146
Canada
Message 1139094 - Posted: 11 Aug 2011, 23:32:04 UTC - in response to Message 1139085.

Thanks for the update Matt, and thanks to the SETI@home crew for all your long hours of hard work ... Best Wishes Byron.

Profile arkaynProject donor
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3728
Credit: 48,768,260
RAC: 1,737
United States
Message 1139104 - Posted: 11 Aug 2011, 23:54:02 UTC

Thanks for the update Matt.

Maybe we should start a drive to fund a new replacement router.
____________

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4602
Credit: 121,650,970
RAC: 38,434
United States
Message 1139123 - Posted: 12 Aug 2011, 0:34:36 UTC - in response to Message 1139104.

Thanks for the update Matt.

Maybe we should start a drive to fund a new replacement router.

Maybe we should shoot for replacing the on campus routers as well. In the event high bandwidth could be utilized.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 13006
Credit: 7,666,318
RAC: 6,097
United States
Message 1139139 - Posted: 12 Aug 2011, 1:06:03 UTC
Last modified: 12 Aug 2011, 1:10:13 UTC

Thanks for the update.

While cleaning up the work units is important, that is at best a temporary patch.

You said donated rack space and router. Are you saying if the donated router was replaced the rack space donation would go away? Or is the rack space such that no other box will fit? Just trying to get the political angle if any sorted.

Another question that comes to mind is the box on the latest version of software from the manufacturer? Perhaps an update for this has been issued and it needs to be flashed.

[edit]Isn't that box a gig router? Shouldn't it be able to handle 10X the traffic (routing table) it is now getting? RAM failure?
____________

Profile Jeff Mercer
Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1139230 - Posted: 12 Aug 2011, 3:39:52 UTC

Once again, Thanks For The Update Matt !

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13664
Credit: 31,501,134
RAC: 7,606
United States
Message 1139235 - Posted: 12 Aug 2011, 3:43:43 UTC - in response to Message 1139085.

I'll be on the road (all over the Eastern North America in September, all over Europe in October) playing keyboards/guitar with the band Secret Chiefs 3. It's been a crazy month thus far getting ready for that.


Les Claypool? Faith No More? Who know you were so cool Matt? Let me know when you make it to Ozzy Osbourne and I'll book the first flight out there and be your groupie, carrying your instruments. ;-D

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5920
Credit: 61,710,452
RAC: 17,476
Australia
Message 1139276 - Posted: 12 Aug 2011, 5:10:50 UTC - in response to Message 1139085.

Part of reducing our traffic means breaking open our splitter code. Basically, one of the seven beams down at Arecibo has been busted for a while, thus causing a much-higher-than-normal rate of noisy workunits.

How many units per 1,000 (or 10,000 etc) are noisy?
I myself haven't seen many noisy Work Units- a couple of times i've had a group of 5-8 that only ran for 10-20secs before finishing- but this is out of 2,500-3,000 Work Units in the cache at that time. What i have been seeing is a lot more shorties than in the past- ie WUs that take only 3 min or less to run on my video card. The work mix, apart from a burst of almost nothing but shorties a week or two a go, appears to be a mix of some longer running WUs but a much higher percentage of shorter running WUs, with very little middle of the range WUs at all (although over the last few days there have been a few more mid runtime WUs than there have been).

Also the fact that my caches have only been full a couple of times over the last month would also be contributing to the extended periods of high traffic- the frequent glitches that have been bringing the system down, or limiting WU production, or limiting it's allocation mean that the caches just haven't had a chance to re-fill.
Many of the faster systems would be having the same problem- the cache just isn't filling up as something occurs that stops it from getting work- but it's still processing work at a significant rate. So when Seti comes up again, it's wiped out the gains it made in builkding up it's cache since the last outage.

I suspect that if the system was able to remain up without any slowdowns in work production or allocation between weekly outages then after 2 weeks the traffic would drop off considerably as all the faster machines would have finally been able to fill their caches.
____________
Grant
Darwin NT.

PKnowles
Send message
Joined: 24 Apr 10
Posts: 44
Credit: 6,596,057
RAC: 6,226
United States
Message 1140675 - Posted: 15 Aug 2011, 3:23:06 UTC

What kind of router are we speaking of here? What is in place now? What might a good replacement be?

Profile edjcox
Avatar
Send message
Joined: 20 May 99
Posts: 71
Credit: 4,043,111
RAC: 763
United States
Message 1140698 - Posted: 15 Aug 2011, 4:59:42 UTC



Do they have any real IT engineers working on this effort?

It seems I see a pattern of hit/miss efforts to fix problems that are never really pinned down.

Before the rightous all chime in here let's seriously look at what's going on. There are a lot of good companies who would contribute hardware and maybe even communications hardware and bandwidth. It just seems the people within the project while doing yeomans work for little reward are simply not getting the support, funding, and outside helpt this project deserves.

We seem to have but a few who even bother to communicate what's going down and that will shortly meet a two month hiatus as Matt hits the road for his music tour.

Where are the principles pn this team and why can't they manage to set up a regular weekly update and a regular status update and troubleshooting direction.

A lot of us out here are available for more that CPU cycles if we're comunicated with. We might even be able to dissect the system if someone would take the time and block chart a schematic and layout of the systems and the hardware.

Those of you that are offended by my viewpoints needs not throw electrons as frankly I am not a happy SETI at Home camper anymore. I'm aware of the limited manpower, limited funds, and so on. I also expect that this project get handled by people with their best work.

A rime example is the dated information on the website and how its not current at all.

What for example has come from the Data acquired afew weeks back. HAs it even been analysed at all? How about some feedback on the front page...

Sorry, with my three PC's working on SETI and Einstein I am getting a better feel looking for gravitational waves than I am looking for evidence of communications between ET's.


____________
Never engage stupid people at their level, they then have the home court advantage.....

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 32348
Credit: 14,277,714
RAC: 7,161
United Kingdom
Message 1140772 - Posted: 15 Aug 2011, 13:23:22 UTC

I'll be on the road (all over the Eastern North America in September, all over Europe in October) playing keyboards/guitar with the band Secret Chiefs 3. It's been a crazy month thus far getting ready for that.


Matts site

Enjoy it Matt, you've earned it. We'll welcome you back later in the year.
____________
Damsel Rescuer, Uli Devotee, Julie Supporter, ES99 Admirer,
Raccoon Friend, Anniet fan, Official crusty old fart


rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8755
Credit: 61,655,007
RAC: 33,169
United Kingdom
Message 1140843 - Posted: 15 Aug 2011, 18:08:03 UTC

Matt, Thanks for all you've been doing to keep S@H running.

Europe in October - must keep my eyes open and try to get my ears filled.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Profile Francis Noel
Avatar
Send message
Joined: 30 Aug 05
Posts: 424
Credit: 61,009,239
RAC: 30,594
Canada
Message 1141266 - Posted: 16 Aug 2011, 20:43:06 UTC
Last modified: 16 Aug 2011, 20:45:05 UTC

Touring with SC3 !
Dude...

What can I say ? Congratulations on having a kick-ass life, Matt. I am genuinely happy for you and I truly hope you enjoy the experience.

I love it when Good Things happen to Good People.

This is beautiful.
____________
mambo

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2328
Credit: 8,869,285
RAC: 683
United States
Message 1141275 - Posted: 16 Aug 2011, 21:14:56 UTC

I don't know if it's even possible, but during the software blanking stage of cleaning the tapes up before splitting them, is it even possible to prevent APs that are 100% blanked from even being sent out? That's 16MB of data transfer that can be saved for every WU affected by it. My machines are nowhere near power crunchers, but I still pick up a handful of those WUs back-to-back every now and then.

I guess something along the lines of a CSV file that says blanking starts at this offset and lasts for this many bytes, so when the splitters come along, they look at the CSV table and see what sections of the tape to skip entirely (obviously only sections where a whole WU will be 100% blanked).

Just a thought.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4216
Credit: 34,480,029
RAC: 12,783
United Kingdom
Message 1143602 - Posted: 21 Aug 2011, 21:40:12 UTC - in response to Message 1139093.
Last modified: 21 Aug 2011, 21:40:52 UTC

Thanks for the update Matt,

Another idea is to penalise hosts more heavily that are producing invalid work on an app version, Hosts with Fermi GPU's running non-Fermi apps springs to mind,

Claggy

Here's an example of an (Anonymous) host which should be penalised for producing invalid results:

hostid=3378825

It has a Phenom II X6 1055T CPU and three Cayman GPU's, most of it's MB CPU tasks are inconclusive/invalid, as are it's ATI Astropulse GPU tasks,
It has 765 AP tasks in it's list, 24 of those are valid, 60 invalid, and 673 pending tasks most of which only took a few hundred seconds if that, and it still has a Max tasks per day of 100 for ATI AP,
The counts for CPU Multibeam are hardly any better, and that also has a Max tasks per day of 100,

Claggy

Profile Hauser Investments, inc
Send message
Joined: 22 Nov 02
Posts: 1
Credit: 2,794,720
RAC: 0
United States
Message 1143853 - Posted: 22 Aug 2011, 15:45:32 UTC

ok
____________

Profile Jeff Mercer
Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1144112 - Posted: 22 Aug 2011, 23:47:18 UTC

Hello to all. Since the project is down at the moment, and I am out of work, I think I'll just back out of the program for a short time. Got a lot of things going on here at the moment, and so, I'm shutting down for a while. All work units that I had are now, "CRUNCHED" and sent back in. I'll be back though ! Thanks !

Profile Jason Safoutin
Volunteer tester
Avatar
Send message
Joined: 8 Sep 05
Posts: 1386
Credit: 200,389
RAC: 0
United States
Message 1144152 - Posted: 23 Aug 2011, 2:12:22 UTC

I have been crunching with SETI@home on and off for a while now. Whenever there was major issues that brought the project down, these guys (Matt et. al.) always pulled through to get the project up and running again, whether it be sooner or later. I have faith that they will pull through this problem as well.

It's a shame there are not more donations both financially and in terms of hardware. So sad to see a good program go underfunded. If I ever run into a lot of money, you can bet I would make a large donation to the project.
____________
"By faith we understand that the universe was formed at God's command, so that what is seen was not made out of what was visible". Hebrews 11.3

Profile Jeff Mercer
Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1144168 - Posted: 23 Aug 2011, 2:48:01 UTC - in response to Message 1144152.

Well Said....... Well said INDEED !!!!

1 · 2 · 3 · 4 · Next

Message boards : Technical News : Drive (Aug 11 2011)

Copyright © 2014 University of California