Afternoon Break (Jan 09 2008)


log in

Advanced search

Message boards : Technical News : Afternoon Break (Jan 09 2008)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 698737 - Posted: 9 Jan 2008, 22:51:15 UTC
Last modified: 9 Jan 2008, 22:51:36 UTC

More blips and blops in our traffic caused by who-knows-what. We still don't have enough data yet to see if yesterday's BOINC result outcome index build helped with those regular slow validation-fix updates. In any case, I misspoke: we are running a version of MySQL where triggers are available to us - we only have to figure out how to implement them to do what we need. This morning the secondary download server bane was having a mount headache and I had to give it a virtual kick to get it going again. And that router is still a problem, but we're not convinced it's the only problem. Swapped out cables, switches etc. to no avail this morning. I installed some real load balancing between vader and bane (in practice round robin DNS is hardly balanced) which may help.

There was still slowness to the web site as of a few minutes ago. This had nothing to do with recent web code tinkering/updates or database load or any such thing - this was strictly due to the aforementioned router problems, as half the web traffic was going through the same router (the other half over the standard campus network). I just moved the competing traffic onto the campus network as well, so that should improve web site performance in general.

Regarding recent assimilator clogs, we had another one this afternoon. And yes, once again it was from a result produced by an optimized client. This time around I attached a debugger and found the problem was in XML parsing of the result and sure enough with enough eye-squinting I found a couple garbage characters in the uploaded result file. Specifically, in the power-of-time declaration of a pulse. Instead of:



It was:



So there are two problems. First, something is causing corruption in the xml (the non-standard client? something else on our end?). And second, the assimilator is too sensitive to such corruption. It shouldn't bail out so readily and create these large ready-to-assimilate queues.

Minor updates to the server status page: I changed references to "beam/polarization pair" to the more concise "channel." I then added a parenthetic numeric value to the ends of each data file (representing total working/done channels for each file) so you don't have to count the little green squares. I also added total values at the bottom for all data files (mostly so we can see how long we have before we run out of data to split). Note how the "vertical" processes (i.e. splitting multiple files at once) has a negative side effect: we are forced to keep data files around much longer, which makes it difficult to keep a queue of data on disk. Some better "vertical" logic has been coded, to be rolled out in the next day or so.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8491
Credit: 49,764,523
RAC: 55,072
United Kingdom
Message 698746 - Posted: 9 Jan 2008, 23:24:17 UTC - in response to Message 698737.

....
Regarding recent assimilator clogs, we had another one this afternoon. And yes, once again it was from a result produced by an optimized client. This time around I attached a debugger and found the problem was in XML parsing of the result and sure enough with enough eye-squinting I found a couple garbage characters in the uploaded result file. Specifically, in the power-of-time declaration of a pulse. Instead of:

<pot length=211 encoding="x-csv">

It was:

<pot length=211 encoding71x-csv">

So there are two problems. First, something is causing corruption in the xml (the non-standard client? something else on our end?). And second, the assimilator is too sensitive to such corruption. It shouldn't bail out so readily and create these large ready-to-assimilate queues.

There's been a bit of a kerfuffle about this in Number Crunching. Matt, could you possibly clarify whether you mean client (by which, we normally mean the version of BOINC in use - and yes, there are 'optimised' versions of BOINC out in the field), or are you referring to an optimised SETI cruncher, which we would usually refer to as an optimised application? Can you identify the particular version of the client/app which has given rise to the problem and - privately - get word to that particular optimiser or crew that they need to do a bit of debugging?

Minor updates to the server status page: I changed references to "beam/polarization pair" to the more concise "channel." I then added a parenthetic numeric value to the ends of each data file (representing total working/done channels for each file) so you don't have to count the little green squares. I also added total values at the bottom for all data files (mostly so we can see how long we have before we run out of data to split). Note how the "vertical" processes (i.e. splitting multiple files at once) has a negative side effect: we are forced to keep data files around much longer, which makes it difficult to keep a queue of data on disk. Some better "vertical" logic has been coded, to be rolled out in the next day or so.

- Matt

The "vertical" splitting process has been a great success (and the extra clarity on the progress blocks is nice, too). There was a bit of a problem when you injected four new 'tapes' after the maintenance outage yesterday, and they got subjected to a sort of server-side EDF (as predicted by Henk Haneveld): the low channels on the new 'tapes' got split in preference to the pre-existing work. Unless your new logic can get round that problem, you'll have to delay putting new 'tapes' into the system until the existing ones have worked through channel 14. Still, the new 'ready to send' buffer limit of ~800,000 tasks should represent around 15 hours of work in hand, so there's no hurry to put another disk online the moment the last of the current crop are split.

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4244
Credit: 1,047,276
RAC: 293
United States
Message 698769 - Posted: 10 Jan 2008, 1:42:36 UTC - in response to Message 698746.

....
Regarding recent assimilator clogs, we had another one this afternoon. And yes, once again it was from a result produced by an optimized client. This time around I attached a debugger and found the problem was in XML parsing of the result and sure enough with enough eye-squinting I found a couple garbage characters in the uploaded result file. Specifically, in the power-of-time declaration of a pulse. Instead of:

<pot length=211 encoding="x-csv">

It was:

<pot length=211 encoding71x-csv">

So there are two problems. First, something is causing corruption in the xml (the non-standard client? something else on our end?). And second, the assimilator is too sensitive to such corruption. It shouldn't bail out so readily and create these large ready-to-assimilate queues.

There's been a bit of a kerfuffle about this in Number Crunching. Matt, could you possibly clarify whether you mean client (by which, we normally mean the version of BOINC in use - and yes, there are 'optimised' versions of BOINC out in the field), or are you referring to an optimised SETI cruncher, which we would usually refer to as an optimised application? Can you identify the particular version of the client/app which has given rise to the problem and - privately - get word to that particular optimiser or crew that they need to do a bit of debugging?

The problem certainly indicates something related to the application rather than the BOINC core client, it's the application which writes all the details into a result file. However, the code to format that line is in the db/schema_master.cpp file rather than the client/ path and there's no reason to modify anything in db/ since that code isn't used frequently enough to consume an appreciable fraction of crunch time. It's also a puzzle how an 0x3D 0x22 sequence could get changed to 0x37 0x31. It tends to make me consider that overclocking often goes hand in glove with use of optimized applications, and can certainly lead to corrupted data.

If all the cases have been from a particular optimized build, that would be very interesting and indicative. And if the other result files were preserved and show a similar corruption pattern then perhaps one might guess that one of the optimization options specified when making that build has a weakness.

If the problem seems to be in one of the 2.4 or 2.4V apps built from lunatics.at source, I will certainly try to figure it out since I'm responsible for many of the source changes in those builds.
Joe

Profile jason_geeProject donor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4981
Credit: 73,358,922
RAC: 15,919
Australia
Message 698817 - Posted: 10 Jan 2008, 6:45:15 UTC - in response to Message 698769.
Last modified: 10 Jan 2008, 6:57:53 UTC

If the problem seems to be in one of the 2.4 or 2.4V apps built from lunatics.at source, I will certainly try to figure it out since I'm responsible for many of the source changes in those builds.
Joe


In case it helps, I can fairly completely understand how an error may have worked into a build in this area of code. This is exactly where I was having problems finding a stable boinc SVN revision to build against the api for my opt app experiments.

There has been 'seemingly minor' twiddling related to string and xml processing lately [During/Since 2.4]. The newly introduced str_util.cpp interfacing with xml_utils all make or break depending on which boincapi revision is chosen, and what 'tweaks' are used to make it build [i.e. arbitrary build time adjustments....] . So my guess is the error will likely be from a specific build source rather than all of them.

Some of those 'buildability' issues may lead to string corruption or an omitted buffer initialisation, in special rare circumstances.

Don't suppose you saved the unprocessed results in question? We could test them standalone - Although we don't have an assimilator, a few coffees and my wired eyeball should be able to spot an errant '71'... ;D

Jason
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 0
United States
Message 698896 - Posted: 10 Jan 2008, 12:03:20 UTC


Matt, Berkeley - Thanks for the work You're each doin' . . . it is appreciated
____________
BOINC Wiki . . .

Science Status Page . . .

Profile Robert Mills
Avatar
Send message
Joined: 7 Jan 08
Posts: 1
Credit: 8,042
RAC: 0
United States
Message 698908 - Posted: 10 Jan 2008, 14:12:55 UTC - in response to Message 698896.


Matt, Berkeley - Thanks for the work You're each doin' . . . it is appreciated


I just joined the project and have installed seti@home software and been up and running

cyberbob-desktop 41.22 449 AuthenticAMD Dual-Core
AMD Opteron(tm) Processor 1210 [Family 15 Model 67 Stepping 2] Linux
2.6.22-14-generic 16 10 Jan 2008 13:47:31 UTC

4130010 VAGABOND4 58.75 666 GenuineIntel
Intel(R) Core(TM) Duo CPU T2450 @ 2.00GHz [x86 Family 6 Model 14 Stepping 12] Microsoft Windows XP
Professional Edition, Service Pack 2,

for a few days now. I have the two computers running 24/7 and figured could share computiong power, especially with new technology becoming available. The Linux has additional scientific and mathematical programs installed including Celesstia, StarPlot and Stellarium.Mathematical programs include Gnumeric Spradsheet and Qalculate. I am set up for remote server/client through Krdc/krfb and through zeroconf (KDE) on Sun Ultra 20 64 bit.

FYI - my computers are designed for this kind of stuff as well

Let me know how we might make use of this, if collaborative task need to be done. On the intel machine I have Micorosft Office 2007, and paertipating in Office Live Workstaation Beta program and can set up a Office Live project there as well.


____________

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 0
United States
Message 698911 - Posted: 10 Jan 2008, 14:44:36 UTC


. . . Welcome to the Project Robert - Glad You Joined the Search Sir

richard
____________
BOINC Wiki . . .

Science Status Page . . .

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1937
Credit: 10,046,945
RAC: 17,883
United States
Message 698923 - Posted: 10 Jan 2008, 16:40:36 UTC

Note to project: "sah_assimilator1" currently shows as not running.
____________
.

Jesse Viviano
Send message
Joined: 27 Feb 00
Posts: 95
Credit: 474,230
RAC: 0
United States
Message 698926 - Posted: 10 Jan 2008, 16:57:25 UTC

I would think that the only way to solve this is to write or use an XML grammar checker that would immediately cause any invalid XML to cause the result to be marked invalid before it is validated through the normal validator, and take the work unit off the validation queue if this results in not having enough results to validate. I would normally think that this is good practice, but your servers seem loaded to the max.

If this is implemented, someone running the buggy optimized client will notice that he is getting zero credit and will do something about it, because those who run optimized clients probably care about their score more than the average user.

Profile Jan Schotsmans
Avatar
Send message
Joined: 27 Oct 00
Posts: 98
Credit: 92,693
RAC: 0
Belgium
Message 698964 - Posted: 10 Jan 2008, 20:37:33 UTC

Matt, doesn't Campus have a backup router or a multi port router around to which you can hook the pipe for some tests?

I also read in many places that the router you mentioned is CPU limited to between 61 and 65Mbit, depending on the packet sizes being pushed trough it, while it has plenty of other resources available.

Also, if cost is an issue for buying routers, check out these guys:

http://www.routerboard.com/comparison.html#powerSeries

I built a RB600 based router at work and it works brilliantly, routing a gigabit connection between 2 of our buildings. And that for nearly no money at all.

It can handle 84000 1500 byte packets per second, which callculates down to 120MB/s So it wouldn't even break a sweat routing the 100Mbit pipe to the Seti farm.

Profile Michael Sinatra
Send message
Joined: 23 Jul 07
Posts: 11
Credit: 5,173
RAC: 0
United States
Message 699418 - Posted: 12 Jan 2008, 0:44:48 UTC - in response to Message 698964.
Last modified: 12 Jan 2008, 0:45:18 UTC

Matt, doesn't Campus have a backup router or a multi port router around to which you can hook the pipe for some tests?


Campus does, but there are a lot of obstacles to doing this sort of test. I am not going to go it now, but they are sizable, and they have to do both with technical issues and with agreements that were made between campus and S@h regarding support of their ISP tunnel.


I also read in many places that the router you mentioned is CPU limited to between 61 and 65Mbit, depending on the packet sizes being pushed trough it, while it has plenty of other resources available.


Keep in mind that this router isn't just routing, it's also encapsulating every packet inside a GRE tunnel packet. That encapsulation process is relatively CPU intensive, and ensures that the tunneling *and* routing must take place on the main CPU and can't be offloaded onto any ASIC or sub-processor. (Some routers, like Junipers, require you have a special tunnel processor just to do the tunneling. It works very well, but is not in the price range we're talking about.)


Also, if cost is an issue for buying routers, check out these guys:

http://www.routerboard.com/comparison.html#powerSeries

I built a RB600 based router at work and it works brilliantly, routing a gigabit connection between 2 of our buildings. And that for nearly no money at all.

It can handle 84000 1500 byte packets per second, which callculates down to 120MB/s So it wouldn't even break a sweat routing the 100Mbit pipe to the Seti farm.


Looks cool, but did you test its throughput when it was doing GRE encapsulation? Another possibility is to simply use a PC running FreeBSD or Linux with the Zebra/Quagga software and GRE tunnel interfaces configured. Since GRE is CPU-intensive, it's much easier to scale the CPUs in PCs than in embedded devices. However, you have to be very careful about reliability. Devices that are designed to be routers are also designed with reliability and maximum uptime in mind. Also, unless you really understand the integration process, the mix of hardware and software requirements and the complexity involved can have an impact on reliability. In many respects, you will get what you pay for.

But I would generally agree with you that ~60 mb/s throughput is what we'd expect from a 2811 with GRE encapsulation.

michael
UCB network person

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 547,098
RAC: 253
United States
Message 699468 - Posted: 12 Jan 2008, 4:12:22 UTC - in response to Message 699418.

But I would generally agree with you that ~60 mb/s throughput is what we'd expect from a 2811 with GRE encapsulation.


I suppose the *real* solution would be to find a network topology agreement with campus so GRE can be eliminated. I.E., just have a straight ISP line to the building, or perhaps a simple fiber switch between ISP line to campus and the S@H building. Anyone got a concave planar lever-operated dirt removal device? :)

seti@elrcastor.com
Volunteer tester
Send message
Joined: 30 Jan 00
Posts: 35
Credit: 4,868,442
RAC: 0
United States
Message 699483 - Posted: 12 Jan 2008, 4:54:04 UTC - in response to Message 699468.

Anyone got a concave planar lever-operated dirt removal device? :)


That would defenitly help, need a bit of man power and some parts and it's done.
____________

Profile Jan Schotsmans
Avatar
Send message
Joined: 27 Oct 00
Posts: 98
Credit: 92,693
RAC: 0
Belgium
Message 699569 - Posted: 12 Jan 2008, 14:15:40 UTC

Michael: I'm purely routing the Cat7 Gbit pipe to that building, the cable is in an underground maintenance tunnel which can only be accessed from a secure area in the buildings, so we didn't see the point of wasting bandwidth by doing encryption.

We have other buildings where the residents pritty much demanded encryption on the pipe, so they are stuck with a crappy pipe for now.

I did build the router with VPN in mind (we'll probably use it to replace our current VPN routers from Symantec during the run of this year, these Symantecs can only pull 13Mbit over VPN connections, in other words they SUCK, and no, not my fault, they are from before my time) for that I installed 2 Soekris VPN 1411 MiniPCI adapters, which are capable of the following:

* Compression, LZS and MPPC at 420 to 510 Mbps
* Encryption, 128/192/256 AES, DES, 3-DES and RC4 at 210 to 460 Mbps
* Authentication, SHA-1 and MD5 at 325 to 360 Mbps
* Public Key, RSA, DSA, SSL, IKE and DH, 24 to 70 connections/sec using 1024 bit keys
* Hardware random number generator
* Form: 33/66 Mhz Mini-PCI type III form factor
* Power max 1.8 Watt
* Operating temperature 0-60 °C

For 50$ a pop, I'd say that isn't half bad, especialy not if all you need to route is 100Mbit.

I used 2 because I'll be attempting to dedicate seperate cards to seperate ports. Rather then offload everything to a single one of these cards.

Also since the router board supports 4 MiniPCI cards and can be extended to support 4 more (so thats 8 of them), it should be possible to engineer a Gigabit router and configure Linux to being able to do encryption without loosing any of the gigabit pipe, and all that for (way)under 1000$.

The VPN Tunnels we use are IPSec 3DES/SHA1.

Since you should run BSD or Linux on a router like this, adding support for GRE and offloading that to the VPN chips is all in your hands and control.

Profile Michael Sinatra
Send message
Joined: 23 Jul 07
Posts: 11
Credit: 5,173
RAC: 0
United States
Message 699612 - Posted: 12 Jan 2008, 19:02:18 UTC - in response to Message 699468.

But I would generally agree with you that ~60 mb/s throughput is what we'd expect from a 2811 with GRE encapsulation.


I suppose the *real* solution would be to find a network topology agreement with campus so GRE can be eliminated. I.E., just have a straight ISP line to the building, or perhaps a simple fiber switch between ISP line to campus and the S@H building. Anyone got a concave planar lever-operated dirt removal device? :)


Agreements with campus don't really have much to do with *what* solution is chosen, only with how that solution is supported when it's in production. The second suggestion you make is largely how the Cogent ISP service for S@H was implemented. When S@H decided to move beyond that, a lot of options were examined, including the first one you suggest. A big part of the issue is that it's hard for an outside ISP to get connectivity up to the top of a 1,000-foot rocky hill that has an active earthquake fault at its foot. (Since most of the campus is on the flats, getting connectivity between the campus and the hills is also difficult.) And difficult==very expensive.

My personal preference would be for the S@H data servers to move to a colo facility close to a Hurricane Electric PoP. No GRE tunnel, and only one router there, instead of two. But I recognize that that solution raises a lot of potentially insurmountable support issues for the S@H folks.

I actually think the current solution isn't bad at all, and it has already proven itself. All we need is a better router in place of the 2811.

michael

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,500,772
RAC: 11,231
United States
Message 699618 - Posted: 12 Jan 2008, 19:14:29 UTC

Michael,

Just wanted to stop in and thank you for posting in here. Poor Matt was wearing his fingers out keeping us up to date by himself. It's good to see some fresh blood.
____________


PROUD MEMBER OF Team Starfire World BOINC

Profile Geek@PlayProject donor
Volunteer tester
Avatar
Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,108,613
RAC: 20,851
United States
Message 699623 - Posted: 12 Jan 2008, 19:24:50 UTC - in response to Message 699619.


I actually think the current solution isn't bad at all, and it has already proven itself. All we need is a better router in place of the 2811.

michael


Maybe you could come up with the ideal replacement for the 2811 and it's cost and pass that along here and to Blurf.....
Perhaps we could ramp up the donation machinery to acquire it for the project...I am sure everybody would be in favor of having more bandwidth on tap.


Michael.......I agree with msattler. Many of us would happily donate for a new router. Let us know what you need.


Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8408
Credit: 4,127,453
RAC: 1,345
United Kingdom
Message 699624 - Posted: 12 Jan 2008, 19:25:06 UTC

OK, so rather than struggle with wires/fibre, would a line-of-sight microwave link or optical link be feasible? You should be able to get the kit to do the job for less than $10k.

Regards,
Martin

____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Profile Jan Schotsmans
Avatar
Send message
Joined: 27 Oct 00
Posts: 98
Credit: 92,693
RAC: 0
Belgium
Message 699641 - Posted: 12 Jan 2008, 21:21:09 UTC - in response to Message 699612.

Michael, Cisco has extension boards available to offload encryption on too, but last price I saw I could build 200 of these home made routers that can do the work as well.

Since Seti is a project running on fumes and the money comming in is needed in the science first, I would feel it my duty as an ITer to do whatever I could to create and support these kinds of cheap solutions.

And if its your responsible for admining the network, believe me, its alot of fun and even more fullfilling and rewarding to get something like this running well, then getting a 200K a month job.

1 · 2 · Next

Message boards : Technical News : Afternoon Break (Jan 09 2008)

Copyright © 2014 University of California