Chaos at the Greasy Spoon (May 24 2007)


log in

Advanced search

Message boards : Technical News : Chaos at the Greasy Spoon (May 24 2007)

1 · 2 · 3 · 4 . . . 6 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 575023 - Posted: 24 May 2007, 21:23:41 UTC
Last modified: 24 May 2007, 21:23:59 UTC

Jeff, Eric, and I had our software meeting this morning, which happens every Thursday. As usual we discuss the game plan as far as bringing a new splitter on line, coding conventions for the near time persistency checker, etc. Then something happens to keep us from doing anything on this front.

Today, at least for me and Jeff, it was isaac crashing. This machine is the boinc.berkeley.edu web server, among other things. Short story: lots of CPU errors, rebooting doesn't help, we tried putting in new memory, no sign of overheating. We got it in rescue mode a put in a non-xen kernel. It's been stable for the past 15 minutes. We'll see if that holds. Doubtful. A service call may be in order. There's a DNS redirect pointing to a stub page in the meantime.

We still haven't figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far. A lot of work is getting sent and results returned, and we're creating a healthy backlog of workunits to send out as I type, but there is still work to be done. I have no insights on ghost workunits outside of what has already been discussed on these boards.

Hmm. Isaac still hasn't crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I'll revert to the original page in 30 minutes or so if we remain up.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile ajinbc
Avatar
Send message
Joined: 15 Mar 06
Posts: 484
Credit: 318,444
RAC: 0
Canada
Message 575027 - Posted: 24 May 2007, 21:41:49 UTC

*keeps fingers crossed*
____________

n7rfa
Volunteer tester
Avatar
Send message
Joined: 13 Apr 04
Posts: 370
Credit: 9,058,599
RAC: 0
United States
Message 575046 - Posted: 24 May 2007, 22:31:42 UTC - in response to Message 575023.
Last modified: 24 May 2007, 22:32:02 UTC

Hmm. Isaac still hasn't crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I'll revert to the original page in 30 minutes or so if we remain up.

- Matt


On an old IBM Mainframe that a former company had, we would occasionally have to recompile a program without making any changes.

When asked why we had to recompile, we would say that "a bit had turned sideways".
____________

Profile Sir Ulli
Volunteer tester
Avatar
Send message
Joined: 21 Oct 99
Posts: 2246
Credit: 6,135,885
RAC: 401
Germany
Message 575047 - Posted: 24 May 2007, 22:36:42 UTC - in response to Message 575027.

*keeps fingers crossed*


Agreed....

Greetings from Germany NRW
Ulli


Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12326
Credit: 2,631,619
RAC: 1,140
Netherlands
Message 575062 - Posted: 24 May 2007, 23:41:09 UTC - in response to Message 575023.

There's a DNS redirect pointing to a stub page in the meantime.

We are experiencing hardware problems on the main BOINC web server. Downloads are currently unavailable at this time.

You do know that currently and at this time is the same thing, right? ;-)
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Profile Andy Lee Robinson
Avatar
Send message
Joined: 8 Dec 05
Posts: 615
Credit: 42,683,899
RAC: 27,270
Hungary
Message 575086 - Posted: 25 May 2007, 0:36:08 UTC - in response to Message 575023.
Last modified: 25 May 2007, 0:38:43 UTC

We still haven't figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far.


Hi Matt, I'm reposting this here for you which I posted in the 'number crunching' forum, in reply to msattler twiddling with his MTU settings...

so I reset the router MTU from 1500 down to 1400. Shouldn't make a difference on my DSL connection, 1500 is reccomended by most TCP advisory programs, but what the heck.


Well, my take is that the MTU should be 1500.
It seems that the machine on 208.68.240.16 has an MTU of 1476

i:\\program files\\boinc>ping 208.68.240.16 -f -l 1450 Pinging 208.68.240.16 with 1450 bytes of data: Packet needs to be fragmented but DF set. Packet needs to be fragmented but DF set.


i:\\program files\\boinc>ping 208.68.240.16 -f -l 1448 Pinging 208.68.240.16 with 1448 bytes of data: Reply from 208.68.240.16: bytes=1448 time=192ms TTL=53 Reply from 208.68.240.16: bytes=1448 time=192ms TTL=53


MTU = size + 28, so MTU is 1448+28 = 1476

This seems suspicious. Why is bruno using 1476 when the rest of the world uses 1500? This will almost double the amount of packets to deal with and could explain the serious difficulty I still have in transferring anything from a couple of crunching linux web servers. They are using state based firewalls (default fedora core 6), a side effect is the filtering of incoming state related RST packets, which I thought could be a reason for the extremely poor performance. So, I explicitly allowed 208.68.240.16 through. Even after this, it's still only successful in communicating 1% of the time, and no, I'm not going to change the MTU of 1500 on a production web server!

Andy.


It may help to explain the difference in transfer reliability between machines.
My main windows machine/gateway has no problems connecting/transferring
A masqueraded windows machine on the same network has terrible difficulty connecting/transferring/reporting
Another two production linux web servers also had terrible difficulty - though at present all have magically uploaded results and reported.

All have an MTU of 1500. Fragmentation and firewall state may interfere with rst at the end of a session.
Observation is that data is sent but then times out trying to complete.

Hope this helps,
Andy.

Admiral Marith
Send message
Joined: 3 Jun 99
Posts: 25
Credit: 22,081,761
RAC: 743
United States
Message 575179 - Posted: 25 May 2007, 4:19:28 UTC - in response to Message 575046.

Hmm. Isaac still hasn't crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I'll revert to the original page in 30 minutes or so if we remain up.

- Matt


On an old IBM Mainframe that a former company had, we would occasionally have to recompile a program without making any changes.

When asked why we had to recompile, we would say that "a bit had turned sideways".


For a period of abot 3 weeks, I had a 9672 (IBM mainframe) processor all to myself just running linux. This was classic seti, and there was a client. So I ran it.

I don't know how many other people can say they ran seti@home on an IBM mainframe. But I did.

____________

EVE: Retsil Evad
Send message
Joined: 3 Apr 99
Posts: 7
Credit: 905,487
RAC: 0
New Zealand
Message 575212 - Posted: 25 May 2007, 7:01:11 UTC - in response to Message 575179.


For a period of abot 3 weeks, I had a 9672 (IBM mainframe) processor all to myself just running linux. This was classic seti, and there was a client. So I ran it.

I don't know how many other people can say they ran seti@home on an IBM mainframe. But I did.

To quote Monty Python - You lucky,lucky bastard.

:)

Wasabi Peanut
Avatar
Send message
Joined: 14 Jul 99
Posts: 62
Credit: 32,646,911
RAC: 0
Switzerland
Message 575216 - Posted: 25 May 2007, 7:13:42 UTC

Things are running smoothly here. Thanks for all your efforts at SSL!

One very curious thing happened two days ago, though: a number of my boxes spawned one - and in one case two - new computer IDs, received a few units under that ID, and then continued crunching with the original computer ID. This all happened without any interference at all on my part.

Has anyone seen such behavior before? I certainly have not...

I'm using BOINC 5.4.9 CLI on a variety of Macs with custom workers, i.e. the app_info.xml work-around is currently in place on all machines in question.

Cheers,

Ron

TarracoServer
Volunteer tester
Send message
Joined: 11 Apr 07
Posts: 38
Credit: 296,807
RAC: 117
Spain
Message 575218 - Posted: 25 May 2007, 7:24:32 UTC

What a nasty month!
I hope nothing more to explode on the lab!

Keep on with this good job, guys!
____________

Profile tpl
Avatar
Send message
Joined: 12 Nov 03
Posts: 364
Credit: 193,915,878
RAC: 12,372
Germany
Message 575227 - Posted: 25 May 2007, 7:51:08 UTC - in response to Message 575216.

Things are running smoothly here. Thanks for all your efforts at SSL!

One very curious thing happened two days ago, though: a number of my boxes spawned one - and in one case two - new computer IDs, received a few units under that ID, and then continued crunching with the original computer ID. This all happened without any interference at all on my part.

Has anyone seen such behavior before? I certainly have not...

I'm using BOINC 5.4.9 CLI on a variety of Macs with custom workers, i.e. the app_info.xml work-around is currently in place on all machines in question.

Cheers,

Ron


Hallo,
at 20.05.07 one of my winXP core2 machines become a new ID with 20 workunits, but it´s only a ghost machine with ghost results...
Thomas
____________

Profile Geek@PlayProject donor
Volunteer tester
Avatar
Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,143,646
RAC: 1,043
United States
Message 575354 - Posted: 25 May 2007, 12:41:07 UTC - in response to Message 575216.

Things are running smoothly here. Thanks for all your efforts at SSL!

One very curious thing happened two days ago, though: a number of my boxes spawned one - and in one case two - new computer IDs, received a few units under that ID, and then continued crunching with the original computer ID. This all happened without any interference at all on my part.

Has anyone seen such behavior before? I certainly have not...

I'm using BOINC 5.4.9 CLI on a variety of Macs with custom workers, i.e. the app_info.xml work-around is currently in place on all machines in question.

Cheers,

Ron


I had the same thing happen several months ago. I just let it go and when the ghost results expired I removed the ghost computer ID. Kind of disturbing how all these ghosts pop up!



____________
Boinc....Boinc....Boinc....Boinc....

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8499
Credit: 4,193,052
RAC: 1,752
United Kingdom
Message 575381 - Posted: 25 May 2007, 14:24:35 UTC - in response to Message 575086.

...This seems suspicious. Why is bruno using 1476 when the rest of the world uses 1500? This will almost double the amount of packets to deal with and could explain the serious difficulty...

It may help to explain the difference in transfer reliability between machines.
My main windows machine/gateway has no problems connecting/transferring
A masqueraded windows machine on the same network has terrible difficulty connecting/transferring/reporting
Another two production linux web servers also had terrible difficulty - though at present all have magically uploaded results and reported...

I have a dim recollecton that the "rest of the world" need not use an MTU of 1500. Unfortunately, Microsoft have a very broken implimentation that doesn't support fragmentation properly and so the rest of the world is forced to use that magic Microsoft number. Anyone know further?

Aside: As networks get faster, we should be using much larger MTUs...

It will be interesting if part of the problem is indeed packet size and fragmentation problems...

Regards,
Martin

____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Urs EchternachtProject donor
Volunteer tester
Send message
Joined: 15 May 99
Posts: 547
Credit: 52,265,170
RAC: 46,322
Germany
Message 575384 - Posted: 25 May 2007, 14:32:09 UTC - in response to Message 575215.


I posted the question back in NC, but I'll post it here also. What do I do with my router's MTU at this point?? Go back to 1500? Go to 1476? What's gonna optimize (some of us are just fixated on that 'O' word, aren't we) communications with Seti?

You could try www.speedguide.net the TCP/IP Analyzer link will tell you how your MTU, MSS, RWIN and so on values look like. If you are behind a firewall port 8080 is needed. Don't forget to close the port again after anlyzing.
____________
_\|/_
U r s

Profile B0BHILL
Send message
Joined: 19 Jul 03
Posts: 23
Credit: 203,166
RAC: 0
United States
Message 575418 - Posted: 25 May 2007, 15:35:21 UTC - in response to Message 575354.

[quote][quote]Things are running smoothly here. Thanks for all your efforts at SSL!



Just for your information, things are not running smoothly here. I have four processors been waiting on work for over 12 hours and I fear this weekend will be a bust again. Project reports no work?? Any News on this? Or am I the only one that has this problem?

Kim Vater
Volunteer tester
Send message
Joined: 27 May 99
Posts: 227
Credit: 22,743,307
RAC: 0
Norway
Message 575422 - Posted: 25 May 2007, 15:43:20 UTC

Hi,

You're not alone on this one.

There's not much activity on the SETI network as can be seen here on this graph:
http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;view=Octets;ranges=d

It's early morning i California - so they will look into the problem soon. ;-)

Kiva
____________
Greetings from Norway

Crunch3er & AK-V8 Inside

Profile Dingo
Volunteer tester
Avatar
Send message
Joined: 28 Jun 99
Posts: 97
Credit: 3,714,597
RAC: 660
Australia
Message 575423 - Posted: 25 May 2007, 15:43:54 UTC - in response to Message 575418.

[quote][quote]Things are running smoothly here. Thanks for all your efforts at SSL!



Just for your information, things are not running smoothly here. I have four processors been waiting on work for over 12 hours and I fear this weekend will be a bust again. Project reports no work?? Any News on this? Or am I the only one that has this problem?


I am also returning work OK but getting the no work from project message.
____________


Proud Founder and member of
Have a look at my WebCam

Profile CarlosProject donor
Volunteer tester
Avatar
Send message
Joined: 9 Jun 99
Posts: 10507
Credit: 25,957,569
RAC: 2,566
United States
Message 575424 - Posted: 25 May 2007, 15:45:58 UTC - in response to Message 575418.

[quote][quote]Things are running smoothly here. Thanks for all your efforts at SSL!



Just for your information, things are not running smoothly here. I have four processors been waiting on work for over 12 hours and I fear this weekend will be a bust again. Project reports no work?? Any News on this? Or am I the only one that has this problem?


Not sure where your problem might be, but I have 11 computers up and running. All have a full supply of work units and are working just like they did before the outage. I had changed thing during the problem time, such as renaming app_info.xml and detaching then reattaching. But as of Tuesday everything was working as it had before. You might try the detach then reattach.
____________

Profile Rudi
Avatar
Send message
Joined: 20 Jan 00
Posts: 20
Credit: 17,904,159
RAC: 0
Austria
Message 575425 - Posted: 25 May 2007, 15:47:04 UTC - in response to Message 575381.

I have a dim recollecton that the "rest of the world" need not use an MTU of 1500. Unfortunately, Microsoft have a very broken implimentation that doesn't support fragmentation properly and so the rest of the world is forced to use that magic Microsoft number. Anyone know further?

There is no such thing as an "optimal" MTU, it depends on the protocol(s) used during an internet transmission.
Ethernet frames normally are limited to 1518 bytes. Of those 1518 bytes the ethernet packet header uses 14 bytes, 4 bytes are used for FCS (frame check sequence) at the end of the packet. So there remain 1500 bytes for IP address and the payload of the packet. This is the so called MTU (maximum transmission unit). In normal LANs all 1500 bytes are available, i.e. your MTU on the LAN is 1500 bytes.

If you connect to the internet over a DSL, some additional bytes are used for the protocol your provider requires (PPPoE, L2TP, PPTP etc.). PPPoE e.g. uses 8 bytes, your MTU is then reduced to 1492. The correct MTU normally is adjusted automatically when you connect to your ISP using the LCP protocol.
The "Path MTU" of an internet transmission defines the smallest MTU on any of the hops on the path of the transmission. If your original MTU is larger than the smallest MTU on the path, additional fragmentation will occur.
So even if you your ISP gives you the "correct" MTU, it still may be "wrong" for the actual transmission path.

HTH and sorry for being a wise guy... ;-)




____________
"il faut imaginer Sisyphe heureux", Albert Camus

1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Technical News : Chaos at the Greasy Spoon (May 24 2007)

Copyright © 2014 University of California