Chaos at the Greasy Spoon (May 24 2007)

Message boards : Technical News : Chaos at the Greasy Spoon (May 24 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 575023 - Posted: 24 May 2007, 21:23:41 UTC
Last modified: 24 May 2007, 21:23:59 UTC

Jeff, Eric, and I had our software meeting this morning, which happens every Thursday. As usual we discuss the game plan as far as bringing a new splitter on line, coding conventions for the near time persistency checker, etc. Then something happens to keep us from doing anything on this front.

Today, at least for me and Jeff, it was isaac crashing. This machine is the boinc.berkeley.edu web server, among other things. Short story: lots of CPU errors, rebooting doesn't help, we tried putting in new memory, no sign of overheating. We got it in rescue mode a put in a non-xen kernel. It's been stable for the past 15 minutes. We'll see if that holds. Doubtful. A service call may be in order. There's a DNS redirect pointing to a stub page in the meantime.

We still haven't figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far. A lot of work is getting sent and results returned, and we're creating a healthy backlog of workunits to send out as I type, but there is still work to be done. I have no insights on ghost workunits outside of what has already been discussed on these boards.

Hmm. Isaac still hasn't crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I'll revert to the original page in 30 minutes or so if we remain up.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 575023 · Report as offensive
Profile ajinbc
Avatar

Send message
Joined: 15 Mar 06
Posts: 484
Credit: 318,444
RAC: 0
Canada
Message 575027 - Posted: 24 May 2007, 21:41:49 UTC

*keeps fingers crossed*
ID: 575027 · Report as offensive
n7rfa
Volunteer tester
Avatar

Send message
Joined: 13 Apr 04
Posts: 370
Credit: 9,058,599
RAC: 0
United States
Message 575046 - Posted: 24 May 2007, 22:31:42 UTC - in response to Message 575023.  
Last modified: 24 May 2007, 22:32:02 UTC

Hmm. Isaac still hasn't crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I'll revert to the original page in 30 minutes or so if we remain up.

- Matt


On an old IBM Mainframe that a former company had, we would occasionally have to recompile a program without making any changes.

When asked why we had to recompile, we would say that "a bit had turned sideways".
ID: 575046 · Report as offensive
Profile Sir Ulli
Volunteer tester
Avatar

Send message
Joined: 21 Oct 99
Posts: 2246
Credit: 6,136,250
RAC: 0
Germany
Message 575047 - Posted: 24 May 2007, 22:36:42 UTC - in response to Message 575027.  

*keeps fingers crossed*


Agreed....

Greetings from Germany NRW
Ulli


ID: 575047 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 575062 - Posted: 24 May 2007, 23:41:09 UTC - in response to Message 575023.  

There's a DNS redirect pointing to a stub page in the meantime.

We are experiencing hardware problems on the main BOINC web server. Downloads are currently unavailable at this time.

You do know that currently and at this time is the same thing, right? ;-)
ID: 575062 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 575086 - Posted: 25 May 2007, 0:36:08 UTC - in response to Message 575023.  
Last modified: 25 May 2007, 0:38:43 UTC

We still haven't figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far.


Hi Matt, I'm reposting this here for you which I posted in the 'number crunching' forum, in reply to msattler twiddling with his MTU settings...

so I reset the router MTU from 1500 down to 1400. Shouldn't make a difference on my DSL connection, 1500 is reccomended by most TCP advisory programs, but what the heck.


Well, my take is that the MTU should be 1500.
It seems that the machine on 208.68.240.16 has an MTU of 1476

i:\\program files\\boinc>ping 208.68.240.16 -f -l 1450
Pinging 208.68.240.16 with 1450 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.


i:\\program files\\boinc>ping 208.68.240.16 -f -l 1448
Pinging 208.68.240.16 with 1448 bytes of data:
Reply from 208.68.240.16: bytes=1448 time=192ms TTL=53
Reply from 208.68.240.16: bytes=1448 time=192ms TTL=53


MTU = size + 28, so MTU is 1448+28 = 1476

This seems suspicious. Why is bruno using 1476 when the rest of the world uses 1500? This will almost double the amount of packets to deal with and could explain the serious difficulty I still have in transferring anything from a couple of crunching linux web servers. They are using state based firewalls (default fedora core 6), a side effect is the filtering of incoming state related RST packets, which I thought could be a reason for the extremely poor performance. So, I explicitly allowed 208.68.240.16 through. Even after this, it's still only successful in communicating 1% of the time, and no, I'm not going to change the MTU of 1500 on a production web server!

Andy.


It may help to explain the difference in transfer reliability between machines.
My main windows machine/gateway has no problems connecting/transferring
A masqueraded windows machine on the same network has terrible difficulty connecting/transferring/reporting
Another two production linux web servers also had terrible difficulty - though at present all have magically uploaded results and reported.

All have an MTU of 1500. Fragmentation and firewall state may interfere with rst at the end of a session.
Observation is that data is sent but then times out trying to complete.

Hope this helps,
Andy.
ID: 575086 · Report as offensive
Admiral Marith

Send message
Joined: 3 Jun 99
Posts: 25
Credit: 24,564,328
RAC: 0
United States
Message 575179 - Posted: 25 May 2007, 4:19:28 UTC - in response to Message 575046.  

Hmm. Isaac still hasn't crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I'll revert to the original page in 30 minutes or so if we remain up.

- Matt


On an old IBM Mainframe that a former company had, we would occasionally have to recompile a program without making any changes.

When asked why we had to recompile, we would say that "a bit had turned sideways".


For a period of abot 3 weeks, I had a 9672 (IBM mainframe) processor all to myself just running linux. This was classic seti, and there was a client. So I ran it.

I don't know how many other people can say they ran seti@home on an IBM mainframe. But I did.

ID: 575179 · Report as offensive
EVE: Retsil Evad

Send message
Joined: 3 Apr 99
Posts: 7
Credit: 905,487
RAC: 0
New Zealand
Message 575212 - Posted: 25 May 2007, 7:01:11 UTC - in response to Message 575179.  


For a period of abot 3 weeks, I had a 9672 (IBM mainframe) processor all to myself just running linux. This was classic seti, and there was a client. So I ran it.

I don't know how many other people can say they ran seti@home on an IBM mainframe. But I did.

To quote Monty Python - You lucky,lucky bastard.

:)
ID: 575212 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 575215 - Posted: 25 May 2007, 7:06:28 UTC - in response to Message 575086.  

We still haven't figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far.


Hi Matt, I'm reposting this here for you which I posted in the 'number crunching' forum, in reply to msattler twiddling with his MTU settings...

so I reset the router MTU from 1500 down to 1400. Shouldn't make a difference on my DSL connection, 1500 is reccomended by most TCP advisory programs, but what the heck.


Well, my take is that the MTU should be 1500.
It seems that the machine on 208.68.240.16 has an MTU of 1476

i:\\program files\\boinc>ping 208.68.240.16 -f -l 1450
Pinging 208.68.240.16 with 1450 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.


i:\\program files\\boinc>ping 208.68.240.16 -f -l 1448
Pinging 208.68.240.16 with 1448 bytes of data:
Reply from 208.68.240.16: bytes=1448 time=192ms TTL=53
Reply from 208.68.240.16: bytes=1448 time=192ms TTL=53


MTU = size + 28, so MTU is 1448+28 = 1476

This seems suspicious. Why is bruno using 1476 when the rest of the world uses 1500? This will almost double the amount of packets to deal with and could explain the serious difficulty I still have in transferring anything from a couple of crunching linux web servers. They are using state based firewalls (default fedora core 6), a side effect is the filtering of incoming state related RST packets, which I thought could be a reason for the extremely poor performance. So, I explicitly allowed 208.68.240.16 through. Even after this, it's still only successful in communicating 1% of the time, and no, I'm not going to change the MTU of 1500 on a production web server!

Andy.


It may help to explain the difference in transfer reliability between machines.
My main windows machine/gateway has no problems connecting/transferring
A masqueraded windows machine on the same network has terrible difficulty connecting/transferring/reporting
Another two production linux web servers also had terrible difficulty - though at present all have magically uploaded results and reported.

All have an MTU of 1500. Fragmentation and firewall state may interfere with rst at the end of a session.
Observation is that data is sent but then times out trying to complete.

Hope this helps,
Andy.


I posted the question back in NC, but I'll post it here also. What do I do with my router's MTU at this point?? Go back to 1500? Go to 1476? What's gonna optimize (some of us are just fixated on that 'O' word, aren't we) communications with Seti?

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 575215 · Report as offensive
Wasabi Peanut
Avatar

Send message
Joined: 14 Jul 99
Posts: 62
Credit: 32,646,911
RAC: 0
Switzerland
Message 575216 - Posted: 25 May 2007, 7:13:42 UTC

Things are running smoothly here. Thanks for all your efforts at SSL!

One very curious thing happened two days ago, though: a number of my boxes spawned one - and in one case two - new computer IDs, received a few units under that ID, and then continued crunching with the original computer ID. This all happened without any interference at all on my part.

Has anyone seen such behavior before? I certainly have not...

I'm using BOINC 5.4.9 CLI on a variety of Macs with custom workers, i.e. the app_info.xml work-around is currently in place on all machines in question.

Cheers,

Ron
ID: 575216 · Report as offensive
TarracoServer
Volunteer tester

Send message
Joined: 11 Apr 07
Posts: 38
Credit: 595,022
RAC: 0
Spain
Message 575218 - Posted: 25 May 2007, 7:24:32 UTC

What a nasty month!
I hope nothing more to explode on the lab!

Keep on with this good job, guys!
ID: 575218 · Report as offensive
Profile tpl
Avatar

Send message
Joined: 12 Nov 03
Posts: 461
Credit: 243,368,408
RAC: 14
Germany
Message 575227 - Posted: 25 May 2007, 7:51:08 UTC - in response to Message 575216.  

Things are running smoothly here. Thanks for all your efforts at SSL!

One very curious thing happened two days ago, though: a number of my boxes spawned one - and in one case two - new computer IDs, received a few units under that ID, and then continued crunching with the original computer ID. This all happened without any interference at all on my part.

Has anyone seen such behavior before? I certainly have not...

I'm using BOINC 5.4.9 CLI on a variety of Macs with custom workers, i.e. the app_info.xml work-around is currently in place on all machines in question.

Cheers,

Ron


Hallo,
at 20.05.07 one of my winXP core2 machines become a new ID with 20 workunits, but it´s only a ghost machine with ghost results...
Thomas
ID: 575227 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 575354 - Posted: 25 May 2007, 12:41:07 UTC - in response to Message 575216.  

Things are running smoothly here. Thanks for all your efforts at SSL!

One very curious thing happened two days ago, though: a number of my boxes spawned one - and in one case two - new computer IDs, received a few units under that ID, and then continued crunching with the original computer ID. This all happened without any interference at all on my part.

Has anyone seen such behavior before? I certainly have not...

I'm using BOINC 5.4.9 CLI on a variety of Macs with custom workers, i.e. the app_info.xml work-around is currently in place on all machines in question.

Cheers,

Ron


I had the same thing happen several months ago. I just let it go and when the ghost results expired I removed the ghost computer ID. Kind of disturbing how all these ghosts pop up!



Boinc....Boinc....Boinc....Boinc....
ID: 575354 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20140
Credit: 7,508,002
RAC: 20
United Kingdom
Message 575381 - Posted: 25 May 2007, 14:24:35 UTC - in response to Message 575086.  

...This seems suspicious. Why is bruno using 1476 when the rest of the world uses 1500? This will almost double the amount of packets to deal with and could explain the serious difficulty...

It may help to explain the difference in transfer reliability between machines.
My main windows machine/gateway has no problems connecting/transferring
A masqueraded windows machine on the same network has terrible difficulty connecting/transferring/reporting
Another two production linux web servers also had terrible difficulty - though at present all have magically uploaded results and reported...

I have a dim recollecton that the "rest of the world" need not use an MTU of 1500. Unfortunately, Microsoft have a very broken implimentation that doesn't support fragmentation properly and so the rest of the world is forced to use that magic Microsoft number. Anyone know further?

Aside: As networks get faster, we should be using much larger MTUs...

It will be interesting if part of the problem is indeed packet size and fragmentation problems...

Regards,
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 575381 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 692
Credit: 135,197,781
RAC: 211
Germany
Message 575384 - Posted: 25 May 2007, 14:32:09 UTC - in response to Message 575215.  


I posted the question back in NC, but I'll post it here also. What do I do with my router's MTU at this point?? Go back to 1500? Go to 1476? What's gonna optimize (some of us are just fixated on that 'O' word, aren't we) communications with Seti?

You could try www.speedguide.net the TCP/IP Analyzer link will tell you how your MTU, MSS, RWIN and so on values look like. If you are behind a firewall port 8080 is needed. Don't forget to close the port again after anlyzing.
_\|/_
U r s
ID: 575384 · Report as offensive
Profile B0BHILL

Send message
Joined: 19 Jul 03
Posts: 23
Credit: 203,166
RAC: 0
United States
Message 575418 - Posted: 25 May 2007, 15:35:21 UTC - in response to Message 575354.  

[quote][quote]Things are running smoothly here. Thanks for all your efforts at SSL!



Just for your information, things are not running smoothly here. I have four processors been waiting on work for over 12 hours and I fear this weekend will be a bust again. Project reports no work?? Any News on this? Or am I the only one that has this problem?
ID: 575418 · Report as offensive
Kim Vater
Volunteer tester

Send message
Joined: 27 May 99
Posts: 227
Credit: 22,743,307
RAC: 0
Norway
Message 575422 - Posted: 25 May 2007, 15:43:20 UTC

Hi,

You're not alone on this one.

There's not much activity on the SETI network as can be seen here on this graph:
http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;view=Octets;ranges=d

It's early morning i California - so they will look into the problem soon. ;-)

Kiva
Greetings from Norway

Crunch3er & AK-V8 Inside
ID: 575422 · Report as offensive
Profile Dingo
Volunteer tester
Avatar

Send message
Joined: 28 Jun 99
Posts: 104
Credit: 16,364,896
RAC: 1
Australia
Message 575423 - Posted: 25 May 2007, 15:43:54 UTC - in response to Message 575418.  

[quote][quote]Things are running smoothly here. Thanks for all your efforts at SSL!



Just for your information, things are not running smoothly here. I have four processors been waiting on work for over 12 hours and I fear this weekend will be a bust again. Project reports no work?? Any News on this? Or am I the only one that has this problem?


I am also returning work OK but getting the no work from project message.

Proud Founder and member of



Have a look at my WebCam
ID: 575423 · Report as offensive
Profile Carlos
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 29752
Credit: 57,275,487
RAC: 157
United States
Message 575424 - Posted: 25 May 2007, 15:45:58 UTC - in response to Message 575418.  

[quote][quote]Things are running smoothly here. Thanks for all your efforts at SSL!



Just for your information, things are not running smoothly here. I have four processors been waiting on work for over 12 hours and I fear this weekend will be a bust again. Project reports no work?? Any News on this? Or am I the only one that has this problem?


Not sure where your problem might be, but I have 11 computers up and running. All have a full supply of work units and are working just like they did before the outage. I had changed thing during the problem time, such as renaming app_info.xml and detaching then reattaching. But as of Tuesday everything was working as it had before. You might try the detach then reattach.
ID: 575424 · Report as offensive
Profile Rudi
Avatar

Send message
Joined: 20 Jan 00
Posts: 20
Credit: 17,904,159
RAC: 0
Austria
Message 575425 - Posted: 25 May 2007, 15:47:04 UTC - in response to Message 575381.  

I have a dim recollecton that the "rest of the world" need not use an MTU of 1500. Unfortunately, Microsoft have a very broken implimentation that doesn't support fragmentation properly and so the rest of the world is forced to use that magic Microsoft number. Anyone know further?

There is no such thing as an "optimal" MTU, it depends on the protocol(s) used during an internet transmission.
Ethernet frames normally are limited to 1518 bytes. Of those 1518 bytes the ethernet packet header uses 14 bytes, 4 bytes are used for FCS (frame check sequence) at the end of the packet. So there remain 1500 bytes for IP address and the payload of the packet. This is the so called MTU (maximum transmission unit). In normal LANs all 1500 bytes are available, i.e. your MTU on the LAN is 1500 bytes.

If you connect to the internet over a DSL, some additional bytes are used for the protocol your provider requires (PPPoE, L2TP, PPTP etc.). PPPoE e.g. uses 8 bytes, your MTU is then reduced to 1492. The correct MTU normally is adjusted automatically when you connect to your ISP using the LCP protocol.
The "Path MTU" of an internet transmission defines the smallest MTU on any of the hops on the path of the transmission. If your original MTU is larger than the smallest MTU on the path, additional fragmentation will occur.
So even if you your ISP gives you the "correct" MTU, it still may be "wrong" for the actual transmission path.

HTH and sorry for being a wise guy... ;-)




"il faut imaginer Sisyphe heureux", Albert Camus
ID: 575425 · Report as offensive
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Technical News : Chaos at the Greasy Spoon (May 24 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.