| Author |
Message |
Matt LebofskyVolunteer moderator Project administrator Project developer Project scientist
 Send message
Joined: 1 Mar 99 Posts: 1376 Credit: 74,079 RAC: 0

|
|
Jeff, Eric, and I had our software meeting this morning, which happens every Thursday. As usual we discuss the game plan as far as bringing a new splitter on line, coding conventions for the near time persistency checker, etc. Then something happens to keep us from doing anything on this front.
Today, at least for me and Jeff, it was isaac crashing. This machine is the boinc.berkeley.edu web server, among other things. Short story: lots of CPU errors, rebooting doesn't help, we tried putting in new memory, no sign of overheating. We got it in rescue mode a put in a non-xen kernel. It's been stable for the past 15 minutes. We'll see if that holds. Doubtful. A service call may be in order. There's a DNS redirect pointing to a stub page in the meantime.
We still haven't figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far. A lot of work is getting sent and results returned, and we're creating a healthy backlog of workunits to send out as I type, but there is still work to be done. I have no insights on ghost workunits outside of what has already been discussed on these boards.
Hmm. Isaac still hasn't crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I'll revert to the original page in 30 minutes or so if we remain up.
- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
|
|
|
|
|
*keeps fingers crossed*
____________
|
|
|
n7rfaVolunteer tester
 Send message
Joined: 13 Apr 04 Posts: 370 Credit: 9,001,976 RAC: 1,306

|
Hmm. Isaac still hasn't crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I'll revert to the original page in 30 minutes or so if we remain up.
- Matt
On an old IBM Mainframe that a former company had, we would occasionally have to recompile a program without making any changes.
When asked why we had to recompile, we would say that "a bit had turned sideways".
____________
|
|
|
|
|
*keeps fingers crossed*
Agreed....
Greetings from Germany NRW
Ulli
|
|
|
|
|
There's a DNS redirect pointing to a stub page in the meantime.
We are experiencing hardware problems on the main BOINC web server. Downloads are currently unavailable at this time.
You do know that currently and at this time is the same thing, right? ;-)
____________
Jord
- BOINC FAQ Service
- BOINC User Wiki
Real is just a matter of perception. |
|
|
|
|
We still haven't figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far.
Hi Matt, I'm reposting this here for you which I posted in the 'number crunching' forum, in reply to msattler twiddling with his MTU settings...
so I reset the router MTU from 1500 down to 1400. Shouldn't make a difference on my DSL connection, 1500 is reccomended by most TCP advisory programs, but what the heck.
Well, my take is that the MTU should be 1500.
It seems that the machine on 208.68.240.16 has an MTU of 1476
i:\\program files\\boinc>ping 208.68.240.16 -f -l 1450
Pinging 208.68.240.16 with 1450 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
i:\\program files\\boinc>ping 208.68.240.16 -f -l 1448
Pinging 208.68.240.16 with 1448 bytes of data:
Reply from 208.68.240.16: bytes=1448 time=192ms TTL=53
Reply from 208.68.240.16: bytes=1448 time=192ms TTL=53
MTU = size + 28, so MTU is 1448+28 = 1476
This seems suspicious. Why is bruno using 1476 when the rest of the world uses 1500? This will almost double the amount of packets to deal with and could explain the serious difficulty I still have in transferring anything from a couple of crunching linux web servers. They are using state based firewalls (default fedora core 6), a side effect is the filtering of incoming state related RST packets, which I thought could be a reason for the extremely poor performance. So, I explicitly allowed 208.68.240.16 through. Even after this, it's still only successful in communicating 1% of the time, and no, I'm not going to change the MTU of 1500 on a production web server!
Andy.
It may help to explain the difference in transfer reliability between machines.
My main windows machine/gateway has no problems connecting/transferring
A masqueraded windows machine on the same network has terrible difficulty connecting/transferring/reporting
Another two production linux web servers also had terrible difficulty - though at present all have magically uploaded results and reported.
All have an MTU of 1500. Fragmentation and firewall state may interfere with rst at the end of a session.
Observation is that data is sent but then times out trying to complete.
Hope this helps,
Andy. |
|
|
|
|
Hmm. Isaac still hasn't crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I'll revert to the original page in 30 minutes or so if we remain up.
- Matt
On an old IBM Mainframe that a former company had, we would occasionally have to recompile a program without making any changes.
When asked why we had to recompile, we would say that "a bit had turned sideways".
For a period of abot 3 weeks, I had a 9672 (IBM mainframe) processor all to myself just running linux. This was classic seti, and there was a client. So I ran it.
I don't know how many other people can say they ran seti@home on an IBM mainframe. But I did.
____________
|
|
|
|
|
For a period of abot 3 weeks, I had a 9672 (IBM mainframe) processor all to myself just running linux. This was classic seti, and there was a client. So I ran it.
I don't know how many other people can say they ran seti@home on an IBM mainframe. But I did.
To quote Monty Python - You lucky,lucky bastard.
:) |
|
|
|
|
We still haven't figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far.
Hi Matt, I'm reposting this here for you which I posted in the 'number crunching' forum, in reply to msattler twiddling with his MTU settings...
so I reset the router MTU from 1500 down to 1400. Shouldn't make a difference on my DSL connection, 1500 is reccomended by most TCP advisory programs, but what the heck.
Well, my take is that the MTU should be 1500.
It seems that the machine on 208.68.240.16 has an MTU of 1476
i:\\program files\\boinc>ping 208.68.240.16 -f -l 1450
Pinging 208.68.240.16 with 1450 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
i:\\program files\\boinc>ping 208.68.240.16 -f -l 1448
Pinging 208.68.240.16 with 1448 bytes of data:
Reply from 208.68.240.16: bytes=1448 time=192ms TTL=53
Reply from 208.68.240.16: bytes=1448 time=192ms TTL=53
MTU = size + 28, so MTU is 1448+28 = 1476
This seems suspicious. Why is bruno using 1476 when the rest of the world uses 1500? This will almost double the amount of packets to deal with and could explain the serious difficulty I still have in transferring anything from a couple of crunching linux web servers. They are using state based firewalls (default fedora core 6), a side effect is the filtering of incoming state related RST packets, which I thought could be a reason for the extremely poor performance. So, I explicitly allowed 208.68.240.16 through. Even after this, it's still only successful in communicating 1% of the time, and no, I'm not going to change the MTU of 1500 on a production web server!
Andy.
It may help to explain the difference in transfer reliability between machines.
My main windows machine/gateway has no problems connecting/transferring
A masqueraded windows machine on the same network has terrible difficulty connecting/transferring/reporting
Another two production linux web servers also had terrible difficulty - though at present all have magically uploaded results and reported.
All have an MTU of 1500. Fragmentation and firewall state may interfere with rst at the end of a session.
Observation is that data is sent but then times out trying to complete.
Hope this helps,
Andy.
I posted the question back in NC, but I'll post it here also. What do I do with my router's MTU at this point?? Go back to 1500? Go to 1476? What's gonna optimize (some of us are just fixated on that 'O' word, aren't we) communications with Seti?
____________
******
"Ask not, what your kitty can do for you. Ask what you can do for your kitty."
As it is kitten, so shall it be done.
|
|
|
|
|
|
Things are running smoothly here. Thanks for all your efforts at SSL!
One very curious thing happened two days ago, though: a number of my boxes spawned one - and in one case two - new computer IDs, received a few units under that ID, and then continued crunching with the original computer ID. This all happened without any interference at all on my part.
Has anyone seen such behavior before? I certainly have not...
I'm using BOINC 5.4.9 CLI on a variety of Macs with custom workers, i.e. the app_info.xml work-around is currently in place on all machines in question.
Cheers,
Ron |
|
|
|
|
|
What a nasty month!
I hope nothing more to explode on the lab!
Keep on with this good job, guys!
____________
|
|
|
|
|
Things are running smoothly here. Thanks for all your efforts at SSL!
One very curious thing happened two days ago, though: a number of my boxes spawned one - and in one case two - new computer IDs, received a few units under that ID, and then continued crunching with the original computer ID. This all happened without any interference at all on my part.
Has anyone seen such behavior before? I certainly have not...
I'm using BOINC 5.4.9 CLI on a variety of Macs with custom workers, i.e. the app_info.xml work-around is currently in place on all machines in question.
Cheers,
Ron
Hallo,
at 20.05.07 one of my winXP core2 machines become a new ID with 20 workunits, but it´s only a ghost machine with ghost results...
Thomas
____________
|
|
|
|
|
Things are running smoothly here. Thanks for all your efforts at SSL!
One very curious thing happened two days ago, though: a number of my boxes spawned one - and in one case two - new computer IDs, received a few units under that ID, and then continued crunching with the original computer ID. This all happened without any interference at all on my part.
Has anyone seen such behavior before? I certainly have not...
I'm using BOINC 5.4.9 CLI on a variety of Macs with custom workers, i.e. the app_info.xml work-around is currently in place on all machines in question.
Cheers,
Ron
I had the same thing happen several months ago. I just let it go and when the ghost results expired I removed the ghost computer ID. Kind of disturbing how all these ghosts pop up!
____________
Boinc....Boinc....Boinc....Boinc.... |
|
|
ML1Volunteer tester Send message
Joined: 25 Nov 01 Posts: 7210 Credit: 3,703,390 RAC: 728

|
...This seems suspicious. Why is bruno using 1476 when the rest of the world uses 1500? This will almost double the amount of packets to deal with and could explain the serious difficulty...
It may help to explain the difference in transfer reliability between machines.
My main windows machine/gateway has no problems connecting/transferring
A masqueraded windows machine on the same network has terrible difficulty connecting/transferring/reporting
Another two production linux web servers also had terrible difficulty - though at present all have magically uploaded results and reported...
I have a dim recollecton that the "rest of the world" need not use an MTU of 1500. Unfortunately, Microsoft have a very broken implimentation that doesn't support fragmentation properly and so the rest of the world is forced to use that magic Microsoft number. Anyone know further?
Aside: As networks get faster, we should be using much larger MTUs...
It will be interesting if part of the problem is indeed packet size and fragmentation problems...
Regards,
Martin
____________
Mandriva Linux A user friendly OS!
See new freedom Mageia2
The Future is what We make IT (GPLv3) |
|
|
|
|
I posted the question back in NC, but I'll post it here also. What do I do with my router's MTU at this point?? Go back to 1500? Go to 1476? What's gonna optimize (some of us are just fixated on that 'O' word, aren't we) communications with Seti?
You could try www.speedguide.net the TCP/IP Analyzer link will tell you how your MTU, MSS, RWIN and so on values look like. If you are behind a firewall port 8080 is needed. Don't forget to close the port again after anlyzing.
____________
_\|/_
U r s |
|
|
|
|
|
[quote][quote]Things are running smoothly here. Thanks for all your efforts at SSL!
Just for your information, things are not running smoothly here. I have four processors been waiting on work for over 12 hours and I fear this weekend will be a bust again. Project reports no work?? Any News on this? Or am I the only one that has this problem? |
|
|
|
|
|
Hi,
You're not alone on this one.
There's not much activity on the SETI network as can be seen here on this graph:
http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;view=Octets;ranges=d
It's early morning i California - so they will look into the problem soon. ;-)
Kiva
____________
Greetings from Norway
Crunch3er & AK-V8 Inside |
|
|
DingoVolunteer tester
 Send message
Joined: 28 Jun 99 Posts: 96 Credit: 3,222,513 RAC: 22,814

|
[quote][quote]Things are running smoothly here. Thanks for all your efforts at SSL!
Just for your information, things are not running smoothly here. I have four processors been waiting on work for over 12 hours and I fear this weekend will be a bust again. Project reports no work?? Any News on this? Or am I the only one that has this problem?
I am also returning work OK but getting the no work from project message.
____________
Proud Founder and member of BOINC@AUSTRALIA
Have a look at my webcam |
|
|
CarlosVolunteer tester
 Send message
Joined: 9 Jun 99 Posts: 9750 Credit: 24,674,778 RAC: 4,408

|
[quote][quote]Things are running smoothly here. Thanks for all your efforts at SSL!
Just for your information, things are not running smoothly here. I have four processors been waiting on work for over 12 hours and I fear this weekend will be a bust again. Project reports no work?? Any News on this? Or am I the only one that has this problem?
Not sure where your problem might be, but I have 11 computers up and running. All have a full supply of work units and are working just like they did before the outage. I had changed thing during the problem time, such as renaming app_info.xml and detaching then reattaching. But as of Tuesday everything was working as it had before. You might try the detach then reattach.
____________
|
|
|
|
|
I have a dim recollecton that the "rest of the world" need not use an MTU of 1500. Unfortunately, Microsoft have a very broken implimentation that doesn't support fragmentation properly and so the rest of the world is forced to use that magic Microsoft number. Anyone know further?
There is no such thing as an "optimal" MTU, it depends on the protocol(s) used during an internet transmission.
Ethernet frames normally are limited to 1518 bytes. Of those 1518 bytes the ethernet packet header uses 14 bytes, 4 bytes are used for FCS (frame check sequence) at the end of the packet. So there remain 1500 bytes for IP address and the payload of the packet. This is the so called MTU (maximum transmission unit). In normal LANs all 1500 bytes are available, i.e. your MTU on the LAN is 1500 bytes.
If you connect to the internet over a DSL, some additional bytes are used for the protocol your provider requires (PPPoE, L2TP, PPTP etc.). PPPoE e.g. uses 8 bytes, your MTU is then reduced to 1492. The correct MTU normally is adjusted automatically when you connect to your ISP using the LCP protocol.
The "Path MTU" of an internet transmission defines the smallest MTU on any of the hops on the path of the transmission. If your original MTU is larger than the smallest MTU on the path, additional fragmentation will occur.
So even if you your ISP gives you the "correct" MTU, it still may be "wrong" for the actual transmission path.
HTH and sorry for being a wise guy... ;-)
____________
"il faut imaginer Sisyphe heureux", Albert Camus |
|
|