Message boards :
Technical News :
Chaos at the Greasy Spoon (May 24 2007)
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 ![]() |
Jeff, Eric, and I had our software meeting this morning, which happens every Thursday. As usual we discuss the game plan as far as bringing a new splitter on line, coding conventions for the near time persistency checker, etc. Then something happens to keep us from doing anything on this front. Today, at least for me and Jeff, it was isaac crashing. This machine is the boinc.berkeley.edu web server, among other things. Short story: lots of CPU errors, rebooting doesn't help, we tried putting in new memory, no sign of overheating. We got it in rescue mode a put in a non-xen kernel. It's been stable for the past 15 minutes. We'll see if that holds. Doubtful. A service call may be in order. There's a DNS redirect pointing to a stub page in the meantime. We still haven't figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far. A lot of work is getting sent and results returned, and we're creating a healthy backlog of workunits to send out as I type, but there is still work to be done. I have no insights on ghost workunits outside of what has already been discussed on these boards. Hmm. Isaac still hasn't crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I'll revert to the original page in 30 minutes or so if we remain up. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
![]() ![]() Send message Joined: 15 Mar 06 Posts: 484 Credit: 318,444 RAC: 0 ![]() |
*keeps fingers crossed* |
n7rfa ![]() Send message Joined: 13 Apr 04 Posts: 370 Credit: 9,058,599 RAC: 0 ![]() |
Hmm. Isaac still hasn't crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I'll revert to the original page in 30 minutes or so if we remain up. On an old IBM Mainframe that a former company had, we would occasionally have to recompile a program without making any changes. When asked why we had to recompile, we would say that "a bit had turned sideways". ![]() |
![]() ![]() Send message Joined: 21 Oct 99 Posts: 2246 Credit: 6,136,250 RAC: 0 ![]() |
|
![]() Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 ![]() |
There's a DNS redirect pointing to a stub page in the meantime. We are experiencing hardware problems on the main BOINC web server. Downloads are currently unavailable at this time. You do know that currently and at this time is the same thing, right? ;-) |
![]() ![]() Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0 ![]() |
We still haven't figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far. Hi Matt, I'm reposting this here for you which I posted in the 'number crunching' forum, in reply to msattler twiddling with his MTU settings... so I reset the router MTU from 1500 down to 1400. Shouldn't make a difference on my DSL connection, 1500 is reccomended by most TCP advisory programs, but what the heck. It may help to explain the difference in transfer reliability between machines. My main windows machine/gateway has no problems connecting/transferring A masqueraded windows machine on the same network has terrible difficulty connecting/transferring/reporting Another two production linux web servers also had terrible difficulty - though at present all have magically uploaded results and reported. All have an MTU of 1500. Fragmentation and firewall state may interfere with rst at the end of a session. Observation is that data is sent but then times out trying to complete. Hope this helps, Andy. |
Admiral Marith Send message Joined: 3 Jun 99 Posts: 25 Credit: 24,564,328 RAC: 0 ![]() |
Hmm. Isaac still hasn't crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I'll revert to the original page in 30 minutes or so if we remain up. For a period of abot 3 weeks, I had a 9672 (IBM mainframe) processor all to myself just running linux. This was classic seti, and there was a client. So I ran it. I don't know how many other people can say they ran seti@home on an IBM mainframe. But I did. |
EVE: Retsil Evad Send message Joined: 3 Apr 99 Posts: 7 Credit: 905,487 RAC: 0 ![]() |
To quote Monty Python - You lucky,lucky bastard. :) |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51507 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
We still haven't figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far. I posted the question back in NC, but I'll post it here also. What do I do with my router's MTU at this point?? Go back to 1500? Go to 1476? What's gonna optimize (some of us are just fixated on that 'O' word, aren't we) communications with Seti? "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
Wasabi Peanut ![]() Send message Joined: 14 Jul 99 Posts: 62 Credit: 32,646,911 RAC: 0 ![]() |
Things are running smoothly here. Thanks for all your efforts at SSL! One very curious thing happened two days ago, though: a number of my boxes spawned one - and in one case two - new computer IDs, received a few units under that ID, and then continued crunching with the original computer ID. This all happened without any interference at all on my part. Has anyone seen such behavior before? I certainly have not... I'm using BOINC 5.4.9 CLI on a variety of Macs with custom workers, i.e. the app_info.xml work-around is currently in place on all machines in question. Cheers, Ron |
TarracoServer Send message Joined: 11 Apr 07 Posts: 38 Credit: 595,022 RAC: 0 ![]() |
What a nasty month! I hope nothing more to explode on the lab! Keep on with this good job, guys! ![]() ![]() |
![]() ![]() Send message Joined: 12 Nov 03 Posts: 461 Credit: 243,368,408 RAC: 14 ![]() ![]() |
Things are running smoothly here. Thanks for all your efforts at SSL! Hallo, at 20.05.07 one of my winXP core2 machines become a new ID with 20 workunits, but it´s only a ghost machine with ghost results... Thomas ![]() |
![]() ![]() Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0 ![]() |
Things are running smoothly here. Thanks for all your efforts at SSL! I had the same thing happen several months ago. I just let it go and when the ghost results expired I removed the ghost computer ID. Kind of disturbing how all these ghosts pop up! Boinc....Boinc....Boinc....Boinc.... |
![]() Send message Joined: 25 Nov 01 Posts: 21567 Credit: 7,508,002 RAC: 20 ![]() ![]() |
...This seems suspicious. Why is bruno using 1476 when the rest of the world uses 1500? This will almost double the amount of packets to deal with and could explain the serious difficulty... I have a dim recollecton that the "rest of the world" need not use an MTU of 1500. Unfortunately, Microsoft have a very broken implimentation that doesn't support fragmentation properly and so the rest of the world is forced to use that magic Microsoft number. Anyone know further? Aside: As networks get faster, we should be using much larger MTUs... It will be interesting if part of the problem is indeed packet size and fragmentation problems... Regards, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
Urs Echternacht ![]() Send message Joined: 15 May 99 Posts: 692 Credit: 135,197,781 RAC: 211 ![]() ![]() |
You could try www.speedguide.net the TCP/IP Analyzer link will tell you how your MTU, MSS, RWIN and so on values look like. If you are behind a firewall port 8080 is needed. Don't forget to close the port again after anlyzing. _\|/_ U r s |
![]() Send message Joined: 19 Jul 03 Posts: 23 Credit: 203,166 RAC: 0 ![]() |
[quote][quote]Things are running smoothly here. Thanks for all your efforts at SSL! Just for your information, things are not running smoothly here. I have four processors been waiting on work for over 12 hours and I fear this weekend will be a bust again. Project reports no work?? Any News on this? Or am I the only one that has this problem? |
Kim Vater Send message Joined: 27 May 99 Posts: 227 Credit: 22,743,307 RAC: 0 ![]() |
Hi, You're not alone on this one. There's not much activity on the SETI network as can be seen here on this graph: http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;view=Octets;ranges=d It's early morning i California - so they will look into the problem soon. ;-) Kiva Greetings from Norway ![]() ![]() Crunch3er & AK-V8 Inside |
![]() ![]() Send message Joined: 28 Jun 99 Posts: 104 Credit: 16,364,896 RAC: 1 ![]() |
[quote][quote]Things are running smoothly here. Thanks for all your efforts at SSL! I am also returning work OK but getting the no work from project message. ![]() Proud Founder and member of ![]() Have a look at my WebCam |
![]() Send message Joined: 9 Jun 99 Posts: 30974 Credit: 57,275,487 RAC: 157 ![]() ![]() |
[quote][quote]Things are running smoothly here. Thanks for all your efforts at SSL! Not sure where your problem might be, but I have 11 computers up and running. All have a full supply of work units and are working just like they did before the outage. I had changed thing during the problem time, such as renaming app_info.xml and detaching then reattaching. But as of Tuesday everything was working as it had before. You might try the detach then reattach. ![]() |
![]() ![]() Send message Joined: 20 Jan 00 Posts: 20 Credit: 17,904,159 RAC: 0 ![]() |
I have a dim recollecton that the "rest of the world" need not use an MTU of 1500. Unfortunately, Microsoft have a very broken implimentation that doesn't support fragmentation properly and so the rest of the world is forced to use that magic Microsoft number. Anyone know further? There is no such thing as an "optimal" MTU, it depends on the protocol(s) used during an internet transmission. Ethernet frames normally are limited to 1518 bytes. Of those 1518 bytes the ethernet packet header uses 14 bytes, 4 bytes are used for FCS (frame check sequence) at the end of the packet. So there remain 1500 bytes for IP address and the payload of the packet. This is the so called MTU (maximum transmission unit). In normal LANs all 1500 bytes are available, i.e. your MTU on the LAN is 1500 bytes. If you connect to the internet over a DSL, some additional bytes are used for the protocol your provider requires (PPPoE, L2TP, PPTP etc.). PPPoE e.g. uses 8 bytes, your MTU is then reduced to 1492. The correct MTU normally is adjusted automatically when you connect to your ISP using the LCP protocol. The "Path MTU" of an internet transmission defines the smallest MTU on any of the hops on the path of the transmission. If your original MTU is larger than the smallest MTU on the path, additional fragmentation will occur. So even if you your ISP gives you the "correct" MTU, it still may be "wrong" for the actual transmission path. HTH and sorry for being a wise guy... ;-) "il faut imaginer Sisyphe heureux", Albert Camus |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.