Eric's biannual post #6: You can tuna fish, but you can't tune a TCP


log in

Advanced search

Message boards : SETI@home Staff Blog : Eric's biannual post #6: You can tuna fish, but you can't tune a TCP

Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next
Author Message
gomeyer
Volunteer tester
Send message
Joined: 21 May 99
Posts: 488
Credit: 50,157,953
RAC: 0
United States
Message 569653 - Posted: 17 May 2007, 17:11:30 UTC - in response to Message 569633.
Last modified: 17 May 2007, 17:24:06 UTC

Hi Eric, Thanks for the information. I just want to confirm something. . .


. . . Only penguin is on download duty, but that may change if downloads start becoming a problem. . .
. . . Eric


Are downloads indeed going out at all now? I've received none for about two days on any of my eight workstations. Also, while uploads are going we are unable to Report any of them.

Please note unusual verbage (Message from Server: Incomplete request received.) in the following exchange. This seems to be common to many per the message threads.

5/17/2007 12:48:15 PM|SETI@home|Sending scheduler request: To report completed tasks
5/17/2007 12:48:15 PM|SETI@home|Requesting 4159 seconds of new work, and reporting 25 completed tasks
5/17/2007 12:48:30 PM|SETI@home|Scheduler RPC succeeded [server version 509]
5/17/2007 12:48:30 PM|SETI@home|Message from server: Incomplete request received.
5/17/2007 12:48:30 PM|SETI@home|Deferring communication for 11 sec
5/17/2007 12:48:30 PM|SETI@home|Reason: requested by project
5/17/2007 12:48:30 PM|SETI@home|Deferring communication for 5 min 35 sec
5/17/2007 12:48:30 PM|SETI@home|Reason: no work from project

I'm not trying to bust your chops, I KNOW how hard you are all rowing trying to fix this monster. I'm just looking for information and to be sure you are aware how many problems remain.

Best regards, and thanks for the hard work.

EDIT
The upload/download server also APPEARS to be bouncing although this may be related to a problem noted by Matt in an earlier post:
...Bruno is dropping lots of packets right now, resulting in all kinds of upload/download snags and showing up as "disabled" on the server status page...
/EDIT

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8551
Credit: 50,423,746
RAC: 51,119
United Kingdom
Message 569675 - Posted: 17 May 2007, 17:47:45 UTC - in response to Message 569653.

Please note unusual verbage (Message from Server: Incomplete request received.) in the following exchange. This seems to be common to many per the message threads.

As you say, many people have reported this - yet others say things are working normally.

I have two older hosts/older clients (v5.3.12.tx36) which have been trying to report at intervals all day. Sometimes they failed to contact the scheduler at all, sometimes they got the 'Incomplete request received' response. Since Eric mentioned that the scheduler has been moved, I thought I'd give them a prod.

Just did an 'Exit BOINC - start BOINC' sequence, and both have now reported. Coincidence or.....?

gomeyer
Volunteer tester
Send message
Joined: 21 May 99
Posts: 488
Credit: 50,157,953
RAC: 0
United States
Message 569685 - Posted: 17 May 2007, 17:59:22 UTC - in response to Message 569675.
Last modified: 17 May 2007, 18:09:52 UTC

Please note unusual verbage (Message from Server: Incomplete request received.) in the following exchange. This seems to be common to many per the message threads.

As you say, many people have reported this - yet others say things are working normally.

I have two older hosts/older clients (v5.3.12.tx36) which have been trying to report at intervals all day. Sometimes they failed to contact the scheduler at all, sometimes they got the 'Incomplete request received' response. Since Eric mentioned that the scheduler has been moved, I thought I'd give them a prod.

Just did an 'Exit BOINC - start BOINC' sequence, and both have now reported. Coincidence or.....?


That's a good thought, when in doubt reboot or restart. I just tried it on two of my closest hosts but they did not report. This is all most likely just the luck of the draw plus an artifact of incredible network traffic, but thought Eric may be able to offer a different perspective.

Update. Two of my more remote hosts have now reported. We're making progress it seems.

Profile Clyde C. Phillips, III
Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 569697 - Posted: 17 May 2007, 18:23:41 UTC

I see that my caches had emptied overnight of all finished results, a few got awarded credits but most are "pendings"- more than two daysworth. Besides, I see no new work and have only five hours before two of my four cores run out of Einstein work. I guess I'll have to fetch about a daysworth more of Einstein; otherwise my two machines will be absolutely idle. Congratulations and thanks to the team (and anyone else helping) for doing its best in coping with all these problems and solving some of them.
____________

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569723 - Posted: 17 May 2007, 19:07:35 UTC

@Eric

Several of us have tested and confirmed an issue where we contact the scheduler and get an HTTP Internal Server Error and it causes our hosts to be assigned a result, but the result never comes to us. I would guess that this is causing other people to get "No work from project" messages, as the system thinks that everything has been sent out????

If you want to read the discussion on it, please see the thread in Number Crunching called Ghost WU issue (and some talk about deadlines) from that post I pointed you at and on upwards. The prior discussion was me musing about deadline extensions and is not nearly as important as addressing this.

Thanks...

Brian

zombie67 [MM]
Volunteer tester
Avatar
Send message
Joined: 22 Apr 04
Posts: 753
Credit: 16,696,896
RAC: 10,472
United States
Message 569766 - Posted: 17 May 2007, 20:33:14 UTC - in response to Message 569633.

Only penguin is on download duty, but that may change if downloads start becoming a problem.

That would be now. Nothing is downloading, even though the account information on the server says that it has.
____________
Dublin, CA

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8551
Credit: 50,423,746
RAC: 51,119
United Kingdom
Message 569768 - Posted: 17 May 2007, 20:41:31 UTC - in response to Message 569766.

Only penguin is on download duty, but that may change if downloads start becoming a problem.

That would be now. Nothing is downloading, even though the account information on the server says that it has.

Do you mean you have tasks shown on the 'Transfers' tab in BOINC Manager, trying but failing to download? If so, then we need an additional download server.

Or do you mean you see tasks in the 'Results for computer' on this web site, but nothing in BOINC Manager? We call those "Ghost WUs" - different problem, different solution.

zombie67 [MM]
Volunteer tester
Avatar
Send message
Joined: 22 Apr 04
Posts: 753
Credit: 16,696,896
RAC: 10,472
United States
Message 569782 - Posted: 17 May 2007, 21:12:02 UTC - in response to Message 569768.

Do you mean you have tasks shown on the 'Transfers' tab in BOINC Manager, trying but failing to download? If so, then we need an additional download server.

Nope.
Or do you mean you see tasks in the 'Results for computer' on this web site, but nothing in BOINC Manager? We call those "Ghost WUs" - different problem, different solution.

Yep. Every time BOINC tries to connect, it generates another ghost WU for my account. FWIW, there are two different error messages:

Thu May 17 13:06:29 2007|SETI@home|Requesting 4104198 seconds of new work
Thu May 17 13:06:44 2007|SETI@home|Scheduler request failed: HTTP internal server error
Thu May 17 13:06:44 2007|SETI@home|Deferring communication for 13 min 41 sec
Thu May 17 13:06:44 2007|SETI@home|Reason: scheduler request failed
Thu May 17 13:20:26 2007|SETI@home|Sending scheduler request: Requested by user
Thu May 17 13:20:26 2007|SETI@home|Requesting 4106271 seconds of new work
Thu May 17 13:20:51 2007|SETI@home|Scheduler RPC succeeded [server version 509]
Thu May 17 13:20:51 2007|SETI@home|Deferring communication for 11 sec
Thu May 17 13:20:51 2007|SETI@home|Reason: requested by project
Thu May 17 13:20:51 2007|SETI@home|Deferring communication for 9 min 52 sec
Thu May 17 13:20:51 2007|SETI@home|Reason: no work from project

____________
Dublin, CA

archae86
Send message
Joined: 31 Aug 99
Posts: 889
Credit: 1,572,688
RAC: 1
United States
Message 569788 - Posted: 17 May 2007, 21:13:41 UTC - in response to Message 569675.

...
Sometimes they failed to contact the scheduler at all, sometimes they got the 'Incomplete request received' response. Since Eric mentioned that the scheduler has been moved, I thought I'd give them a prod.

Just did an 'Exit BOINC - start BOINC' sequence, and both have now reported. Coincidence or.....?

A simple exit boincmgr, start boincmgr gave first trial success on two of my three repeat offending machines. The third was not healed by that, but was healed by a full power off reboot.

Thanks Richard.

____________

gomeyer
Volunteer tester
Send message
Joined: 21 May 99
Posts: 488
Credit: 50,157,953
RAC: 0
United States
Message 569831 - Posted: 17 May 2007, 22:19:28 UTC

Just one more wierd happening to report and I'll be quiet. One of my hosts reported to request work and report 36 results. It got the dreaded:

5/17/2007 6:07:11 PM Scheduler request failed: HTTP internal server error

When I checked that host under My Computers on the web site, I see that all 36 were indeed reported at that time even though they still show up in BOINC as "Ready to Report". The system also claims to have sent two ghost work units at the same time which never arrived.

Anyway, things are still very broken. Just my 2c.

Profile KenKLRC
Avatar
Send message
Joined: 12 Jul 06
Posts: 27
Credit: 7,791,658
RAC: 0
United States
Message 569902 - Posted: 18 May 2007, 0:09:11 UTC
Last modified: 18 May 2007, 0:09:39 UTC

Eric,

I still have this "Stand By" box offline.....If you think more H/W is the answer I can Overnight-it. U have my number.

Profile Kirsten
Volunteer tester
Avatar
Send message
Joined: 7 Jul 00
Posts: 190
Credit: 565,264
RAC: 0
Denmark
Message 569922 - Posted: 18 May 2007, 0:48:16 UTC - in response to Message 568867.
Last modified: 18 May 2007, 0:49:42 UTC

Eric, it looks like you hit the jackpot. Slowly but surely my upload queue is shrinking and WUs are trickling down too. Thanks for making it happen!


I got manu WU's, too. Unfortunately they were all ghosts. They upload queue is gone, though.

____________
Kind regards
Kirsten

picantecomputing
Volunteer tester
Send message
Joined: 1 Jan 07
Posts: 4
Credit: 104,652
RAC: 0
United States
Message 569985 - Posted: 18 May 2007, 2:19:19 UTC - in response to Message 569922.

I got manu WU's, too. Unfortunately they were all ghosts. They upload queue is gone, though.


Same thing here. Tons of ghosts showing up in my results, but nothing has downloaded for days. Last attempt to request work gave me this:

5/17/2007 8:15:51 PM|SETI@home|Sending scheduler request: Requested by user
5/17/2007 8:15:51 PM|SETI@home|Requesting 442 seconds of new work
5/17/2007 8:16:01 PM||Project communication failed: attempting access to reference site
5/17/2007 8:16:01 PM|SETI@home|Scheduler request failed: server returned nothing (no headers, no data)
5/17/2007 8:16:01 PM|SETI@home|Deferring communication for 1 min 0 sec
5/17/2007 8:16:01 PM|SETI@home|Reason: scheduler request failed
5/17/2007 8:16:02 PM||Access to reference site succeeded - project servers may be temporarily down.

Eric KorpelaProject donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1088
Credit: 8,982,837
RAC: 12,330
United States
Message 570083 - Posted: 18 May 2007, 5:52:03 UTC - in response to Message 569985.

It's still pretty hit and miss (as you can see). Hopefully it's getting toward more hit than miss at this point.

Ptolemy lost connectivity to the upload directories on bruno for a while. Just fixed that, so our upload rate should double.

This Graph is still your best bet of checking your chances of getting through. The higher the graph is, the better. But we should be hovering around 22 Mbps rather than 15.

We're still operating on a single scheduler due to compile problems. G'nite. I'll catch up on where we are in the morning.

Eric


____________

Profile Jim Franklin
Send message
Joined: 3 Apr 99
Posts: 55
Credit: 1,578,201
RAC: 0
United Kingdom
Message 570151 - Posted: 18 May 2007, 10:14:44 UTC - in response to Message 570083.

Eric, just to add to what others have said, none of my ten workstations are able to upload/download, and are returning the same messages, "Schedular Request Failed: Server returned nothing (No headers, no data)" and then they return the " Message from Server: Incomplete request recieved" message.

It would appear from my stats that some of the machines managed some form of uload in the previous 36 hours, although the quantity uploaded is dwarfed by those in the upload queues and the rate of completing units is outstripping the exchanges that are taking place. Currently I have about 150 completed units in my queues and about 2 days of total crunch time left, and that is moving units from one machine to another to ensure they stay cruching!

Zeus, my main workstation is now running Einstein again as it's 8 cores ran out of seti units yesterday, it has about 60 units in it's upload queue and has not connected in about 10 days, currently it is trying to recieve some 3,815,477 seconds of new work, but gets dropped constantly.

If it were not for the sheer cost of transport, I would send you Zeus as an upload server, it is a DELL quad 3.2GHz HT Xeon machine Running with 4GB of Registered DDRII at the moment and Ubuntu 4.07. I have just sourced some DDRII (unbuffered or registered) to put into it (10GB). However a check on the shipping cost from London to you was more than £1000 unless it went by ship and would not arrive with you for 8 to 9 weeks!!

Hopefully all the glitches will be sorted soon and we will be back to as normal as things ever are.

Jim
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8551
Credit: 50,423,746
RAC: 51,119
United Kingdom
Message 570200 - Posted: 18 May 2007, 11:57:57 UTC

@ Eric,

A lot of people are reporting incomplete scheduler requests (with various error messages) and the creation of a lot of 'ghost' results - potentially storing up another batch of problems in the future when the 'ghost' deadlines expire.

At least some of this seems to relate to the anonymous platform mechanism. The following recipe has worked for me on three machines, and independently verified by another user:

Rename app_info.xml so it won't be recognised
Restart BOINC (service)
Update SETI - may not get through first time, but keep trying
Restore app_info.xml to original name
Wait until all transfers have finished
Restart BOINC (service)

I don't know why this works - one user has speculated about app_info processing overhead, I'm wondering about the BOINC v5.10 <platform> tag - but it seems consistent and reproducible, so it may help to narrow down the debug search.

Profile Lycanthrope
Avatar
Send message
Joined: 27 May 05
Posts: 31
Credit: 1,338,589
RAC: 0
Belgium
Message 570284 - Posted: 18 May 2007, 13:45:53 UTC - in response to Message 570200.

I don't know why this works - one user has speculated about app_info processing overhead, I'm wondering about the BOINC v5.10 <platform> tag - but it seems consistent and reproducible, so it may help to narrow down the debug search.


I've tried this workaround on my MacIntel; G5 and XP laptop and it has worked in each case!

Added to that I have solved my caching issue so I've loaded a nine queue of WU's. Now I can switch back to the optimized app.

Thanks for the interim solution!

____________

Eric KorpelaProject donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1088
Credit: 8,982,837
RAC: 12,330
United States
Message 570285 - Posted: 18 May 2007, 13:46:51 UTC - in response to Message 570151.


Eric, just to add to what others have said, none of my ten workstations are able to upload/download, and are returning the same messages, "Schedular Request Failed: Server returned nothing (No headers, no data)" and then they return the " Message from Server: Incomplete request recieved" message.


Could you give me an IP address for this machine? I'd like to scan the logs to see what's going on on this side.

Eric
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8551
Credit: 50,423,746
RAC: 51,119
United Kingdom
Message 570297 - Posted: 18 May 2007, 14:03:29 UTC - in response to Message 570285.
Last modified: 18 May 2007, 14:09:41 UTC


Eric, just to add to what others have said, none of my ten workstations are able to upload/download, and are returning the same messages, "Schedular Request Failed: Server returned nothing (No headers, no data)" and then they return the " Message from Server: Incomplete request recieved" message.


Could you give me an IP address for this machine? I'd like to scan the logs to see what's going on on this side.

Eric

Here are some examples from host 2901600 - Windows Vista, BOINC v5.8.16

IP 81.156.16.160
[edit - IP may have been 86.141.28.126 - router has lost DSL since then, been re-assigned - but I think that was before these events]

SETI@home 17/05/2007 18:18:18 Message from server: Incomplete request received.

[BOINC restarted - no 'incomplete request' since then]

SETI@home 17/05/2007 19:01:39 Scheduler request failed: HTTP internal server error
SETI@home 17/05/2007 19:06:49 Scheduler request failed: server returned nothing (no headers, no data)

Times have been adjusted to UTC - should be pretty exact.

Hope that gives you something to look for while you're waiting for Jim to post.

Mat
Volunteer tester
Send message
Joined: 20 Oct 01
Posts: 1
Credit: 8,611,384
RAC: 692
Thailand
Message 570347 - Posted: 18 May 2007, 14:41:21 UTC

@ Eric,

A lot of people are reporting incomplete scheduler requests (with various error messages) and the creation of a lot of 'ghost' results - potentially storing up another batch of problems in the future when the 'ghost' deadlines expire.

At least some of this seems to relate to the anonymous platform mechanism. The following recipe has worked for me on three machines, and independently verified by another user:

Rename app_info.xml so it won't be recognised
Restart BOINC (service)
Update SETI - may not get through first time, but keep trying
Restore app_info.xml to original name
Wait until all transfers have finished
Restart BOINC (service)

I don't know why this works - one user has speculated about app_info processing overhead, I'm wondering about the BOINC v5.10 <platform> tag - but it seems consistent and reproducible, so it may help to narrow down the debug search.



That solution seems to work for me as well as i use the anonymous plattforn mechanism in conjunction with boinc 5.8.16 and had servere difficulties do download any new wu's since the 16th. around 16:00utc
Now new wu's pooring in.
The above mentioned workaround works.

ty Richard


____________

Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next

Message boards : SETI@home Staff Blog : Eric's biannual post #6: You can tuna fish, but you can't tune a TCP

Copyright © 2014 University of California