What exactly does this mean?

Message boards : Number crunching : What exactly does this mean?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile gcpeters
Avatar

Send message
Joined: 20 May 99
Posts: 67
Credit: 109,352,237
RAC: 1
United States
Message 1158658 - Posted: 4 Oct 2011, 0:01:37 UTC

10/3/2011 4:56:07 PM | SETI@home | Scheduler request failed: HTTP service unavailable

I have multiple (identical) systems on this same network segment that are chugging along just fine...relatively speaking. I can only assume this must be server side noise...
ID: 1158658 · Report as offensive
Blake Bonkofsky
Volunteer tester
Avatar

Send message
Joined: 29 Dec 99
Posts: 617
Credit: 46,383,149
RAC: 0
United States
Message 1158659 - Posted: 4 Oct 2011, 0:04:51 UTC - in response to Message 1158658.  
Last modified: 4 Oct 2011, 0:05:43 UTC

Either something broke again on the network link, or they are working on it. The cricket graph shows the servers are basically dead to the world. You can see there, a couple of hours ago the link went completely flat, and the few bits/sec that remain, I believe are just from router to router communications.

http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;ranges=d%3Aw%3Am%3Ay;view=Octets
ID: 1158659 · Report as offensive
Profile gcpeters
Avatar

Send message
Joined: 20 May 99
Posts: 67
Credit: 109,352,237
RAC: 1
United States
Message 1159233 - Posted: 5 Oct 2011, 23:24:05 UTC
Last modified: 5 Oct 2011, 23:35:17 UTC

But why does the one machine on this network segment upload and download just fine and suddenly this one keeps throwing out these log messages and doesn't get squat??? Is there something server side that tracks these machine's names and puts them in a queue for when they can upload and download? The code running here is just wacky...two identical machines running S@H...one works...the other no longer does.

And again, what exactly does this code (Scheduler request failed: HTTP service unavailable) refer to? It would be nice if someone published a Batman decoder ring explaining all the stuff we see in the log files.
ID: 1159233 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1159241 - Posted: 5 Oct 2011, 23:56:05 UTC - in response to Message 1159233.  

But why does the one machine on this network segment upload and download just fine and suddenly this one keeps throwing out these log messages and doesn't get squat??? Is there something server side that tracks these machine's names and puts them in a queue for when they can upload and download? The code running here is just wacky...two identical machines running S@H...one works...the other no longer does...

The exact meaning of that message is that you got a response of 503 from the server. Which is something along the lines of "The server is currently unable to handle the request due to a temporary overloading or maintenance of the server."

I have found when the network is working correctly I sometimes have to bounce BOINC, or set the network activity to suspend for a few seconds then back on before anything will work again.

Each machine you have makes a unique connection to the server every time it makes a request. It could be one of those intangible resources in the machine needs cleared by restarting BOINC or rebooting.

It could also just be that when that machine says "Hi" to the server it freaks out and responds with "#$@&*!". Normally TCP/IP packets do include the host machine name at some point. So there could be something in the process that is getting tripped up. However I wouldn't expect it to be host name filtering. As that would probably to much CPU time on the backend.

More than likely it is just the result of the servers getting hammered constantly.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1159241 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1159268 - Posted: 6 Oct 2011, 1:35:08 UTC - in response to Message 1159241.  

...
However I wouldn't expect it to be host name filtering. As that would probably to much CPU time on the backend.
...

Actually, BOINC does have a blacklist feature though it works from the hostID rather than name. If a project sets a field in a host's record to -1, a work request from that host gets a "Not accepting requests from this host" error message in the reply.

With BOINCstats showing 223782 active hosts on the project, there may be a few which have been lucky enough to get work every time they ask, and a few which have gotten no work for weeks by pure bad luck. But if I had a host which wasn't getting any work I'd be capturing packets from its requests and comparing them to packets from a host which was successfully getting work, or any other methods of seeing what was different I could conceive.
                                                                   Joe
ID: 1159268 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 1159274 - Posted: 6 Oct 2011, 1:45:18 UTC
Last modified: 6 Oct 2011, 1:45:54 UTC

SETI has two upload/download servers. Each machine gets the URL of one at random when it first signs up. It may very will be that the IP used by one of your machines is different that the IP used by the other.

[edit]

At least this used to be the case. Not certain if it is still the case.


BOINC WIKI
ID: 1159274 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1159279 - Posted: 6 Oct 2011, 2:11:11 UTC - in response to Message 1159274.  

SETI has two upload/download servers. Each machine gets the URL of one at random when it first signs up. It may very will be that the IP used by one of your machines is different that the IP used by the other.

[edit]

At least this used to be the case. Not certain if it is still the case.


One upload, two download. Download is load-balanced by round-robin DNS every 5 minutes. Sometimes .13 (download_1) has major packet drop issues, and sometimes .18 (download_2) does. Some seasoned veterans around here have both in their HOSTS file and uncomment whichever one seems to be working better when there are a lot of time-outs and retries.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1159279 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1159280 - Posted: 6 Oct 2011, 2:13:20 UTC - in response to Message 1159268.  

...
However I wouldn't expect it to be host name filtering. As that would probably to much CPU time on the backend.
...

Actually, BOINC does have a blacklist feature though it works from the hostID rather than name. If a project sets a field in a host's record to -1, a work request from that host gets a "Not accepting requests from this host" error message in the reply.

With BOINCstats showing 223782 active hosts on the project, there may be a few which have been lucky enough to get work every time they ask, and a few which have gotten no work for weeks by pure bad luck. But if I had a host which wasn't getting any work I'd be capturing packets from its requests and comparing them to packets from a host which was successfully getting work, or any other methods of seeing what was different I could conceive.
                                                                   Joe

Good to know. I would guess this is the same mechanism that is used for the things like "Not sending work - last request too recent: X sec" & "This computer has reached a limit on tasks in progress".

Normally I just get:
"Project communication failed: attempting access to reference site"
"Internet access OK - project servers may be temporarily down."

Although I have seen the "HTTP service unavailable" message. At work I normally see some kind of "gateway timeout" message.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1159280 · Report as offensive
Profile gcpeters
Avatar

Send message
Joined: 20 May 99
Posts: 67
Credit: 109,352,237
RAC: 1
United States
Message 1161153 - Posted: 11 Oct 2011, 5:29:57 UTC - in response to Message 1159280.  
Last modified: 11 Oct 2011, 5:34:40 UTC

Holy Jebus. Is this much manual intervention really needed? Whatever happened to coding "set and forget?" Once again my RAC is taking a nosedive due to various and sundry "under the covers" code that in my opinion could be better written. Instead, we have code that allows some folks to connect semi-regularly, but only on Thursdays, if the moon is full, and if they bounced there BOINC client 3 times while tapping their heels. Others just have to baby their systems along and then suddenly rejoice if they get some WUs to run...

There has got to be a better way to do this...
ID: 1161153 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1161186 - Posted: 11 Oct 2011, 9:46:13 UTC - in response to Message 1161153.  

This would/could be a nice example of how a Project is "killed", by it's own success....

I thought, we have already a Forum thread, called : HE connection problem
Thread
???


ID: 1161186 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1161223 - Posted: 11 Oct 2011, 13:30:29 UTC

Hi

Just recieve today this mensage[

11/10/2011 10:22:08 | SETI@home | Reporting 3 completed tasks, requesting new tasks for CPU and NVIDIA GPU
11/10/2011 10:22:11 | SETI@home | Scheduler request completed: got 0 new tasks
11/10/2011 10:22:11 | SETI@home | No tasks sent
11/10/2011 10:22:11 | SETI@home | This computer has finished a daily quota of 8 tasks

Anybody knows what happend? A new bug? This a new limit of task per day? Only 8?

This computer is a I7 with 3 GTX560 working 7/24 so it is capable to crunch more than 500 WU/day
ID: 1161223 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19401
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1161224 - Posted: 11 Oct 2011, 13:40:24 UTC - in response to Message 1161223.  

Hi

Just recieve today this mensage[

11/10/2011 10:22:08 | SETI@home | Reporting 3 completed tasks, requesting new tasks for CPU and NVIDIA GPU
11/10/2011 10:22:11 | SETI@home | Scheduler request completed: got 0 new tasks
11/10/2011 10:22:11 | SETI@home | No tasks sent
11/10/2011 10:22:11 | SETI@home | This computer has finished a daily quota of 8 tasks

Anybody knows what happend? A new bug? This a new limit of task per day? Only 8?

This computer is a I7 with 3 GTX560 working 7/24 so it is capable to crunch more than 500 WU/day

If you mean this computer Host 5264653 Error Tasks then why did you abort all these tasks. Each error decreases the number that can be d/loaded by 1. 362 in error will cause the 8 tasks/day msg. Good news is each sucessful task will double the amount/day.
ID: 1161224 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1161228 - Posted: 11 Oct 2011, 13:52:16 UTC - in response to Message 1161223.  

Hi

Just recieve today this mensage[

11/10/2011 10:22:08 | SETI@home | Reporting 3 completed tasks, requesting new tasks for CPU and NVIDIA GPU
11/10/2011 10:22:11 | SETI@home | Scheduler request completed: got 0 new tasks
11/10/2011 10:22:11 | SETI@home | No tasks sent
11/10/2011 10:22:11 | SETI@home | This computer has finished a daily quota of 8 tasks

Anybody knows what happend? A new bug? This a new limit of task per day? Only 8?

This computer is a I7 with 3 GTX560 working 7/24 so it is capable to crunch more than 500 WU/day


Checking the application details for your host 5264653

shows 'consecutive valid' as 3 and 'max tasks' as 37 [when I first looked, it's now at 4/101] for CPU.

Checking your task list I find you aborted about 300 tasks earlier today [11 Oct 2011 | 13:17:18 UTC]. Those count as errors. Errors reset your 'consecutive valid' count to 0 and reduce your 'maximum tasks per day'. As soon as you restart delivering valid results (or some of your pendings validate) numbers start to climb again.
The mechanism is to prevent hosts that have suffered some sort of breakdown and are only returning errors from getting lots of WUs they won't be able to crunch properly.

Beaten by WinterKnight :)
small correction - it doubles only till it reaches 100 - after that it is one more per validation.
ID: 1161228 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1161248 - Posted: 11 Oct 2011, 15:33:27 UTC

Thanks for the help. You must be right, my sistem works ok now.




ID: 1161248 · Report as offensive
Profile gcpeters
Avatar

Send message
Joined: 20 May 99
Posts: 67
Credit: 109,352,237
RAC: 1
United States
Message 1161569 - Posted: 12 Oct 2011, 17:50:57 UTC

So, I'm noticing a trend. I have several systems in the lab that are completely full with nothing but completed WUs (some in the hundreds) and these systems seem to be hung up in a state where they cannot upload anything. Communication is fine to the internets and the extremely robust and helpful log files (yes, that was sarcasm) say only:

10/12/2011 9:56:11 AM | SETI@home | Scheduler request failed: HTTP service unavailable

Yet again, another system literally sitting in the same server rack, connected to the same gigE switch, to the same internets, does not experience this issue. Is there something in the code that is causing this to happen when a system is full with nothing but upload tasks? Other systems in the lab, again on the same network segments, chug along just fine as long as they don't get into this state of being full with finished WUs they are simply just trying to upload. One of those completed WUs could have alien signal contact in them!!! Just kidding. And the only reason they seem to get into this state is the extensive connnectivity issues that continuously seem to plague the S@H hardware infrastructure. Why the bizarro-world server discrimination? Obviously with only a single system running S@H, like probably the majority have, most people won't notice this type of S@H weirdness. But when I have 10 systems normally running S@H just fine, and right next to them 5 aren't able to upload or download for whatever software reason, it gets pretty unnerving. I can only infer that something is broken in the S@H backend code. Contrary to how it might sound, I really do care about this project. I'm just frustrated by all the inconsistancies I'm seeing (I'm a validation engineer by trade) and I have almost zero insight to help fix the issues I'm having as a user. I have literally hundreds of gflops if not tflops worth of compute power at my avail to throw at this project...it just seems like a waste to not be able to use it all...
ID: 1161569 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19401
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1161579 - Posted: 12 Oct 2011, 18:10:52 UTC - in response to Message 1161569.  

So, I'm noticing a trend. I have several systems in the lab that are completely full with nothing but completed WUs (some in the hundreds) and these systems seem to be hung up in a state where they cannot upload anything. Communication is fine to the internets and the extremely robust and helpful log files (yes, that was sarcasm) say only:

10/12/2011 9:56:11 AM | SETI@home | Scheduler request failed: HTTP service unavailable

Yet again, another system literally sitting in the same server rack, connected to the same gigE switch, to the same internets, does not experience this issue. Is there something in the code that is causing this to happen when a system is full with nothing but upload tasks? Other systems in the lab, again on the same network segments, chug along just fine as long as they don't get into this state of being full with finished WUs they are simply just trying to upload. One of those completed WUs could have alien signal contact in them!!! Just kidding. And the only reason they seem to get into this state is the extensive connnectivity issues that continuously seem to plague the S@H hardware infrastructure. Why the bizarro-world server discrimination? Obviously with only a single system running S@H, like probably the majority have, most people won't notice this type of S@H weirdness. But when I have 10 systems normally running S@H just fine, and right next to them 5 aren't able to upload or download for whatever software reason, it gets pretty unnerving. I can only infer that something is broken in the S@H backend code. Contrary to how it might sound, I really do care about this project. I'm just frustrated by all the inconsistancies I'm seeing (I'm a validation engineer by trade) and I have almost zero insight to help fix the issues I'm having as a user. I have literally hundreds of gflops if not tflops worth of compute power at my avail to throw at this project...it just seems like a waste to not be able to use it all...

BOINC will not request work if the number of uploads is greater than 2 * number of processors.
ID: 1161579 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1161639 - Posted: 12 Oct 2011, 21:22:18 UTC - in response to Message 1161569.  

So, I'm noticing a trend. I have several systems in the lab that are completely full with nothing but completed WUs (some in the hundreds) and these systems seem to be hung up in a state where they cannot upload anything. Communication is fine to the internets and the extremely robust and helpful log files (yes, that was sarcasm) say only:

10/12/2011 9:56:11 AM | SETI@home | Scheduler request failed: HTTP service unavailable
...

BOINC is of course defaulted to information the devs think appropriate for non-technical users. A cc_config.xml file can be used to get extremely detailed logging. "HTTP service unavailable" would of course show a line indicating a 503 error among many other lines for each contact attempt if the cc_config.xml was:

<cc_config>
  <log_flags>
    <http_debug>1</http_debug>
  </log_flags>
</cc_config>
                                                                      Joe
ID: 1161639 · Report as offensive
Profile gcpeters
Avatar

Send message
Joined: 20 May 99
Posts: 67
Credit: 109,352,237
RAC: 1
United States
Message 1161697 - Posted: 13 Oct 2011, 0:04:38 UTC - in response to Message 1161579.  

"BOINC will not request work if the number of uploads is greater than 2 * number of processors."

What should that matter? BOINC got itself into this mess by having so much downtime while my system is still chugging away at 10 days worth of WUs. Suddenly it's so full that it's hit some kind of arbitrary limit as to what it can upload? Poopoo. So penalize the power crunchers? Who wrote that code?

I have 80 logical processors in this system. I can't believe this is a real issue...

/me throws arms up in disbelief

Ok, I'm better now. So, what do I do to unscrew this situation that is no fault of mine? Throw away all these perfectly good crunched WUs? I am tired of seeing:

10/11/2011 12:00:09 PM | SETI@home | Scheduler request failed: HTTP service unavailable

Which is total garbage messaging btw. HTTP service is available because my other 10+ systems are downloading, crunching and uploading just fine on the same network. This is just plain and simple bad code.
ID: 1161697 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 1161745 - Posted: 13 Oct 2011, 2:56:08 UTC - in response to Message 1161697.  

"BOINC will not request work if the number of uploads is greater than 2 * number of processors."

What should that matter? BOINC got itself into this mess by having so much downtime while my system is still chugging away at 10 days worth of WUs. Suddenly it's so full that it's hit some kind of arbitrary limit as to what it can upload? Poopoo. So penalize the power crunchers? Who wrote that code?

I have 80 logical processors in this system. I can't believe this is a real issue...

/me throws arms up in disbelief

Ok, I'm better now. So, what do I do to unscrew this situation that is no fault of mine? Throw away all these perfectly good crunched WUs? I am tired of seeing:

10/11/2011 12:00:09 PM | SETI@home | Scheduler request failed: HTTP service unavailable

Which is total garbage messaging btw. HTTP service is available because my other 10+ systems are downloading, crunching and uploading just fine on the same network. This is just plain and simple bad code.

BOINC != SETI. SETI has gotten into a bit of trouble. Some projects have had spells where they could produce, deliver, and crunch tasks faster than the results could upload. This caused an ever increasing list of uploads on the clients. This is not a good state to be in. The situation was such that there was nothing preventing the number of uploads from reaching infinity on every computer attached to the project - thus the arbitrary limit on the number of tasks waiting for uploads to complete before new work is requested.


BOINC WIKI
ID: 1161745 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1161830 - Posted: 13 Oct 2011, 10:09:18 UTC - in response to Message 1161697.  

"BOINC will not request work if the number of uploads is greater than 2 * number of processors."

What should that matter? BOINC got itself into this mess by having so much downtime while my system is still chugging away at 10 days worth of WUs. Suddenly it's so full that it's hit some kind of arbitrary limit as to what it can upload? Poopoo. So penalize the power crunchers? Who wrote that code?

I have 80 logical processors in this system. I can't believe this is a real issue...

/me throws arms up in disbelief

Ok, I'm better now. So, what do I do to unscrew this situation that is no fault of mine? Throw away all these perfectly good crunched WUs? I am tired of seeing:

10/11/2011 12:00:09 PM | SETI@home | Scheduler request failed: HTTP service unavailable

Which is total garbage messaging btw. HTTP service is available because my other 10+ systems are downloading, crunching and uploading just fine on the same network. This is just plain and simple bad code.


try ping/tracert to 208.68.240.20 from the affected machine to see if that particular machine can reach the servers. Just because the rest of the LAN can, doesn't mean that one can as well. Don't assume - test.

if it works, enable the debugging log Joe posted and post the log.
ID: 1161830 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : What exactly does this mean?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.