Message boards :
Number crunching :
What exactly does this mean?
Message board moderation
Author | Message |
---|---|
gcpeters Send message Joined: 20 May 99 Posts: 67 Credit: 109,352,237 RAC: 1 |
10/3/2011 4:56:07 PM | SETI@home | Scheduler request failed: HTTP service unavailable I have multiple (identical) systems on this same network segment that are chugging along just fine...relatively speaking. I can only assume this must be server side noise... |
Blake Bonkofsky Send message Joined: 29 Dec 99 Posts: 617 Credit: 46,383,149 RAC: 0 |
Either something broke again on the network link, or they are working on it. The cricket graph shows the servers are basically dead to the world. You can see there, a couple of hours ago the link went completely flat, and the few bits/sec that remain, I believe are just from router to router communications. http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;ranges=d%3Aw%3Am%3Ay;view=Octets |
gcpeters Send message Joined: 20 May 99 Posts: 67 Credit: 109,352,237 RAC: 1 |
But why does the one machine on this network segment upload and download just fine and suddenly this one keeps throwing out these log messages and doesn't get squat??? Is there something server side that tracks these machine's names and puts them in a queue for when they can upload and download? The code running here is just wacky...two identical machines running S@H...one works...the other no longer does. And again, what exactly does this code (Scheduler request failed: HTTP service unavailable) refer to? It would be nice if someone published a Batman decoder ring explaining all the stuff we see in the log files. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
But why does the one machine on this network segment upload and download just fine and suddenly this one keeps throwing out these log messages and doesn't get squat??? Is there something server side that tracks these machine's names and puts them in a queue for when they can upload and download? The code running here is just wacky...two identical machines running S@H...one works...the other no longer does... The exact meaning of that message is that you got a response of 503 from the server. Which is something along the lines of "The server is currently unable to handle the request due to a temporary overloading or maintenance of the server." I have found when the network is working correctly I sometimes have to bounce BOINC, or set the network activity to suspend for a few seconds then back on before anything will work again. Each machine you have makes a unique connection to the server every time it makes a request. It could be one of those intangible resources in the machine needs cleared by restarting BOINC or rebooting. It could also just be that when that machine says "Hi" to the server it freaks out and responds with "#$@&*!". Normally TCP/IP packets do include the host machine name at some point. So there could be something in the process that is getting tripped up. However I wouldn't expect it to be host name filtering. As that would probably to much CPU time on the backend. More than likely it is just the result of the servers getting hammered constantly. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
... Actually, BOINC does have a blacklist feature though it works from the hostID rather than name. If a project sets a field in a host's record to -1, a work request from that host gets a "Not accepting requests from this host" error message in the reply. With BOINCstats showing 223782 active hosts on the project, there may be a few which have been lucky enough to get work every time they ask, and a few which have gotten no work for weeks by pure bad luck. But if I had a host which wasn't getting any work I'd be capturing packets from its requests and comparing them to packets from a host which was successfully getting work, or any other methods of seeing what was different I could conceive. Joe |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
SETI has two upload/download servers. Each machine gets the URL of one at random when it first signs up. It may very will be that the IP used by one of your machines is different that the IP used by the other. [edit] At least this used to be the case. Not certain if it is still the case. BOINC WIKI |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
SETI has two upload/download servers. Each machine gets the URL of one at random when it first signs up. It may very will be that the IP used by one of your machines is different that the IP used by the other. One upload, two download. Download is load-balanced by round-robin DNS every 5 minutes. Sometimes .13 (download_1) has major packet drop issues, and sometimes .18 (download_2) does. Some seasoned veterans around here have both in their HOSTS file and uncomment whichever one seems to be working better when there are a lot of time-outs and retries. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
... Good to know. I would guess this is the same mechanism that is used for the things like "Not sending work - last request too recent: X sec" & "This computer has reached a limit on tasks in progress". Normally I just get: "Project communication failed: attempting access to reference site" "Internet access OK - project servers may be temporarily down." Although I have seen the "HTTP service unavailable" message. At work I normally see some kind of "gateway timeout" message. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
gcpeters Send message Joined: 20 May 99 Posts: 67 Credit: 109,352,237 RAC: 1 |
Holy Jebus. Is this much manual intervention really needed? Whatever happened to coding "set and forget?" Once again my RAC is taking a nosedive due to various and sundry "under the covers" code that in my opinion could be better written. Instead, we have code that allows some folks to connect semi-regularly, but only on Thursdays, if the moon is full, and if they bounced there BOINC client 3 times while tapping their heels. Others just have to baby their systems along and then suddenly rejoice if they get some WUs to run... There has got to be a better way to do this... |
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 |
|
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Hi Just recieve today this mensage[ 11/10/2011 10:22:08 | SETI@home | Reporting 3 completed tasks, requesting new tasks for CPU and NVIDIA GPU 11/10/2011 10:22:11 | SETI@home | Scheduler request completed: got 0 new tasks 11/10/2011 10:22:11 | SETI@home | No tasks sent 11/10/2011 10:22:11 | SETI@home | This computer has finished a daily quota of 8 tasks Anybody knows what happend? A new bug? This a new limit of task per day? Only 8? This computer is a I7 with 3 GTX560 working 7/24 so it is capable to crunch more than 500 WU/day |
W-K 666 Send message Joined: 18 May 99 Posts: 19403 Credit: 40,757,560 RAC: 67 |
Hi If you mean this computer Host 5264653 Error Tasks then why did you abort all these tasks. Each error decreases the number that can be d/loaded by 1. 362 in error will cause the 8 tasks/day msg. Good news is each sucessful task will double the amount/day. |
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
Hi Checking the application details for your host 5264653 shows 'consecutive valid' as 3 and 'max tasks' as 37 [when I first looked, it's now at 4/101] for CPU. Checking your task list I find you aborted about 300 tasks earlier today [11 Oct 2011 | 13:17:18 UTC]. Those count as errors. Errors reset your 'consecutive valid' count to 0 and reduce your 'maximum tasks per day'. As soon as you restart delivering valid results (or some of your pendings validate) numbers start to climb again. The mechanism is to prevent hosts that have suffered some sort of breakdown and are only returning errors from getting lots of WUs they won't be able to crunch properly. Beaten by WinterKnight :) small correction - it doubles only till it reaches 100 - after that it is one more per validation. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Thanks for the help. You must be right, my sistem works ok now. |
gcpeters Send message Joined: 20 May 99 Posts: 67 Credit: 109,352,237 RAC: 1 |
So, I'm noticing a trend. I have several systems in the lab that are completely full with nothing but completed WUs (some in the hundreds) and these systems seem to be hung up in a state where they cannot upload anything. Communication is fine to the internets and the extremely robust and helpful log files (yes, that was sarcasm) say only: 10/12/2011 9:56:11 AM | SETI@home | Scheduler request failed: HTTP service unavailable Yet again, another system literally sitting in the same server rack, connected to the same gigE switch, to the same internets, does not experience this issue. Is there something in the code that is causing this to happen when a system is full with nothing but upload tasks? Other systems in the lab, again on the same network segments, chug along just fine as long as they don't get into this state of being full with finished WUs they are simply just trying to upload. One of those completed WUs could have alien signal contact in them!!! Just kidding. And the only reason they seem to get into this state is the extensive connnectivity issues that continuously seem to plague the S@H hardware infrastructure. Why the bizarro-world server discrimination? Obviously with only a single system running S@H, like probably the majority have, most people won't notice this type of S@H weirdness. But when I have 10 systems normally running S@H just fine, and right next to them 5 aren't able to upload or download for whatever software reason, it gets pretty unnerving. I can only infer that something is broken in the S@H backend code. Contrary to how it might sound, I really do care about this project. I'm just frustrated by all the inconsistancies I'm seeing (I'm a validation engineer by trade) and I have almost zero insight to help fix the issues I'm having as a user. I have literally hundreds of gflops if not tflops worth of compute power at my avail to throw at this project...it just seems like a waste to not be able to use it all... |
W-K 666 Send message Joined: 18 May 99 Posts: 19403 Credit: 40,757,560 RAC: 67 |
So, I'm noticing a trend. I have several systems in the lab that are completely full with nothing but completed WUs (some in the hundreds) and these systems seem to be hung up in a state where they cannot upload anything. Communication is fine to the internets and the extremely robust and helpful log files (yes, that was sarcasm) say only: BOINC will not request work if the number of uploads is greater than 2 * number of processors. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
So, I'm noticing a trend. I have several systems in the lab that are completely full with nothing but completed WUs (some in the hundreds) and these systems seem to be hung up in a state where they cannot upload anything. Communication is fine to the internets and the extremely robust and helpful log files (yes, that was sarcasm) say only: BOINC is of course defaulted to information the devs think appropriate for non-technical users. A cc_config.xml file can be used to get extremely detailed logging. "HTTP service unavailable" would of course show a line indicating a 503 error among many other lines for each contact attempt if the cc_config.xml was: <cc_config> <log_flags> <http_debug>1</http_debug> </log_flags> </cc_config> Joe |
gcpeters Send message Joined: 20 May 99 Posts: 67 Credit: 109,352,237 RAC: 1 |
"BOINC will not request work if the number of uploads is greater than 2 * number of processors." What should that matter? BOINC got itself into this mess by having so much downtime while my system is still chugging away at 10 days worth of WUs. Suddenly it's so full that it's hit some kind of arbitrary limit as to what it can upload? Poopoo. So penalize the power crunchers? Who wrote that code? I have 80 logical processors in this system. I can't believe this is a real issue... /me throws arms up in disbelief Ok, I'm better now. So, what do I do to unscrew this situation that is no fault of mine? Throw away all these perfectly good crunched WUs? I am tired of seeing: 10/11/2011 12:00:09 PM | SETI@home | Scheduler request failed: HTTP service unavailable Which is total garbage messaging btw. HTTP service is available because my other 10+ systems are downloading, crunching and uploading just fine on the same network. This is just plain and simple bad code. |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
"BOINC will not request work if the number of uploads is greater than 2 * number of processors." BOINC != SETI. SETI has gotten into a bit of trouble. Some projects have had spells where they could produce, deliver, and crunch tasks faster than the results could upload. This caused an ever increasing list of uploads on the clients. This is not a good state to be in. The situation was such that there was nothing preventing the number of uploads from reaching infinity on every computer attached to the project - thus the arbitrary limit on the number of tasks waiting for uploads to complete before new work is requested. BOINC WIKI |
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
"BOINC will not request work if the number of uploads is greater than 2 * number of processors." try ping/tracert to 208.68.240.20 from the affected machine to see if that particular machine can reach the servers. Just because the rest of the LAN can, doesn't mean that one can as well. Don't assume - test. if it works, enable the debugging log Joe posted and post the log. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.