Suddenly BOINC Decides to Abandon 71 APs...WTH?

Author	Message
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1696105 - Posted: 27 Jun 2015, 2:48:26 UTC Being fast isn't necessarily all there is to it. I'm on Comcast and for a long time, it was by far the fastest (on average) ISP in the country, but every now and then, some webpages would be really slow to respond, and I thought it was just that one server/site that was slow, but it turned out to be everything. Doing an endless ping (-t switch on the command-line) showed that over a 20 minute period of pinging google, I had a 22% packet loss. Power cycling the modem and router fixed it for a few minutes, and then it came back again. Power cycled one more time, and kept doing it until the channel frequency my modem bonded to (that's how cable modems work) was something other than 555 MHz. I got on 597 MHz and the SNR went from 30.1 to 39.3 and the packet loss went away. So.. basically, that's great if you can download at 12 MiB/sec, but if you have packet loss.. you're still going to have issues with stable, reliable connections. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1696105 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696106 - Posted: 27 Jun 2015, 2:58:25 UTC - in response to Message 1696105. Being fast isn't necessarily all there is to it. I'm on Comcast and for a long time, it was by far the fastest (on average) ISP in the country, but every now and then, some webpages would be really slow to respond, and I thought it was just that one server/site that was slow, but it turned out to be everything. Doing an endless ping (-t switch on the command-line) showed that over a 20 minute period of pinging google, I had a 22% packet loss. Power cycling the modem and router fixed it for a few minutes, and then it came back again. Power cycled one more time, and kept doing it until the channel frequency my modem bonded to (that's how cable modems work) was something other than 555 MHz. I got on 597 MHz and the SNR went from 30.1 to 39.3 and the packet loss went away. So.. basically, that's great if you can download at 12 MiB/sec, but if you have packet loss.. you're still going to have issues with stable, reliable connections. Sorry to hear of your troubles, however, you must have missed this part; It's long been the gold standard in what customers would want in a connectionâ€”namely, fiber-to-the-home (FTTH)....It remains the top ISP in four out of the six regions of the continental U.S. Also, as in previous years, our Readers' Choice awards indicate that the PCMag audience considers FiOS the utter pinnacle of great customer service. FiOS, we also should note, has an astronomically good upload speed to go with its download speed.. The only trouble I've had over the years is when it goes out. If it's working at all, it's as fast as I would ever need. The point is, as I mentioned earlier, the bad internet ploy won't cut it this time. SETI works on Dial-up and Fiber's worse day is still light years ahead of dial-up. ID: 1696106 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1696108 - Posted: 27 Jun 2015, 3:10:04 UTC I'm not disagreeing at all. I wouldn't expect it to be an ISP issue, not to mention only if one machine did it and not the rest. I still think the analysis of the timestamps on the chain of events indicates that the first request did not get a timely response, so BOINC decided it was a failed connection. Another contact was made successfully, and then that first request finally arrived or got processed after the second one that was successful. That points to out-of-sequence communication. I strongly doubt it is anything to do with your equipment, or even the local system for the ISP, but the Internet backbone is a pretty vast place. Either the request bounced around for a few minutes, or as Jason suggested, maybe for whatever reason, the hostID was missing from the request and it took the server nine minutes to run a query to try to figure out who that request belongs to. Once the server processed that delayed request, it did what it did, and of course BOINC isn't going to get a spontaneous response from the server when there are no active, open, awaiting a response connections at that point. However, on the next contact after that, the scheduler should have told your client that it isn't supposed to have all those WUs anymore (I'm assuming they stayed in your cache and you could have continued crunching them, right?). It is my understanding that these days, when you make a scheduler contact, a list of what you currently have is sent in with the request, so that the scheduler knows that you're still working on them, or if you ended up being assigned a few WUs, but that reply got lost and never made it to you, you can be resent those tasks (if that particular feature is enabled at the time). So as has been said numerous times: the question here is: why did that first request take 9m 34s to finally be processed by the server? What happened to it? Did it get lost in-transit and finally find its way there, or was it received right away, but then took ~9m 30s to do a query on the DB? And then why wasn't your client notified on the next request that you should no longer have those WUs? Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1696108 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696110 - Posted: 27 Jun 2015, 3:22:48 UTC - in response to Message 1696103. So you see, when people try to insinuate the long time Best ISP in the country isn't good enough for the SETI server, it's kinda funny. If your ISP provided a dedicated direct pipe from your router to Berkeley, then I suppose the ISP rankings would matter. However, it's likely that a typical scheduler request from your house to SETI's scheduler probably passes through 15-20 or more switches/routers/servers or other connections on the way. Any one of them could have a temporary bottleneck. Heck, I'm only about 100 miles from Berkeley and a TraceRoute shows 12 hops from my PC to the server, 9 of those after it leaves my router: 1 0 ms 1367 ms 4102 ms [local] 2 1 ms 1 ms 1 ms [local] 3 20 ms 32 ms 53 ms 108-239-176-2.lightspeed.mtryca.sbcglobal.net [108.239.176.2] 4 -1 ms 0 ms 0 ms [local] 5 21 ms 22 ms 23 ms 12.83.47.129 6 29 ms 29 ms 30 ms 12.122.200.9 7 31 ms 31 ms 31 ms 192.205.33.46 8 31 ms 31 ms 32 ms palo-b1-link.telia.net [62.115.139.13] 9 31 ms 37 ms 42 ms hurricane-ic-308019-palo-b1.c.telia.net [80.239.167.174] 10 146 ms 187 ms 211 ms 64.71.140.42 11 34 ms 34 ms 34 ms 208.68.243.254 12 47 ms 47 ms 48 ms setiboinc.ssl.berkeley.edu [208.68.240.20] Completed. If I run the same TraceRoute another time, it may follow a slightly different path. Anyway, I don't really think the issue here is that a packet got held up somewhere temporarily, albeit for an extraordinarily long time. That can happen to anybody at any time I would think, and probably does, every day. To me, the issue is how the scheduler handles it. From Jason's code diving, it appears that it's designed to automatically assume that any out-of-sequence scheduler requests result from nefarious intentions and trigger what is essentially a punitive action. Seems to me that it would be better for the scheduler to simply ignore any requests with an earlier sequence number. Even better would be for it to respond with a message actually stating that it was ignoring the request for just that reason. At least that would provide something for the requesting host's log. ID: 1696110 ·

cliff Send message Joined: 16 Dec 07 Posts: 625 Credit: 3,590,440 RAC: 0	Message 1696111 - Posted: 27 Jun 2015, 3:38:06 UTC Last modified: 27 Jun 2015, 3:52:02 UTC Hi Folks, Well add me to the list of users with postponed, then abandoned AP WU's, jist lost a few.. As soon as the 1st one started to be crunched. begining to be a bit of ye old Laurel & Hardy routine.. 'This is another fine mess you've gotten me into':-) [edit] BTW there was a 'CL build failure' Error mssg as well 27/06/2015 04:25:32 \| SETI@home \| Starting task ap_02fe15ab_B2_P1_00339_20150623_10703.wu_0 27/06/2015 04:25:33 \| SETI@home \| Task postponed: CL file build failure 27/06/2015 04:25:33 \| SETI@home \| Starting task ap_07ja15aa_B6_P1_00058_20150624_13380.wu_0 27/06/2015 04:25:35 \| SETI@home \| Task postponed: CL file build failure Cheers, Cliff, Been there, Done that, Still no damm T shirt! ID: 1696111 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13765 Credit: 208,696,464 RAC: 304	Message 1696112 - Posted: 27 Jun 2015, 3:45:57 UTC - in response to Message 1696108. So as has been said numerous times: the question here is: why did that first request take 9m 34s to finally be processed by the server? What happened to it? Did it get lost in-transit and finally find its way there, or was it received right away, but then took ~9m 30s to do a query on the DB? Or is it an error that occurs on the client side? All requests are sent & received in order, but for some reason one of them ends up with an incorrect time stamp. Grant Darwin NT ID: 1696112 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1696114 - Posted: 27 Jun 2015, 3:58:41 UTC - in response to Message 1696111. Cliff your problem is with Lunatics and driver versions. Read this thread about it. http://setiathome.berkeley.edu/forum_thread.php?id=77251 ID: 1696114 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696116 - Posted: 27 Jun 2015, 4:06:50 UTC - in response to Message 1696108. Last modified: 27 Jun 2015, 4:46:54 UTC I'm not disagreeing at all. I wouldn't expect it to be an ISP issue, not to mention only if one machine did it and not the rest. I still think the analysis of the timestamps on the chain of events indicates that the first request did not get a timely response, so BOINC decided it was a failed connection. Another contact was made successfully, and then that first request finally arrived or got processed after the second one that was successful. That points to out-of-sequence communication. I strongly doubt it is anything to do with your equipment, or even the local system for the ISP, but the Internet backbone is a pretty vast place. Either the request bounced around for a few minutes, or as Jason suggested, maybe for whatever reason, the hostID was missing from the request and it took the server nine minutes to run a query to try to figure out who that request belongs to. Once the server processed that delayed request, it did what it did, and of course BOINC isn't going to get a spontaneous response from the server when there are no active, open, awaiting a response connections at that point. However, on the next contact after that, the scheduler should have told your client that it isn't supposed to have all those WUs anymore (I'm assuming they stayed in your cache and you could have continued crunching them, right?). It is my understanding that these days, when you make a scheduler contact, a list of what you currently have is sent in with the request, so that the scheduler knows that you're still working on them, or if you ended up being assigned a few WUs, but that reply got lost and never made it to you, you can be resent those tasks (if that particular feature is enabled at the time). So as has been said numerous times: the question here is: why did that first request take 9m 34s to finally be processed by the server? What happened to it? Did it get lost in-transit and finally find its way there, or was it received right away, but then took ~9m 30s to do a query on the DB? And then why wasn't your client notified on the next request that you should no longer have those WUs? My Event was similar to this one, Phantom errors/abandoned tasks; Posted: 17 Jun 2015, 12:05:31 UTC According to the 'listing' for one of my machines, I/it has abandoned/errored 47 tasks, all at the same time and has no tasks in progress. Not according to my eyes!!! There are currently 43 tasks in progress on it and according to the event log in BOINC, none have been abandoned. The only things listed in the log, are the normal start, finish, upload and report. So, why the difference and why is my PC going to get a bad rep, because of an error at the S@H end? Mine was still starting and running the tasks as usual. If I hadn't of looked at the WebPage I wouldn't have known anything was wrong. Here it is again; 27 Jun 2015, 1:20:21 UTC - Abandoned, he just got whacked again. I count 48 tasks this time, and I doubt he's aware of any problems. Which means his machine will work all night on those Worthless tasks. None of the new tasks are being completed but he is consistently being sent a new task in what would appear to be the time it takes to complete one of the Abandoned tasks. What a waste. Did anyone catch the Server Log on that Host? 'cause I told ya it was going to happen. It will most likely happen tomorrow as well. ID: 1696116 ·

cliff Send message Joined: 16 Dec 07 Posts: 625 Credit: 3,590,440 RAC: 0	Message 1696117 - Posted: 27 Jun 2015, 4:13:33 UTC - in response to Message 1696114. Hi Brent, Cliff your problem is with Lunatics and driver versions. Read this thread about it. http://setiathome.berkeley.edu/forum_thread.php?id=77251 Obliged, guess I'll be editing the dratted .cl file:-) Regards, Cliff, Been there, Done that, Still no damm T shirt! ID: 1696117 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696124 - Posted: 27 Jun 2015, 4:59:44 UTC - in response to Message 1696103. Last modified: 27 Jun 2015, 5:02:42 UTC No, not considering 'throwing the baby out with the bathwater' as far as sanity check logic etc. I'm just finding it extremely challenging to rationalise the coded logic that an out of sequence receipt implies only a client state migration to another host, or that a failed host id lookup says the user detached and reattached, with no other possibilities. Seems like simply incomplete logic with minimal checking of the data already on hand that could verify those assumptions before (necessary and appropriate) action is taken. One other item. Any clue why this alleged request at 25 Jun 2015, 18:36:13 UTC was never acknowledged? There should be an entry such as; Thu 25 Jun 2015, 18:36:13 UTC Scheduler request completed: got 0 new tasks But there isn't one, there is nothing. No indication there was anything to be out of sequence. First, These questionable logic portions I've traced through so far (and stopped for the moment), are very early in the process during request authentication. That's AFAICT for now possibly before any log message has occurred, and certainly long before any successful server response would be due to print anywhere. I'm doubtful anything has to be out of sequence before it reaches inside Berkeley's network for things to go wrong with this (incomplete) logic, but suspect it could even be jumbled up within the scheduler operation. If that's, at least in some rare proportion of cases, simply due to some earlier request taking longer to authenticate than a more recent one arriving, then it fits a certain pattern that appears throughout the Boinc codebase, of not using any locks while servicing different requests. Not saying that's necessarily the precise mechanism of weirdness here, but it fits the patterns. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696124 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874	Message 1696157 - Posted: 27 Jun 2015, 6:54:41 UTC - in response to Message 1696108. So as has been said numerous times: the question here is: why did that first request take 9m 34s to finally be processed by the server? What happened to it? Did it get lost in-transit and finally find its way there, or was it received right away, but then took ~9m 30s to do a query on the DB? And then why wasn't your client notified on the next request that you should no longer have those WUs? If the request had been received in a timely fashion, and had started processing right away, there would have been no database anomalies to trigger the query. We know that everything was hunky-dory with the host record at 14:33:31/14:33:34 (Request 2 sent/completed), so the abandonment only makes sense - even given the crude sledgehammer response to out-of-sequence requests - if the 9m 34s is broken down into two parts: From 14:26:47 to 14:33:34 lost somewhere in cyberspace or Berkeley's servers From 14:33:34 to 14:36:13 running the query and scrubbing the tasks That's why I want somebody to look at the server logs. The user at Beta who has this happen repeatedly would be a good candidate - but as things stand, at this project, at the moment - only staff can access those logs. Bless - he runs Einstein as well, on what looks like the same computer. No trashed tasks, but this is what I want to see at SETI (probably Beta only, because of the server loading): http://einstein.phys.uwm.edu/host_sched_logs/10981/10981248 ID: 1696157 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696164 - Posted: 27 Jun 2015, 6:59:54 UTC - in response to Message 1696157. So which project has more capable scheduler setups, and which hostid is lower ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696164 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874	Message 1696166 - Posted: 27 Jun 2015, 7:05:12 UTC - in response to Message 1696164. So which project has more capable scheduler setups, and which hostid is lower ? It's not the raw power of the servers that matters, as the ratio of capability to user demand. It may be wishful thinking, but I'd imagine that Beta has relatively low loading comparative to the server power available - and there'd be fewer public complaints if Eric tried it there and found I was wrong. Not sure what the question is about HostID? ID: 1696166 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696168 - Posted: 27 Jun 2015, 7:08:33 UTC - in response to Message 1696166. So which project has more capable scheduler setups, and which hostid is lower ? It's not the raw power of the servers that matters, as the ratio of capability to user demand. It may be wishful thinking, but I'd imagine that Beta has relatively low loading comparative to the server power available - and there'd be fewer public complaints if Eric tried it there and found I was wrong. Not sure what the question is about HostID? Never mentioned power, by capable I of course mean in proportion to the tasks demanded of it, thanks anyway for the attempt at a correction. Isn't beta here coresiding with main ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696168 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696170 - Posted: 27 Jun 2015, 7:18:40 UTC - in response to Message 1696168. Last modified: 27 Jun 2015, 7:19:09 UTC Not sure what the question is about HostID? Only that I'm tripping over really suspicious logic early in authentication, where the server looks up the hostid and fails (for whatever reason). Nothing more than that. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696170 ·

Rasputin42 Volunteer tester Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0	Message 1696171 - Posted: 27 Jun 2015, 7:18:42 UTC The scheduler should not jump to any rash conclusions,because there was a late response! 9 minute delay or not. That needs to be changed. ID: 1696171 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874	Message 1696173 - Posted: 27 Jun 2015, 7:19:46 UTC - in response to Message 1696168. Last modified: 27 Jun 2015, 7:51:55 UTC Isn't beta here coresiding with main ? Yes, I see familiar names on the Beta status page. To be honest, I don't know how much (if any) pre-processing work is done on the Einstein logs to make them ready for inspection on demand, or whether what you see is created on-the-fly by a processing script. When you follow a link like the one I posted, you or I - as public viewers - only get the data for the single most recent RPC, and some of the user security information is redacted. BOINC servers keep longer logs than that, and the full information (user ID and IP address) is accessible to administrators. So there's processing done somewhere. If the processing is done routinely in advance (unknown), then the load to implement it at the Main project would be much greater than for Beta. Edit - answered my own (implied) question. The Einstein server logs must be pre-processed, because if you go up a level to, e.g., http://einstein.phys.uwm.edu/host_sched_logs/10981/, you can see the file times and sizes for each host in that part of the fanout. So, implementing it here would cost both time (added to every RPC) and storage space. ID: 1696173 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696175 - Posted: 27 Jun 2015, 7:21:23 UTC - in response to Message 1696171. The scheduler should not jump to any rash conclusions,because there was a late response! 9 minute delay or not. That needs to be changed. I happen to agree about the jumping to conclusions part. Still trying to figure out why those conclusions would be made at all... "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696175 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696176 - Posted: 27 Jun 2015, 7:25:06 UTC - in response to Message 1696173. Look, Bernd and Oliver are great blokes with a lot of insight, and I respect them a lot without ever being likely to meet them. We all know we're thrown into a puddle of poo that doesn't work the way it should, but the quicksand of complacency and lies is draining. What it needs now is a hero. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696176 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696296 - Posted: 28 Jun 2015, 1:20:18 UTC Last modified: 28 Jun 2015, 1:24:38 UTC Proposal to test/reproduce at least part of what happens, if someone is game. while having some tasks in cache, blank the hostid field contents in client state (but leave the cpid alone). This should simulate a hostid lookup failure, and be interpreted as a detach/reattach of the same host, marking the tasks abandoned, without actually having done a detach/reattach. It'd be interesting to verify if the tasks remain on the host (as I guess they might), where a real detach would see them gone. That might lead to a possible easy improvement to that scheduler logic, in that if the scheduler request lists tasks, then obviously it wasn't a real detach/reattach. There are many reasons that hostid lookup could either fail, or take a very long time. One example would be if that part of the host table happened to be locked for other purposes by the database, or if on a RAID-5 array with damaged block then it'd have to wait for the whole block to correct. Anyone game ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696296 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.