Posts by Jeff Buck


log in
1) Message boards : Number crunching : Panic Mode On (98) Server Problems? (Message 1699028)
Posted 8 hours ago by Profile Jeff BuckProject donor
Hi folks!

I just checked my pending MBs and noticed some recent results haven't validated yet, although my wingmen and I finished the WUs. For example http://setiathome.berkeley.edu/workunit.php?wuid=1830619685
Any idea what's wrong with the validators?

It seems that, from time to time, the validators take a nap and some WUs like that one slip past them. Last September I had a whole bunch of tasks that fell into a validation black hole that lasted about 10 minutes or so. Richard Haselgrove responded to my report, as follows:

There is a failsafe in place, which weeds out most errors like that - the transitioners/validators take a second look at stuck WUs on the day the original deadlines would have passed (three weeks for shorties, six/seven/eight weeks away for the rest).

He was right. When the original deadline was reached for those tasks, the validators successfully picked them all up. In the case of your task, that should be about August 18. You'll just have to be patient until then.
2) Message boards : Number crunching : Panic Mode On (98) Server Problems? (Message 1699015)
Posted 9 hours ago by Profile Jeff BuckProject donor
I attribute it to BOINC 7.2.33 which seems to be the best version I've come across.

Well, take a look at my active hosts and see which BOINC version I'm running....on all of them. :^) I still average a couple of truncations a day, I think, though as I mentioned, only(!) about 5 a month end up Invalid.
3) Message boards : Number crunching : Panic Mode On (98) Server Problems? (Message 1699005)
Posted 10 hours ago by Profile Jeff BuckProject donor
Yes, that thread was a Long time ago, the problem was identified, yet it still exists.

Most irritating to me, not only was the problem identified, but a simple fix for the validation side of it was proposed by Joe Segur and passed along to Eric....never to be heard of again. While Jason's commode build seems to successfully address the Stderr truncation, it only covers NVIDIA GPUs. Unfortunately, the truncations can happen on CPUs and ATI GPUs, as well, so the validation-side fix would be more comprehensive, even though it wouldn't eliminate the truncations. I usually see the truncations happening every day, but only a small subset of those end up Invalid, currently averaging about 5 a month for me in 2015.
4) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1697594)
Posted 4 days ago by Profile Jeff BuckProject donor
Well, I certainly can't speak for Richard, but if I look at the task detail page for that first task in your list, I see:

Report deadline 26 Jul 2015, 13:03:07 UTC

That would have been the original 25-day deadline. The workunit detail page is showing when the task was "reported", or in this case, abandoned. It looks like all your tasks were abandoned at the same time, so you might take a look in your Event Log around that time to see if you've had a timeout similar to what TBar reported in the first post in this thread. If you did, then that really is what they're busily trying to fix here (and may very well have, if their proposed solution passes muster).
5) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1697079)
Posted 6 days ago by Profile Jeff BuckProject donor
well we actually have two conditions - one is the backup scenario the other is the 'ask twice - rpcs get out of order' that keeps killing the cache of some people here. the bits of code triggered are the same.

Yes, so should the response to those two conditions be different? And where does the fraud-blocking that was discussed earlier come in?
6) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1697076)
Posted 6 days ago by Profile Jeff BuckProject donor
My goodness you boys have been busy while some of us slept! Seems to have gotten a lot more complicated since the last installment I read, where we were simply discussing the handling of out-of-sequence requests. Let's see, what caught me eye....

edit2: I still think it's exceedingly impertinent to insinuate that you were doing something dodgy, when the most probable cause is having reverted to a backup for some reason.
Actually, the probable cause that triggered this discussion was having an initial request get hung up in transmission, timeout, then have a second request arrive at the scheduler before the first one eventually trundles in. My successful test of the abandonment simply used a backup copy to simulate the out-of-sequence condition.

exactly - so just check the host really hasn't anything running before we ditch the lot.
If you want to be more sophisticated, clean out what's really not there.
I like that even better than my "do nothing except report the out-of-sequence condition" suggestion. If active tasks are included in the request, why abandon them?

I have a few other thoughts, but with the outage looming, I think I'll just post this quick.

EDIT: Actually, one minor issue with that last suggestion just occurred to me. If the second request (arriving first) generated new tasks, those tasks would not show up in the later-arriving first request. You wouldn't want to abandon those new ones.
7) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696892)
Posted 6 days ago by Profile Jeff BuckProject donor
Could just all be signs of bandaid induced entropy.

I think that description pretty much fits any programs that have been around for more than, oh say, six months, especially if they have more than one person's fingerprints on them.

Well, I don't think there's anything more I can really offer here so, as I said before, I sure hope you can convince "someone" to make, or at least implement, some changes. ;^) Good luck!
8) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696888)
Posted 6 days ago by Profile Jeff BuckProject donor
hmmm, yeah definitely seems backwards. perhaps it wasn't fully though out.

In any case, I think the basic trigger of assuming a low rpc number, followed by host match, means the user is juggling hosts/folders, the reason to leave out cpid search in this path, is pretty thin logic.

If you transfer the data folder to an identical host [name it the same], adjust the local IP to the old one, and the key hardware is the same, why care ? Maybe it's assuming you copied the client state and forgot the rest of the data folder ?

[Even then, the sequence number would be fine...]

It almost seems like that "make_new_host" logic was originally written for another purpose, then just co-opted later for use by the rpc_seqno checking. Are there other routines that perform, or "goto", that code? (BTW, I'm a retired dinosaur, and if you ever really want to try a brain-bender, take a crack at following the logic of an old COBOL program with ALTER statements in it, or the equivalent in ALC. AAAAARRRRGGGGHHH!)
9) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696881)
Posted 7 days ago by Profile Jeff BuckProject donor
LOL! Yep, that explains it. It seems kind of mystifying in the second case for it to have to try to "locate" the host after the rpc sequence number check fails, when it's already succeeded in looking up the hostid and the user, and authenticating the request. Why in the world does it have to do all that additional scanning and then, only when it succeeds, trash the tasks in progress? Definite weirdness! I certainly hope you can convince "someone" to make some changes! ;^)


Yeah definitely oddball logic. The best I can fathom of the intent from the comments and code, is that the idea is to punish you for moving the client state to another host. I'd have to think if that's the case, then the collateral damage for legitimately scrambled rpc sequence is too high.

A Less destructive choice in my mind, is they could use cpid match, AND token match other elements, but leave out local IP as they can be dynamic, and especially voltatile under communication stresses (that may cause a scrambled rpc sequence number).

It'd be one thing if it failed one of those matches for "hostname, IP, processor and amount of RAM". but to abandon tasks when it was successful seems awfully strange.

I agree with you, too, about relying on the IP lookup as part of the validation. Personally, I don't use DHCP, and the static IPs I've assigned rarely change. (That host I tested with shows "same the last 1874 times".) But I could conceivably shuffle some IPs if I make a change, and DHCP would certainly seem like a crapshoot for those using it, especially when adding or deleting a device.
10) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696879)
Posted 7 days ago by Profile Jeff BuckProject donor
Is there anyway something simple could work. Such as having the client send a cc: to the Server when it Times Out a Request? You know, when it logs a timeout on the host send a copy to the Server informing the Server it is canceling the request.

In a sense, simply sending the next request should serve as that kind of notification, IF the higher rpc_seqno would cause the scheduler to ignore any request that it receives later but with the lower sequence number. Then, again, who's to say that the second request (or some other notification like you suggest) will always get to the scheduler before the first request. Even with the timeout, the first one could still conceivably get there first. The bottleneck might not cause a 9+ minute delay but maybe just a long enough delay that clears about the same time the host reaches its timeout deadline or, for that matter, anytime during that minute and a half between the timeout and the sending of the next request. Of course, the next request might also happen to hit a similar bottleneck. I don't really know how they could reliably synchronize requests for every possible situation.
11) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696876)
Posted 7 days ago by Profile Jeff BuckProject donor
Which it certainly doesn't seem like it's currently accomplishing. By the way, have you figured out why the first group of tests with both doctored hostid and rpc_seqno fields didn't trigger the abandonment, while my final test with the lower rpc_seqno but an untouched hostid field was successful? Was the hostid check executed first, and then the rpc_seqno check bypassed after the hostid was corrected?


yeah hostid lookup is first. Personally I would have made user authentication first so as to reduce the exposure to DoS attacks, but that's a side issue for these purposes.

In the first case [no abandonment]:
- lookup by hostid (fails)
-- lookup by rpc seqno in users hosts (fails, goto (!) lookup_user_and_make_new_host)
lookup_user_and_make_new_host:
- lookup user, match authenticators
- if cpid is present, scan the the user's hosts and match it. ( succeeds, last ditch attempt)

In the second case [tasks abandoned]:
- lookup by hostid (succeeds)
- lookup the user (succeeds)
- Authenticate (succeeds)
- rpc seqeunce number check (fails, goto (!) make_new_host )
make_new_host:
- Final attempt to locate host by scanning back through user's hosts matching hostname, IP, processor and amount of RAM. (succeeds next do ***)
*** if found (it was), use the existing record AND mark results as over (except if allow_multiple_clients is enabled)

LOL! Yep, that explains it. It seems kind of mystifying in the second case for it to have to try to "locate" the host after the rpc sequence number check fails, when it's already succeeded in looking up the hostid and the user, and authenticating the request. Why in the world does it have to do all that additional scanning and then, only when it succeeds, trash the tasks in progress? Definite weirdness! I certainly hope you can convince "someone" to make some changes! ;^)
12) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696862)
Posted 7 days ago by Profile Jeff BuckProject donor
True. What I'm trying to picture is any situation that scaled across many occurrences would indicate a source of compounding bloat.
I guess you'd have to have some way of finding out how often legitimate detach/reattach events get caught by this code trap. But if database bloat was a serious concern of the powers that be, I can think of several other ongoing problem areas that could probably cut into that load significantly, IF they were ever addressed, starting with all the runaway hosts that maintain a revolving stash of thousands of Invalid tasks.

To me, in that context, freeing up the tasks makes sense, but only really after the detach/reattach is certain.
Which it certainly doesn't seem like it's currently accomplishing. By the way, have you figured out why the first group of tests with both doctored hostid and rpc_seqno fields didn't trigger the abandonment, while my final test with the lower rpc_seqno but an untouched hostid field was successful? Was the hostid check executed first, and then the rpc_seqno check bypassed after the hostid was corrected?

Doing nothing is of course simpler to code and maintain :D
Now there's a worthy goal!!
13) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696833)
Posted 7 days ago by Profile Jeff BuckProject donor
One possible fix comes to mind. On lower sequence number, Instead of immediately enact detach/reattach and abandon, it could set a flag for the host 'abandon_if_nextrequest_RPCseqno_follows_this_one' and do nothing apart from store the old (greater) sequence number, and the new lower one (+1) as current.

Then on subsequent contact, if the sequence number follows the current one, but not the earliergreater one, the detach/reattach should be genuine, or a least have a much higher probability of being genuine. Do the detach if so, and in either case finally clear the stored flags before continuing with the full request.

Storing that extra little bit of data may not need to happen in the main hosts table, but a small lookup table or file called suspect_contacts_for_possible_reattach.

I'm curious as to what the worst consequence would be if an out-of-sequence request resulted in no scheduler action at all, other than perhaps a message to the requesting host to that effect, which would be posted in the event log.

To me it seems as if the situation was not an actual detach/reattach, but simply an in-transit delay like we've been discussing, that there would be no consequences at all, since the later request (which was successfully processed earlier) would have already taken care of any completed task reporting and new task retrieval. Subsequent requests would just continue as normal after that one.

For a legitimate detach/reattach, wouldn't a non-action simply leave those "in progress" tasks on the server until they time out? The only thing the forced abandonment seems to accomplish is that those tasks get resent to new hosts more quickly.
14) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696799)
Posted 7 days ago by Profile Jeff BuckProject donor
You appear to be suggesting None of the Request arrived at the server in the 6:49 between the first Request and the second Request. I find that hard to believe considering the previous requests were taking a second or two. Even if SETI doesn't Trust the same Time Stamps other institutions such as Banks and Finance Trust surely they can Trust when the First few packets arrived. Personally I find the suggestion that SETI can't Trust the same Time Stamps everyone else uses rather evasive and questionable. In any event, something should be changed in the ways SETI responds to something as simple as a delayed packet, if that is the case.

I'm only talking about a possible time stamp in the body of the actual request message that the scheduler receives, not header time stamps in individual packets (if, in fact, the scheduler request actually does get broken into separate packets). I'm hardly a communications expert, but I seriously doubt that the scheduler sees anything except the whole message body, once it's been completely received and, if necessary, reassembled from individual packets. It certainly can't do any processing on the message until it's got the whole thing.

In any event, using your example that began this thread, I really can't see where it would have made a difference whether it was using a time stamp or a request sequence number. Either way, the scheduler didn't receive the first request until after the second one had been successfully processed. Therefore, an out-of-sequence condition would have been raised no matter what. However, the assumption the scheduler apparently made, in deciding that there was some sort of detach/reattach scenario that required trashing all the tasks in progress, is what certainly seems to me to need fixing.
15) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696785)
Posted 7 days ago by Profile Jeff BuckProject donor
Someone Please tell me SETI actually checks the Time Stamps on Requests before labeling them as out of sequence, Abandoning All your tasks, Not removing them from your Host, Not informing you of their actions, and leaving your Host to waste Time and Energy working Worthless tasks.

Why sure, TBar, and would you also like someone to tell you that the Easter Bunny and the Tooth Fairy are real? I could probably do that, all for the same price. And for just a nominal extra charge, I could throw in Santa Claus. ;^)

Seriously, though, I rather doubt that time stamps from local hosts would be a very reliable method of verification, even if the scheduler was storing them in the database. The time stamps would be outside of BOINC's control and subject to all sorts of adjustments that could occur for a variety of reasons on the local hosts. In addition to minor automatic syncs by the OS (or manual ones by the user), I can think of Daylight Savings Time changes (varying by locale and user option), traveling laptops whose owners like to have them on local time, dead or dying CMOS batteries, and probably many other manual manipulations for reasons that have nothing at all to do with trying to game BOINC. No, I think that anything BOINC would use for sequence checking would have to be something that was pretty much completely in BOINC's control. But, again, I don't think the issue is as much detecting the out-of-sequence condition as it is how the scheduler deals with it.
16) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696596)
Posted 8 days ago by Profile Jeff BuckProject donor
Well, I received a resend on one of my hosts and it had been Abandoned. I checked the host and found him again, on Main this time. Same same, still getting those Abandoned whatevers;
28 Jun 2015, 15:07:31 UTC - Abandoned
Amazing.

Can someone check the Server Log on this Host, http://setiathome.berkeley.edu/show_host_detail.php?hostid=7206136
Back in business!

That's interesting. I have a database with almost all my tasks for the last 2+ years. I just checked it and found I've been paired with host 7206136 14 times in that span. Four of them were abandoned and one was a timeout, all of them in 2014: April 18, July 11 (the timeout), September 16, September 18, and October 5, so it appears to be a long-standing issue. No problems in the 7 WUs since then, although the last one was way back in February of this year. I checked his other host, too, 1850030, but only shared 3 WUs, all last year, and all were fine.
17) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696555)
Posted 8 days ago by Profile Jeff BuckProject donor
Yes, there are the multiple conditions required there, which is when I got the image of space shuttle O-rings being connected to the same piece of metal.

The exact sequence can go to one of two places the abandonments occur, and either one or both of them could need attention. My current feeling is that the server shouldn't be doing anything to host records, or associated tasks, until authentication is completed successfully.

You won't get your front door open if you insert a half sucked lozenge into the lock before the key. This feels like a claymore connected to a lozenge detector.

[Edit:} OMG this code has goto statements in it; how quaint!

Nothing like spaghetti code to really make things interesting. Sort of like Alice sliding down the rabbit hole!
18) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696549)
Posted 8 days ago by Profile Jeff BuckProject donor
Ah, good morning, Jason! Happy to provide some grist for your mill. ;^)

I think any of my tests last night where I was resetting the rpc_seqno to a lower number also always followed at least one manual update with the higher number. However, I don't recall doing any where I wasn't also tinkering with the hostid field, since that was the primary focus. Could it have been that the scheduler was dealing with the missing hostid first, and then ignoring the rpc_seqno once it finished correcting the hostid?
19) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696540)
Posted 8 days ago by Profile Jeff BuckProject donor
The concept behind the coding of BOINC is that it should be fault-tolerant, but cheating-intolerant.

The problem here is that faults are being sent down the cheaters' pathway, which is far from ideal for anyone. The question is, what needs to change to route them down a fault-tolerant pathway?

My inclination would be for the scheduler to simply take no action at all on an out-of-sequence request, other than perhaps to send a response back to the requesting host that such a request was received. It would neither accept any reported completed tasks nor send out any new tasks when the request is out of sequence, and it certainly wouldn't abort everything in progress without alerting the host to that action.
20) Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH? (Message 1696530)
Posted 8 days ago by Profile Jeff BuckProject donor
The next question is why packets that have time stamps could be deemed out of sequence even if they arrive late. Simply checking the time stamp would identify the sequence. The Server does check time stamps, doesn't it?

To do that, it would have to store the time stamp from the previous message in the DB in order to have something to compare it with. I kind of doubt that it would do that since it thinks that the rpc_seqno addresses the issue.

In any event, I think that the underlying problem is not so much that the requests arrive out of sequence (regardless of the reason), it's that the scheduler applies such a drastic solution when it does happen. I would think that could be improved.


Next 20

Copyright © 2015 University of California