Message boards :
Number crunching :
Serious design flaw part of ul problem
Message board moderation
Author | Message |
---|---|
Mibe, ZX-81 16kb Send message Joined: 30 Jun 99 Posts: 42 Credit: 2,622,033 RAC: 0 |
Since the server marks scientificaly valid wu's as error, only because some user couldn't dl it. Uneccessary strain is put on the network and servers when replacement wu's are generated even though perfectly sound wu's already has been returned by one or two clients. My suggestion is to not let dl errors invalidate a oterwise correct wu. See http://setiweb.ssl.berkeley.edu/workunit.php?wuid=20702791 and http://setiweb.ssl.berkeley.edu/forum_thread.php?id=16563 |
Harry.nl Send message Joined: 21 Apr 03 Posts: 53 Credit: 67,821 RAC: 0 |
Since the server marks scientificaly valid wu's as error, only because some user couldn't dl it. See also this thread: http://setiathome.berkeley.edu/forum_thread.php?id=17234 A good idea, because this is costing a lot of users, a lot of precious processing time. Another solution is to stop sending work to older client versions. Especially 4.13 is a problem....... |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Since the server marks scientificaly valid wu's as error, only because some user couldn't dl it. There is a reason.... It is theoretically possible for a work unit to be impossible to download. Over time, you can reach a state where most of the work units in the system are these "broken" work units. That impacts database performance, bandwidth, etc. The alternative is to flag these work units, take 'em out of the queue, and at some point the operations people and/or developers can try to figure out why, or re-release the work. ... and since we're really supposed to be running on CPU cycles that would otherwise be wasted, it's not a big deal. For those who are actually making CPU cycles just for SETI, you still want the "bad stuff" out of the way so you can crunch good stuff. |
spacemeat Send message Joined: 4 Oct 99 Posts: 239 Credit: 8,425,288 RAC: 0 |
i feel bad for #3 here: http://setiathome.berkeley.edu/workunit.php?wuid=20706449 he spent 12 hours crunching a wu just to be granted 0 credit because of all the other client errors |
Mibe, ZX-81 16kb Send message Joined: 30 Jun 99 Posts: 42 Credit: 2,622,033 RAC: 0 |
Since a lot of ppl here demands that when we point out things that are wrong also propose a solution, I would like you to give an example of an impossible dl. |
Jim Baize Send message Joined: 6 May 00 Posts: 758 Credit: 149,536 RAC: 0 |
He cited an example of a known problem that may or may not happen (but has happened in the past) and cited the known solution regarding a problem. He is not picking up a new problem and complaining about it without offering a solution. Jim
|
David C Thompson Send message Joined: 1 Jun 05 Posts: 27 Credit: 90,446 RAC: 0 |
It sounds like this is a legit structural issue (not design flaw) with the way WUs are read off tape, split, and distributed. Short of a major re-design the existance of WUs that can't be downloaded seems to be a permanent feature of SETI@home. Killing off WUs that seem to jam might not be a bad idea. If it's possible to eliminate 4.13 users then that might help a bit. He cited an example of a known problem that may or may not happen (but has happened in the past) and cited the known solution regarding a problem. <a href="http://www.davidcthompson.com">David Thompson</a>, Intellectual Property Law <a href="http://www.stanford.edu/group/slaps/">Stanford Law and Policy Society</a> |
Ingleside Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13 |
Example... a fix-script was re-run start April on SETI@Home/BOINC to fix any stuck wu with 3 or more "success"-results, all of these was tried validated and failed with "no consensus yet" and therefore more "results" generated ready to be sent out again. In the mean-time, AFAIK all wu-files had been deleted from disk, and therefore all of these gave download-errors. Most download-errors from v4.13 is "WU download error: couldn't get input files: 16ja05ab.28093.25954.423578.148: MD5 computation error", it can look like failed due to incorrect MD5, indicating possibly corrupt file on disk. Anyone remembering back to BOINC beta, for one of the releases a file got corrupted when copied from alpha to beta, when downloaded this crashed any wu if tried to display screensaver... For this the MD5 had been updated so corresponded to the corrupt file, but this shows files can be corrupted on disk and therefore impossible to download. Not to forget, when the BOINC-client started enforcing signatures on all application-files, there was a bug in the application-upgrader so all files wasn't signed anyway... Back in SETI@Home "classic", just like under BOINC there is some wu with too much RFI that normally terminates after 1 minute or something. For a very small group of these "turbo-wu", the result-file increased past 32 KB before client terminated. Under v3.03, any result-file bigger than 32 KB was never successfully uploaded, and the server-program didn't clean-out any wu before got N results back. This meant a small group of wu was re-issued again and again, and over time more and more of the available wu was these that never could be finished. Well, eventually this was detected, and a new seti-application was released. That many continued to use v3.03 since faster is another matter... So, to guard against something similar should happen, there are various error-limits under BOINC deciding when to give up. Since "incorrect MD5" can be due to file corrupted on disk, not counting these errors can lead to wu being re-issued again and again infinitely many times... Anyway, since AFAIK most download-errors is due to a bug in v4.13 and earlier clients, if v4.19 had been enforced my guess is this discussion wouldn't really be here at all. :) Depending on release-schedule of v5, there's maybe no point to enforce v4.19 now... |
TheSleuth Send message Joined: 9 Aug 01 Posts: 19 Credit: 171,559 RAC: 0 |
The ZERO CREDIT happened to me at least once on each of my computers, and with the large number of "pending" results I suspect to see it some more. It's one of those facts-of-life that you accept or risk going over the edge because things don't go your way. I'm just happy to see things back to normal, working again as designed. |
Mibe, ZX-81 16kb Send message Joined: 30 Jun 99 Posts: 42 Credit: 2,622,033 RAC: 0 |
Sure, I have no problem handling things that happen against my way of having it, unless I can avoid it. This dl problem is possible to separate from the rest of the errors listed below (MD5 etc). Once identified it should not be treated as a wu-error. That would alleviate the burden on the servers. And decrease the networkload. An added bonus would be that faulty versions like 4.13 wouldn't nullify others credit or the crunched science. And belive me, if a version has been released with a bug like 4.13, it'll happen again. And if the proposed fix to the dl error is in place, the negative impact of that bug will be much less. [EDIT'ED]: 4.19 -> 4.13 |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
deleted. |
Mibe, ZX-81 16kb Send message Joined: 30 Jun 99 Posts: 42 Credit: 2,622,033 RAC: 0 |
There's just enough time to edit that post, going from the error in 4.19 to 4.13. Thank's! |
Don Erway Send message Joined: 18 May 99 Posts: 305 Credit: 471,946 RAC: 0 |
I think the "WU's just won't download" is a red herring. What is going on is that the WU has not only been downloaded, but completed by 1 or often 2 computers. The rest "failed download". Likely due to old bionc sw. It is nonsensical to throw away this perfectly good expenditure of cpu effort. Just cause a resched, and send the WU back out, to a smaller number of hosts, when this happens. Or, for that matter, do it immediately, when the download failure is detected, as the WU is initially sent out. Why not have a list of hosts, and just send it to 4, (or whatever you consider the minimum), if they all succeed download, and then keep sending it out to more hosts, until you have successful downloads of the WU to your required minimum. Limit it to 10 or 15 host dl attempts, or something, and you have your "bad WU" detector built in... This is the way to go. Why let 1 or 2 computers crank away, on a WU that you know, aprior, is going to be useless? Just ensure it gets sent, *successfully*, to enough hosts from the start. No? Don |
Jim Baize Send message Joined: 6 May 00 Posts: 758 Credit: 149,536 RAC: 0 |
There is a couple of good explainations in the forums as to why the WU's error out like they do. I don't remember exactly where they are so I cannot give you a link, but if you do some searching they are there. Jim I think the "WU's just won't download" is a red herring. |
Mibe, ZX-81 16kb Send message Joined: 30 Jun 99 Posts: 42 Credit: 2,622,033 RAC: 0 |
Why let 1 or 2 computers crank away, on a WU that you know, aprior, is going to be useless? Just ensure it gets sent, *successfully*, to enough hosts from the start. Yes, this is an urgent bug to look into. The worst thing about it is that not only 1 or 2 users can get screwed, sometimes even 3 clients returns their results within the timelimit but the wu is dismissed because of others dl errors. See http://setiweb.ssl.berkeley.edu/workunit.php?wuid=20882964 It's not serious to let blunders like this persist. Ok if there are bugs with higher priority to take care of, but not rekognizing the problem is a shame. |
ML1 Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20 |
[quote].../quote] Thanks for the detail and history. Interesting. Regards, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
Mibe, ZX-81 16kb Send message Joined: 30 Jun 99 Posts: 42 Credit: 2,622,033 RAC: 0 |
There is a couple of good explainations in the forums as to why the WU's error out like they do. I don't remember exactly where they are so I cannot give you a link, but if you do some searching they are there. Fine, but the problem is that client dl-error is mixed with those you mension above. And that is the design flaw. Internet, by it's very nature, can make a dl fail, but in the long run it will succeede. So to not recognize this (perhaps for some ppl a marginal difference), is to make extra work (totally unneccessary), with server and network resources allready stretched thin, and piss some people off by cheating them of credits. Which is fine if you want to make many blunders with one fault, but not all to bright if you aim for a broad user group of volunteers, with different personal goals for the offered cpu-cycles. |
Jim Baize Send message Joined: 6 May 00 Posts: 758 Credit: 149,536 RAC: 0 |
The current system takes up to 9 failures before it tags the WU as invalid. Perhaps it could use some tweaking. This last outage may have shown them that it could use tweaking. I'm not sure that it is a "serious design flaw" as the title of this thread suggests, however. Jim There is a couple of good explainations in the forums as to why the WU's error out like they do. I don't remember exactly where they are so I cannot give you a link, but if you do some searching they are there. |
Don Erway Send message Joined: 18 May 99 Posts: 305 Credit: 471,946 RAC: 0 |
[quote]The current system takes up to 9 failures before it tags the WU as invalid. Perhaps it could use some tweaking. This last outage may have shown them that it could use tweaking. I'm not sure that it is a "serious design flaw" as the title of this thread suggests, however. Jim [quote] In the presence of so many hosts, which get "client error downloading", almost every single time, this is a serious design flaw. This may be phone line related, it may be BOINC version related. But as long as it is happening, with such frequency, it is a real loss of valuable calculations, which are being thrown away, which are being crunched by 2 hosts, when the system knows they are useless, and then, presumably, the entire WU needs to get recalculated at some point, wasting the calcs that already succeeded, and denying the users credit. It just doesn't seem that hard: As long as you have less than 4 successful downloads, keep trying new hosts. If the first 4 all fail download, abort the WU. If it gets downloaded at least 1 host, then keep trying until you get 4, or reach 20 attempts. Treating "client error downloading" as a result, is the flaw. |
Jim Baize Send message Joined: 6 May 00 Posts: 758 Credit: 149,536 RAC: 0 |
I wish I had some hard statistics on the following statements, but I don't. However, most of the older clients usually do successfully download the WU's. It was only during this bottleneck time did this problem surface. Your suggested solution is basically SETI's current solution. You've just made the hard cap at 20 instead of 9. The error handling is much more robust in the newer clients. Once all of the clients are upgraded this will be a moot point. Jim
|
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.