Serious design flaw part of ul problem

Message boards : Number crunching : Serious design flaw part of ul problem
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Mibe, ZX-81 16kb
Volunteer tester

Send message
Joined: 30 Jun 99
Posts: 42
Credit: 2,622,033
RAC: 0
Sweden
Message 139503 - Posted: 20 Jul 2005, 8:37:14 UTC

Since the server marks scientificaly valid wu's as error, only because some user couldn't dl it.

Uneccessary strain is put on the network and servers when replacement wu's are generated even though perfectly sound wu's already has been returned by one or two clients.

My suggestion is to not let dl errors invalidate a oterwise correct wu.

See http://setiweb.ssl.berkeley.edu/workunit.php?wuid=20702791

and http://setiweb.ssl.berkeley.edu/forum_thread.php?id=16563
ID: 139503 · Report as offensive
Profile Harry.nl

Send message
Joined: 21 Apr 03
Posts: 53
Credit: 67,821
RAC: 0
Netherlands
Message 139505 - Posted: 20 Jul 2005, 8:49:01 UTC - in response to Message 139503.  

Since the server marks scientificaly valid wu's as error, only because some user couldn't dl it.

Uneccessary strain is put on the network and servers when replacement wu's are generated even though perfectly sound wu's already has been returned by one or two clients.

My suggestion is to not let dl errors invalidate a oterwise correct wu.

See http://setiweb.ssl.berkeley.edu/workunit.php?wuid=20702791

and http://setiweb.ssl.berkeley.edu/forum_thread.php?id=16563


See also this thread: http://setiathome.berkeley.edu/forum_thread.php?id=17234

A good idea, because this is costing a lot of users, a lot of precious processing time. Another solution is to stop sending work to older client versions. Especially 4.13 is a problem.......

ID: 139505 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 139664 - Posted: 20 Jul 2005, 17:10:38 UTC - in response to Message 139505.  

Since the server marks scientificaly valid wu's as error, only because some user couldn't dl it.

Uneccessary strain is put on the network and servers when replacement wu's are generated even though perfectly sound wu's already has been returned by one or two clients.

My suggestion is to not let dl errors invalidate a oterwise correct wu.

See http://setiweb.ssl.berkeley.edu/workunit.php?wuid=20702791

and http://setiweb.ssl.berkeley.edu/forum_thread.php?id=16563


See also this thread: http://setiathome.berkeley.edu/forum_thread.php?id=17234

A good idea, because this is costing a lot of users, a lot of precious processing time. Another solution is to stop sending work to older client versions. Especially 4.13 is a problem.......

There is a reason....

It is theoretically possible for a work unit to be impossible to download.

Over time, you can reach a state where most of the work units in the system are these "broken" work units. That impacts database performance, bandwidth, etc.

The alternative is to flag these work units, take 'em out of the queue, and at some point the operations people and/or developers can try to figure out why, or re-release the work.

... and since we're really supposed to be running on CPU cycles that would otherwise be wasted, it's not a big deal. For those who are actually making CPU cycles just for SETI, you still want the "bad stuff" out of the way so you can crunch good stuff.
ID: 139664 · Report as offensive
Profile spacemeat
Avatar

Send message
Joined: 4 Oct 99
Posts: 239
Credit: 8,425,288
RAC: 0
United States
Message 139733 - Posted: 20 Jul 2005, 18:49:22 UTC

i feel bad for #3 here:
http://setiathome.berkeley.edu/workunit.php?wuid=20706449

he spent 12 hours crunching a wu just to be granted 0 credit because of all the other client errors
ID: 139733 · Report as offensive
Mibe, ZX-81 16kb
Volunteer tester

Send message
Joined: 30 Jun 99
Posts: 42
Credit: 2,622,033
RAC: 0
Sweden
Message 139754 - Posted: 20 Jul 2005, 19:31:01 UTC - in response to Message 139664.  


It is theoretically possible for a work unit to be impossible to download.


Since a lot of ppl here demands that when we point out things that are wrong also propose a solution, I would like you to give an example of an impossible dl.
ID: 139754 · Report as offensive
Profile Jim Baize
Volunteer tester

Send message
Joined: 6 May 00
Posts: 758
Credit: 149,536
RAC: 0
United States
Message 139841 - Posted: 20 Jul 2005, 22:09:24 UTC - in response to Message 139754.  

He cited an example of a known problem that may or may not happen (but has happened in the past) and cited the known solution regarding a problem.

He is not picking up a new problem and complaining about it without offering a solution.

Jim


It is theoretically possible for a work unit to be impossible to download.


Since a lot of ppl here demands that when we point out things that are wrong also propose a solution, I would like you to give an example of an impossible dl.


ID: 139841 · Report as offensive
Profile David C Thompson
Volunteer tester

Send message
Joined: 1 Jun 05
Posts: 27
Credit: 90,446
RAC: 0
United States
Message 139846 - Posted: 20 Jul 2005, 22:23:37 UTC - in response to Message 139841.  

It sounds like this is a legit structural issue (not design flaw) with the way WUs are read off tape, split, and distributed. Short of a major re-design the existance of WUs that can't be downloaded seems to be a permanent feature of SETI@home. Killing off WUs that seem to jam might not be a bad idea. If it's possible to eliminate 4.13 users then that might help a bit.

He cited an example of a known problem that may or may not happen (but has happened in the past) and cited the known solution regarding a problem.

He is not picking up a new problem and complaining about it without offering a solution.

Jim


<a href="http://www.davidcthompson.com">David Thompson</a>, Intellectual Property Law

<a href="http://www.stanford.edu/group/slaps/">Stanford Law and Policy Society</a>
ID: 139846 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 139857 - Posted: 20 Jul 2005, 22:38:01 UTC - in response to Message 139754.  


It is theoretically possible for a work unit to be impossible to download.


Since a lot of ppl here demands that when we point out things that are wrong also propose a solution, I would like you to give an example of an impossible dl.


Example... a fix-script was re-run start April on SETI@Home/BOINC to fix any stuck wu with 3 or more "success"-results, all of these was tried validated and failed with "no consensus yet" and therefore more "results" generated ready to be sent out again. In the mean-time, AFAIK all wu-files had been deleted from disk, and therefore all of these gave download-errors.

Most download-errors from v4.13 is "WU download error: couldn't get input files: 16ja05ab.28093.25954.423578.148: MD5 computation error", it can look like failed due to incorrect MD5, indicating possibly corrupt file on disk.

Anyone remembering back to BOINC beta, for one of the releases a file got corrupted when copied from alpha to beta, when downloaded this crashed any wu if tried to display screensaver... For this the MD5 had been updated so corresponded to the corrupt file, but this shows files can be corrupted on disk and therefore impossible to download.

Not to forget, when the BOINC-client started enforcing signatures on all application-files, there was a bug in the application-upgrader so all files wasn't signed anyway...


Back in SETI@Home "classic", just like under BOINC there is some wu with too much RFI that normally terminates after 1 minute or something. For a very small group of these "turbo-wu", the result-file increased past 32 KB before client terminated. Under v3.03, any result-file bigger than 32 KB was never successfully uploaded, and the server-program didn't clean-out any wu before got N results back. This meant a small group of wu was re-issued again and again, and over time more and more of the available wu was these that never could be finished.
Well, eventually this was detected, and a new seti-application was released. That many continued to use v3.03 since faster is another matter...


So, to guard against something similar should happen, there are various error-limits under BOINC deciding when to give up. Since "incorrect MD5" can be due to file corrupted on disk, not counting these errors can lead to wu being re-issued again and again infinitely many times...


Anyway, since AFAIK most download-errors is due to a bug in v4.13 and earlier clients, if v4.19 had been enforced my guess is this discussion wouldn't really be here at all. :)

Depending on release-schedule of v5, there's maybe no point to enforce v4.19 now...
ID: 139857 · Report as offensive
TheSleuth

Send message
Joined: 9 Aug 01
Posts: 19
Credit: 171,559
RAC: 0
United States
Message 139868 - Posted: 20 Jul 2005, 23:00:54 UTC

The ZERO CREDIT happened to me at least once on each of my computers, and with the large number of "pending" results I suspect to see it some more. It's one of those facts-of-life that you accept or risk going over the edge because things don't go your way. I'm just happy to see things back to normal, working again as designed.

ID: 139868 · Report as offensive
Mibe, ZX-81 16kb
Volunteer tester

Send message
Joined: 30 Jun 99
Posts: 42
Credit: 2,622,033
RAC: 0
Sweden
Message 139878 - Posted: 20 Jul 2005, 23:27:06 UTC
Last modified: 20 Jul 2005, 23:36:09 UTC

Sure, I have no problem handling things that happen against my way of having it, unless I can avoid it.

This dl problem is possible to separate from the rest of the errors listed below (MD5 etc). Once identified it should not be treated as a wu-error.
That would alleviate the burden on the servers. And decrease the networkload.

An added bonus would be that faulty versions like 4.13 wouldn't nullify others credit or the crunched science.

And belive me, if a version has been released with a bug like 4.13, it'll happen again. And if the proposed fix to the dl error is in place, the negative impact of that bug will be much less.

[EDIT'ED]: 4.19 -> 4.13
ID: 139878 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 139879 - Posted: 20 Jul 2005, 23:29:49 UTC
Last modified: 20 Jul 2005, 23:39:16 UTC

deleted.
ID: 139879 · Report as offensive
Mibe, ZX-81 16kb
Volunteer tester

Send message
Joined: 30 Jun 99
Posts: 42
Credit: 2,622,033
RAC: 0
Sweden
Message 139882 - Posted: 20 Jul 2005, 23:36:48 UTC - in response to Message 139879.  

There's just enough time to edit that post, going from the error in 4.19 to 4.13.


Thank's!
ID: 139882 · Report as offensive
Don Erway
Volunteer tester

Send message
Joined: 18 May 99
Posts: 305
Credit: 471,946
RAC: 0
United States
Message 139897 - Posted: 21 Jul 2005, 0:42:30 UTC
Last modified: 21 Jul 2005, 0:44:48 UTC

I think the "WU's just won't download" is a red herring.

What is going on is that the WU has not only been downloaded, but completed by 1 or often 2 computers. The rest "failed download". Likely due to old bionc sw.

It is nonsensical to throw away this perfectly good expenditure of cpu effort.

Just cause a resched, and send the WU back out, to a smaller number of hosts, when this happens.

Or, for that matter, do it immediately, when the download failure is detected, as the WU is initially sent out. Why not have a list of hosts, and just send it to 4, (or whatever you consider the minimum), if they all succeed download, and then keep sending it out to more hosts, until you have successful downloads of the WU to your required minimum. Limit it to 10 or 15 host dl attempts, or something, and you have your "bad WU" detector built in... This is the way to go.

Why let 1 or 2 computers crank away, on a WU that you know, aprior, is going to be useless? Just ensure it gets sent, *successfully*, to enough hosts from the start.

No?

Don

ID: 139897 · Report as offensive
Profile Jim Baize
Volunteer tester

Send message
Joined: 6 May 00
Posts: 758
Credit: 149,536
RAC: 0
United States
Message 139980 - Posted: 21 Jul 2005, 4:36:02 UTC - in response to Message 139897.  

There is a couple of good explainations in the forums as to why the WU's error out like they do. I don't remember exactly where they are so I cannot give you a link, but if you do some searching they are there.

Jim

I think the "WU's just won't download" is a red herring.

What is going on is that the WU has not only been downloaded, but completed by 1 or often 2 computers. The rest "failed download". Likely due to old bionc sw.

It is nonsensical to throw away this perfectly good expenditure of cpu effort.

Just cause a resched, and send the WU back out, to a smaller number of hosts, when this happens.

Or, for that matter, do it immediately, when the download failure is detected, as the WU is initially sent out. Why not have a list of hosts, and just send it to 4, (or whatever you consider the minimum), if they all succeed download, and then keep sending it out to more hosts, until you have successful downloads of the WU to your required minimum. Limit it to 10 or 15 host dl attempts, or something, and you have your "bad WU" detector built in... This is the way to go.

Why let 1 or 2 computers crank away, on a WU that you know, aprior, is going to be useless? Just ensure it gets sent, *successfully*, to enough hosts from the start.

No?

Don


ID: 139980 · Report as offensive
Mibe, ZX-81 16kb
Volunteer tester

Send message
Joined: 30 Jun 99
Posts: 42
Credit: 2,622,033
RAC: 0
Sweden
Message 140048 - Posted: 21 Jul 2005, 7:53:12 UTC - in response to Message 139897.  

Why let 1 or 2 computers crank away, on a WU that you know, aprior, is going to be useless? Just ensure it gets sent, *successfully*, to enough hosts from the start.


Yes, this is an urgent bug to look into. The worst thing about it is that not only 1 or 2 users can get screwed, sometimes even 3 clients returns their results within the timelimit but the wu is dismissed because of others dl errors.

See http://setiweb.ssl.berkeley.edu/workunit.php?wuid=20882964

It's not serious to let blunders like this persist. Ok if there are bugs with higher priority to take care of, but not rekognizing the problem is a shame.
ID: 140048 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20283
Credit: 7,508,002
RAC: 20
United Kingdom
Message 140071 - Posted: 21 Jul 2005, 9:43:50 UTC - in response to Message 139857.  

[quote]
It is theoretically possible for a work unit to be impossible to download.
.../quote]

Example... a fix-script was re-run start April on SETI@Home/BOINC to fix any stuck wu...

Depending on release-schedule of v5, there's maybe no point to enforce v4.19 now...

Thanks for the detail and history. Interesting.

Regards,
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 140071 · Report as offensive
Mibe, ZX-81 16kb
Volunteer tester

Send message
Joined: 30 Jun 99
Posts: 42
Credit: 2,622,033
RAC: 0
Sweden
Message 140293 - Posted: 21 Jul 2005, 18:33:47 UTC - in response to Message 139980.  

There is a couple of good explainations in the forums as to why the WU's error out like they do. I don't remember exactly where they are so I cannot give you a link, but if you do some searching they are there.

Jim



Fine, but the problem is that client dl-error is mixed with those you mension above. And that is the design flaw. Internet, by it's very nature, can make a dl fail, but in the long run it will succeede.

So to not recognize this (perhaps for some ppl a marginal difference), is to make extra work (totally unneccessary), with server and network resources allready stretched thin, and piss some people off by cheating them of credits.

Which is fine if you want to make many blunders with one fault, but not all to bright if you aim for a broad user group of volunteers, with different personal goals for the offered cpu-cycles.
ID: 140293 · Report as offensive
Profile Jim Baize
Volunteer tester

Send message
Joined: 6 May 00
Posts: 758
Credit: 149,536
RAC: 0
United States
Message 140357 - Posted: 21 Jul 2005, 19:53:27 UTC - in response to Message 140293.  

The current system takes up to 9 failures before it tags the WU as invalid. Perhaps it could use some tweaking. This last outage may have shown them that it could use tweaking. I'm not sure that it is a "serious design flaw" as the title of this thread suggests, however.

Jim

There is a couple of good explainations in the forums as to why the WU's error out like they do. I don't remember exactly where they are so I cannot give you a link, but if you do some searching they are there.

Jim



Fine, but the problem is that client dl-error is mixed with those you mension above. And that is the design flaw. Internet, by it's very nature, can make a dl fail, but in the long run it will succeede.

So to not recognize this (perhaps for some ppl a marginal difference), is to make extra work (totally unneccessary), with server and network resources allready stretched thin, and piss some people off by cheating them of credits.

Which is fine if you want to make many blunders with one fault, but not all to bright if you aim for a broad user group of volunteers, with different personal goals for the offered cpu-cycles.


ID: 140357 · Report as offensive
Don Erway
Volunteer tester

Send message
Joined: 18 May 99
Posts: 305
Credit: 471,946
RAC: 0
United States
Message 140412 - Posted: 21 Jul 2005, 21:05:01 UTC - in response to Message 140357.  

[quote]The current system takes up to 9 failures before it tags the WU as invalid. Perhaps it could use some tweaking. This last outage may have shown them that it could use tweaking. I'm not sure that it is a "serious design flaw" as the title of this thread suggests, however.

Jim

[quote]

In the presence of so many hosts, which get "client error downloading", almost every single time, this is a serious design flaw.

This may be phone line related, it may be BOINC version related.

But as long as it is happening, with such frequency, it is a real loss of valuable calculations, which are being thrown away, which are being crunched by 2 hosts, when the system knows they are useless, and then, presumably, the entire WU needs to get recalculated at some point, wasting the calcs that already succeeded, and denying the users credit.

It just doesn't seem that hard: As long as you have less than 4 successful downloads, keep trying new hosts. If the first 4 all fail download, abort the WU. If it gets downloaded at least 1 host, then keep trying until you get 4, or reach 20 attempts.

Treating "client error downloading" as a result, is the flaw.




ID: 140412 · Report as offensive
Profile Jim Baize
Volunteer tester

Send message
Joined: 6 May 00
Posts: 758
Credit: 149,536
RAC: 0
United States
Message 140507 - Posted: 22 Jul 2005, 0:05:25 UTC - in response to Message 140412.  

I wish I had some hard statistics on the following statements, but I don't. However, most of the older clients usually do successfully download the WU's. It was only during this bottleneck time did this problem surface.

Your suggested solution is basically SETI's current solution. You've just made the hard cap at 20 instead of 9.

The error handling is much more robust in the newer clients. Once all of the clients are upgraded this will be a moot point.

Jim


In the presence of so many hosts, which get "client error downloading", almost every single time, this is a serious design flaw.

This may be phone line related, it may be BOINC version related.

But as long as it is happening, with such frequency, it is a real loss of valuable calculations, which are being thrown away, which are being crunched by 2 hosts, when the system knows they are useless, and then, presumably, the entire WU needs to get recalculated at some point, wasting the calcs that already succeeded, and denying the users credit.

It just doesn't seem that hard: As long as you have less than 4 successful downloads, keep trying new hosts. If the first 4 all fail download, abort the WU. If it gets downloaded at least 1 host, then keep trying until you get 4, or reach 20 attempts.

Treating "client error downloading" as a result, is the flaw.





ID: 140507 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Serious design flaw part of ul problem


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.