Testing resend old WU feature

Author	Message
Henk Haneveld Volunteer tester Send message Joined: 16 May 99 Posts: 154 Credit: 1,577,293 RAC: 1	Message 727692 - Posted: 18 Mar 2008, 21:45:11 UTC - in response to Message 727680. It looks to me that every time a host connects the check for lost results is made, this is not needed. A check per host once every couple of days is enough to resent the lost results before even the shortest results expire. A common situation that would have use for this feature is a failure of a hard drive or some other disaster that caused the BOINC folder to be wiped. In that type of situation, if the host contacts the scheduler and the resend is enabled, but the client doesn't initiate the check, the host will get a large batch of work, then potentially a few hours later be socked with another large batch. Somewhere along the line it would have to be determined whether or not the results could be completed in time or not. If they couldn't be completed in time, would they then sit on the server until such time as that specific host proves it has the time to complete them, or would they be voided and redistributed to another host? I don't think this matters. If the resend feature is off the results have to wait until they time-out before they are send to a other host. If the resend feature is turned on but only every couple of days checks a host then it is possible that a host gets to much work and is not able to return it on time but this will likely only happen if the results all are short units and the host has a large cache. More likely is that the host will go in to EDF for a while. Worst case, the results is not returned on time and will time-out on the same moment as in option 1. ID: 727692 ·

Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0	Message 727701 - Posted: 18 Mar 2008, 22:14:39 UTC - in response to Message 727692. Last modified: 18 Mar 2008, 22:17:04 UTC More likely is that the host will go in to EDF for a while. Worst case, the results is not returned on time and will time-out on the same moment as in option 1. Ah, but many people hate EDF / High Priority with a passion that is matched only by the need to breathe... What will happen if coordination between the host and the server works as it is designed, is that the server will reject the attempt at downloading work if it determines that work cannot be completed in time. If the resend mechanism works outside of this normal checking, then the host will be overloaded and will potentially endure two separate periods of EDF, i.e. the original resend batch and then for the new work. During this time, unless tinkered with, BOINC will refuse to get any work from other projects because it is into EDF workaholic mode... IOW, I'm trying to say that this idea of doing the check "only so many connections", while a nice thought, it has some design hurdles to be overcome... The thought that I'd have is if a request for attachment comes in, the attachment request immediately initiates the check on that same connection, before new work is sent... ID: 727701 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 727765 - Posted: 19 Mar 2008, 1:51:12 UTC - in response to Message 727658. Last modified: 19 Mar 2008, 2:07:21 UTC Might I point out that there are probably a lot of "lost WU"'s out there at the current time (even I have one) and that (therefore) initial response to re-activating this feature will probably be intense? I remind you that this feature has been off for at least two months... This may mean an initial heavy workload on the servers - but if you wait for the ~300k active SETI users to all get whatever lost WU's are still available, then I think you'll find that this workload will taper off to a much lower figure. I'm not certain I can agree with that. Ever since the feature was made available SETI had problems with their server load. Since disabling it, a lot of those issues went away. I could be wrong, so they can always try your suggestion to verify, but I think the outcome will be constant increased server load just from merely having the option enabled. I agree with you OzzFan. The increased server load is not from the actual re-transmission of lost WUs. Rather, it is because of the processing involved in looking-up which WUs the host is supposed to have and comparing that to the WUs the host actually has on-hand. Perhaps if there was an index or a separate DB record containing this info (assuming it's not there already), the load might decrease. Perhaps a better way to handle this process would be to let the Boinc Client do most of the actual work: When the client phones home, the project would respond with a list of WUs it thinks the host should have; the client could compare this list with the WUs it actually has and then request download of any missing WUs. You are both missing my point, that during the short time that resend was turned on, and the large workload that resulted, is/was not representative of the day-to-day workload that the servers would have to contend with. Sure that first blast of db activity is big - but a fair portion of that will go away as soon as all the lost WU's are sent... Yes, I realize that there will be extra workload on the db server, as well as the scheduler. (and probably the "download to client" server as well...) . Hello, from Albany, CA!... ID: 727765 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 727770 - Posted: 19 Mar 2008, 1:56:02 UTC - in response to Message 727688. Last modified: 19 Mar 2008, 2:09:24 UTC Ahh!! Thanks for the clarification, Joe. Is it any wonder there is a massive hit on the database when "Resend lost WU's" is switched on, then? Effectively, every Scheduler Request from every host is generating a query on the database saying "Check that my list matches your list and re-send any that are missing from my list". One can see that this may have been supportable before multi-core CPU's came out, but now, with Duo/Quad/Octo being common and caches of anything up to 20 days these lists have grown significantly. If my reading of the mechanism is anything like accurate, then I can't see the Resend coming back in the foreseeable future. F. Umm, Octo? I haven't seen [add] or heard of [/add] one of those yet, although I've seen a motherboard for 4 AMD Opteron quads, hence 16 cores... . Hello, from Albany, CA!... ID: 727770 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 727780 - Posted: 19 Mar 2008, 2:15:04 UTC - in response to Message 727765. Might I point out that there are probably a lot of "lost WU"'s out there at the current time (even I have one) and that (therefore) initial response to re-activating this feature will probably be intense? I remind you that this feature has been off for at least two months... This may mean an initial heavy workload on the servers - but if you wait for the ~300k active SETI users to all get whatever lost WU's are still available, then I think you'll find that this workload will taper off to a much lower figure. I'm not certain I can agree with that. Ever since the feature was made available SETI had problems with their server load. Since disabling it, a lot of those issues went away. I could be wrong, so they can always try your suggestion to verify, but I think the outcome will be constant increased server load just from merely having the option enabled. I agree with you OzzFan. The increased server load is not from the actual re-transmission of lost WUs. Rather, it is because of the processing involved in looking-up which WUs the host is supposed to have and comparing that to the WUs the host actually has on-hand. Perhaps if there was an index or a separate DB record containing this info (assuming it's not there already), the load might decrease. Perhaps a better way to handle this process would be to let the Boinc Client do most of the actual work: When the client phones home, the project would respond with a list of WUs it thinks the host should have; the client could compare this list with the WUs it actually has and then request download of any missing WUs. You are both missing my point, that during the short time that resend was turned on, and the large workload that resulted, is/was not representative of the day-to-day workload that the servers would have to contend with. Sure that first blast of db activity is big - but a fair portion of that will go away as soon as all the lost WU's are sent... Yes, I realize that there will be extra workload on the db server, as well as the scheduler. (and probably the "download to client" server as well...) No, I'm positive I got your point. I believe the day-to-day workload, while not as big as it would be initially, would still cause more stress on the servers than the SETI team can afford to spare at the moment. ID: 727780 ·

speedimic Volunteer tester Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0	Message 727781 - Posted: 19 Mar 2008, 2:17:45 UTC [off topic] Umm, Octo? I haven't seen [add]or heard of [/add] one of those yet, although I've seen a board for 4 AMD Opteron quads, hence 16 cores... Some of those come to my mind: Sun's T1 (4, 6 or 8 cores), T2 (8 cores) and IBM's z6 (20 cores). At least the Sparcs could already work for S@h. [/off topic] mic. ID: 727781 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874	Message 727822 - Posted: 19 Mar 2008, 9:20:20 UTC - in response to Message 727781. [off topic] Umm, Octo? I haven't seen [add]or heard of [/add] one of those yet, although I've seen a board for 4 AMD Opteron quads, hence 16 cores... Some of those come to my mind: Sun's T1 (4, 6 or 8 cores), T2 (8 cores) and IBM's z6 (20 cores). At least the Sparcs could already work for S@h. [/off topic] And I've been posting about my adventures with host 2901600 for 15 months now. The new Mac Pros are all octos, too. ID: 727822 ·

Arion Send message Joined: 6 Aug 99 Posts: 50 Credit: 140,650 RAC: 0	Message 727827 - Posted: 19 Mar 2008, 10:21:37 UTC - in response to Message 727822. Last modified: 19 Mar 2008, 10:22:27 UTC And I've been posting about my adventures with host 2901600 for 15 months now. The new Mac Pros are all octos, too. I know this is off the subject but I was looking at your wu's and your oldest one waiting to validate is from Nov. 18th of last year. 2 missed deadlines and now its been extended to April 15th or so. Somehow I wonder if resending lost wu's and adjusted deadlines would help eliminate such long wait times for wu's to be validated. Waiting 4 or 5 months for credit to be awarded seems to me to be absurb..... Sorry to have jumped in with this if its already a sore spot. I just couldn't help commenting after seeing someone else waiting for wu's to be sent back. ID: 727827 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874	Message 727829 - Posted: 19 Mar 2008, 10:36:26 UTC - in response to Message 727827. Last modified: 19 Mar 2008, 10:43:40 UTC And I've been posting about my adventures with host 2901600 for 15 months now. The new Mac Pros are all octos, too. I know this is off the subject but I was looking at your wu's and your oldest one waiting to validate is from Nov. 18th of last year. 2 missed deadlines and now its been extended to April 15th or so. Somehow I wonder if resending lost wu's and adjusted deadlines would help eliminate such long wait times for wu's to be validated. Waiting 4 or 5 months for credit to be awarded seems to me to be absurb..... Sorry to have jumped in with this if its already a sore spot. I just couldn't help commenting after seeing someone else waiting for wu's to be sent back. It wouldn't have helped with the first or second wingmates - the first seems to have left the project round about the time that that task was allocated, and the second never even completed his very first WU. But the third (current) wingmate seems to be crunching now, but to have lost a whole big block in the middle. That would have been picked up by 'resend old WU' (as Bobb2 calls it, to get back on topic), but as it is I think I'll have to wait until 15 April and a fourth wingmate. Edit - going back off-topic, my second-oldest pending is from WU 196689608 (different host) - returned on 28 December, still waiting for the first and original wingmate to phone home. Ain't going to happen (but what a waste of an octo server): so I wait until 31 March. At least these three month deadlines won't happen any more with the new splitter code. ID: 727829 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 727863 - Posted: 19 Mar 2008, 15:29:16 UTC - in response to Message 727822. Last modified: 19 Mar 2008, 15:45:32 UTC [off topic] Umm, Octo? I haven't seen [add]or heard of [/add] one of those yet, although I've seen a board for 4 AMD Opteron quads, hence 16 cores... Some of those come to my mind: Sun's T1 (4, 6 or 8 cores), T2 (8 cores) and IBM's z6 (20 cores). At least the Sparcs could already work for S@h. [/off topic] And I've been posting about my adventures with host 2901600 for 15 months now. The new Mac Pros are all octos, too. Perhaps I should have mentioned that I'm almost exclusively a Windo$e man... I haven't seen an Octo from InteÃ‚Â£ or from AMD is what I meant... [add] from what I've been able to research, the E5320 is a quad! Do you have a second chip in the system?[/add] . Hello, from Albany, CA!... ID: 727863 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 727874 - Posted: 19 Mar 2008, 17:20:50 UTC Isn't "resend" a bit absurd. If the client has lost the wu (disk crash/whatever), and the server learns of it, why doesn't the server just reallocate the wu to the outgoing queue and clean up the database? That is, why send it back to the original client? There is nothing special about him, is there? ID: 727874 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874	Message 727877 - Posted: 19 Mar 2008, 17:42:01 UTC - in response to Message 727863. [add] from what I've been able to research, the E5320 is a quad! Do you have a second chip in the system?[/add] Yes. Dual socket motherboard, 2 x quad CPUs. ID: 727877 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874	Message 727878 - Posted: 19 Mar 2008, 17:47:05 UTC - in response to Message 727874. Isn't "resend" a bit absurd. If the client has lost the wu (disk crash/whatever), and the server learns of it, why doesn't the server just reallocate the wu to the outgoing queue and clean up the database? That is, why send it back to the original client? There is nothing special about him, is there? That's just the point. Usually the server doesn't learn of it, and that's why we have WUs waiting for one, or several, deadlines to expire. If the resend feature is active, then the servers are actively checking whether the WUs it has sent are actually on the host to be crunched. IMO, it's that checking process which is taking up all the time, rather than the actual resend. ID: 727878 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 727948 - Posted: 19 Mar 2008, 22:23:17 UTC - in response to Message 727878. That's just the point. Usually the server doesn't learn of it, and that's why we have WUs waiting for one, or several, deadlines to expire. If the resend feature is active, then the servers are actively checking whether the WUs it has sent are actually on the host to be crunched. IMO, it's that checking process which is taking up all the time, rather than the actual resend. There is a mechanism that does this....it's called a trickle. as the WU is being crunched, at certain checkpoints it will generate a trickle that BOINC will upload to the server. Tasks that do not have a checkpoint within the past x days will be forced to expire and another task issued. Trouble is, the fastest machines do SETI WU inside of 2 hours. Maybe this is a silly question, but no one has answered this for me. If the client "loses a file", why can't it just download any missing tasks when it connects to the server? Also, why can't the server, when it gets a client request for more work, learn that the client's queue does NOT have the tasks it gave it? Then it could auto-expire and re-issue those tasks. When I request more work, it could check that my queued tasks = server's list of queued tasks for me. Perhaps I don't understand all the different cases that may arise to make a client loose a task. There has to be a clever way to address all/most of the cases with minimal impact on the database. ID: 727948 ·

Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0	Message 727962 - Posted: 19 Mar 2008, 22:35:00 UTC - in response to Message 727948. Maybe this is a silly question, but no one has answered this for me. If the client "loses a file", why can't it just download any missing tasks when it connects to the server? ...because the typical scenario isn't that it "loses a file", but that it "loses the entire installation". The only other main use for the resend feature is if the download server went squirrelly and so the server thinks that the client has a file that it never really received. Also, why can't the server, when it gets a client request for more work, learn that the client's queue does NOT have the tasks it gave it? Then it could auto-expire and re-issue those tasks. When I request more work, it could check that my queued tasks = server's list of queued tasks for me. This is exactly what the resend does, except it doesn't expire them, it attempts to let you have another shot at them. ID: 727962 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874	Message 727971 - Posted: 19 Mar 2008, 22:44:47 UTC - in response to Message 727948. When I request more work, it could check that my queued tasks = server's list of queued tasks for me. That's what it does, that's what takes the time, and that's why it's such a big hit on the database. With people having legitimate queues of hundreds, if not thousands, of tasks, and contacting the servers a couple of dozen times a day, that's a heck of a matching job. If you wrote a thousand checks a day, how long would it take you to reconcile your bank statement? ID: 727971 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 728084 - Posted: 20 Mar 2008, 6:52:58 UTC - in response to Message 727962. Maybe this is a silly question, but no one has answered this for me. If the client "loses a file", why can't it just download any missing tasks when it connects to the server? ...because the typical scenario isn't that it "loses a file", but that it "loses the entire installation". The only other main use for the resend feature is if the download server went squirrelly and so the server thinks that the client has a file that it never really received. That exact scenario has happened to me. But I think the majority of people who want the feature probably lose an entire installation folder, rather than just one or two tasks. I think, unless cache sizes are significantly capped, the WU resend feature will never be used again. The only other thing I could think of is a server-side setting. E.g., in S@H your tasks list, have a checkbox at the top say "resend any missing WU", so when the scheduler checks your preferences, it will only do the extra DB calls if that preference is checked. When the client contacts the scheduler, the checkbox is cleared by the scheduler. ID: 728084 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 728197 - Posted: 20 Mar 2008, 15:14:25 UTC - in response to Message 727962. Also, why can't the server, when it gets a client request for more work, learn that the client's queue does NOT have the tasks it gave it? Then it could auto-expire and re-issue those tasks. When I request more work, it could check that my queued tasks = server's list of queued tasks for me. This is exactly what the resend does, except it doesn't expire them, it attempts to let you have another shot at them. This is what I'd say is absurd. Why give the same wu to the client for another shot? That adds a constraint that doesn't seem necessary. Just move the 'lost' wu's back to the pool, and satisfy the hungry clients using the usual routines. ID: 728197 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 728262 - Posted: 20 Mar 2008, 19:19:10 UTC - in response to Message 728197. This is exactly what the resend does, except it doesn't expire them, it attempts to let you have another shot at them. This is what I'd say is absurd. Why give the same wu to the client for another shot? That adds a constraint that doesn't seem necessary. Just move the 'lost' wu's back to the pool, and satisfy the hungry clients using the usual routines. Because if you do that the reissue is placed at the end of the queue due to the new RID and has to work it's way up the list to be reissued. By sending it back to the same host it maintains it's place in the time frame hierarchy. In the case of the classic Ghost task, the client asked for work, the project thought it assigned it, but the assignment never got back to the host. When the client comes back after the RPC deferral and says "What's the story? You didn't give me anything!" and asks again, what's the point of recycling the one already assigned back to the end of the queue and sending a different one at that point in time? In a SAH only case you're only talking about seconds before that happens, and even if you run more than one project the delay between the requests will still be only a small fraction of the deadline window. The worst thing which might happen is it could force a period of EDF to get back on track overall, but so what. In the case of BOINC augering in on the host, the case is less clear, depending on how much time has passed between when the event happened and when it is discovered. Assuming it is even possible to recover the installation easily, then it still makes sense to resend to the host as long as the problem was fixed quick enough that the host isn't in an unresolvable deadline jam. In any event, the question of whether Resend functionality is good, bad, or indifferent is moot. It is shutoff for SAH, and will most likely stay shut off as long as they choose to 'overbook' the backend's capacity in terms of total outstanding workload. IOW, there is way more complaining about not being able to carry way more work than you really need to under most circumstances than there is about having a bunch of Ghosts rearing their ugly 'red head' as they timeout for a Computer Summary composed 2500 results over 125 pages or so. Alinator ID: 728262 ·

Yellow Horror Send message Joined: 10 Jun 03 Posts: 3 Credit: 10,157,045 RAC: 7	Message 728557 - Posted: 21 Mar 2008, 4:59:35 UTC Last modified: 21 Mar 2008, 5:06:07 UTC I think, in most cases the user knows if he/she have some "lost WU". So, redesigning the "resend" feature to a manually activated one-shot option instead of an automatic check that happens each time of client/server negotiation will leave the feature effective and solve the loadout issue. ID: 728557 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.