Strange New Error Message

Author	Message
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1006380 - Posted: 19 Jun 2010, 23:48:10 UTC Last modified: 20 Jun 2010, 0:03:13 UTC All of a sudden I am seeing the following: 6/19/2010 7:09:10 PM SETI@home Aborting task 04no09ad.13023.12342.13.10.28_0: exceeded elapsed time limit 10267.334105 I have gotten this from several WUs from 04no09ad. I have never before seen BOINC abort a WU for taking too long to crunch (note: NOT for exceeding return date and time). Is this another irritating outcome of the recent server changes? What genius decided this would be a cool thing to do? And wasting almost 3 hours of crunching per WU. And aborting the WUs now has further consequences in the new scheme of counting consecutive successful crunches to upgrade quotas (I think). I've gotten this about 30 times in the last day: exit status -177, maximum time limit exceeded. So 100+ hours of crunch time is totally wasted for no reason. WTF? Especially since the recent server changes have screwed up DCF, so estimates of time for WUs can be way off. Are the devs trying to drive away those of us who are willing to donate cycles, etc., to the project? ID: 1006380 ·

Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0	Message 1006389 - Posted: 20 Jun 2010, 0:15:21 UTC - in response to Message 1006380. See the threads What causes this ERROR and how to prevent? and -177 (0xffffffffffffff4f) Faults. :-) GruÃŸ, Gundolf Computer sind nicht alles im Leben. (Kleiner Scherz) SETI@home classic workunits 3,758 SETI@home classic CPU time 66,520 hours ID: 1006389 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1006396 - Posted: 20 Jun 2010, 0:46:49 UTC - in response to Message 1006389. See the threads What causes this ERROR and how to prevent? and -177 (0xffffffffffffff4f) Faults. :-) GruÃŸ, Gundolf No "flops" in my app_info.xml. So what else can be done? And, again, why is this occurring now? MUST be a server-side change, as I haven't done anything recently except go to 6.10.56 (from 6.10.18) so I could exclude the GPU in my chipset from crunching. Is it a 6.10.56 feature? And is there a way of turning it off in app_info.xml? ID: 1006396 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1006408 - Posted: 20 Jun 2010, 1:40:21 UTC - in response to Message 1006396. See the threads What causes this ERROR and how to prevent? and -177 (0xffffffffffffff4f) Faults. :-) GruÃŸ, Gundolf No "flops" in my app_info.xml. So what else can be done? And, again, why is this occurring now? MUST be a server-side change, as I haven't done anything recently except go to 6.10.56 (from 6.10.18) so I could exclude the GPU in my chipset from crunching. Is it a 6.10.56 feature? And is there a way of turning it off in app_info.xml? The only thing you could do is look in BOINC Manager for tasks with ridiculously small estimates, then shut down BOINC and edit client_state.xml so the <rsc_fpops_bound> values for those workunits is reasonable. 2.5e15 (2500000000000000.0) could be used for any MB WU. The issue is that the server is doing what amounts to DCF for each application on your host by adjusting the rsc_fpops_est, and the rsc_fpops_bound is adjusted by the same factor. Sometimes the adjustment goes too far, we can hope it will stabilize as the server averaging has more data. Joe ID: 1006408 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1006412 - Posted: 20 Jun 2010, 1:59:15 UTC - in response to Message 1006408. See the threads What causes this ERROR and how to prevent? and -177 (0xffffffffffffff4f) Faults. :-) GruÃŸ, Gundolf No "flops" in my app_info.xml. So what else can be done? And, again, why is this occurring now? MUST be a server-side change, as I haven't done anything recently except go to 6.10.56 (from 6.10.18) so I could exclude the GPU in my chipset from crunching. Is it a 6.10.56 feature? And is there a way of turning it off in app_info.xml? The only thing you could do is look in BOINC Manager for tasks with ridiculously small estimates, then shut down BOINC and edit client_state.xml so the <rsc_fpops_bound> values for those workunits is reasonable. 2.5e15 (2500000000000000.0) could be used for any MB WU. The issue is that the server is doing what amounts to DCF for each application on your host by adjusting the rsc_fpops_est, and the rsc_fpops_bound is adjusted by the same factor. Sometimes the adjustment goes too far, we can hope it will stabilize as the server averaging has more data. Joe That's too much damn work; I have at least several hundred WUs. Why can't the devs stop screwing around with the production server(s) and code until they have debugged their changes on the Beta site? ID: 1006412 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1006428 - Posted: 20 Jun 2010, 2:47:13 UTC Last modified: 20 Jun 2010, 3:03:08 UTC OK - here's the deal: the WUs that are timing out are marked as "Anonymous platform - NVIDIA GPU" - i.e. (I think) they were sent to me as CUDA WUs. I use Reschedule, which moved them to CPU as part of load balancing on my machine (automated). So I am getting screwed by BOINC for it. If BOINC is going to do this, shouldn't it be cognizant of Reschedule, and be smarter than this? What a screwup! Do I have to abort all these WUs so as to avoid wasting all the hours of comp time that are going to be aborted because of the disconnect between Reschedule and BOINC? And are there any more IEDs buried in the BOINC / server code changes????? Gah! EDIT: Actually, this is a reversal of the old CUDA VLAR problem that Rescheduler was largely a fix for - VLARs took forever on CUDA, so the tool changed them to CPU WUs. But now, that makes them take too long on the CPU, so they get aborted by BOINC (on the CPU) rather than by the user on the GPU, when he/she notices a lack of progress and a l-o-n-g execution time! Talk about working at cross purposes... Perhaps we need a better Properties button on the Tasks tab, to give this info (i.e., whether a VLAR or not) and a new button to allow the user to turn off the execution timeout that BOINC now has for a specific WU, or to modify it (with 0 meaning "no execution timeout"). ID: 1006428 ·

Aurora Borealis Volunteer tester Send message Joined: 14 Jan 01 Posts: 3075 Credit: 5,631,463 RAC: 0	Message 1006434 - Posted: 20 Jun 2010, 2:59:55 UTC - in response to Message 1006428. Last modified: 20 Jun 2010, 3:02:57 UTC OK - here's the deal: the WUs that are timing out are marked as "Anonymous platform - NVIDIA GPU" - i.e. (I think) they were sent to me as CUDA WUs. I use Reschedule, which moved them to CPU as part of load balancing on my machine (automated). So I am getting screwed by BOINC for it. If BOINC is going to do this, shouldn't it be cognizant of Reschedule, and be smarter than this? What a screwup! Do I have to abort all these WUs so as to avoid wasting all the hours of comp time that are going to be aborted because of the disconnect between Reschedule and BOINC? And are there any more IEDs buried in the BOINC / server code changes????? Gah! You really don't expect that Boinc can adapt to micromanaging do you. You are using a third party, project specific app to shuffle things around. There's no way Boinc can react well to this. It may compensate to some degree over time, but it wont be instantaneous. Boinc V7.2.42 Win7 i5 3.33G 4GB, GTX470 ID: 1006434 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1006435 - Posted: 20 Jun 2010, 3:08:34 UTC - in response to Message 1006434. OK - here's the deal: the WUs that are timing out are marked as "Anonymous platform - NVIDIA GPU" - i.e. (I think) they were sent to me as CUDA WUs. I use Reschedule, which moved them to CPU as part of load balancing on my machine (automated). So I am getting screwed by BOINC for it. If BOINC is going to do this, shouldn't it be cognizant of Reschedule, and be smarter than this? What a screwup! Do I have to abort all these WUs so as to avoid wasting all the hours of comp time that are going to be aborted because of the disconnect between Reschedule and BOINC? And are there any more IEDs buried in the BOINC / server code changes????? Gah! You really don't expect that Boinc can adapt to micromanaging do you. You are using a third party, project specific app to shuffle things around. There's no way Boinc can react well to this. It may compensate to some degree over time, but it wont be instantaneous. Perhaps - but it would be even nicer if BOINC didn't make it necessary. Reschedule makes it possible for users to do a lot more science by using resources more effectively (VLARs to CPU, load balancing when SETI doesn't distribute CPU/GPU in the proper ratio). Perhaps these new changes should be thought through a little better before being implemented. We keep hearing that "it's the science, stupid, not the credits" and I agree. So why make it harder to do the science efficiently? ID: 1006435 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51469 Credit: 1,018,363,574 RAC: 1,004	Message 1006437 - Posted: 20 Jun 2010, 3:14:57 UTC - in response to Message 1006412. That's too much damn work; I have at least several hundred WUs. Why can't the devs stop screwing around with the production server(s) and code until they have debugged their changes on the Beta site? Aww.. come on now, what would be the fun in that? "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1006437 ·

Aurora Borealis Volunteer tester Send message Joined: 14 Jan 01 Posts: 3075 Credit: 5,631,463 RAC: 0	Message 1006440 - Posted: 20 Jun 2010, 3:22:54 UTC - in response to Message 1006435. OK - here's the deal: the WUs that are timing out are marked as "Anonymous platform - NVIDIA GPU" - i.e. (I think) they were sent to me as CUDA WUs. I use Reschedule, which moved them to CPU as part of load balancing on my machine (automated). So I am getting screwed by BOINC for it. If BOINC is going to do this, shouldn't it be cognizant of Reschedule, and be smarter than this? What a screwup! Do I have to abort all these WUs so as to avoid wasting all the hours of comp time that are going to be aborted because of the disconnect between Reschedule and BOINC? And are there any more IEDs buried in the BOINC / server code changes????? Gah! You really don't expect that Boinc can adapt to micromanaging do you. You are using a third party, project specific app to shuffle things around. There's no way Boinc can react well to this. It may compensate to some degree over time, but it wont be instantaneous. Perhaps - but it would be even nicer if BOINC didn't make it necessary. Reschedule makes it possible for users to do a lot more science by using resources more effectively (VLARs to CPU, load balancing when SETI doesn't distribute CPU/GPU in the proper ratio). Perhaps these new changes should be thought through a little better before being implemented. We keep hearing that "it's the science, stupid, not the credits" and I agree. So why make it harder to do the science efficiently? And that is precisely the direction Boinc is going. One of the things on the agenda is to maintain separate the stats for different apps on a project. I don't know for sure if that is part of the current server changes, but whatever modifications they need to make for it to happen, it's not likely to be a smooth transition as new tables of stats need to be established and adjusted from the current system. Boinc V7.2.42 Win7 i5 3.33G 4GB, GTX470 ID: 1006440 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.