Strange New Error Message

Message boards : Number crunching : Strange New Error Message
Message board moderation

To post messages, you must log in.

AuthorMessage
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1006380 - Posted: 19 Jun 2010, 23:48:10 UTC
Last modified: 20 Jun 2010, 0:03:13 UTC

All of a sudden I am seeing the following:

6/19/2010 7:09:10 PM SETI@home Aborting task 04no09ad.13023.12342.13.10.28_0: exceeded elapsed time limit 10267.334105

I have gotten this from several WUs from 04no09ad.
I have never before seen BOINC abort a WU for taking too long to crunch (note: NOT for exceeding return date and time).

Is this another irritating outcome of the recent server changes? What genius decided this would be a cool thing to do? And wasting almost 3 hours of crunching per WU. And aborting the WUs now has further consequences in the new scheme of counting consecutive successful crunches to upgrade quotas (I think).
I've gotten this about 30 times in the last day: exit status -177, maximum time limit exceeded. So 100+ hours of crunch time is totally wasted for no reason. WTF?

Especially since the recent server changes have screwed up DCF, so estimates of time for WUs can be way off.

Are the devs trying to drive away those of us who are willing to donate cycles, etc., to the project?
ID: 1006380 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1006389 - Posted: 20 Jun 2010, 0:15:21 UTC - in response to Message 1006380.  

See the threads What causes this ERROR and how to prevent? and -177 (0xffffffffffffff4f) Faults. :-)

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours
ID: 1006389 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1006396 - Posted: 20 Jun 2010, 0:46:49 UTC - in response to Message 1006389.  

See the threads What causes this ERROR and how to prevent? and -177 (0xffffffffffffff4f) Faults. :-)

Gruß,
Gundolf


No "flops" in my app_info.xml.

So what else can be done? And, again, why is this occurring now? MUST be a server-side change, as I haven't done anything recently except go to 6.10.56 (from 6.10.18) so I could exclude the GPU in my chipset from crunching.

Is it a 6.10.56 feature? And is there a way of turning it off in app_info.xml?
ID: 1006396 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1006408 - Posted: 20 Jun 2010, 1:40:21 UTC - in response to Message 1006396.  

See the threads What causes this ERROR and how to prevent? and -177 (0xffffffffffffff4f) Faults. :-)

Gruß,
Gundolf


No "flops" in my app_info.xml.

So what else can be done? And, again, why is this occurring now? MUST be a server-side change, as I haven't done anything recently except go to 6.10.56 (from 6.10.18) so I could exclude the GPU in my chipset from crunching.

Is it a 6.10.56 feature? And is there a way of turning it off in app_info.xml?

The only thing you could do is look in BOINC Manager for tasks with ridiculously small estimates, then shut down BOINC and edit client_state.xml so the <rsc_fpops_bound> values for those workunits is reasonable. 2.5e15 (2500000000000000.0) could be used for any MB WU.

The issue is that the server is doing what amounts to DCF for each application on your host by adjusting the rsc_fpops_est, and the rsc_fpops_bound is adjusted by the same factor. Sometimes the adjustment goes too far, we can hope it will stabilize as the server averaging has more data.
                                                                Joe
ID: 1006408 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1006412 - Posted: 20 Jun 2010, 1:59:15 UTC - in response to Message 1006408.  

See the threads What causes this ERROR and how to prevent? and -177 (0xffffffffffffff4f) Faults. :-)

Gruß,
Gundolf


No "flops" in my app_info.xml.

So what else can be done? And, again, why is this occurring now? MUST be a server-side change, as I haven't done anything recently except go to 6.10.56 (from 6.10.18) so I could exclude the GPU in my chipset from crunching.

Is it a 6.10.56 feature? And is there a way of turning it off in app_info.xml?

The only thing you could do is look in BOINC Manager for tasks with ridiculously small estimates, then shut down BOINC and edit client_state.xml so the <rsc_fpops_bound> values for those workunits is reasonable. 2.5e15 (2500000000000000.0) could be used for any MB WU.

The issue is that the server is doing what amounts to DCF for each application on your host by adjusting the rsc_fpops_est, and the rsc_fpops_bound is adjusted by the same factor. Sometimes the adjustment goes too far, we can hope it will stabilize as the server averaging has more data.
                                                                Joe


That's too much damn work; I have at least several hundred WUs. Why can't the devs stop screwing around with the production server(s) and code until they have debugged their changes on the Beta site?
ID: 1006412 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1006428 - Posted: 20 Jun 2010, 2:47:13 UTC
Last modified: 20 Jun 2010, 3:03:08 UTC

OK - here's the deal: the WUs that are timing out are marked as "Anonymous platform - NVIDIA GPU" - i.e. (I think) they were sent to me as CUDA WUs. I use Reschedule, which moved them to CPU as part of load balancing on my machine (automated). So I am getting screwed by BOINC for it.

If BOINC is going to do this, shouldn't it be cognizant of Reschedule, and be smarter than this? What a screwup! Do I have to abort all these WUs so as to avoid wasting all the hours of comp time that are going to be aborted because of the disconnect between Reschedule and BOINC?

And are there any more IEDs buried in the BOINC / server code changes?????

Gah!

EDIT: Actually, this is a reversal of the old CUDA VLAR problem that Rescheduler was largely a fix for - VLARs took forever on CUDA, so the tool changed them to CPU WUs. But now, that makes them take too long on the CPU, so they get aborted by BOINC (on the CPU) rather than by the user on the GPU, when he/she notices a lack of progress and a l-o-n-g execution time!
Talk about working at cross purposes...

Perhaps we need a better Properties button on the Tasks tab, to give this info (i.e., whether a VLAR or not) and a new button to allow the user to turn off the execution timeout that BOINC now has for a specific WU, or to modify it (with 0 meaning "no execution timeout").
ID: 1006428 · Report as offensive
Aurora Borealis
Volunteer tester
Avatar

Send message
Joined: 14 Jan 01
Posts: 3075
Credit: 5,631,463
RAC: 0
Canada
Message 1006434 - Posted: 20 Jun 2010, 2:59:55 UTC - in response to Message 1006428.  
Last modified: 20 Jun 2010, 3:02:57 UTC

OK - here's the deal: the WUs that are timing out are marked as "Anonymous platform - NVIDIA GPU" - i.e. (I think) they were sent to me as CUDA WUs. I use Reschedule, which moved them to CPU as part of load balancing on my machine (automated). So I am getting screwed by BOINC for it.

If BOINC is going to do this, shouldn't it be cognizant of Reschedule, and be smarter than this? What a screwup! Do I have to abort all these WUs so as to avoid wasting all the hours of comp time that are going to be aborted because of the disconnect between Reschedule and BOINC?

And are there any more IEDs buried in the BOINC / server code changes?????

Gah!

You really don't expect that Boinc can adapt to micromanaging do you. You are using a third party, project specific app to shuffle things around. There's no way Boinc can react well to this. It may compensate to some degree over time, but it wont be instantaneous.

Boinc V7.2.42
Win7 i5 3.33G 4GB, GTX470
ID: 1006434 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1006435 - Posted: 20 Jun 2010, 3:08:34 UTC - in response to Message 1006434.  

OK - here's the deal: the WUs that are timing out are marked as "Anonymous platform - NVIDIA GPU" - i.e. (I think) they were sent to me as CUDA WUs. I use Reschedule, which moved them to CPU as part of load balancing on my machine (automated). So I am getting screwed by BOINC for it.

If BOINC is going to do this, shouldn't it be cognizant of Reschedule, and be smarter than this? What a screwup! Do I have to abort all these WUs so as to avoid wasting all the hours of comp time that are going to be aborted because of the disconnect between Reschedule and BOINC?

And are there any more IEDs buried in the BOINC / server code changes?????

Gah!

You really don't expect that Boinc can adapt to micromanaging do you. You are using a third party, project specific app to shuffle things around. There's no way Boinc can react well to this. It may compensate to some degree over time, but it wont be instantaneous.


Perhaps - but it would be even nicer if BOINC didn't make it necessary.

Reschedule makes it possible for users to do a lot more science by using resources more effectively (VLARs to CPU, load balancing when SETI doesn't distribute CPU/GPU in the proper ratio). Perhaps these new changes should be thought through a little better before being implemented. We keep hearing that "it's the science, stupid, not the credits" and I agree. So why make it harder to do the science efficiently?
ID: 1006435 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51469
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1006437 - Posted: 20 Jun 2010, 3:14:57 UTC - in response to Message 1006412.  


That's too much damn work; I have at least several hundred WUs. Why can't the devs stop screwing around with the production server(s) and code until they have debugged their changes on the Beta site?

Aww.. come on now, what would be the fun in that?
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1006437 · Report as offensive
Aurora Borealis
Volunteer tester
Avatar

Send message
Joined: 14 Jan 01
Posts: 3075
Credit: 5,631,463
RAC: 0
Canada
Message 1006440 - Posted: 20 Jun 2010, 3:22:54 UTC - in response to Message 1006435.  

OK - here's the deal: the WUs that are timing out are marked as "Anonymous platform - NVIDIA GPU" - i.e. (I think) they were sent to me as CUDA WUs. I use Reschedule, which moved them to CPU as part of load balancing on my machine (automated). So I am getting screwed by BOINC for it.

If BOINC is going to do this, shouldn't it be cognizant of Reschedule, and be smarter than this? What a screwup! Do I have to abort all these WUs so as to avoid wasting all the hours of comp time that are going to be aborted because of the disconnect between Reschedule and BOINC?

And are there any more IEDs buried in the BOINC / server code changes?????

Gah!

You really don't expect that Boinc can adapt to micromanaging do you. You are using a third party, project specific app to shuffle things around. There's no way Boinc can react well to this. It may compensate to some degree over time, but it wont be instantaneous.


Perhaps - but it would be even nicer if BOINC didn't make it necessary.

Reschedule makes it possible for users to do a lot more science by using resources more effectively (VLARs to CPU, load balancing when SETI doesn't distribute CPU/GPU in the proper ratio). Perhaps these new changes should be thought through a little better before being implemented. We keep hearing that "it's the science, stupid, not the credits" and I agree. So why make it harder to do the science efficiently?

And that is precisely the direction Boinc is going. One of the things on the agenda is to maintain separate the stats for different apps on a project. I don't know for sure if that is part of the current server changes, but whatever modifications they need to make for it to happen, it's not likely to be a smooth transition as new tables of stats need to be established and adjusted from the current system.

Boinc V7.2.42
Win7 i5 3.33G 4GB, GTX470
ID: 1006440 · Report as offensive

Message boards : Number crunching : Strange New Error Message


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.