F cuda.............

Message boards : Number crunching : F cuda.............
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 850315 - Posted: 7 Jan 2009, 1:42:07 UTC - in response to Message 850312.  
Last modified: 7 Jan 2009, 1:43:07 UTC

really doesn't seem like it's possible it's a CUDA problem since you're still using Boinc 5.10.45

You are getting "too many normally harmless exit" which if I'm not mistaken might have to do with cpu throttling. In your Boinc Manger under processor usage do you have cpu time set to anything but 100%?
ID: 850315 · Report as offensive
Profile dragon1

Send message
Joined: 17 Sep 05
Posts: 33
Credit: 4,438,013
RAC: 0
Canada
Message 850321 - Posted: 7 Jan 2009, 1:53:59 UTC - in response to Message 850315.  

None of my preferences have been changed since this machine first went online last spring, and current setting is still "100% of the processors" and 98% of CPU time...nothing has been altered at all in the past 8-9 months. This is a Core 2 Duo...guess you already have access to that info..lol.
ID: 850321 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 850333 - Posted: 7 Jan 2009, 2:08:05 UTC
Last modified: 7 Jan 2009, 2:38:57 UTC

Hmmm...

After looking over a few of the errored out tasks for your host it looks like you are having CC heartbeat problems on this host.

In general terms, this means that something is causing the CC to 'stall' and not send heartbeat signals to the science app in a timely fashion. This causes the science app to think the CC died, and they are designed to exit on their when this happens to prevent them from remaining to run as orphaned processes.

There are several things which can cause this.

The first thing I'd try would be to set the CPU throttle to 100%. There have been reports of the throttle causing some problems for some hosts with the later 5.10 CC's.

The next thing to look at is to exclude the BOINC directory and all its subfolders from virus scanning.

Another possibility is if you have the host set up for 'remote' monitoring and control. If so you can try increasing the 'polling' interval with which the tools requests status/state updates. If you have the interval set too short for the specific circumstances on the host, this can lead to the aforementioned CC stalls, and too many of those will result in the task aborting like you have been observing.

One last, sort of desperation, thing to do is to defrag the drive where BOINC lives. I know this sounds like groping, but has solved weird BOINC malfunctions in the past a couple of times, when everything else I had tried failed and I was fresh out of ideas. ;-)

HTH,

Alinator
ID: 850333 · Report as offensive
cowboy

Send message
Joined: 2 Aug 08
Posts: 51
Credit: 18,580
RAC: 0
United States
Message 850339 - Posted: 7 Jan 2009, 2:15:55 UTC - in response to Message 850298.  

If you don't want to run a CUDA app, go into your account, click on Seti@home Preferences, edit the preferences to where your computer is, weather Home/Work/School, ect, change the setting of "Use Graphics Processing Unit (GPU) if available" to no, save your changes, open BOINC Manager, click on your SETI project, click update, and you should no longer get CUDA work. Easy way to opt out of running it until its fixed.

The issue isn't with those of us who do not want to use CUDA, as I doubt that anyone in this thread who is expressing frustration with this rollout is using it.

The issue is that there is a concern that the CUDA application is poorly written and is returning bad information, crashing, and requesting ridiculously small amounts of credit, forcing workunits to be unnecessarily distributed to additional hosts for further verification. Or, to put it succinctly, CUDA is harming the project and alienating some previously strong advocates of the project.


I fully understand that, in which my point still stands. If you don't want to run it while there are still issues, opt out of it. Matt posted in the Tech News thread that they found at least 1 bug in the validator to fix the issue about the cuda and non-cuda results not matching, so work is being done to fix the issue. No one is being alienated at all.
ID: 850339 · Report as offensive
Profile dragon1

Send message
Joined: 17 Sep 05
Posts: 33
Credit: 4,438,013
RAC: 0
Canada
Message 850347 - Posted: 7 Jan 2009, 2:29:42 UTC - in response to Message 850333.  

Here's something I was just told...about 14 hrs ago the system was actually told to shut off by the operator and it failed to do so and when I come by an hr or so ago the screen was displaying the Windows XP error box saying that "svchost.exe" was not responding and the system had obviously been that way all day with no attention...could that have eliminated all the WU's in the que as it continued to try each one in succession and failed something? hence and empty que now?
I have no idea as to how to change or check CPU to 100% throttle, but will run a defrag now. Guess there's no way to get anymore WU's for now.
Thanks so far.

ID: 850347 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 850352 - Posted: 7 Jan 2009, 2:37:48 UTC - in response to Message 850347.  

Here's something I was just told...about 14 hrs ago the system was actually told to shut off by the operator and it failed to do so and when I come by an hr or so ago the screen was displaying the Windows XP error box saying that "svchost.exe" was not responding and the system had obviously been that way all day with no attention...could that have eliminated all the WU's in the que as it continued to try each one in succession and failed something? hence and empty que now?
I have no idea as to how to change or check CPU to 100% throttle, but will run a defrag now. Guess there's no way to get anymore WU's for now.
Thanks so far.


Yep, that could definitely do it. Especially if it was a network related service running through the Services and Controller Application (the user friendly name for svchost) which had gagged.

Alinator
ID: 850352 · Report as offensive
Profile dragon1

Send message
Joined: 17 Sep 05
Posts: 33
Credit: 4,438,013
RAC: 0
Canada
Message 850372 - Posted: 7 Jan 2009, 3:43:05 UTC - in response to Message 850352.  

Ok...thanks mate. Then I guess I'll leave my system up and running BOINC and see if and when it will start sending me some WU's. I guess it's not been happy with me sending too many Client Errors to SETI in one day so they shut me down for awhile?
ID: 850372 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 850375 - Posted: 7 Jan 2009, 4:10:02 UTC - in response to Message 850372.  
Last modified: 7 Jan 2009, 4:17:07 UTC

Ok...thanks mate. Then I guess I'll leave my system up and running BOINC and see if and when it will start sending me some WU's. I guess it's not been happy with me sending too many Client Errors to SETI in one day so they shut me down for awhile?


Yep, the project side has two cutoffs to slow down 'renegade' hosts.

The basic quota limit is 100 tasks per quota day per core up to 4 cores (last I knew), or 400 tasks total per quota day.

A quota day is defined as from midnight to midnight Pacific time for SAH.

The first cutoff is the quota is reduced by one for every errored out task the host returns. The project will double the current value for every successful task returned (this does not mean validates for credit though, an important distinction). This one is intended to shutdown a host which goes 'insane' and isn't noticed right off, like the situation you had on your host for example.

The second cutoff is intended to help keep the overall work in progress load on the backend within reason by limiting how much work any host can download in a given day (in theory). This one hardly ever came into play for most participants until recently, since there weren't many folks crunching who could afford hosts which could sustain a throughput of 400 tasks per day. ;-)

However it is possible for quads and higher to reach the 400 tasks per day limit when they are filling their cache initially or after prolonged outages if you have the cache settings set high enough. Of course when the 'true' 8 core (16 virtual) i7's are launched we could be seeing mainstream hosts hit the absolute daily limit a lot more.

HTH,

Alinator
ID: 850375 · Report as offensive
Profile dragon1

Send message
Joined: 17 Sep 05
Posts: 33
Credit: 4,438,013
RAC: 0
Canada
Message 850378 - Posted: 7 Jan 2009, 4:34:19 UTC - in response to Message 850375.  

Thanks for the quick response. Not sure how to interpret "The first cutoff is the quota is reduced by one for every errored out task the host returns"...so I'll just leave everything running for a few days and see what happens. There is currently nothing in the "task" listing waiting to run or running. I'll just wait.
Thanks again.
ID: 850378 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 850398 - Posted: 7 Jan 2009, 5:55:53 UTC - in response to Message 850378.  

Thanks for the quick response. Not sure how to interpret "The first cutoff is the quota is reduced by one for every errored out task the host returns"...so I'll just leave everything running for a few days and see what happens. There is currently nothing in the "task" listing waiting to run or running. I'll just wait.
Thanks again.

You start off at 100. If you have one 'compute error', or any other error (including an abort, or a missed deadline), your quota is now 99/cpu/day. If this happens a few more times, it continues to decrease by one for every problem.

Now say it's down to 50/cpu/day. If you turn in one good result, it goes up by two, so now it is 52. If you turn in 10 good tasks all at the same time, it goes up 20, and so on.

So, one bad task decreases the quota by one. One good task increases the quota by two.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 850398 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 850403 - Posted: 7 Jan 2009, 6:18:10 UTC - in response to Message 850398.  

Thanks for the quick response. Not sure how to interpret "The first cutoff is the quota is reduced by one for every errored out task the host returns"...so I'll just leave everything running for a few days and see what happens. There is currently nothing in the "task" listing waiting to run or running. I'll just wait.
Thanks again.

You start off at 100. If you have one 'compute error', or any other error (including an abort, or a missed deadline), your quota is now 99/cpu/day. If this happens a few more times, it continues to decrease by one for every problem.

Now say it's down to 50/cpu/day. If you turn in one good result, it goes up by two, so now it is 52. If you turn in 10 good tasks all at the same time, it goes up 20, and so on.

So, one bad task decreases the quota by one. One good task increases the quota by two.


Hmmm...

Unless there's been a change I'm not aware of, good returns double the current quota value (up to the maximum).

So if you had enough errors in a row to drop the quota to 50, the next successful task would return the quota to 100.

Or put another way if you were at 1 for the quota, each good return would double it, so it would go:

1-2-4-8-16-32-64-100

IOW's, seven straight successes gets you back to the max.

Alinator
ID: 850403 · Report as offensive
Previous · 1 · 2 · 3

Message boards : Number crunching : F cuda.............


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.