BOINC v6.6.31 available

Message boards : Number crunching : BOINC v6.6.31 available
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Questor Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 3 Sep 04
Posts: 471
Credit: 230,506,401
RAC: 157
United Kingdom
Message 903637 - Posted: 4 Jun 2009, 16:27:06 UTC - in response to Message 903612.  


The VLARs don't seem to be a problem. The autokill has been working OK, I think.


I wondered if you'd had so many VLAR kills it had cleared all your tasks out - making it look like you'd had lots of the preempt problems.

The other -5 I mentioned is the memory problem rather than then the pre-empt issue. 1247338546

<core_client_version>6.6.31</core_client_version>
<![CDATA[
<message>
- exit code -5 (0xfffffffb)
</message>
<stderr_txt>
setiathome_CUDA: Found 2 CUDA device(s):
Device 1 : GeForce GTX 285
totalGlobalMem = 1073414144
sharedMemPerBlock = 16384
regsPerBlock = 16384
warpSize = 32
memPitch = 262144
maxThreadsPerBlock = 512
clockRate = 1548000
totalConstMem = 65536
major = 1
minor = 3
textureAlignment = 256
deviceOverlap = 1
multiProcessorCount = 30
Device 2 : GeForce GTX 285
totalGlobalMem = 1073479680
sharedMemPerBlock = 16384
regsPerBlock = 16384
warpSize = 32
memPitch = 262144
maxThreadsPerBlock = 512
clockRate = 1548000
totalConstMem = 65536
major = 1
minor = 3
textureAlignment = 256
deviceOverlap = 1
multiProcessorCount = 30
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GTX 285 is okay
SETI@home using CUDA accelerated device GeForce GTX 285
V10 modification by Raistmer
Priority of worker thread rised successfully
Priority of process adjusted successfully
Total GPU memory 1073414144 free GPU memory 745774080
setiathome_enhanced 6.02 Visual Studio/Microsoft C++

Build features: Non-graphics VLAR autokill enabled FFTW x86
CPUID: Intel(R) Xeon(R) CPU W5580 @ 3.20GHz

Cache: L1=64K L2=256K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3
libboinc: 6.4.5

Work Unit Info:
...............
WU true angle range is : 8.984127
Cuda error 'cudaMemcpy(dev_cx_DataArray, cx_DataArray, NumDataPoints * sizeof(*cx_DataArray), cudaMemcpyHostToDevice)' in file 'd:/BTR/SETI6/SETI_MB_CUDA/client/cuda/cudaAcceleration.cu' in line 262 : unspecified launch failure.
SETI@home error -5 Can't open file
(work_unit.sah) in read_wu_state() errno=2

File: ..\worker.cpp
Line: 123


</stderr_txt>
]]>



Question: What is the decision process that makes BOINC preempt in this situation? Is this a bug, or is there some logic to it?

Bob


As I understand it the pre-empt issue is a bug - described here:-

25110

- client: fixed nasty bug that caused GPU jobs to crash on startup when they're preempting another GPU job. The problem was as follows:

* job A is chosen to preempt job B
* we tell job B to quit, and initialize job A but don't start it; however, we set if scheduler state to SCHEDULED (rather than UNINITIALIZED)

* job B exits, and we start job A. Since its state is not UNITIALIZED, we don't set up its slot dir.

* job A runs in an empty slot dir, doesn't find its files, and bombs out.

* client: add <slot_debug> option (prints messages about allocation of slots, creating/removing files in slot dirs).

---------------------------------

I seem to recall reading on the VLAR front (probably on Lunatics) that by upgrading to CUDA2.2 dlls and the newset Nvidia driver, the VLAR kill was no longer necessary as the tasks ran in reasonable time - which would possibly avoid clearing your cache if you are being unlucky and get a skewed number of VLARs. I never went down the kill route but did rebranding to a CPU task instead and have much less CUDA processing power than you so things happen on a smaller scale but over the last couple of days do seem to have predominantly got VLAR tasks (probably sweeping up all the VLAR kills :-) )

Perhaps might be worth trying reverting to the normal opt. apps. to see if it gets you past this hurdle. Hopefully someone else with a better memory might confirm if this is the case?


John.
GPU Users Group



ID: 903637 · Report as offensive
Profile Questor Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 3 Sep 04
Posts: 471
Credit: 230,506,401
RAC: 157
United Kingdom
Message 903639 - Posted: 4 Jun 2009, 16:39:22 UTC - in response to Message 903622.  

I did install v6.6.31 over v6.6.28 and OK'd it's installation defaults (admittedly I didn't look at what they were after seeing v6.6.31 installer properly saw my (default) installation directories ... so maybe if apps run as service, that would be my fault).



I am running my XP SP3 machines (32 and 64 bit) all as services (I think that is what you say you are doing?) and did my 6.6.31 installs over the top of existing installs. I checked all the install options and they seemed standard.

Will be going home shortly and will give the 6.6.31 XP machines another try out to see if I can replicate your experience.


John.
GPU Users Group



ID: 903639 · Report as offensive
Profile Bob Mahoney Design
Avatar

Send message
Joined: 4 Apr 04
Posts: 178
Credit: 9,205,632
RAC: 0
United States
Message 903640 - Posted: 4 Jun 2009, 16:39:35 UTC - in response to Message 903637.  


I wondered if you'd had so many VLAR kills it had cleared all your tasks out - making it look like you'd had lots of the preempt problems.

The other -5 I mentioned is the memory problem rather than then the pre-empt issue. 1247338546
...
<core_client_version>6.6.31</core_client_version>
<![CDATA[
<message>
- exit code -5 (0xfffffffb)
</message>
<stderr_txt>
...
setiathome_CUDA: Found 2 CUDA device(s):
Work Unit Info:
...............
WU true angle range is : 8.984127
Cuda error 'cudaMemcpy(dev_cx_DataArray, cx_DataArray, NumDataPoints * sizeof(*cx_DataArray), cudaMemcpyHostToDevice)' in file 'd:/BTR/SETI6/SETI_MB_CUDA/client/cuda/cudaAcceleration.cu' in line 262 : unspecified launch failure.
SETI@home error -5 Can't open file
(work_unit.sah) in read_wu_state() errno=2

</stderr_txt>
]]>

Perhaps it is a bad GPU. I assumed it was overloaded with resident "wait to run" tasks, maybe it is actually bad memory. I will remove that card and try some test runs without it.

Bob
ID: 903640 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14686
Credit: 200,643,578
RAC: 874
United Kingdom
Message 903642 - Posted: 4 Jun 2009, 16:44:48 UTC - in response to Message 903630.  

Since the preempting is also happening on my single-core GPU system, it goes back to my question: Why is the preempting happening in the first place? In other words, with a short cache (1.6 days), and plenty of time for all tasks to complete before deadline, why does a WU go EDF in the first place? Is the WU born and flagged that way before I get it? I thought EDF was a calculated state based on the immediate context within the local host?

Bob

That is indeed a good question. Let's try and track it down.

First, BOINC versions up to and including v6.6.20 always ran CUDA tasks in 'earliest deadline' order. That was a design decision, since abandoned.

Starting with v6.6.23 (I think they missed a couple of release numbers), CUDA tasks should run in the order they're received from the server, unless there's a deadline problem. Then, they would switch to 'earliest deadline' mode, but should also show a flag for running in "High Priority". Are you seeing that? (You may need to extend the width of the 'status' column to be sure). Fred's screenshot doesn't show High Priority, so the shorty tasks may simply have been the next due to run in FIFO order: unfortunately, with a bespoke sort order set, we can't see the tasks in issue order.

[Tip:
If, like Fred, you have a bespoke sort order set, you can clear it from the registry by clearing these two registry keys:

[HKEY_CURRENT_USER\Software\Space Sciences Laboratory, U.C. Berkeley\BOINC Manager\Tasks]
"SortColumn"=dword:ffffffff
"SortAscending"=dword:00000001

It's the only way I've found.
/Tip]

With a 1.6 day cache, there should never be any need for High Priority running, and hence no EDF. But beware: if you have a full cache, and something goes wrong with your Duration Correction Factor or other time metrics, a 1.6 day cache can suddenly evaluate to seven or more days, and trigger EDF. I've had that happen with a near-VLAR which escaped my rebranding. Look and see if your current BOINC estimates for unstarted tasks seem to be realistic.

If that doesn't throw up any clues, we may be moving into the territory of extended debug logging - see Client configuration. Are any of you up for that? We're probably talking about <coproc_debug>, <cpu_sched> and <cpu_sched_debug>.
ID: 903642 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14686
Credit: 200,643,578
RAC: 874
United Kingdom
Message 903648 - Posted: 4 Jun 2009, 16:56:11 UTC - in response to Message 903637.  

I seem to recall reading on the VLAR front (probably on Lunatics) that by upgrading to CUDA2.2 dlls and the newset Nvidia driver, the VLAR kill was no longer necessary as the tasks ran in reasonable time - which would possibly avoid clearing your cache if you are being unlucky and get a skewed number of VLARs. I never went down the kill route but did rebranding to a CPU task instead and have much less CUDA processing power than you so things happen on a smaller scale but over the last couple of days do seem to have predominantly got VLAR tasks (probably sweeping up all the VLAR kills :-) )

Perhaps might be worth trying reverting to the normal opt. apps. to see if it gets you past this hurdle. Hopefully someone else with a better memory might confirm if this is the case?

No - I tried VLARs with the 2.2 DLL kit and they were just as bad.

The 'nasty bug' you've picked out of the changelog is indeed the most significant change in v6.6.31, and the reason I started this thread. The fix seems successful, in that when a task is pre-empted (EDF or user intrervention), the task which replaces it no longer errors immediately on initial startup. But these reports suggest there is some other cause of 'error -5', not yet addressed by that bug-fix.
ID: 903648 · Report as offensive
Profile BMaytum
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 104
Credit: 4,382,041
RAC: 2
United States
Message 903651 - Posted: 4 Jun 2009, 17:06:28 UTC - in response to Message 903639.  

@ Questor/ John:

When I installed v6.6.28 a week or so ago, I made sure I was NOT installing as a service (I don't want it to run a service, instead just on-demand for all users) on this single-user WinXP32SP3 32-bit PC. As I noted, I (stoopidly)didn't verify that when I installed v6.6.31 over v6.6.28. I look forward to results of your replication study.

-Bruce
Sabertooth Z77, i7-3770K@4.2GHz, GTX680, W8.1Pro x64
P5N32-E SLI, C2D E8400@3Ghz, GTX580, Win7SP1Pro x64 & PCLinuxOS2015 x64
ID: 903651 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 66513
Credit: 55,293,173
RAC: 49
United States
Message 903667 - Posted: 4 Jun 2009, 17:56:11 UTC - in response to Message 903648.  

I seem to recall reading on the VLAR front (probably on Lunatics) that by upgrading to CUDA2.2 dlls and the newest Nvidia driver, the VLAR kill was no longer necessary as the tasks ran in reasonable time - which would possibly avoid clearing your cache if you are being unlucky and get a skewed number of VLARs. I never went down the kill route but did rebranding to a CPU task instead and have much less CUDA processing power than you so things happen on a smaller scale but over the last couple of days do seem to have predominantly got VLAR tasks (probably sweeping up all the VLAR kills :-) )

Perhaps might be worth trying reverting to the normal opt. apps. to see if it gets you past this hurdle. Hopefully someone else with a better memory might confirm if this is the case?

No - I tried VLARs with the 2.2 DLL kit and they were just as bad.

The 'nasty bug' you've picked out of the changelog is indeed the most significant change in v6.6.31, and the reason I started this thread. The fix seems successful, in that when a task is pre-empted (EDF or user intervention), the task which replaces it no longer errors immediately on initial startup. But these reports suggest there is some other cause of 'error -5', not yet addressed by that bug-fix.

I'm stayin away from those evil 2.2 dll files as they just lockup My PC, The older ones work fine. I also like 6.6.20 and I don't like being urged to upgrade to a slower version of Boinc. I use the Nvidia 185.85 XP x64 driver with the V11 cross watch app that's Nvidia and WHQL approved and that's good enough for Me.
CA HSR built a foundation, is laying Track!
PRR T1 Class 4-4-4-4 #5550 Loco, US's 1st HST

ID: 903667 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 903670 - Posted: 4 Jun 2009, 18:01:19 UTC - in response to Message 903648.  

Bob, you may be thinking of the way they have narrowed the range on the VLARs. I've got a few that would have been -6'd before but are now deemed to be acceptable to run. Raistmer was a bit on the cautious side when he first figured the range.


PROUD MEMBER OF Team Starfire World BOINC
ID: 903670 · Report as offensive
Profile Bob Mahoney Design
Avatar

Send message
Joined: 4 Apr 04
Posts: 178
Credit: 9,205,632
RAC: 0
United States
Message 903680 - Posted: 4 Jun 2009, 18:33:48 UTC - in response to Message 903642.  

From Richard:
...
Starting with v6.6.23 (I think they missed a couple of release numbers), CUDA tasks should run in the order they're received from the server, unless there's a deadline problem. Then, they would switch to 'earliest deadline' mode, but should also show a flag for running in "High Priority". Are you seeing that? (You may need to extend the width of the 'status' column to be sure). Fred's screenshot doesn't show High Priority, so the shorty tasks may simply have been the next due to run in FIFO order: unfortunately, with a bespoke sort order set, we can't see the tasks in issue order.

[Tip:
If, like Fred, you have a bespoke sort order set, you can clear it from the registry by clearing these two registry keys:

[HKEY_CURRENT_USER\Software\Space Sciences Laboratory, U.C. Berkeley\BOINC Manager\Tasks]
"SortColumn"=dword:ffffffff
"SortAscending"=dword:00000001

It's the only way I've found.
/Tip]

With a 1.6 day cache, there should never be any need for High Priority running, and hence no EDF. But beware: if you have a full cache, and something goes wrong with your Duration Correction Factor or other time metrics, a 1.6 day cache can suddenly evaluate to seven or more days, and trigger EDF. I've had that happen with a near-VLAR which escaped my rebranding. Look and see if your current BOINC estimates for unstarted tasks seem to be realistic.

If that doesn't throw up any clues, we may be moving into the territory of extended debug logging - see Client configuration. Are any of you up for that? We're probably talking about <coproc_debug>, <cpu_sched> and <cpu_sched_debug>.

I miss the old "Accessible view" option in BOINC - where tasks resorted to 'natural' order.

Points:

1. I run only SETI@home, no other projects at this time.
2. Apparent EDF mode (preempting) still has NONE of my tasks saying "High Priority"
3. Preempted tasks usually have a deadline EARLIER THAN the task that replaced it.
4. "Waiting to run" tasks never get restarted.
5. I don't think my duration correction factor ever got skewed enough (from near-VLAR runtime influence) to force EDF. But I will double check this.
6. I DID have AP running on the CPU on both computers. This might be a factor.

Before I volunteer the big system for sacrificial testing, I'll try running it as CUDA-only, with no AP. This will eliminate some obvious questions in order to purify the test environment. As soon as it sees the first "waiting to run", you can tell me how to torture the computer in any way you like. Then we will at least know it is not related to AP on the same system. I'm ready to retire it and take it to storage, but if it can help out with the problem, let's do it.

Bob
Opinion stated as fact? Who, me?
ID: 903680 · Report as offensive
Profile Bob Mahoney Design
Avatar

Send message
Joined: 4 Apr 04
Posts: 178
Credit: 9,205,632
RAC: 0
United States
Message 903682 - Posted: 4 Jun 2009, 18:37:44 UTC - in response to Message 903670.  

Bob, you may be thinking of the way they have narrowed the range on the VLARs. I've got a few that would have been -6'd before but are now deemed to be acceptable to run. Raistmer was a bit on the cautious side when he first figured the range.

Good point. I'll watch for a near-VLAR to see if the duration factor goes out of wack and forces high priority.

Also, I'll try starting with a cache of .1day to decrease the odds of a long-running WU skewing durations too far.

Bob
ID: 903682 · Report as offensive
Profile Questor Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 3 Sep 04
Posts: 471
Credit: 230,506,401
RAC: 157
United Kingdom
Message 903694 - Posted: 4 Jun 2009, 19:21:01 UTC - in response to Message 903648.  

I seem to recall reading on the VLAR front (probably on Lunatics) that by upgrading to CUDA2.2 dlls and the newset Nvidia driver, the VLAR kill was no longer necessary as the tasks ran in reasonable time - which would possibly avoid clearing your cache if you are being unlucky and get a skewed number of VLARs. I never went down the kill route but did rebranding to a CPU task instead and have much less CUDA processing power than you so things happen on a smaller scale but over the last couple of days do seem to have predominantly got VLAR tasks (probably sweeping up all the VLAR kills :-) )

Perhaps might be worth trying reverting to the normal opt. apps. to see if it gets you past this hurdle. Hopefully someone else with a better memory might confirm if this is the case?

No - I tried VLARs with the 2.2 DLL kit and they were just as bad.

The 'nasty bug' you've picked out of the changelog is indeed the most significant change in v6.6.31, and the reason I started this thread. The fix seems successful, in that when a task is pre-empted (EDF or user intrervention), the task which replaces it no longer errors immediately on initial startup. But these reports suggest there is some other cause of 'error -5', not yet addressed by that bug-fix.


OK - I'll stick with the rebranding for now then. As you say, these new -5 errors are definitely different to the ones I saw caused by the preempting issue.

GPU Users Group



ID: 903694 · Report as offensive
Profile Questor Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 3 Sep 04
Posts: 471
Credit: 230,506,401
RAC: 157
United Kingdom
Message 903696 - Posted: 4 Jun 2009, 19:30:40 UTC - in response to Message 903651.  

@ Questor/ John:

When I installed v6.6.28 a week or so ago, I made sure I was NOT installing as a service (I don't want it to run a service, instead just on-demand for all users) on this single-user WinXP32SP3 32-bit PC. As I noted, I (stoopidly)didn't verify that when I installed v6.6.31 over v6.6.28. I look forward to results of your replication study.

-Bruce



OK - I misread your previous message. I have confirmed just now that the service element is the significant factor.

All my XP machines run as a service and don't exhibit the tasks not stopping problem.

My Vista machine is set to NOT run as a service and does exhibit the problem..

I changed one of my XP machines to not run as a service and that then had exactly the same problem - when exiting BOINC manager with the "Stop apps" option selected in 6.6.31 it fails to stop the tasks every time.

I then reinstalled as a service and it now stops the tasks again every time.

Doesn't help fix the problem but at least we know when it does it and with Fred's workaround you can at least avoid the problem.


John.

GPU Users Group



ID: 903696 · Report as offensive
Profile Bob Mahoney Design
Avatar

Send message
Joined: 4 Apr 04
Posts: 178
Credit: 9,205,632
RAC: 0
United States
Message 904018 - Posted: 5 Jun 2009, 17:26:05 UTC

Re. the "waiting to run" issue, where tasks say "waiting to run" and never get restarted...

It happens to Fred W when he is running CUDA tasks plus either AP or MB on his CPU. Same for me.

Is everyone who is experiencing this problem running tasks on their CPU as well as on CUDA? If so, this might be a very good clue about the bug.

Bob
ID: 904018 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 904024 - Posted: 5 Jun 2009, 17:40:23 UTC - in response to Message 904018.  

Yes, it's happening to me (Accumulating 'Waiting to Run' tasks), both machines AP+MB(CPU&Cuda). I can't quite fathom a pattern to it yet, but I haven't looked that hard either.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 904024 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 66513
Credit: 55,293,173
RAC: 49
United States
Message 904044 - Posted: 5 Jun 2009, 18:25:53 UTC - in response to Message 904018.  

Re. the "waiting to run" issue, where tasks say "waiting to run" and never get restarted...

It happens to Fred W when he is running CUDA tasks plus either AP or MB on his CPU. Same for me.

Is everyone who is experiencing this problem running tasks on their CPU as well as on CUDA? If so, this might be a very good clue about the bug.

Bob

Nope, Not here, I'm using Boinc 6.6.20 still.
CA HSR built a foundation, is laying Track!
PRR T1 Class 4-4-4-4 #5550 Loco, US's 1st HST

ID: 904044 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 904067 - Posted: 5 Jun 2009, 19:36:21 UTC - in response to Message 904018.  
Last modified: 5 Jun 2009, 19:36:47 UTC

Re. the "waiting to run" issue, where tasks say "waiting to run" and never get restarted...

It happens to Fred W when he is running CUDA tasks plus either AP or MB on his CPU. Same for me.

Is everyone who is experiencing this problem running tasks on their CPU as well as on CUDA? If so, this might be a very good clue about the bug.

Bob

"...never get restarted" is not quite correct. Because we are talking EDF here, these tasks are dragged forward from when they would have crunched in FIFO mode. And EDF was normally kicked off (when not by something that I had done like switching on my 2 CPDN models that I am crunching slowly!) when new tasks were downloaded that had a short deadline - I am running a nominal 3 day cache. So the tasks being pulled in by EDF should have been processed (say) 3 days hence.
The effect I was seeing with my GTX295 (have not seen it yet using v6.6.33 - [crosses fingers]) was that when entering EDF mode, existing CIDA tasks would be suspended and 2 new ones started up. So far, so normal. But when the first of those 2 finished crunching, the second was put into "waiting to run" (usually with only a few seconds left to run) and two more tasks were started. This continued with one task completing normally and the second being put into "waiting to run" mode until Boinc decided that EDF was no longer warranted and returned to processing in FIFO order. So all those "waiting to run" now hang around until they are picked up FIFO; i.e. 3 days hence.
What happens then is also interesting, because crunching restarts from the beginning of the WU but the timer is not reset so they *appear* to take double their allotted time to crunch and then error out because there is no result file to upload. This, of course, doubles the DCF and often kicks the thing into EDF once again.
Since I am confident that these "waiting to run" are going to fail anyway, I have got into the habit of aborting them all once out of EDF mode.

I am sure that I have read that the Devs are on top of this one but I have just riffled through my Alpha digests and can't lay my fingers on the text. I am waiting with interest to see how v6.6.33 handles it.

F.
ID: 904067 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 904080 - Posted: 5 Jun 2009, 20:10:36 UTC - in response to Message 903642.  


... Fred's screenshot doesn't show High Priority, so the shorty tasks may simply have been the next due to run in FIFO order: unfortunately, with a bespoke sort order set, we can't see the tasks in issue order.

[Tip:
If, like Fred, you have a bespoke sort order set, you can clear it from the registry by clearing these two registry keys:

[HKEY_CURRENT_USER\Software\Space Sciences Laboratory, U.C. Berkeley\BOINC Manager\Tasks]
"SortColumn"=dword:ffffffff
"SortAscending"=dword:00000001

It's the only way I've found.
/Tip]


2 points on this:

I have not seen "High Priority" in Boinc Manager for weeks and the column normally is wide enough to see it if is there.

As to the sort order, I find, using the "Elapsed Time" column as the sort key *does* show FIFO order for all that have not yet started (after a stop/restart of BM if you have manually played with the sort order during the current session). Obviously it will have to drag a CUDA forward if it needs one and the next in the queue is an AP or a 603 (or vice versa) but I can readily predict from the BM list which is going to be the next to crunch. I don't find this an issue at all.

F.

ID: 904080 · Report as offensive
Basshopper

Send message
Joined: 5 Aug 99
Posts: 6
Credit: 20,615,691
RAC: 0
United States
Message 904092 - Posted: 5 Jun 2009, 20:34:21 UTC - in response to Message 904018.  

Re. the "waiting to run" issue, where tasks say "waiting to run" and never get restarted...

It happens to Fred W when he is running CUDA tasks plus either AP or MB on his CPU. Same for me.

Is everyone who is experiencing this problem running tasks on their CPU as well as on CUDA? If so, this might be a very good clue about the bug.

Bob



DITTO Exactly
ID: 904092 · Report as offensive
Profile Questor Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 3 Sep 04
Posts: 471
Credit: 230,506,401
RAC: 157
United Kingdom
Message 904666 - Posted: 7 Jun 2009, 7:00:09 UTC - in response to Message 903640.  


I wondered if you'd had so many VLAR kills it had cleared all your tasks out - making it look like you'd had lots of the preempt problems.

The other -5 I mentioned is the memory problem rather than then the pre-empt issue. 1247338546
...
<core_client_version>6.6.31</core_client_version>
<![CDATA[
<message>
- exit code -5 (0xfffffffb)
</message>
<stderr_txt>
...
setiathome_CUDA: Found 2 CUDA device(s):
Work Unit Info:
...............
WU true angle range is : 8.984127
Cuda error 'cudaMemcpy(dev_cx_DataArray, cx_DataArray, NumDataPoints * sizeof(*cx_DataArray), cudaMemcpyHostToDevice)' in file 'd:/BTR/SETI6/SETI_MB_CUDA/client/cuda/cudaAcceleration.cu' in line 262 : unspecified launch failure.
SETI@home error -5 Can't open file
(work_unit.sah) in read_wu_state() errno=2

</stderr_txt>
]]>

Perhaps it is a bad GPU. I assumed it was overloaded with resident "wait to run" tasks, maybe it is actually bad memory. I will remove that card and try some test runs without it.

Bob


I just spotted one of these cudamemcpyHostToDevice errors on one of my machines which had bombed out with an error but hadn't been reported.

As a test, I shut down BOINC and edited the client_state.xml file to set everything for this task back to 0 but left it as a CUDA task.

Restarted BOINC and this task started to run. It ran all the way through and completed OK, nothing seemingly out of the ordinary.

I assume that some of the Waiting to run tasks were still locked in the GPU memory which then caused this one to fail to load or some other factor perhaps memory fragementation issue was going on. I hadn't rebooted the machine just stop started the apps and everything seems to be continuing to run OK.

The card is a GE9600T with 512MB of memory so not a huge capacity and perhaps not fast enough to spawn too many Waiting tasks (37Gflops)

I've got 15 CUDA Waiting tasks at the moment but they don't seem to be doing any harm.

My machines are shut down once per day as I mainly crunch in the evenings (except at weekends) so perhaps memory is flushed normally before the problem builds up.

GPU Users Group



ID: 904666 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 905030 - Posted: 7 Jun 2009, 23:26:40 UTC

I'm having this problem as well, it's become a real pain with the recent batch of "shorties".
It's my old P4 running in HT mode, 2 x CUDA cards and crunching MB on both CPU using AK_v8 and GPU using the stock app.
When the shorties come in, the running CUDA units are pre-empted, it does not appear to go into EDF mode. If only one unit is put on hold there appears to be no problem, the shortie crunches out, then the bumped unit resumes and finishes OK. However if two are bumped, one appears to stay in memory and a third 6.08 process starts, suffers a memory error and falls back to the CPU. It doesn't seem specific to either CUDA card

Restarting BOINC clears the the problem, the on hold unit is removed from memory, the Shortys restart on the GPU, finish OK and the bumped tasks are restarted and appear to finish without further problem.

You can check these units out
http://setiathome.berkeley.edu/result.php?resultid=1253679589
http://setiathome.berkeley.edu/result.php?resultid=1252829884
http://setiathome.berkeley.edu/result.php?resultid=1251873005

There is also this one which "minus 9'ed" after the restart but I don't think it was due to the above problem, this has happened to the odd unit lately anyway
http://setiathome.berkeley.edu/result.php?resultid=1251992014

Brodo
ID: 905030 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : BOINC v6.6.31 available


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.