BOINC v6.6.31 available

Author	Message
Questor Volunteer tester Send message Joined: 3 Sep 04 Posts: 471 Credit: 230,506,401 RAC: 157	Message 903639 - Posted: 4 Jun 2009, 16:39:22 UTC - in response to Message 903622. I did install v6.6.31 over v6.6.28 and OK'd it's installation defaults (admittedly I didn't look at what they were after seeing v6.6.31 installer properly saw my (default) installation directories ... so maybe if apps run as service, that would be my fault). I am running my XP SP3 machines (32 and 64 bit) all as services (I think that is what you say you are doing?) and did my 6.6.31 installs over the top of existing installs. I checked all the install options and they seemed standard. Will be going home shortly and will give the 6.6.31 XP machines another try out to see if I can replicate your experience. John. GPU Users Group ID: 903639 ·

Bob Mahoney Design Send message Joined: 4 Apr 04 Posts: 178 Credit: 9,205,632 RAC: 0	Message 903640 - Posted: 4 Jun 2009, 16:39:35 UTC - in response to Message 903637. I wondered if you'd had so many VLAR kills it had cleared all your tasks out - making it look like you'd had lots of the preempt problems. The other -5 I mentioned is the memory problem rather than then the pre-empt issue. 1247338546 ... <core_client_version>6.6.31</core_client_version> <![CDATA[ <message> - exit code -5 (0xfffffffb) </message> <stderr_txt> ... setiathome_CUDA: Found 2 CUDA device(s): Work Unit Info: ............... WU true angle range is : 8.984127 Cuda error 'cudaMemcpy(dev_cx_DataArray, cx_DataArray, NumDataPoints * sizeof(*cx_DataArray), cudaMemcpyHostToDevice)' in file 'd:/BTR/SETI6/SETI_MB_CUDA/client/cuda/cudaAcceleration.cu' in line 262 : unspecified launch failure. SETI@home error -5 Can't open file (work_unit.sah) in read_wu_state() errno=2 </stderr_txt> ]]> Perhaps it is a bad GPU. I assumed it was overloaded with resident "wait to run" tasks, maybe it is actually bad memory. I will remove that card and try some test runs without it. Bob ID: 903640 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 903642 - Posted: 4 Jun 2009, 16:44:48 UTC - in response to Message 903630. Since the preempting is also happening on my single-core GPU system, it goes back to my question: Why is the preempting happening in the first place? In other words, with a short cache (1.6 days), and plenty of time for all tasks to complete before deadline, why does a WU go EDF in the first place? Is the WU born and flagged that way before I get it? I thought EDF was a calculated state based on the immediate context within the local host? Bob That is indeed a good question. Let's try and track it down. First, BOINC versions up to and including v6.6.20 always ran CUDA tasks in 'earliest deadline' order. That was a design decision, since abandoned. Starting with v6.6.23 (I think they missed a couple of release numbers), CUDA tasks should run in the order they're received from the server, unless there's a deadline problem. Then, they would switch to 'earliest deadline' mode, but should also show a flag for running in "High Priority". Are you seeing that? (You may need to extend the width of the 'status' column to be sure). Fred's screenshot doesn't show High Priority, so the shorty tasks may simply have been the next due to run in FIFO order: unfortunately, with a bespoke sort order set, we can't see the tasks in issue order. [Tip: If, like Fred, you have a bespoke sort order set, you can clear it from the registry by clearing these two registry keys: [HKEY_CURRENT_USER\Software\Space Sciences Laboratory, U.C. Berkeley\BOINC Manager\Tasks] "SortColumn"=dword:ffffffff "SortAscending"=dword:00000001 It's the only way I've found. /Tip] With a 1.6 day cache, there should never be any need for High Priority running, and hence no EDF. But beware: if you have a full cache, and something goes wrong with your Duration Correction Factor or other time metrics, a 1.6 day cache can suddenly evaluate to seven or more days, and trigger EDF. I've had that happen with a near-VLAR which escaped my rebranding. Look and see if your current BOINC estimates for unstarted tasks seem to be realistic. If that doesn't throw up any clues, we may be moving into the territory of extended debug logging - see Client configuration. Are any of you up for that? We're probably talking about <coproc_debug>, <cpu_sched> and <cpu_sched_debug>. ID: 903642 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 903648 - Posted: 4 Jun 2009, 16:56:11 UTC - in response to Message 903637. I seem to recall reading on the VLAR front (probably on Lunatics) that by upgrading to CUDA2.2 dlls and the newset Nvidia driver, the VLAR kill was no longer necessary as the tasks ran in reasonable time - which would possibly avoid clearing your cache if you are being unlucky and get a skewed number of VLARs. I never went down the kill route but did rebranding to a CPU task instead and have much less CUDA processing power than you so things happen on a smaller scale but over the last couple of days do seem to have predominantly got VLAR tasks (probably sweeping up all the VLAR kills :-) ) Perhaps might be worth trying reverting to the normal opt. apps. to see if it gets you past this hurdle. Hopefully someone else with a better memory might confirm if this is the case? No - I tried VLARs with the 2.2 DLL kit and they were just as bad. The 'nasty bug' you've picked out of the changelog is indeed the most significant change in v6.6.31, and the reason I started this thread. The fix seems successful, in that when a task is pre-empted (EDF or user intrervention), the task which replaces it no longer errors immediately on initial startup. But these reports suggest there is some other cause of 'error -5', not yet addressed by that bug-fix. ID: 903648 ·

BMaytum Volunteer tester Send message Joined: 3 Apr 99 Posts: 104 Credit: 4,382,041 RAC: 2	Message 903651 - Posted: 4 Jun 2009, 17:06:28 UTC - in response to Message 903639. @ Questor/ John: When I installed v6.6.28 a week or so ago, I made sure I was NOT installing as a service (I don't want it to run a service, instead just on-demand for all users) on this single-user WinXP32SP3 32-bit PC. As I noted, I (stoopidly)didn't verify that when I installed v6.6.31 over v6.6.28. I look forward to results of your replication study. -Bruce Sabertooth Z77, i7-3770K@4.2GHz, GTX680, W8.1Pro x64 P5N32-E SLI, C2D E8400@3Ghz, GTX580, Win7SP1Pro x64 & PCLinuxOS2015 x64 ID: 903651 ·

perryjay Volunteer tester Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0	Message 903670 - Posted: 4 Jun 2009, 18:01:19 UTC - in response to Message 903648. Bob, you may be thinking of the way they have narrowed the range on the VLARs. I've got a few that would have been -6'd before but are now deemed to be acceptable to run. Raistmer was a bit on the cautious side when he first figured the range. PROUD MEMBER OF Team Starfire World BOINC ID: 903670 ·

Bob Mahoney Design Send message Joined: 4 Apr 04 Posts: 178 Credit: 9,205,632 RAC: 0	Message 903680 - Posted: 4 Jun 2009, 18:33:48 UTC - in response to Message 903642. From Richard: ... Starting with v6.6.23 (I think they missed a couple of release numbers), CUDA tasks should run in the order they're received from the server, unless there's a deadline problem. Then, they would switch to 'earliest deadline' mode, but should also show a flag for running in "High Priority". Are you seeing that? (You may need to extend the width of the 'status' column to be sure). Fred's screenshot doesn't show High Priority, so the shorty tasks may simply have been the next due to run in FIFO order: unfortunately, with a bespoke sort order set, we can't see the tasks in issue order. [Tip: If, like Fred, you have a bespoke sort order set, you can clear it from the registry by clearing these two registry keys: [HKEY_CURRENT_USER\Software\Space Sciences Laboratory, U.C. Berkeley\BOINC Manager\Tasks] "SortColumn"=dword:ffffffff "SortAscending"=dword:00000001 It's the only way I've found. /Tip] With a 1.6 day cache, there should never be any need for High Priority running, and hence no EDF. But beware: if you have a full cache, and something goes wrong with your Duration Correction Factor or other time metrics, a 1.6 day cache can suddenly evaluate to seven or more days, and trigger EDF. I've had that happen with a near-VLAR which escaped my rebranding. Look and see if your current BOINC estimates for unstarted tasks seem to be realistic. If that doesn't throw up any clues, we may be moving into the territory of extended debug logging - see Client configuration. Are any of you up for that? We're probably talking about <coproc_debug>, <cpu_sched> and <cpu_sched_debug>. I miss the old "Accessible view" option in BOINC - where tasks resorted to 'natural' order. Points: 1. I run only SETI@home, no other projects at this time. 2. Apparent EDF mode (preempting) still has NONE of my tasks saying "High Priority" 3. Preempted tasks usually have a deadline EARLIER THAN the task that replaced it. 4. "Waiting to run" tasks never get restarted. 5. I don't think my duration correction factor ever got skewed enough (from near-VLAR runtime influence) to force EDF. But I will double check this. 6. I DID have AP running on the CPU on both computers. This might be a factor. Before I volunteer the big system for sacrificial testing, I'll try running it as CUDA-only, with no AP. This will eliminate some obvious questions in order to purify the test environment. As soon as it sees the first "waiting to run", you can tell me how to torture the computer in any way you like. Then we will at least know it is not related to AP on the same system. I'm ready to retire it and take it to storage, but if it can help out with the problem, let's do it. Bob Opinion stated as fact? Who, me? ID: 903680 ·

Bob Mahoney Design Send message Joined: 4 Apr 04 Posts: 178 Credit: 9,205,632 RAC: 0	Message 903682 - Posted: 4 Jun 2009, 18:37:44 UTC - in response to Message 903670. Bob, you may be thinking of the way they have narrowed the range on the VLARs. I've got a few that would have been -6'd before but are now deemed to be acceptable to run. Raistmer was a bit on the cautious side when he first figured the range. Good point. I'll watch for a near-VLAR to see if the duration factor goes out of wack and forces high priority. Also, I'll try starting with a cache of .1day to decrease the odds of a long-running WU skewing durations too far. Bob ID: 903682 ·

Questor Volunteer tester Send message Joined: 3 Sep 04 Posts: 471 Credit: 230,506,401 RAC: 157	Message 903694 - Posted: 4 Jun 2009, 19:21:01 UTC - in response to Message 903648. I seem to recall reading on the VLAR front (probably on Lunatics) that by upgrading to CUDA2.2 dlls and the newset Nvidia driver, the VLAR kill was no longer necessary as the tasks ran in reasonable time - which would possibly avoid clearing your cache if you are being unlucky and get a skewed number of VLARs. I never went down the kill route but did rebranding to a CPU task instead and have much less CUDA processing power than you so things happen on a smaller scale but over the last couple of days do seem to have predominantly got VLAR tasks (probably sweeping up all the VLAR kills :-) ) Perhaps might be worth trying reverting to the normal opt. apps. to see if it gets you past this hurdle. Hopefully someone else with a better memory might confirm if this is the case? No - I tried VLARs with the 2.2 DLL kit and they were just as bad. The 'nasty bug' you've picked out of the changelog is indeed the most significant change in v6.6.31, and the reason I started this thread. The fix seems successful, in that when a task is pre-empted (EDF or user intrervention), the task which replaces it no longer errors immediately on initial startup. But these reports suggest there is some other cause of 'error -5', not yet addressed by that bug-fix. OK - I'll stick with the rebranding for now then. As you say, these new -5 errors are definitely different to the ones I saw caused by the preempting issue. GPU Users Group ID: 903694 ·

Questor Volunteer tester Send message Joined: 3 Sep 04 Posts: 471 Credit: 230,506,401 RAC: 157	Message 903696 - Posted: 4 Jun 2009, 19:30:40 UTC - in response to Message 903651. @ Questor/ John: When I installed v6.6.28 a week or so ago, I made sure I was NOT installing as a service (I don't want it to run a service, instead just on-demand for all users) on this single-user WinXP32SP3 32-bit PC. As I noted, I (stoopidly)didn't verify that when I installed v6.6.31 over v6.6.28. I look forward to results of your replication study. -Bruce OK - I misread your previous message. I have confirmed just now that the service element is the significant factor. All my XP machines run as a service and don't exhibit the tasks not stopping problem. My Vista machine is set to NOT run as a service and does exhibit the problem.. I changed one of my XP machines to not run as a service and that then had exactly the same problem - when exiting BOINC manager with the "Stop apps" option selected in 6.6.31 it fails to stop the tasks every time. I then reinstalled as a service and it now stops the tasks again every time. Doesn't help fix the problem but at least we know when it does it and with Fred's workaround you can at least avoid the problem. John. GPU Users Group ID: 903696 ·

Bob Mahoney Design Send message Joined: 4 Apr 04 Posts: 178 Credit: 9,205,632 RAC: 0	Message 904018 - Posted: 5 Jun 2009, 17:26:05 UTC Re. the "waiting to run" issue, where tasks say "waiting to run" and never get restarted... It happens to Fred W when he is running CUDA tasks plus either AP or MB on his CPU. Same for me. Is everyone who is experiencing this problem running tasks on their CPU as well as on CUDA? If so, this might be a very good clue about the bug. Bob ID: 904018 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 904024 - Posted: 5 Jun 2009, 17:40:23 UTC - in response to Message 904018. Yes, it's happening to me (Accumulating 'Waiting to Run' tasks), both machines AP+MB(CPU&Cuda). I can't quite fathom a pattern to it yet, but I haven't looked that hard either. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 904024 ·

Fred W Volunteer tester Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0	Message 904067 - Posted: 5 Jun 2009, 19:36:21 UTC - in response to Message 904018. Last modified: 5 Jun 2009, 19:36:47 UTC Re. the "waiting to run" issue, where tasks say "waiting to run" and never get restarted... It happens to Fred W when he is running CUDA tasks plus either AP or MB on his CPU. Same for me. Is everyone who is experiencing this problem running tasks on their CPU as well as on CUDA? If so, this might be a very good clue about the bug. Bob "...never get restarted" is not quite correct. Because we are talking EDF here, these tasks are dragged forward from when they would have crunched in FIFO mode. And EDF was normally kicked off (when not by something that I had done like switching on my 2 CPDN models that I am crunching slowly!) when new tasks were downloaded that had a short deadline - I am running a nominal 3 day cache. So the tasks being pulled in by EDF should have been processed (say) 3 days hence. The effect I was seeing with my GTX295 (have not seen it yet using v6.6.33 - [crosses fingers]) was that when entering EDF mode, existing CIDA tasks would be suspended and 2 new ones started up. So far, so normal. But when the first of those 2 finished crunching, the second was put into "waiting to run" (usually with only a few seconds left to run) and two more tasks were started. This continued with one task completing normally and the second being put into "waiting to run" mode until Boinc decided that EDF was no longer warranted and returned to processing in FIFO order. So all those "waiting to run" now hang around until they are picked up FIFO; i.e. 3 days hence. What happens then is also interesting, because crunching restarts from the beginning of the WU but the timer is not reset so they appear to take double their allotted time to crunch and then error out because there is no result file to upload. This, of course, doubles the DCF and often kicks the thing into EDF once again. Since I am confident that these "waiting to run" are going to fail anyway, I have got into the habit of aborting them all once out of EDF mode. I am sure that I have read that the Devs are on top of this one but I have just riffled through my Alpha digests and can't lay my fingers on the text. I am waiting with interest to see how v6.6.33 handles it. F. ID: 904067 ·

Fred W Volunteer tester Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0	Message 904080 - Posted: 5 Jun 2009, 20:10:36 UTC - in response to Message 903642. ... Fred's screenshot doesn't show High Priority, so the shorty tasks may simply have been the next due to run in FIFO order: unfortunately, with a bespoke sort order set, we can't see the tasks in issue order. [Tip: If, like Fred, you have a bespoke sort order set, you can clear it from the registry by clearing these two registry keys: [HKEY_CURRENT_USER\Software\Space Sciences Laboratory, U.C. Berkeley\BOINC Manager\Tasks] "SortColumn"=dword:ffffffff "SortAscending"=dword:00000001 It's the only way I've found. /Tip] 2 points on this: I have not seen "High Priority" in Boinc Manager for weeks and the column normally is wide enough to see it if is there. As to the sort order, I find, using the "Elapsed Time" column as the sort key does show FIFO order for all that have not yet started (after a stop/restart of BM if you have manually played with the sort order during the current session). Obviously it will have to drag a CUDA forward if it needs one and the next in the queue is an AP or a 603 (or vice versa) but I can readily predict from the BM list which is going to be the next to crunch. I don't find this an issue at all. F. ID: 904080 ·

Basshopper Send message Joined: 5 Aug 99 Posts: 6 Credit: 20,615,691 RAC: 0	Message 904092 - Posted: 5 Jun 2009, 20:34:21 UTC - in response to Message 904018. Re. the "waiting to run" issue, where tasks say "waiting to run" and never get restarted... It happens to Fred W when he is running CUDA tasks plus either AP or MB on his CPU. Same for me. Is everyone who is experiencing this problem running tasks on their CPU as well as on CUDA? If so, this might be a very good clue about the bug. Bob DITTO Exactly ID: 904092 ·

Questor Volunteer tester Send message Joined: 3 Sep 04 Posts: 471 Credit: 230,506,401 RAC: 157	Message 904666 - Posted: 7 Jun 2009, 7:00:09 UTC - in response to Message 903640. I wondered if you'd had so many VLAR kills it had cleared all your tasks out - making it look like you'd had lots of the preempt problems. The other -5 I mentioned is the memory problem rather than then the pre-empt issue. 1247338546 ... <core_client_version>6.6.31</core_client_version> <![CDATA[ <message> - exit code -5 (0xfffffffb) </message> <stderr_txt> ... setiathome_CUDA: Found 2 CUDA device(s): Work Unit Info: ............... WU true angle range is : 8.984127 Cuda error 'cudaMemcpy(dev_cx_DataArray, cx_DataArray, NumDataPoints * sizeof(*cx_DataArray), cudaMemcpyHostToDevice)' in file 'd:/BTR/SETI6/SETI_MB_CUDA/client/cuda/cudaAcceleration.cu' in line 262 : unspecified launch failure. SETI@home error -5 Can't open file (work_unit.sah) in read_wu_state() errno=2 </stderr_txt> ]]> Perhaps it is a bad GPU. I assumed it was overloaded with resident "wait to run" tasks, maybe it is actually bad memory. I will remove that card and try some test runs without it. Bob I just spotted one of these cudamemcpyHostToDevice errors on one of my machines which had bombed out with an error but hadn't been reported. As a test, I shut down BOINC and edited the client_state.xml file to set everything for this task back to 0 but left it as a CUDA task. Restarted BOINC and this task started to run. It ran all the way through and completed OK, nothing seemingly out of the ordinary. I assume that some of the Waiting to run tasks were still locked in the GPU memory which then caused this one to fail to load or some other factor perhaps memory fragementation issue was going on. I hadn't rebooted the machine just stop started the apps and everything seems to be continuing to run OK. The card is a GE9600T with 512MB of memory so not a huge capacity and perhaps not fast enough to spawn too many Waiting tasks (37Gflops) I've got 15 CUDA Waiting tasks at the moment but they don't seem to be doing any harm. My machines are shut down once per day as I mainly crunch in the evenings (except at weekends) so perhaps memory is flushed normally before the problem builds up. GPU Users Group ID: 904666 ·

Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 905030 - Posted: 7 Jun 2009, 23:26:40 UTC I'm having this problem as well, it's become a real pain with the recent batch of "shorties". It's my old P4 running in HT mode, 2 x CUDA cards and crunching MB on both CPU using AK_v8 and GPU using the stock app. When the shorties come in, the running CUDA units are pre-empted, it does not appear to go into EDF mode. If only one unit is put on hold there appears to be no problem, the shortie crunches out, then the bumped unit resumes and finishes OK. However if two are bumped, one appears to stay in memory and a third 6.08 process starts, suffers a memory error and falls back to the CPU. It doesn't seem specific to either CUDA card Restarting BOINC clears the the problem, the on hold unit is removed from memory, the Shortys restart on the GPU, finish OK and the bumped tasks are restarted and appear to finish without further problem. You can check these units out http://setiathome.berkeley.edu/result.php?resultid=1253679589 http://setiathome.berkeley.edu/result.php?resultid=1252829884 http://setiathome.berkeley.edu/result.php?resultid=1251873005 There is also this one which "minus 9'ed" after the restart but I don't think it was due to the above problem, this has happened to the odd unit lately anyway http://setiathome.berkeley.edu/result.php?resultid=1251992014 Brodo ID: 905030 ·

Questor Volunteer tester Send message Joined: 3 Sep 04 Posts: 471 Credit: 230,506,401 RAC: 157	Message 905734 - Posted: 10 Jun 2009, 10:54:48 UTC - in response to Message 905030. Interesting your tasks actually report out of memory error and fallback to CPU. An example of mine is 1242853008 with Cuda error 'cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, cudaAcc_NumDataPoints / fftlen * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost)' in file 'c:/sw/gpgpu/seti/seti_boinc/client/cuda/cudaAcc_summax.cu' in line 160 : unspecified launch failure. I was assuming the failure to copy into cuda memory to be an out of memory problem. If I rerun the task after a restart of BOINC it runs successfully. GPU Users Group ID: 905734 ·

Bob Mahoney Design Send message Joined: 4 Apr 04 Posts: 178 Credit: 9,205,632 RAC: 0	Message 905761 - Posted: 10 Jun 2009, 12:36:53 UTC Wating to Run issue: I have a system here that created 12 waiting to run last night (from which it crashed). I rebooted it, aborted the 12 hung tasks, and it just created 3 more in the past few minutes. If it is not too tough to do, I'll volunteer it for debugging. (Thanks to everyone who responded so far. No obvious pattern has emerged re. this problem.) Bob ID: 905761 ·

Bob Mahoney Design Send message Joined: 4 Apr 04 Posts: 178 Credit: 9,205,632 RAC: 0	Message 905763 - Posted: 10 Jun 2009, 12:41:35 UTC - in response to Message 905761. Last modified: 10 Jun 2009, 12:52:10 UTC Wating to Run issue: ...and it just created 3 more in the past few minutes. Bob Hoo Hah! It just created another seven! This thing is a "Waiting to run" factory! Bob (Edit changed one to two. Oh yeah.) (Edit again, changed two to seven. Wow.) (Last edit: I'm shutting this computer off. Ready for debugging.) ID: 905763 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.