BOINC v6.6.31 available

Message boards : Number crunching : BOINC v6.6.31 available
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Questor Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 3 Sep 04
Posts: 471
Credit: 230,506,401
RAC: 157
United Kingdom
Message 905734 - Posted: 10 Jun 2009, 10:54:48 UTC - in response to Message 905030.  

Interesting your tasks actually report out of memory error and fallback to CPU.

An example of mine is 1242853008 with

Cuda error 'cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, cudaAcc_NumDataPoints / fftlen * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost)' in file 'c:/sw/gpgpu/seti/seti_boinc/client/cuda/cudaAcc_summax.cu' in line 160 : unspecified launch failure.

I was assuming the failure to copy into cuda memory to be an out of memory problem.

If I rerun the task after a restart of BOINC it runs successfully.

GPU Users Group



ID: 905734 · Report as offensive
Profile Bob Mahoney Design
Avatar

Send message
Joined: 4 Apr 04
Posts: 178
Credit: 9,205,632
RAC: 0
United States
Message 905761 - Posted: 10 Jun 2009, 12:36:53 UTC

Wating to Run issue:

I have a system here that created 12 waiting to run last night (from which it crashed). I rebooted it, aborted the 12 hung tasks, and it just created 3 more in the past few minutes.

If it is not too tough to do, I'll volunteer it for debugging.

(Thanks to everyone who responded so far. No obvious pattern has emerged re. this problem.)

Bob
ID: 905761 · Report as offensive
Profile Bob Mahoney Design
Avatar

Send message
Joined: 4 Apr 04
Posts: 178
Credit: 9,205,632
RAC: 0
United States
Message 905763 - Posted: 10 Jun 2009, 12:41:35 UTC - in response to Message 905761.  
Last modified: 10 Jun 2009, 12:52:10 UTC

Wating to Run issue:

...and it just created 3 more in the past few minutes.

Bob

Hoo Hah! It just created another seven! This thing is a "Waiting to run" factory!

Bob

(Edit changed one to two. Oh yeah.)
(Edit again, changed two to seven. Wow.)
(Last edit: I'm shutting this computer off. Ready for debugging.)
ID: 905763 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 905769 - Posted: 10 Jun 2009, 13:26:12 UTC - in response to Message 905763.  

Wating to Run issue:

...and it just created 3 more in the past few minutes.

Bob

Hoo Hah! It just created another seven! This thing is a "Waiting to run" factory!

Bob

(Edit changed one to two. Oh yeah.)
(Edit again, changed two to seven. Wow.)
(Last edit: I'm shutting this computer off. Ready for debugging.)

@Bob,
If it's anything like mine (running v6.6.28, .31, or .33), then once the GPU is in EDF mode (and it doesn't show "high priority" in BM) then for every 2 tasks that start, one will be left in "waiting to run" when the next 2 start until it comes out of EDF mode. I have recently noticed that with v6.6.33, when the GPU is in EDF mode, the completion of a CPU task will put *both* GPU tasks into "wating to run" and start a new CPU tasks and 2 new CUDA tasks. The only way to avoid all this seems to be to stay out of EDF mode on the CUDA which I do manage to do most of the time with my 3-day cache.

F.
ID: 905769 · Report as offensive
Profile Bob Mahoney Design
Avatar

Send message
Joined: 4 Apr 04
Posts: 178
Credit: 9,205,632
RAC: 0
United States
Message 905783 - Posted: 10 Jun 2009, 14:17:06 UTC - in response to Message 905769.  

If it's anything like mine (running v6.6.28, .31, or .33), then once the GPU is in EDF mode (and it doesn't show "high priority" in BM) then for every 2 tasks that start, one will be left in "waiting to run" when the next 2 start until it comes out of EDF mode. I have recently noticed that with v6.6.33, when the GPU is in EDF mode, the completion of a CPU task will put *both* GPU tasks into "wating to run" and start a new CPU tasks and 2 new CUDA tasks. The only way to avoid all this seems to be to stay out of EDF mode on the CUDA which I do manage to do most of the time with my 3-day cache.

F.

I think that is exactly what happened. It looked like the first "waiting to run" is proper, later deadline, single task suspended. As you say, the next time (minutes later) a GPU task (typically a shorty) starts, it caused two running GPU tasks to revert to "waiting to run". After that, suspension happened in multiples of two. Keep in mind we are running two GPU.

Let us not forget that Questor is experiencing this with a single GPU. So he must be getting singles of "waiting to run" as it happens.

It does NOT happen when I run ONLY CPU or ONLY GPU. I can beat the heck out of such a setup and it never fails. As soon as I add a task to the other side of the computer, it eventually has this issue.

Fred, is there an imbalance in cache size between your 608 vs. 603+AP queue on your system? I mean, do you have an intentional difference in cache size, achieved by filling up one, then going back to a shorter queue for running? On mine there is a full 10 days of AP for the CPU, and half that duration of MB for CUDA. I'm wondering if BOINC is confused by the cache size contrast?

Note: Don't anyone get too upset about the big AP cache here - a 3.3.31 "waiting to run" crash took most of my hard drive with it last week. I finally got it running long enough to retrieve the 'lost' AP units yesterday.

Bob
ID: 905783 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 905790 - Posted: 10 Jun 2009, 14:38:40 UTC - in response to Message 905783.  

Fred, is there an imbalance in cache size between your 608 vs. 603+AP queue on your system? I mean, do you have an intentional difference in cache size, achieved by filling up one, then going back to a shorter queue for running? On mine there is a full 10 days of AP for the CPU, and half that duration of MB for CUDA. I'm wondering if BOINC is confused by the cache size contrast?

I work on the pricipal that the minimum turn-round time specified for MB tasks is 7-days, so with a 3-day cache I am rarely going to run into EDF. I don't try to micro-manage the cache - I just tell it 3-days (with a 0.1 day connect interval) and leave Boinc to it.
You have a (approx) 5-day cache for MB. If your cache is filled up when your DCF is at or near its minimum value and then the DCF is doubled by a task taking longer than expected (I have watched this happen), then any 7-day return task that was toward the end of your queue is in deadline trouble and you are into EDF. Running a 3 (or less) day cache gives much more headroom in this respect and I have not run out of work at any time recently, even with the major network problems we have experienced. I am pretty sure that halving your cache size would virtually elimate the "waiting to run's".

F.
ID: 905790 · Report as offensive
Profile Questor Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 3 Sep 04
Posts: 471
Credit: 230,506,401
RAC: 157
United Kingdom
Message 905794 - Posted: 10 Jun 2009, 15:00:24 UTC

Interestingly I had been micromanaging some of my caches - why can't I just leave well alone.

A while back I had gone through a patch where there weren't many MB tasks coming through and eventually I ended up with a buffer setting of 10 days.

I then ended up with a huge number of AP tasks which I was worried about not completing on time so deselected the preference to avoid downloading any more.

On a couple of machines my DCF had also gone way out for some reason. I foolishly corrected this but didn't reduce the buffer back down and ended up with an equally ridiculously large number of MBs.

I've since got everything about right but haven't yet reduced the buffer back down to a sensible level - so I am going to do that tonight.

The only tweak I've got at the moment is to have APv5 preference selected with no MB but "do other tasks if none available" - attempting to chew up a few AP abort/returns when thy turn up to clear them out the way, so I am really only getting MBs anyway.

The other factor is I have been rebranding VLAR tasks to CPU to avoid the problems with them which will have skewed the ratio of CPU/GPU tasks I have.
(I haven't rebranded any back the other way to compensate).

It's a wonder the poor thing is working at all really :-)








GPU Users Group



ID: 905794 · Report as offensive
Profile Bob Mahoney Design
Avatar

Send message
Joined: 4 Apr 04
Posts: 178
Credit: 9,205,632
RAC: 0
United States
Message 905800 - Posted: 10 Jun 2009, 15:21:37 UTC - in response to Message 905790.  

Fred, is there an imbalance in cache size between your 608 vs. 603+AP queue on your system? I mean, do you have an intentional difference in cache size, achieved by filling up one, then going back to a shorter queue for running? On mine there is a full 10 days of AP for the CPU, and half that duration of MB for CUDA. I'm wondering if BOINC is confused by the cache size contrast?

I work on the pricipal that the minimum turn-round time specified for MB tasks is 7-days, so with a 3-day cache I am rarely going to run into EDF. I don't try to micro-manage the cache - I just tell it 3-days (with a 0.1 day connect interval) and leave Boinc to it.
You have a (approx) 5-day cache for MB. If your cache is filled up when your DCF is at or near its minimum value and then the DCF is doubled by a task taking longer than expected (I have watched this happen), then any 7-day return task that was toward the end of your queue is in deadline trouble and you are into EDF. Running a 3 (or less) day cache gives much more headroom in this respect and I have not run out of work at any time recently, even with the major network problems we have experienced. I am pretty sure that halving your cache size would virtually elimate the "waiting to run's".

F.

That is it! That is exactly what I just observed. I turned the host back on for 5 minutes and observed the following:

1. An AP task completed.
2. This influenced the CUDA tasks to nearly triple their "To completion" estimates from 8min to 22min.
3. EDF mode arrived, and CUDA task hijacking began.

Yesterday, after restoring 10+ days of AP, and running ONLY AP (no CUDA), my system put the 8 running AP tasks into "Running high priority", with no ill effect. Problem is only with EDF on CUDA, possibly only while running CUDA in conjunction with AP or 603 on CPU.

Troubling is the fact that completion of an AP task on the CPU should heavily influence the "time to completion" estimate for CUDA tasks. If that is how the current BOINC works, that does not seem appropriate.

Bob
ID: 905800 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 905805 - Posted: 10 Jun 2009, 15:40:33 UTC - in response to Message 905800.  

Troubling is the fact that completion of an AP task on the CPU should heavily influence the "time to completion" estimate for CUDA tasks. If that is how the current BOINC works, that does not seem appropriate.

It is how BOINC works and it is not appropriate, but developers seem to have pushed that to a higher rev.

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours
ID: 905805 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14681
Credit: 200,643,578
RAC: 874
United Kingdom
Message 905807 - Posted: 10 Jun 2009, 15:44:04 UTC

1) AP tasks influencing the anticipated running times of CUDA tasks is a design flaw in BOINC which we're going to have to live with until at least BOINC v6.10

The worst effects can be mitigated by careful fine-tuning of the FLOPs figures in app_info.xml: if you get it right (carefully balanced for the performance bias of your particular hardware), you can avoid EDF almost entirely.

2) Except that then, you wouldn't have discovered the mis-behaviour of CUDA under stress! From what I'm reading, there are problems:

a) When there are two or more CUDA devices in a single host (which rules me out for testing, sadly)

b) When a combination of cache size/DCF/deadlines brings 'shorties' forward in EDF - which to my mind should be flagged as 'High Priority', but doesn't seem to be.

I think someone - not me, I'm single GPU only - is going to have to do some logging with <coproc_debug> (see Client configuration), try and work out what's happening, and report the analysis and logs to boinc_alpha.
ID: 905807 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 905835 - Posted: 10 Jun 2009, 16:56:45 UTC - in response to Message 905800.  

Troubling is the fact that completion of an AP task on the CPU should heavily influence the "time to completion" estimate for CUDA tasks. If that is how the current BOINC works, that does not seem appropriate.

Bob

With only a single DCF controlling both, this interaction is unavoidable. What bothers me is why completion of a CPU WU pre-emps both running (EDF) CUDA tasks and starts 2 more. I guess I'll have to try the logging route and publish the output since I doubt I will be able to make head or tail of it.

F.
ID: 905835 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 905912 - Posted: 10 Jun 2009, 20:01:50 UTC - in response to Message 905807.  

Now I am *really* confused.

I set up the cc_config file with <cpu_sched>, <cpu_sched_debug>, <coproc_debug> and <sched_op_debug> flags as a starter. Then forced EDF by changing Connect Interval (local prefs) from 0.1 to 4.0 days leaving Extra Days at 3.0.

Great - 2 CUDA's paused ("Waiting to run") and 2 new ones started as expected (the 4 CPU cores running MB ploughed on regardless).

BUT: after that everything worked as it should... When one EDF CUDA WU finished another started and the partner EDF CUDA WU continued to completion. When a CPU MB WU completed and uploaded, there was no effect on the CUDA WU's in flight. And when I gave up and restored the original settings (including removing the config flags) and allowed the pre-empted CUDA WU's to complete, they did so without error (no -5!!).

The only difference on my host at the moment is that I am running down my cache so that I can release tasks that were lost in the abortive attempt to load v6.6.34 (without checking the Alpha postings first!!) but I am not sure how that would affect this since I was able to force CUDA EDF mode.

Unless there are any other ideas forthcoming, I will leave the flags (switched OFF) in my cc_config file until I see this happen spontaneously again and then try to capture the debug info. I will try this exercise again after I have cleared out the garbage and rebuilt my cache. Certainly the last time I experienced the problem was when I re-branded some 400 non-VLAR 603's to push them onto the GPU in the hope of snagging a few AP's for the CPU cores so I will probably try that method of forcing EDF again when the cache is full.

F.

ID: 905912 · Report as offensive
Profile Bob Mahoney Design
Avatar

Send message
Joined: 4 Apr 04
Posts: 178
Credit: 9,205,632
RAC: 0
United States
Message 905949 - Posted: 10 Jun 2009, 21:29:07 UTC - in response to Message 905912.  

Fred said: ... When one EDF CUDA WU finished another started and the partner EDF CUDA WU continued to completion. When a CPU MB WU completed and uploaded, there was no effect on the CUDA WU's in flight. And when I gave up and restored the original settings (including removing the config flags) and allowed the pre-empted CUDA WU's to complete, they did so without error (no -5!!).

The only difference on my host at the moment is that I am running down my cache..

The only difference I can point out with my situation is as follows:

Whenever the "Waiting to run" problem happened, I had a queue of waiting CPU tasks that was way beyond my "Additional work buffer" setting.

On one computer I had 5 days of CPU tasks waiting, "Additional work buffer" set to 1.6 days.

The other computer had 10+ days of CPU tasks waiting, "Additional work buffer" set to 4.5 days.

Perhaps BOINC (or whatever) only gets confused when the CPU tasks queue is full of extra days of work in comparison to the GPU queue?

Bob
ID: 905949 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 905970 - Posted: 10 Jun 2009, 22:03:28 UTC - in response to Message 905949.  

Perhaps BOINC (or whatever) only gets confused when the CPU tasks queue is full of extra days of work in comparison to the GPU queue?

Bob

Hmmm... Doesn't quite gel with my last paragraph. Last time this happened to me the box was chuntering away quite happily and stably (so I assume there was about 3 days of CPU work and 3 days of GPU work on board) when I decided to tinker. I found that nearly 400 of the CPU tasks (603's) were non-VLAR so I re-branded them as 608's to run on the GPU. Since there were no VLAR CUDA tasks in the cache at the time there was no transfer back in the other direction. So the CPU cache was now nearly 400 tasks (probably equates to about 1.25 days crunching) short of 3-day's worth. Obviously the nearly 400 extra tasks were too much for the GPU queue and tipped it into EDF but the CPU cache should have been down below 2 days at this point.

Still, we're trying to knit fog until we can get some debug facts so...

F.
ID: 905970 · Report as offensive
Profile tullio
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 8797
Credit: 2,930,782
RAC: 1
Italy
Message 906093 - Posted: 11 Jun 2009, 5:45:53 UTC

I've downloaded BOINC 6.6.36 on my Linux box running SuSE Linux 10.3 and it does not work. A library is missing. I had to go back to 6.6.31 which works, although it says it is 6.6.29.
Tullio
ID: 906093 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 906127 - Posted: 11 Jun 2009, 10:45:47 UTC - in response to Message 906093.  

A library is missing.

Which one? Have you checked if that library is on your system? If it isn't, have you tried to install it?
ID: 906127 · Report as offensive
Profile tullio
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 8797
Credit: 2,930,782
RAC: 1
Italy
Message 906129 - Posted: 11 Jun 2009, 10:59:15 UTC - in response to Message 906127.  

A library is missing.

Which one? Have you checked if that library is on your system? If it isn't, have you tried to install it?

No, I just reinstalled 6.6.31. But I suspect it is a library used by the manager for graphic purposes, not by the client itself. Maybe I shall try a reinstall next time I shut down boinc because CPDN Beta freezes it after each 60 minutes period allowed to it. I have other 5 projects running and all function properly.

ID: 906129 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 906140 - Posted: 11 Jun 2009, 11:46:43 UTC - in response to Message 906129.  
Last modified: 11 Jun 2009, 11:48:44 UTC

Else reinstall 6.6.36, then run:

ldd boincmgr and you'll see what it needs and what is missing. Then install that library file. Might be the new sqlite3 library. As far as I know, it won't come with the installer, it expects those libraries to be on your system already.

(else ldd boinc if there's nothing wrong with the manager's libraries)
ID: 906140 · Report as offensive
Profile tullio
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 8797
Credit: 2,930,782
RAC: 1
Italy
Message 906151 - Posted: 11 Jun 2009, 12:32:14 UTC
Last modified: 11 Jun 2009, 13:10:46 UTC

Missing library is libpcre.so.3. I installed pcre from the SuSE repository but it gives me libpcre.so.0.0.1.So I shall reinstall 6.6.31.
Tullio
I've compiled the latest release of pcre (7.9) and it also gives me the libpcre.so.0 linked to libpcre.so.0.0.1.
ID: 906151 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 906169 - Posted: 11 Jun 2009, 13:39:18 UTC - in response to Message 906151.  

OK, someone else reported the exact same problem, so I reported it to the developers. Why they released these versions as recommended without any testing is really beyond me.
ID: 906169 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : BOINC v6.6.31 available


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.