Message boards :
Number crunching :
BOINC v6.6.31 available
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Questor Send message Joined: 3 Sep 04 Posts: 471 Credit: 230,506,401 RAC: 157 |
Interesting your tasks actually report out of memory error and fallback to CPU. An example of mine is 1242853008 with Cuda error 'cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, cudaAcc_NumDataPoints / fftlen * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost)' in file 'c:/sw/gpgpu/seti/seti_boinc/client/cuda/cudaAcc_summax.cu' in line 160 : unspecified launch failure. I was assuming the failure to copy into cuda memory to be an out of memory problem. If I rerun the task after a restart of BOINC it runs successfully. GPU Users Group |
Bob Mahoney Design Send message Joined: 4 Apr 04 Posts: 178 Credit: 9,205,632 RAC: 0 |
Wating to Run issue: I have a system here that created 12 waiting to run last night (from which it crashed). I rebooted it, aborted the 12 hung tasks, and it just created 3 more in the past few minutes. If it is not too tough to do, I'll volunteer it for debugging. (Thanks to everyone who responded so far. No obvious pattern has emerged re. this problem.) Bob |
Bob Mahoney Design Send message Joined: 4 Apr 04 Posts: 178 Credit: 9,205,632 RAC: 0 |
Wating to Run issue: Hoo Hah! It just created another seven! This thing is a "Waiting to run" factory! Bob (Edit changed one to two. Oh yeah.) (Edit again, changed two to seven. Wow.) (Last edit: I'm shutting this computer off. Ready for debugging.) |
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0 |
Wating to Run issue: @Bob, If it's anything like mine (running v6.6.28, .31, or .33), then once the GPU is in EDF mode (and it doesn't show "high priority" in BM) then for every 2 tasks that start, one will be left in "waiting to run" when the next 2 start until it comes out of EDF mode. I have recently noticed that with v6.6.33, when the GPU is in EDF mode, the completion of a CPU task will put *both* GPU tasks into "wating to run" and start a new CPU tasks and 2 new CUDA tasks. The only way to avoid all this seems to be to stay out of EDF mode on the CUDA which I do manage to do most of the time with my 3-day cache. F. |
Bob Mahoney Design Send message Joined: 4 Apr 04 Posts: 178 Credit: 9,205,632 RAC: 0 |
If it's anything like mine (running v6.6.28, .31, or .33), then once the GPU is in EDF mode (and it doesn't show "high priority" in BM) then for every 2 tasks that start, one will be left in "waiting to run" when the next 2 start until it comes out of EDF mode. I have recently noticed that with v6.6.33, when the GPU is in EDF mode, the completion of a CPU task will put *both* GPU tasks into "wating to run" and start a new CPU tasks and 2 new CUDA tasks. The only way to avoid all this seems to be to stay out of EDF mode on the CUDA which I do manage to do most of the time with my 3-day cache. I think that is exactly what happened. It looked like the first "waiting to run" is proper, later deadline, single task suspended. As you say, the next time (minutes later) a GPU task (typically a shorty) starts, it caused two running GPU tasks to revert to "waiting to run". After that, suspension happened in multiples of two. Keep in mind we are running two GPU. Let us not forget that Questor is experiencing this with a single GPU. So he must be getting singles of "waiting to run" as it happens. It does NOT happen when I run ONLY CPU or ONLY GPU. I can beat the heck out of such a setup and it never fails. As soon as I add a task to the other side of the computer, it eventually has this issue. Fred, is there an imbalance in cache size between your 608 vs. 603+AP queue on your system? I mean, do you have an intentional difference in cache size, achieved by filling up one, then going back to a shorter queue for running? On mine there is a full 10 days of AP for the CPU, and half that duration of MB for CUDA. I'm wondering if BOINC is confused by the cache size contrast? Note: Don't anyone get too upset about the big AP cache here - a 3.3.31 "waiting to run" crash took most of my hard drive with it last week. I finally got it running long enough to retrieve the 'lost' AP units yesterday. Bob |
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0 |
Fred, is there an imbalance in cache size between your 608 vs. 603+AP queue on your system? I mean, do you have an intentional difference in cache size, achieved by filling up one, then going back to a shorter queue for running? On mine there is a full 10 days of AP for the CPU, and half that duration of MB for CUDA. I'm wondering if BOINC is confused by the cache size contrast? I work on the pricipal that the minimum turn-round time specified for MB tasks is 7-days, so with a 3-day cache I am rarely going to run into EDF. I don't try to micro-manage the cache - I just tell it 3-days (with a 0.1 day connect interval) and leave Boinc to it. You have a (approx) 5-day cache for MB. If your cache is filled up when your DCF is at or near its minimum value and then the DCF is doubled by a task taking longer than expected (I have watched this happen), then any 7-day return task that was toward the end of your queue is in deadline trouble and you are into EDF. Running a 3 (or less) day cache gives much more headroom in this respect and I have not run out of work at any time recently, even with the major network problems we have experienced. I am pretty sure that halving your cache size would virtually elimate the "waiting to run's". F. |
Questor Send message Joined: 3 Sep 04 Posts: 471 Credit: 230,506,401 RAC: 157 |
Interestingly I had been micromanaging some of my caches - why can't I just leave well alone. A while back I had gone through a patch where there weren't many MB tasks coming through and eventually I ended up with a buffer setting of 10 days. I then ended up with a huge number of AP tasks which I was worried about not completing on time so deselected the preference to avoid downloading any more. On a couple of machines my DCF had also gone way out for some reason. I foolishly corrected this but didn't reduce the buffer back down and ended up with an equally ridiculously large number of MBs. I've since got everything about right but haven't yet reduced the buffer back down to a sensible level - so I am going to do that tonight. The only tweak I've got at the moment is to have APv5 preference selected with no MB but "do other tasks if none available" - attempting to chew up a few AP abort/returns when thy turn up to clear them out the way, so I am really only getting MBs anyway. The other factor is I have been rebranding VLAR tasks to CPU to avoid the problems with them which will have skewed the ratio of CPU/GPU tasks I have. (I haven't rebranded any back the other way to compensate). It's a wonder the poor thing is working at all really :-) GPU Users Group |
Bob Mahoney Design Send message Joined: 4 Apr 04 Posts: 178 Credit: 9,205,632 RAC: 0 |
Fred, is there an imbalance in cache size between your 608 vs. 603+AP queue on your system? I mean, do you have an intentional difference in cache size, achieved by filling up one, then going back to a shorter queue for running? On mine there is a full 10 days of AP for the CPU, and half that duration of MB for CUDA. I'm wondering if BOINC is confused by the cache size contrast? That is it! That is exactly what I just observed. I turned the host back on for 5 minutes and observed the following: 1. An AP task completed. 2. This influenced the CUDA tasks to nearly triple their "To completion" estimates from 8min to 22min. 3. EDF mode arrived, and CUDA task hijacking began. Yesterday, after restoring 10+ days of AP, and running ONLY AP (no CUDA), my system put the 8 running AP tasks into "Running high priority", with no ill effect. Problem is only with EDF on CUDA, possibly only while running CUDA in conjunction with AP or 603 on CPU. Troubling is the fact that completion of an AP task on the CPU should heavily influence the "time to completion" estimate for CUDA tasks. If that is how the current BOINC works, that does not seem appropriate. Bob |
Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0 |
Troubling is the fact that completion of an AP task on the CPU should heavily influence the "time to completion" estimate for CUDA tasks. If that is how the current BOINC works, that does not seem appropriate. It is how BOINC works and it is not appropriate, but developers seem to have pushed that to a higher rev. Gruß, Gundolf Computer sind nicht alles im Leben. (Kleiner Scherz) SETI@home classic workunits 3,758 SETI@home classic CPU time 66,520 hours |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14681 Credit: 200,643,578 RAC: 874 |
1) AP tasks influencing the anticipated running times of CUDA tasks is a design flaw in BOINC which we're going to have to live with until at least BOINC v6.10 The worst effects can be mitigated by careful fine-tuning of the FLOPs figures in app_info.xml: if you get it right (carefully balanced for the performance bias of your particular hardware), you can avoid EDF almost entirely. 2) Except that then, you wouldn't have discovered the mis-behaviour of CUDA under stress! From what I'm reading, there are problems: a) When there are two or more CUDA devices in a single host (which rules me out for testing, sadly) b) When a combination of cache size/DCF/deadlines brings 'shorties' forward in EDF - which to my mind should be flagged as 'High Priority', but doesn't seem to be. I think someone - not me, I'm single GPU only - is going to have to do some logging with <coproc_debug> (see Client configuration), try and work out what's happening, and report the analysis and logs to boinc_alpha. |
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0 |
Troubling is the fact that completion of an AP task on the CPU should heavily influence the "time to completion" estimate for CUDA tasks. If that is how the current BOINC works, that does not seem appropriate. With only a single DCF controlling both, this interaction is unavoidable. What bothers me is why completion of a CPU WU pre-emps both running (EDF) CUDA tasks and starts 2 more. I guess I'll have to try the logging route and publish the output since I doubt I will be able to make head or tail of it. F. |
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0 |
Now I am *really* confused. I set up the cc_config file with <cpu_sched>, <cpu_sched_debug>, <coproc_debug> and <sched_op_debug> flags as a starter. Then forced EDF by changing Connect Interval (local prefs) from 0.1 to 4.0 days leaving Extra Days at 3.0. Great - 2 CUDA's paused ("Waiting to run") and 2 new ones started as expected (the 4 CPU cores running MB ploughed on regardless). BUT: after that everything worked as it should... When one EDF CUDA WU finished another started and the partner EDF CUDA WU continued to completion. When a CPU MB WU completed and uploaded, there was no effect on the CUDA WU's in flight. And when I gave up and restored the original settings (including removing the config flags) and allowed the pre-empted CUDA WU's to complete, they did so without error (no -5!!). The only difference on my host at the moment is that I am running down my cache so that I can release tasks that were lost in the abortive attempt to load v6.6.34 (without checking the Alpha postings first!!) but I am not sure how that would affect this since I was able to force CUDA EDF mode. Unless there are any other ideas forthcoming, I will leave the flags (switched OFF) in my cc_config file until I see this happen spontaneously again and then try to capture the debug info. I will try this exercise again after I have cleared out the garbage and rebuilt my cache. Certainly the last time I experienced the problem was when I re-branded some 400 non-VLAR 603's to push them onto the GPU in the hope of snagging a few AP's for the CPU cores so I will probably try that method of forcing EDF again when the cache is full. F. |
Bob Mahoney Design Send message Joined: 4 Apr 04 Posts: 178 Credit: 9,205,632 RAC: 0 |
Fred said: ... When one EDF CUDA WU finished another started and the partner EDF CUDA WU continued to completion. When a CPU MB WU completed and uploaded, there was no effect on the CUDA WU's in flight. And when I gave up and restored the original settings (including removing the config flags) and allowed the pre-empted CUDA WU's to complete, they did so without error (no -5!!). The only difference I can point out with my situation is as follows: Whenever the "Waiting to run" problem happened, I had a queue of waiting CPU tasks that was way beyond my "Additional work buffer" setting. On one computer I had 5 days of CPU tasks waiting, "Additional work buffer" set to 1.6 days. The other computer had 10+ days of CPU tasks waiting, "Additional work buffer" set to 4.5 days. Perhaps BOINC (or whatever) only gets confused when the CPU tasks queue is full of extra days of work in comparison to the GPU queue? Bob |
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0 |
Perhaps BOINC (or whatever) only gets confused when the CPU tasks queue is full of extra days of work in comparison to the GPU queue? Hmmm... Doesn't quite gel with my last paragraph. Last time this happened to me the box was chuntering away quite happily and stably (so I assume there was about 3 days of CPU work and 3 days of GPU work on board) when I decided to tinker. I found that nearly 400 of the CPU tasks (603's) were non-VLAR so I re-branded them as 608's to run on the GPU. Since there were no VLAR CUDA tasks in the cache at the time there was no transfer back in the other direction. So the CPU cache was now nearly 400 tasks (probably equates to about 1.25 days crunching) short of 3-day's worth. Obviously the nearly 400 extra tasks were too much for the GPU queue and tipped it into EDF but the CPU cache should have been down below 2 days at this point. Still, we're trying to knit fog until we can get some debug facts so... F. |
tullio Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 |
I've downloaded BOINC 6.6.36 on my Linux box running SuSE Linux 10.3 and it does not work. A library is missing. I had to go back to 6.6.31 which works, although it says it is 6.6.29. Tullio |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
A library is missing. Which one? Have you checked if that library is on your system? If it isn't, have you tried to install it? |
tullio Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 |
A library is missing. No, I just reinstalled 6.6.31. But I suspect it is a library used by the manager for graphic purposes, not by the client itself. Maybe I shall try a reinstall next time I shut down boinc because CPDN Beta freezes it after each 60 minutes period allowed to it. I have other 5 projects running and all function properly. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Else reinstall 6.6.36, then run: ldd boincmgr and you'll see what it needs and what is missing. Then install that library file. Might be the new sqlite3 library. As far as I know, it won't come with the installer, it expects those libraries to be on your system already. (else ldd boinc if there's nothing wrong with the manager's libraries) |
tullio Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 |
Missing library is libpcre.so.3. I installed pcre from the SuSE repository but it gives me libpcre.so.0.0.1.So I shall reinstall 6.6.31. Tullio I've compiled the latest release of pcre (7.9) and it also gives me the libpcre.so.0 linked to libpcre.so.0.0.1. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
OK, someone else reported the exact same problem, so I reported it to the developers. Why they released these versions as recommended without any testing is really beyond me. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.