Message boards :
Number crunching :
GBT ('guppi') .vlar tasks will be send to GPUs, what you think about this?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next
Author | Message |
---|---|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I was finally able to compile Petri's code in Yosemite but I'm back to getting a number of these "SIGBUS: bus error". It's a little faster with ToolKit 7.5, it's just strange why I'm getting what appears to be a BOINC Error. The Task is finished, the results printed, then it gives an Error; ... setiathome v8 enhanced x41p_zi r3452-64, Cuda 7.50 special Compiled with NVCC 7.5. Modifications done by petri33. Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.008975 Sigma 372 Sigma > GaussTOffsetStop: 372 > -308 Thread call stack limit is: 1k cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... 1,2,3,4,5,6,7,8,9,10,10,11,12,cudaAcc_free() DONE. 13 Flopcounter: 31398940702413.851562 Spike count: 1 Autocorr count: 1 Pulse count: 5 Triplet count: 3 Gaussian count: 0 SIGBUS: bus error Crashed executable name: setiathome_x41p_zi_x86_64-apple-darwin_cuda75 Machine type Intel 80486 (64-bit executable) System version: Macintosh OS 10.10.5 build 14F1713 Wed May 11 10:59:49 2016 ... I've switched back to BOINC 7.4.36, but, I think I've been here and done this before. I've been using <no_priority_change> for years. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Assuming you changed maxregisters up to 64 as you stated elsewhere, try dialling it back to 32. The long pulsefinds in both baseline and Petri's code are greedy, and will dominate in guppi vlars, potentially tripping up OS failsafes. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Gianfranco Lizzio Send message Joined: 5 May 99 Posts: 39 Credit: 28,049,113 RAC: 87 |
Assuming you changed maxregisters up to 64 as you stated elsewhere, try dialling it back to 32. I'm using maxrregcount=64 in El Capitan without any problem. Gianfranco I don't want to believe, I want to know! |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I think part of the compiling problem I was having before was trying it with maxregisters=32. I believe it's coded to use maxregisters=64. I don't have this problem using maxregisters=64 with the ToolKit 6.5 App compiled in ML. Also, it doesn't have a problem with the Science App, the App is Finished, and the results printed before the SIGBUS error. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Assuming you changed maxregisters up to 64 as you stated elsewhere, try dialling it back to 32. Hmmmm, Tbar's Build/Boinc suspicions remain then :), I have no other ideas. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I think part of the compiling problem I was having before was trying it with maxregisters=32. I believe it's coded to use maxregisters=64. I don't have this problem using maxregisters=64 with the ToolKit 6.5 App compiled in ML. Also, it doesn't have a problem with the Science App, the App is Finished, and the results printed before the SIGBUS error. IOW it's dying after boinc_finish() call ? or being killed by the client (if any way to tell the difference) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I think part of the compiling problem I was having before was trying it with maxregisters=32. I believe it's coded to use maxregisters=64. I don't have this problem using maxregisters=64 with the ToolKit 6.5 App compiled in ML. Also, it doesn't have a problem with the Science App, the App is Finished, and the results printed before the SIGBUS error. All I know is the last 30 seconds seems to take much longer than 30 secs ;-) So far no errors since going back to 7.4.36. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
One thing that comes to mind, is a change at some point in the shared memory arrangement. This may not affect Gianfranco or myself, by may (possibly) affect you. Using app_info somewhere, iirc, you may need to include an explicit <api_version> tag containing the api version number you built with. That's one item that could depend on both the api and client version you chose (though I didn't have to mess with adding the entry on my system,) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
One thing that comes to mind, is a change at some point in the shared memory arrangement. This may not affect Gianfranco or myself, by may (possibly) affect you. Using app_info somewhere, iirc, you may need to include an explicit <api_version> tag containing the api version number you built with. That's one item that could depend on both the api and client version you chose (though I didn't have to mess with adding the entry on my system,) In reality, it doesn't have to be the exact build version: the only tests are for BOINC/API 6.0 (PID instead of heartbeat), and 7.5 (something to do with a Bitcoin Utopia command line). There was also the shared memory segment issue (the old SpyHill bug) which was specific to multi-core, multi-instance Macs, but that would blow up at the start of the run, not at the end. It would be good practice to get into the habit of using <api_version> tags in app_info.xml, just in case somebody starts using one for real and we miss it, but I doubt they're implicated here. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
One thing that comes to mind, is a change at some point in the shared memory arrangement. This may not affect Gianfranco or myself, by may (possibly) affect you. Using app_info somewhere, iirc, you may need to include an explicit <api_version> tag containing the api version number you built with. That's one item that could depend on both the api and client version you chose (though I didn't have to mess with adding the entry on my system,) Hmmm, that doesn't seem to work very well for me. Adding the line <api_version>7.7.0</api_version> results in Many trashed tasks; Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.006956 Sigma 513 Sigma > GaussTOffsetStop: 513 > -449 plan autocorr R2C batched FFT failed 5 Not enough VRAM for Autocorrelations... setiathome_CUDA: CUDA runtime ERROR in device memory allocation, attempt 1 of 6 cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... 1,2,3,4,5,6,7,8,9,10,10,11,12,cudaAcc_free() DONE. 13 waiting 5 seconds... Reinitialising Cuda Device... Cuda error 'Couldn't get cuda device count ' in file 'cuda/cudaAcceleration.cu' in line 161 : invalid resource handle. </stderr_txt> ... Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.005559 Cuda error 'cudaMalloc((void**) &dev_WorkData' in file 'cuda/cudaAcceleration.cu' in line 439 : out of memory. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
plan autocorr R2C batched FFT failed 5 Not enough VRAM for Autocorrelations... setiathome_CUDA: CUDA runtime ERROR in device memory allocation, attempt 1 of 6 it has failed at the start... Sometimes a reboooooot helps. The GPU memory can get in to a fragmented state. There is plenty of ram available but not in one continuous block. How about running the older build? On top hosts there is one running happily and without any extra inconclusives. (I have trashed my version and need to revert back because of too many inconclusives, no errors though.) To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Hmmm, that doesn't seem to work very well for me. Adding the line <api_version>7.7.0</api_version> results in Many trashed tasks; That might mess with the command line passed to your application at startup: if (!app_version->api_version_at_least(7, 5)) { int rt = app_version->gpu_usage.rsc_type; if (rt) { coproc_cmdline(rt, result, app_version->gpu_usage.usage, cmdline, sizeof(cmdline)); } } Try knocking it back to 7.3.0 and see if that changes anything. But you should be getting info about the device to use from init_data.xml these days, not from the command line at all. |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
@Mark 2 hours seems a bit long for those cards. I ran both my computers running 3 tasks on 750Ti's (started at same time) and ended up with run times of 1:29 to 1:39. A bit strange since I was running 1:05 with 2 tasks. My mbcuda.cfg processpriority = abovenormal pfblockspersm = 16 pfperiodsperlaunch = 400 |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
@Mark Well, with work keeping me busy, I have not done any optimizing yet. This weekend, I'll have to see if I can find time to make sure I am running the right apps and add some opti parameters. Hopefully that will get the kitties back on the right track. I'll be posting in the team forum for some help and tips. Thanks for the input. Meow. "Time is simply the mechanism that keeps everything from happening all at once." |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Well, rebooting didn't help. I had to remove the line <api_version>7.7.0</api_version> to get it to stop immediately trashing All the GPU tasks. Changing the line to <api_version>7.3.0</api_version> allows it to start again. The current build is using the code in Jason's folder, it seems to be quite a bit faster than the exact same code compiled in ML with ToolKit 6.5. It would be nice if it weren't for the Crash After Finish. I might try building it again with boinc-master 7.5, the App compiled in ML with ToolKit 6.5 was using boinc-master 7.5. Adding <api_version>7.3.0</api_version> didn't help. Still get the Crash After Finish with 1 out of 3 tasks. Why does the validator give Some tasks an Invalid when run a second time? It was Only Reported Once. As you can see, the results are the same, http://setiathome.berkeley.edu/workunit.php?wuid=2155978136 The way I see it the Results you Report are the ones that should be used. Not the ones caught up in a Mass trashing and later run and reported. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Running Raismter's SoG on my Titan X's Those have ar 0.009, 0.01 and 0.009. Other reporters may have ar at around 0.005 or something. Nice times though (3 ones under 30). To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Jus as someone said: Let the creditscrew settle now for a week or thee. I was looking at the credit granted for host http://setiathome.berkeley.edu/results.php?hostid=7939003&offset=0&show_names=0&state=4&appid= and saw a huge difference in credit granted depending on which kind of host the task validated against. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Running Raismter's SoG on my Titan X's Are you dividing by 3 or is that the exact time? 3 in 54 minutes for me averages 18 minutes |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Running Raismter's SoG on my Titan X's I clicked the links and saw about 29 min for each three of them. So running one at a time that would be about 10 min for one. EDIT: but they'd have to run on same GPU simultaneously., To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Running Raismter's SoG on my Titan X's "Ja, men tider dom var uppladdat är inte samma. Dom var något interlaced." Yes, but the times they were uploaded are not the same. They were somewhat interlaced. (Not runnig the whole time together, and may have had other kind of tasks running at the same time) To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.