Modified SETI MB CUDA + opt AP package for full GPU utilization

Message boards : Number crunching : Modified SETI MB CUDA + opt AP package for full GPU utilization
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · Next

AuthorMessage
Morten Ross
Volunteer tester
Avatar

Send message
Joined: 30 Apr 01
Posts: 183
Credit: 385,664,915
RAC: 0
Norway
Message 856887 - Posted: 23 Jan 2009, 18:43:52 UTC - in response to Message 856826.  

Hi,

I have downgraded from 181.22 to 181.20 (from file version 7.15.11.8122 to 7.15.11.8120) - but no change, boincmgr must be restarted for the hanging WU to be processed.....

The WU hangs are allways occurring at the very beginning of the processing - after about 17 seconds (CPU time - not elapsed time) of processing, and between 0,000% and 0,500% progress.

Morten


Monitoring this configuration I see that MB_6.08_mod_VLAR_kill_CUDA.exe has 100% CPU utilization, so this module seems to be the culprit now.

Morten

Morten Ross
ID: 856887 · Report as offensive
Morten Ross
Volunteer tester
Avatar

Send message
Joined: 30 Apr 01
Posts: 183
Credit: 385,664,915
RAC: 0
Norway
Message 856902 - Posted: 23 Jan 2009, 19:38:37 UTC - in response to Message 856877.  
Last modified: 23 Jan 2009, 19:40:11 UTC

Well, it seems your host is good candidate for using Maik's script. It handles such hang situation AFAIK. Look for it here or on Lunatics site.


As I have understood it Afaik's script is for terminating app when a WU is CPU-idle. In my scenario the WU is not progressing while the CPU utilization is 100% by the app.

The workaround is to close boincmgr and restart it, then the same WU is being properly processed by app, so no WU is terminated.

I have tested Afaik's script in this scenario and nothing happens, as the logic of the s cript is not aware of the erraneous situation.

I think this one need further investigation into your code.

Morten
Morten Ross
ID: 856902 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 856912 - Posted: 23 Jan 2009, 20:03:07 UTC - in response to Message 856884.  

Have you tried Boinc 6.6.2 yet?

No. I read reports and it seems too unstable. Don't need it right now, maybe will try later.


I am wondering if the present VLAR Kill feature in your V7 application is actually necessary, because when I earlier was running Boinc 6.4.5 with stock Berkeley 6.08-Cuda and nVidia 180.60 drivers, that stock 6.08-Cuda would successfully crunch MB WUs that had very low angle range (VLAR).

There are 2 V7 versions available. One with VLAR processing. Slow but w/o errors (I hope). And another one with VLAR autokill enabled.
So answer is yes, it's actually needed because it allows to rise overall performance. And no, you doesn't required to use VLAR kill version, you can use V7 w/o autokill mod. It will behave like 6.08 stock but do nice things with priorities and report memory availability. Maybe smth more in future versions.
Everyone can chose what he/she like more ;)


Another question: When your V7 application detects that a WU angle range is below your VLAR limit, V7 terminates that WU with an exit code -6 so that WU gets listed as a Client Error/Compute Error.

No, V7 doesnt do it. V7 with autokill mod does it. Look on downloads places again - do you see two versions ?


Could you instead terminate VLAR WUs with an error code that reports it as "Client Aborted"? That would help differentiate such WUs from WUs that terminate with error code -9 overflow (e.g. #spikes=30, or #spikes=15+#peaks=15). Just a thought....

they clearly differentiated from overflows now too, cause -9 overflow is not computation error at all. And BAD_HEADER code is true error code, such result will not be validated in any case (even if two results will be with this errors). If you provide code to "client aborted" exit code I could change it. I become incredibly lazy when doesn't understand reason of work proposed ;)
ID: 856912 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 856913 - Posted: 23 Jan 2009, 20:08:31 UTC - in response to Message 856887.  


Monitoring this configuration I see that MB_6.08_mod_VLAR_kill_CUDA.exe has 100% CPU utilization, so this module seems to be the culprit now.

Link to result, srderr, please ?
When it does hara-kiry it should not use CPU, it should just die. In all another cases it will behave like 6.08.
Only in one case you can see 100% CPU usage - when due to low memory condition (or maybe another reason) it fall back to CPU processing. The it will use CPU for computations and will behave like 6.03 stock.
ID: 856913 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 856915 - Posted: 23 Jan 2009, 20:09:56 UTC - in response to Message 856902.  

Well, it seems your host is good candidate for using Maik's script. It handles such hang situation AFAIK. Look for it here or on Lunatics site.


As I have understood it Afaik's script is for terminating app when a WU is CPU-idle. In my scenario the WU is not progressing while the CPU utilization is 100% by the app.

The workaround is to close boincmgr and restart it, then the same WU is being properly processed by app, so no WU is terminated.

I have tested Afaik's script in this scenario and nothing happens, as the logic of the s cript is not aware of the erraneous situation.

I think this one need further investigation into your code.

Morten


I need at least stderr.txt for this task.

ID: 856915 · Report as offensive
Profile popandbob
Volunteer tester

Send message
Joined: 19 Mar 05
Posts: 551
Credit: 4,673,015
RAC: 0
Canada
Message 857052 - Posted: 24 Jan 2009, 3:58:51 UTC - in response to Message 856887.  



Monitoring this configuration I see that MB_6.08_mod_VLAR_kill_CUDA.exe has 100% CPU utilization, so this module seems to be the culprit now.

Morten


For me the first ~25 seconds the wu's run at 100% CPU then they start processing... Was it in this start-up phase?


Do you Good Search for Seti@Home? http://www.goodsearch.com/?charityid=888957
Or Good Shop? http://www.goodshop.com/?charityid=888957
ID: 857052 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 857101 - Posted: 24 Jan 2009, 7:27:59 UTC - in response to Message 857052.  



Monitoring this configuration I see that MB_6.08_mod_VLAR_kill_CUDA.exe has 100% CPU utilization, so this module seems to be the culprit now.

Morten


For me the first ~25 seconds the wu's run at 100% CPU then they start processing... Was it in this start-up phase?

Data decoding and GPU feeding I suppose. This feature was in place from very first CUDA MB release, nothing new in that. But 100% CPU usage more than ~1min - it's something that need to be analysed .
ID: 857101 · Report as offensive
Morten Ross
Volunteer tester
Avatar

Send message
Joined: 30 Apr 01
Posts: 183
Credit: 385,664,915
RAC: 0
Norway
Message 857136 - Posted: 24 Jan 2009, 10:29:54 UTC - in response to Message 857052.  
Last modified: 24 Jan 2009, 10:32:56 UTC



Monitoring this configuration I see that MB_6.08_mod_VLAR_kill_CUDA.exe has 100% CPU utilization, so this module seems to be the culprit now.

Morten



For me the first ~25 seconds the wu's run at 100% CPU then they start processing... Was it in this start-up phase?



It's allways in the start-up phase, and then the application gets stuck at 100% cpu util. There are no errors logged in the application stderr.txt or online, as seen here: http://setiathome.berkeley.edu/result.php?resultid=1129307192. This WU was first not progressing due to 100% cpu for the app, then boincmgr was restarted, and the WU-data once again dumped - that's why it's listed twice - but no errors as you can see, as the WU was completed successfully the second time around.

If the application is killed, the WU is also failing, so that is not an option.

Morten
Morten Ross
ID: 857136 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 857212 - Posted: 24 Jan 2009, 16:26:54 UTC - in response to Message 857136.  

Realy strange. Try to reproduce this with stock 6.08 and if it shows the same behavior - report it in Q&A section - it's most direct way to pass bug to devs.
If 6.08 will work OK - well, you could continue with it and not use my current build.
ID: 857212 · Report as offensive
Morten Ross
Volunteer tester
Avatar

Send message
Joined: 30 Apr 01
Posts: 183
Credit: 385,664,915
RAC: 0
Norway
Message 857236 - Posted: 24 Jan 2009, 17:36:57 UTC - in response to Message 857212.  

Hi,

I downlaoded the setiathome_6.08_windows_intelx86__cuda.exe from fanoutserver and renamed all references to your V7-app in app_info to setiathome_6.08_windows_intelx86__cuda.exe and started BOINC. This failed - only AP WUs were processed and MB WUs failed, so I had to do a roll-back...

What must the app_info contain in order to successfully use the stock 6.08-app?

Morten
Morten Ross
ID: 857236 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 857240 - Posted: 24 Jan 2009, 17:54:34 UTC
Last modified: 24 Jan 2009, 18:03:24 UTC

Funny you should ask that - hang on a mo - have a look at message 857243.
ID: 857240 · Report as offensive
Morten Ross
Volunteer tester
Avatar

Send message
Joined: 30 Apr 01
Posts: 183
Credit: 385,664,915
RAC: 0
Norway
Message 857279 - Posted: 24 Jan 2009, 19:22:17 UTC - in response to Message 857240.  

Hi,

Excellent initiative!

Unfortunately the same result, I'm afraid. Comparing the app_info only one difference between yours and mine:

Mine:
<version_num>528</version_num>
<plan_class>cuda</plan_class>
<avg_ncpus>0.040000</avg_ncpus>
<max_ncpus>0.040000</max_ncpus>

yours:
<version_num>528</version_num>
<plan_class>cuda</plan_class>
<avg_ncpus>0.114729</avg_ncpus>
<max_ncpus>0.114729</max_ncpus>

Apart from that the your dll-files files are newer.

Morten

Morten Ross
ID: 857279 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 857283 - Posted: 24 Jan 2009, 19:38:11 UTC - in response to Message 857279.  

The app and the DLL files are as downloaded from the SETI servers - any variation in datestamp will be simply due to the different times we downloaded them.

I have installed the packages direct from my own webspace (well, the AP SSE3 package - but there's very little difference) on to two machines which didn't previously have CUDA-capable graphics cards. No problems encountered - both AP and CUDA started to run exactly as expected.

Which probably directs the focus onto your CUDA card, and the NVidia drivers you're using. Or has Raistmer already gone over all of that with you?
ID: 857283 · Report as offensive
Morten Ross
Volunteer tester
Avatar

Send message
Joined: 30 Apr 01
Posts: 183
Credit: 385,664,915
RAC: 0
Norway
Message 857327 - Posted: 24 Jan 2009, 20:29:46 UTC - in response to Message 857283.  

Hi,

I've gone through a thorough version-testing and have arrived at Cuda 181.20 and BOINC 6.4.5 (which Raistmer is running) as the best combo.

Prior to this I have run GPGgrid for a while and there was not even one hitch, so this is S@H-specific.

I have suggested to Raistmer that a debug-version of V7 is compiled, in o rder to collect more information at the time of the problem.

Have you been successfull in running the "SETI CUDA v6.08 - AP SSE3" on a Windows Vista x64? I have a stable Vista 32-bit, so this is propably x64-related.

I am considering demoting the BOINC installation to 32-bit......

Morten
Morten Ross
ID: 857327 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 857335 - Posted: 24 Jan 2009, 20:58:08 UTC

I'm using the same Cuda 181.20 and BOINC v6.4.5 combo, so nothing helps us there.

All my rigs are 32-bit XP, so I can't help there: but I agree the suspicion lies in the 64-bit or Vista area.
ID: 857335 · Report as offensive
Morten Ross
Volunteer tester
Avatar

Send message
Joined: 30 Apr 01
Posts: 183
Credit: 385,664,915
RAC: 0
Norway
Message 857570 - Posted: 25 Jan 2009, 10:16:50 UTC - in response to Message 857335.  

Hi,

I might have been a bit quick to judge - boincmngr was hanging and I saw no setiathome_6.08_windows_intelx86__cuda.exe in task list and assumed a no-go and rolled back immediately.

I wanted to give it another more thorough try before I demote to BOINC 32-bit and this time it is working. I'm bnot sure what is different - a reboot is the only thing I can think of is different from the previous two attempts.

Nevertheless - same behaviour, so I am now going to demote the x64 to 32-bit BOINC and see if that makes a difference.

Morten
Morten Ross
ID: 857570 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 857590 - Posted: 25 Jan 2009, 11:24:02 UTC - in response to Message 857570.  

You can try new version also. V8 uses completely different approach to GPU-handling. No BOINC involved. Maybe it will run on you host...
ID: 857590 · Report as offensive
Morten Ross
Volunteer tester
Avatar

Send message
Joined: 30 Apr 01
Posts: 183
Credit: 385,664,915
RAC: 0
Norway
Message 857597 - Posted: 25 Jan 2009, 12:07:37 UTC - in response to Message 857590.  

Hi,

Great approach!

I am currently running BOINC 32-bit to see if this changes the issue...

Morten
Morten Ross
ID: 857597 · Report as offensive
Morten Ross
Volunteer tester
Avatar

Send message
Joined: 30 Apr 01
Posts: 183
Credit: 385,664,915
RAC: 0
Norway
Message 857672 - Posted: 25 Jan 2009, 16:10:39 UTC - in response to Message 857590.  
Last modified: 25 Jan 2009, 16:11:30 UTC

Hi,

BOINC 32-bit tested, and same happens....

I've now tested V8, and the same happens - the cuda-app is using 100% cpu and the WU is not progressing beyond 0,026%.

Morten
Morten Ross
ID: 857672 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 857674 - Posted: 25 Jan 2009, 16:19:00 UTC - in response to Message 857672.  

well.... only x64 Vista remains...
ID: 857674 · Report as offensive
Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · Next

Message boards : Number crunching : Modified SETI MB CUDA + opt AP package for full GPU utilization


 
©2026 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.