OpenCL NV MultiBeam v8 SoG edition for Windows

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 18 · 19 · 20 · 21

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1795572 - Posted: 12 Jun 2016, 8:45:34 UTC - in response to Message 1795504.  
Last modified: 12 Jun 2016, 9:17:20 UTC

.

The slowdown in low ar comes from having a (only) one full pot to process.
That kind of makes all work to go to one SM/SMX unit.

More precisely as I explained in post on Lunatics (sadly they came down now) it's 8 independend arrays to search. 8 PoT arrays. Depending on workgroup you chose all of them, indeed, can go to only single CU (in OpenCL terms) that corresponds SM/SMx in NV/CUDA terms. One can artifically distribute it to 8 different CUs by appropriate workgroup limits, but this will underload each of CUs of course.
Another way is to unroll some of periods to provide more data to process inparallel.


My errors are probably from calculating the average from an artifically shortened pot. A pre-calculated avg is something I'll try tomorrow or in the next days,

I do avg pre-calculation in Triplet search. It was looked as good idea before. Currently if full signal search decoupling needed this provide additional dependence to get rid of. But cause it's not the single obstacle for full PoT on GPU I don't touch it yet.


then getting rid of the loop doing form LastP to FirstP and replacing that with grid.z and a parameter.

Not sure it's really possible. Such unroll comes with memory for arrays to hold.
Initially I did fixed-size x32 unroll for periods. Currently it configurable but still much less than total periods num. Max total periods num could be estimated as 2/3*(1024*1024/8). One need to have corresponding amount of memory to hold that number of separate (though little shortened on first iteration) arrays.
Maybe doable with 4GB GPUs? Worth to calculate.


Earlier I tried with 8 streams and other stuff to make more work parallel. It is hard to keep track of found results and to report them in the same order as the CPU version does. You have figured out a way to do that with SoG!!

Few queues (again, in OpenCL terms that correspond to CUDA stream) per single PoT search looks like increase in overhead. One could try few PoT searches in separate queues (not too big memory footprint increase, partially implemented) or even few icfft iterations simultaneously (that would be quite a big rework of existing code and unfortunately sharp increase in memory footprint).
Regarding particular signal order - for not overflowed task it's irrelevant while one have no false positives/negatives. For overflowed task it constitutes real issue. Ironically, they are "noisy" ones that will need separate treatment on postprocessing stage anyway. I decided to sacrifice absolute signal ordering and just attempt to keep differencies as small as possible to reduce numbers of mismatched overflows. Also, seems ordering issue existed even in original CUDA code (though in quite small degree). So I would recommend to concentrate on false positives/negatives more than on signal ordering.
EDIT: BTW, this quite differs from AstroPulse situation where signals (for example, in FFA) are updated for some proximity-establishing algorithm. There if such signal updates come in wrong order even non-overflow task will have wrong final signals. That took lot of time and some tricks in code to keep all found in parallel signals in order. One of released builds even grow in memory to some huge sizes because of this.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1795572 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14472
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1795584 - Posted: 12 Jun 2016, 10:23:22 UTC - in response to Message 1795483.  

Unfortunately there is no other tool for such task.
Both plans define exactly the same subset of hosts. So, bigger number will represent domination of one of apps in that subset. And app can dominate either because it's faster or because of BOINC's imperfection (wrong best selection). That's noise I expect to be.

Having thought about it overnight, I think I can lower my flag of caution.

The GFlops figures on the applications page are - as individual numbers - well dodgy, but the relative numbers for two applications deployed on the same day (and as you say, for the same subset of hosts) should be a useful comparator.

The alarm bells that went off in my head last night related more to the other recent stats pages (CPU models, GPU models): it was the CPU list which started out with some very bad maths, but we got that corrected. Also, I think it's the cpu/gpu lists which must include anonymous platform data: I don't see how that could be done for the applications page. Sorry about the red herring.
ID: 1795584 · Report as offensive
Grumpy Swede (I stand with Ukraine)
Volunteer tester
Avatar

Send message
Joined: 1 Nov 08
Posts: 8921
Credit: 49,849,242
RAC: 65
Sweden
Message 1795596 - Posted: 12 Jun 2016, 12:14:26 UTC - in response to Message 1795465.  

OK, I just wondered because I noticed that there is a MB8_win_x86_SSE3_OpenCL_ATi_APU_r3430_SoG.exe, for the ATI APU, on your download site, but no SoG for the iGPU.

Yes but it did not show as big advantage over non-SoG as for NV.
If much higher main project statistics will show the same I'll consider to drop that build.

EDIT: current values:
Windows/x86 8.12 (opencl_atiapu_sah) 19 May 2016, 16:32:07 UTC 4,961 GigaFLOPS
Windows/x86 8.12 (opencl_atiapu_SoG) 19 May 2016, 16:32:07 UTC 5,564 GigaFLOPS

SoG better a little but probably inside noise range.

OK, it's just that I have this nagging "feeling", that SoG would make a difference even with iGPU. Not that it's doing bad at all with the Non SoG version, as can be seen in this example from Beta:
Task 24037397
ID: 1795596 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1795776 - Posted: 12 Jun 2016, 20:51:27 UTC - in response to Message 1795572.  

.

The slowdown in low ar comes from having a (only) one full pot to process.
That kind of makes all work to go to one SM/SMX unit.

More precisely as I explained in post on Lunatics (sadly they came down now) it's 8 independend arrays to search. 8 PoT arrays. Depending on workgroup you chose all of them, indeed, can go to only single CU (in OpenCL terms) that corresponds SM/SMx in NV/CUDA terms. One can artifically distribute it to 8 different CUs by appropriate workgroup limits, but this will underload each of CUs of course.
Another way is to unroll some of periods to provide more data to process inparallel.


My errors are probably from calculating the average from an artifically shortened pot. A pre-calculated avg is something I'll try tomorrow or in the next days,

I do avg pre-calculation in Triplet search. It was looked as good idea before. Currently if full signal search decoupling needed this provide additional dependence to get rid of. But cause it's not the single obstacle for full PoT on GPU I don't touch it yet.


then getting rid of the loop doing form LastP to FirstP and replacing that with grid.z and a parameter.

Not sure it's really possible. Such unroll comes with memory for arrays to hold.
Initially I did fixed-size x32 unroll for periods. Currently it configurable but still much less than total periods num. Max total periods num could be estimated as 2/3*(1024*1024/8). One need to have corresponding amount of memory to hold that number of separate (though little shortened on first iteration) arrays.
Maybe doable with 4GB GPUs? Worth to calculate.


Earlier I tried with 8 streams and other stuff to make more work parallel. It is hard to keep track of found results and to report them in the same order as the CPU version does. You have figured out a way to do that with SoG!!

Few queues (again, in OpenCL terms that correspond to CUDA stream) per single PoT search looks like increase in overhead. One could try few PoT searches in separate queues (not too big memory footprint increase, partially implemented) or even few icfft iterations simultaneously (that would be quite a big rework of existing code and unfortunately sharp increase in memory footprint).
Regarding particular signal order - for not overflowed task it's irrelevant while one have no false positives/negatives. For overflowed task it constitutes real issue. Ironically, they are "noisy" ones that will need separate treatment on postprocessing stage anyway. I decided to sacrifice absolute signal ordering and just attempt to keep differencies as small as possible to reduce numbers of mismatched overflows. Also, seems ordering issue existed even in original CUDA code (though in quite small degree). So I would recommend to concentrate on false positives/negatives more than on signal ordering.
EDIT: BTW, this quite differs from AstroPulse situation where signals (for example, in FFA) are updated for some proximity-establishing algorithm. There if such signal updates come in wrong order even non-overflow task will have wrong final signals. That took lot of time and some tricks in code to keep all found in parallel signals in order. One of released builds even grow in memory to some huge sizes because of this.


Thank You for a detailed explanation. I'll read it again and again at least three times or until I get all that is in it. Thank You.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1795776 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1795782 - Posted: 12 Jun 2016, 21:14:43 UTC - in response to Message 1795776.  

The slowdown in low ar comes from having a (only) one full pot to process. That kind of makes all work to go to one SM/SMX unit.

More precisely as I explained in post on Lunatics (sadly they came down now) it's 8 independend arrays to search. 8 PoT arrays. Depending on workgroup you chose all of them, indeed, can go to only single CU (in OpenCL terms) that corresponds SM/SMx in NV/CUDA terms. One can artifically distribute it to 8 different CUs by appropriate workgroup limits, but this will underload each of CUs of course.
Another way is to unroll some of periods to provide more data to process inparallel.


My errors are probably from calculating the average from an artifically shortened pot. A pre-calculated avg is something I'll try tomorrow or in the next days,

I do avg pre-calculation in Triplet search. It was looked as good idea before. Currently if full signal search decoupling needed this provide additional dependence to get rid of. But cause it's not the single obstacle for full PoT on GPU I don't touch it yet.

then getting rid of the loop doing form LastP to FirstP and replacing that with grid.z and a parameter.

Not sure it's really possible. Such unroll comes with memory for arrays to hold.
Initially I did fixed-size x32 unroll for periods. Currently it configurable but still much less than total periods num. Max total periods num could be estimated as 2/3*(1024*1024/8). One need to have corresponding amount of memory to hold that number of separate (though little shortened on first iteration) arrays.
Maybe doable with 4GB GPUs? Worth to calculate.


Earlier I tried with 8 streams and other stuff to make more work parallel. It is hard to keep track of found results and to report them in the same order as the CPU version does. You have figured out a way to do that with SoG!!

Few queues (again, in OpenCL terms that correspond to CUDA stream) per single PoT search looks like increase in overhead. One could try few PoT searches in separate queues (not too big memory footprint increase, partially implemented) or even few icfft iterations simultaneously (that would be quite a big rework of existing code and unfortunately sharp increase in memory footprint).
Regarding particular signal order - for not overflowed task it's irrelevant while one have no false positives/negatives. For overflowed task it constitutes real issue. Ironically, they are "noisy" ones that will need separate treatment on postprocessing stage anyway. I decided to sacrifice absolute signal ordering and just attempt to keep differencies as small as possible to reduce numbers of mismatched overflows. Also, seems ordering issue existed even in original CUDA code (though in quite small degree). So I would recommend to concentrate on false positives/negatives more than on signal ordering.
EDIT: BTW, this quite differs from AstroPulse situation where signals (for example, in FFA) are updated for some proximity-establishing algorithm. There if such signal updates come in wrong order even non-overflow task will have wrong final signals. That took lot of time and some tricks in code to keep all found in parallel signals in order. One of released builds even grow in memory to some huge sizes because of this.

Thank You for a detailed explanation. I'll read it again and again at least three times or until I get all that is in it. Thank You.

I found that the GUPPIs speed up quite a bit if you throw registers at them.
Note this task run with maxrregcount=32; http://setiathome.berkeley.edu/result.php?resultid=4979667790
Run time: 22 min 36 sec
CPU time: 22 min 19 sec
This task was run with the App set to maxrregcount=128; http://setiathome.berkeley.edu/result.php?resultid=4979681158
Run time: 8 min 22 sec
CPU time: 8 min 14 sec
ID: 1795782 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5516
Credit: 528,817,460
RAC: 242
United States
Message 1795788 - Posted: 12 Jun 2016, 21:58:27 UTC - in response to Message 1795782.  

TBar, Is that 1 at a time or multiples on the same card?
ID: 1795788 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1795797 - Posted: 12 Jun 2016, 22:20:38 UTC - in response to Message 1795788.  

TBar, Is that 1 at a time or multiples on the same card?

Those Apps are using Petri's code which runs One task at a time in streams.
Modifications done by petri33.
ID: 1795797 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1795805 - Posted: 12 Jun 2016, 22:35:04 UTC - in response to Message 1795782.  
Last modified: 12 Jun 2016, 22:36:53 UTC

This task was run with the App set to maxrregcount=128; http://setiathome.berkeley.edu/result.php?resultid=4979681158
Run time: 8 min 22 sec
CPU time: 8 min 14 sec

I'm afraid this particulr task not good example at least right now:

Имя blc2_2bit_guppi_57451_64017_HIP116936_0007.21158.416.17.26.209.vlar_1
Задача 2182225561
Создан 11 Jun 2016, 7:20:47 UTC
Отправлен 11 Jun 2016, 13:55:11 UTC
Крайний срок отчёта 3 Aug 2016, 18:54:53 UTC
Получен 12 Jun 2016, 4:51:58 UTC
Состояние сервера Завершено
Результат выполнения Успех
Состояние клиента Готово
Статус выхода 0 (0x0)
ID компьютера 6796479
Время выполнения 8 мин. 22 сек.
Время ЦП 8 мин. 14 сек.
Состояние проверки Проверено, но пока нет согласия
Очки 0.00
Пиковая производительность устройства, FLOPS 2,022.14 GFLOPS
Версия приложения SETI@home v8
Анонимная платформа (ГП NVIDIA)
Peak working set size 186.86 MB
Peak swap size 27,254.51 MB
Peak disk usage 0.03 MB

That is, inconclusive.
Maybe wingman will ultimately be invalid one, but better to give clear valid task for example.
Also, there is no CUDA compiler options to pass to OpenCL compiler. At least it was so in older CUDA SDKs. Need to check new one. Strange, cause passing options to compiler is supported in OpenCL standard.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1795805 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1795810 - Posted: 12 Jun 2016, 22:46:31 UTC - in response to Message 1795805.  
Last modified: 12 Jun 2016, 22:53:06 UTC

Petri's latest mod results in the guppis being off by at least 1 pulse count even though the previous code gave the correct results. I'm waiting for the next version. The difference in maxrregcount is valid though. It has been noticed in previous builds for some time now. I believe I even posted about in over at Beta some time ago. If you set the compiler to maxrregcount=32 the App will be much slower. Unfortunately the gpus less than Compute Code 3.2 can't use any setting above 32, the compiler will just ignore the setting for those gpus and use 32.
ID: 1795810 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1795820 - Posted: 12 Jun 2016, 23:41:58 UTC - in response to Message 1795810.  

PulseFind code has quite a lot variables so keeping them from spilling is good thing.With 128 regs per thread it would be possible to do some last iterations purely inside registers that would give very good speedup on those iterations.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1795820 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5516
Credit: 528,817,460
RAC: 242
United States
Message 1796236 - Posted: 15 Jun 2016, 2:21:37 UTC

Raistmer,

Just noticed you removed r3430 SoG from your downloadables. That versions works best for me. Fortunately, still had the zip on another computer.
ID: 1796236 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13139
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1796255 - Posted: 15 Jun 2016, 4:52:30 UTC - in response to Message 1796236.  

I also have noticed that the r3430 app is faster than the new "improved" r3472 app. In general about 50-100 seconds faster on non-VLAR (.41 AR typical) tasks compared to r3472 app. About 100 seconds faster on VLAR BLC2 GUPPI's. I haven't switched over my dedicated cruncher partly due to that fact and also due to the fact I dumped all my 8.00 tasks when I updated to the r3472 app and didn't notice the change in the app version.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1796255 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1796315 - Posted: 15 Jun 2016, 9:57:10 UTC - in response to Message 1796236.  
Last modified: 15 Jun 2016, 10:01:07 UTC

Raistmer,

Just noticed you removed r3430 SoG from your downloadables. That versions works best for me. Fortunately, still had the zip on another computer.

It replaced with bugfixed SoG.
Should be similar to 3430 but w/o hang on -use_sleep option.

What relative slowdown you see (that is, not absolute number of seconds but 100%*deltaT(slowdown)/Elapsed_time ) ?

BTW, such things will come unnoticed currently cause I can't immediately run NV SoG build at all. So,, all performance changes observations on you ... (Also would be good to see formal offline KWSN-made benches with full-length tasks)
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1796315 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1796316 - Posted: 15 Jun 2016, 10:02:40 UTC - in response to Message 1796255.  

I also have noticed that the r3430 app is faster than the new "improved" r3472 app. In general about 50-100 seconds faster on non-VLAR (.41 AR typical) tasks compared to r3472 app. About 100 seconds faster on VLAR BLC2 GUPPI's. I haven't switched over my dedicated cruncher partly due to that fact and also due to the fact I dumped all my 8.00 tasks when I updated to the r3472 app and didn't notice the change in the app version.

The same, what relative slowdown is?
100 seconds per 300 total is one thing, 100 seconds per 100ks of elapsed time - quite another one.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1796316 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5516
Credit: 528,817,460
RAC: 242
United States
Message 1796348 - Posted: 15 Jun 2016, 13:17:24 UTC - in response to Message 1796315.  

What relative slowdown you see (that is, not absolute number of seconds but 100%*deltaT(slowdown)/Elapsed_time ) ?

BTW, such things will come unnoticed currently cause I can't immediately run NV SoG build at all. So,, all performance changes observations on you ... (Also would be good to see formal offline KWSN-made benches with full-length tasks)



Well since I was over at Beta testing, I had not the chance to try the new installer.

That and blc2,3 and 5 and blc 7,8 all have different run times.

It's harder to get an true "average'

Probably later today
ID: 1796348 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13139
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1796410 - Posted: 15 Jun 2016, 17:23:44 UTC - in response to Message 1796316.  


The same, what relative slowdown is?
100 seconds per 300 total is one thing, 100 seconds per 100ks of elapsed time - quite another one.

I was just comparing a dozen or so completed tasks in my lists for both machines. r3430 tasks, non-VLAR (AR 0.41) completed in an average of around 700 seconds. r3430 tasks, VLAR blc2 GUPPI tasks in an average about 1200 seconds. All tasks done two up on each card in both machines.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1796410 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5516
Credit: 528,817,460
RAC: 242
United States
Message 1796412 - Posted: 15 Jun 2016, 17:40:45 UTC - in response to Message 1796410.  
Last modified: 15 Jun 2016, 17:44:15 UTC


The same, what relative slowdown is?
100 seconds per 300 total is one thing, 100 seconds per 100ks of elapsed time - quite another one.

I was just comparing a dozen or so completed tasks in my lists for both machines. r3430 tasks, non-VLAR (AR 0.41) completed in an average of around 700 seconds. r3430 tasks, VLAR blc2 GUPPI tasks in an average about 1200 seconds. All tasks done two up on each card in both machines.


Best to compare guppi to guppi and non-guppi to non-guppi.

Before this current reversion r3471? there was another between it and r3430 if I remember correctly. That version slowed down the processing of the GUPPI so that is why I revert back to r3430. Tonight I will give this new one a try and will see we can get comparables.

or maybe it was the non_SoG version of r3430, hard to remember all the different ones, lol..
ID: 1796412 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13139
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1796414 - Posted: 15 Jun 2016, 17:51:54 UTC - in response to Message 1796412.  


Best to compare guppi to guppi and non-guppi to non-guppi.

Before this current reversion r3471? there was another between it and r3430 if I remember correctly. That version slowed down the processing of the GUPPI so that is why I revert back to r3430. Tonight I will give this new one a try and will see we can get comparables.

or maybe it was the non_SoG version of r3430, hard to remember all the different ones, lol..

I was comparing the runtimes of non-guppi to non-guppi and guppi to guppi between the original r3430 and r3472 apps. I was running the interim SoG r3430 app that didn't have SoG in its filename even though it was a SoG app. Lots of confusion there and was displayed in forum threads.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1796414 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 64840
Credit: 55,293,173
RAC: 49
United States
Message 1796423 - Posted: 15 Jun 2016, 19:37:12 UTC

I am very happy with r3472, I like sog, this one works for Me at the very least.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1796423 · Report as offensive
Previous · 1 . . . 18 · 19 · 20 · 21

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows


 
©2022 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.