OpenCL NV MultiBeam v8 SoG edition for Windows

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13399
Credit: 208,696,464
RAC: 304
Australia
Message 1795305 - Posted: 11 Jun 2016, 6:32:00 UTC - in response to Message 1795303.  
Last modified: 11 Jun 2016, 6:32:41 UTC

Just wondering if any of the SoG stuff has hit a Lunatics build yet? Assuming no ...


Check out the "Open Beta test: SoG for NVidia, Lunatics v0.45" thread.
Grant
Darwin NT
ID: 1795305 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 1795307 - Posted: 11 Jun 2016, 6:41:33 UTC - in response to Message 1795305.  

Thanks, Grant ...
ID: 1795307 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1795321 - Posted: 11 Jun 2016, 8:38:26 UTC - in response to Message 1795312.  

I think that confusion arises from fast-fix build, that IS SoG one but has no "SoG" suffix in exe file name. Soon it will be replaced with build with proper rev number and corresponding suffixes.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1795321 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1795326 - Posted: 11 Jun 2016, 9:07:14 UTC
Last modified: 11 Jun 2016, 9:07:27 UTC

Ah, the fast-fix build.


Was that it?

https://setiathome.berkeley.edu/forum_thread.php?id=79629&postid=1793393#1793393

That was done to stop low_perf gpus from "hanging".
ID: 1795326 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1795327 - Posted: 11 Jun 2016, 9:10:19 UTC - in response to Message 1795326.  

Ah, the fast-fix build.


Was that it?

https://setiathome.berkeley.edu/forum_thread.php?id=79629&postid=1793393#1793393

That was done to stop low_perf gpus from "hanging".


Mostly to allow -use_sleep along with SoG.
But yes, low-performance ones were affected in default config. Also, default sleep for them reduced to 1ms instead of 5ms.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1795327 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1795330 - Posted: 11 Jun 2016, 9:50:10 UTC - in response to Message 1795329.  

It was demonstrated that -use_sleep doesn't work for some of tasks.
The reason it did not work - one of queues was not flushed manually.
And NV runtime never flushed it w/o direct command. Is it peculiarity of particular driver version or feature of NV runtime as whole - don't know.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1795330 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 33593
Credit: 79,922,639
RAC: 80
Germany
Message 1795353 - Posted: 11 Jun 2016, 12:47:04 UTC - in response to Message 1795348.  

Another question Raistmer: Is there no SoG version for Intel GPU, just the (opencl_intel_gpu_sah) Non SoG (I guess) version.

I'm running it as stock on Beta now.


Nope, no SoG version for iGPU atm.


With each crime and every kindness we birth our future.
ID: 1795353 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1795379 - Posted: 11 Jun 2016, 15:05:56 UTC

iGPU OpenCL build uses forced sync after almost each OpenCL API call.
This allowed to reduce CPU usage (well, third way to deal with sync weirdness that vendor's OpenCL runtime offers). So, almost no real sense to make SoG build for it. Considering that iGPU is hybrid device just as AMD APU is, it shares memory between CPU and GPU parts. So, main advantage of SoG - to reduce communication between CPU and discrete GPU - will add almost nothing here (at least in theory).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1795379 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1795465 - Posted: 11 Jun 2016, 21:55:49 UTC - in response to Message 1795430.  
Last modified: 11 Jun 2016, 21:58:34 UTC

OK, I just wondered because I noticed that there is a MB8_win_x86_SSE3_OpenCL_ATi_APU_r3430_SoG.exe, for the ATI APU, on your download site, but no SoG for the iGPU.

Yes but it did not show as big advantage over non-SoG as for NV.
If much higher main project statistics will show the same I'll consider to drop that build.

EDIT: current values:
Windows/x86 8.12 (opencl_atiapu_sah) 19 May 2016, 16:32:07 UTC 4,961 GigaFLOPS
Windows/x86 8.12 (opencl_atiapu_SoG) 19 May 2016, 16:32:07 UTC 5,564 GigaFLOPS

SoG better a little but probably inside noise range.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1795465 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14532
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1795473 - Posted: 11 Jun 2016, 22:13:39 UTC - in response to Message 1795465.  

Yes but it did not show as big advantage over non-SoG as for NV.
If much higher main project statistics will show the same I'll consider to drop that build.

EDIT: current values:
Windows/x86 8.12 (opencl_atiapu_sah) 19 May 2016, 16:32:07 UTC 4,961 GigaFLOPS
Windows/x86 8.12 (opencl_atiapu_SoG) 19 May 2016, 16:32:07 UTC 5,564 GigaFLOPS

SoG better a little but probably inside noise range.

I'm a little worried about using that table for that purpose. I think those numbers are totals across all hosts running the app: so the equality of total could be caused by twice the number of hosts, running at half the speed (and we wouldn't know which way round the imbalance worked).

It may be valid mathematically, because if the table contains results only from stock apps where the server does the allocating, we would expect fewer hosts to still be running the slower app, and the totals should diverge quite quickly. But David did say he was going to include anonymous platform too - let me do some codewalking in the morning. Just a caution flag for now.
ID: 1795473 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1795483 - Posted: 11 Jun 2016, 22:58:04 UTC - in response to Message 1795473.  

Unfortunately there is no other tool for such task.
Both plans define exactly the same subset of hosts. So, bigger number will represent domination of one of apps in that subset. And app can dominate either because it's faster or because of BOINC's imperfection (wrong best selection). That's noise I expect to be.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1795483 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1795486 - Posted: 11 Jun 2016, 23:06:09 UTC
Last modified: 11 Jun 2016, 23:07:15 UTC

What is interesting in this aspect - behavior of SoG/non-SoG on Linux platform:
ux/x86_64 8.10 (opencl_nvidia_sah) 18 May 2016, 1:10:51 UTC 1,460 GigaFLOPS
Linux/x86_64 8.10 (opencl_nvidia_SoG) 18 May 2016, 1:10:51 UTC 1,718 GigaFLOPS

SoG only marginaly faster while on Windows it leads with ~2 fold magnitude.

Can anyone running SoG on Linux post its typical stderr header?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1795486 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1795499 - Posted: 11 Jun 2016, 23:59:09 UTC - in response to Message 1795486.  

What is interesting in this aspect - behavior of SoG/non-SoG on Linux platform:
ux/x86_64 8.10 (opencl_nvidia_sah) 18 May 2016, 1:10:51 UTC 1,460 GigaFLOPS
Linux/x86_64 8.10 (opencl_nvidia_SoG) 18 May 2016, 1:10:51 UTC 1,718 GigaFLOPS

SoG only marginaly faster while on Windows it leads with ~2 fold magnitude.

Can anyone running SoG on Linux post its typical stderr header?


Something like this?
Name	blc2_2bit_guppi_57451_63021_HIP116936_OFF_0004.12280.0.17.26.210.vlar_1
Workunit	2182371096
Created	11 Jun 2016, 11:22:35 UTC
Sent	11 Jun 2016, 17:50:31 UTC
Report deadline	3 Aug 2016, 22:50:13 UTC
Received	11 Jun 2016, 22:47:22 UTC
Server state	Over
Outcome	Success
Client state	Done
Exit status	0 (0x0)
Computer ID	7475713
Run time	4 min 57 sec
CPU time	54 sec
Validate state	Valid
Credit	146.01
Device peak FLOPS	13,313.28 GFLOPS
Application version	SETI@home v8
Anonymous platform (NVIDIA GPU)


or this (from the same unit)

Stderr output

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 3 CUDA device(s):
  Device 1: Graphics Device, 8113 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 20 
     pciBusID = 2, pciSlotID = 0
  Device 2: GeForce GTX 980, 4036 MiB, regsPerBlock 65536
     computeCap 5.2, multiProcs 16 
     pciBusID = 1, pciSlotID = 0
  Device 3: GeForce GTX 980, 4037 MiB, regsPerBlock 65536
     computeCap 5.2, multiProcs 16 
     pciBusID = 3, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 2
setiathome_CUDA: CUDA Device 2 specified, checking...
   Device 2: GeForce GTX 980 is okay
SETI@home using CUDA accelerated device GeForce GTX 980

setiathome v8 enhanced x41p_zi, Cuda 7.50 special
Compiled with NVCC 7.5, using 6.5 libraries. Modifications done by petri33.



Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.008659


or something else since I'm not runnig OpenCL.
[/pre]
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1795499 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1795501 - Posted: 12 Jun 2016, 0:15:05 UTC - in response to Message 1795499.  



or something else since I'm not runnig OpenCL.

Cause SoG is modification of OpenCL build - obviously something else.

Did you consider to put that build on beta ?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1795501 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1795504 - Posted: 12 Jun 2016, 0:26:33 UTC - in response to Message 1795501.  



or something else since I'm not runnig OpenCL.

Cause SoG is modification of OpenCL build - obviously something else.

Did you consider to put that build on beta ?


It is still an alpha, but I have sent code to JasonG and TBar for more serious testing.

The slowdown in low ar comes from having a (only) one full pot to process. That kind of makes all work to go to one SM/SMX unit. My errors are probably from calculating the average from an artifically shortened pot. A pre-calculated avg is something I'll try tomorrow or in the next days, then getting rid of the loop doing form LastP to FirstP and replacing that with grid.z and a parameter. Earlier I tried with 8 streams and other stuff to make more work parallel. It is hard to keep track of found results and to report them in the same order as the CPU version does. You have figured out a way to do that with SoG!!
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1795504 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13399
Credit: 208,696,464
RAC: 304
Australia
Message 1795532 - Posted: 12 Jun 2016, 2:15:21 UTC - in response to Message 1795473.  

Yes but it did not show as big advantage over non-SoG as for NV.
If much higher main project statistics will show the same I'll consider to drop that build.

EDIT: current values:
Windows/x86 8.12 (opencl_atiapu_sah) 19 May 2016, 16:32:07 UTC 4,961 GigaFLOPS
Windows/x86 8.12 (opencl_atiapu_SoG) 19 May 2016, 16:32:07 UTC 5,564 GigaFLOPS

SoG better a little but probably inside noise range.

I'm a little worried about using that table for that purpose. I think those numbers are totals across all hosts running the app: so the equality of total could be caused by twice the number of hosts, running at half the speed (and we wouldn't know which way round the imbalance worked).

It may be valid mathematically, because if the table contains results only from stock apps where the server does the allocating, we would expect fewer hosts to still be running the slower app, and the totals should diverge quite quickly. But David did say he was going to include anonymous platform too - let me do some codewalking in the morning. Just a caution flag for now.


Are those values related to the Average processing rate in a system's Application details page?

My 2 systems.
                                 System1      System2
Average processing rate (GFLOPS)  56.51        71.9  
Avg Turnaround time (Days)         1.35         1.47 
Approx WUs/hour                    3.086        2.834



System1 produces more work per hour, yet it's APR is significantly less.
Grant
Darwin NT
ID: 1795532 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1795572 - Posted: 12 Jun 2016, 8:45:34 UTC - in response to Message 1795504.  
Last modified: 12 Jun 2016, 9:17:20 UTC

.

The slowdown in low ar comes from having a (only) one full pot to process.
That kind of makes all work to go to one SM/SMX unit.

More precisely as I explained in post on Lunatics (sadly they came down now) it's 8 independend arrays to search. 8 PoT arrays. Depending on workgroup you chose all of them, indeed, can go to only single CU (in OpenCL terms) that corresponds SM/SMx in NV/CUDA terms. One can artifically distribute it to 8 different CUs by appropriate workgroup limits, but this will underload each of CUs of course.
Another way is to unroll some of periods to provide more data to process inparallel.


My errors are probably from calculating the average from an artifically shortened pot. A pre-calculated avg is something I'll try tomorrow or in the next days,

I do avg pre-calculation in Triplet search. It was looked as good idea before. Currently if full signal search decoupling needed this provide additional dependence to get rid of. But cause it's not the single obstacle for full PoT on GPU I don't touch it yet.


then getting rid of the loop doing form LastP to FirstP and replacing that with grid.z and a parameter.

Not sure it's really possible. Such unroll comes with memory for arrays to hold.
Initially I did fixed-size x32 unroll for periods. Currently it configurable but still much less than total periods num. Max total periods num could be estimated as 2/3*(1024*1024/8). One need to have corresponding amount of memory to hold that number of separate (though little shortened on first iteration) arrays.
Maybe doable with 4GB GPUs? Worth to calculate.


Earlier I tried with 8 streams and other stuff to make more work parallel. It is hard to keep track of found results and to report them in the same order as the CPU version does. You have figured out a way to do that with SoG!!

Few queues (again, in OpenCL terms that correspond to CUDA stream) per single PoT search looks like increase in overhead. One could try few PoT searches in separate queues (not too big memory footprint increase, partially implemented) or even few icfft iterations simultaneously (that would be quite a big rework of existing code and unfortunately sharp increase in memory footprint).
Regarding particular signal order - for not overflowed task it's irrelevant while one have no false positives/negatives. For overflowed task it constitutes real issue. Ironically, they are "noisy" ones that will need separate treatment on postprocessing stage anyway. I decided to sacrifice absolute signal ordering and just attempt to keep differencies as small as possible to reduce numbers of mismatched overflows. Also, seems ordering issue existed even in original CUDA code (though in quite small degree). So I would recommend to concentrate on false positives/negatives more than on signal ordering.
EDIT: BTW, this quite differs from AstroPulse situation where signals (for example, in FFA) are updated for some proximity-establishing algorithm. There if such signal updates come in wrong order even non-overflow task will have wrong final signals. That took lot of time and some tricks in code to keep all found in parallel signals in order. One of released builds even grow in memory to some huge sizes because of this.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1795572 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14532
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1795584 - Posted: 12 Jun 2016, 10:23:22 UTC - in response to Message 1795483.  

Unfortunately there is no other tool for such task.
Both plans define exactly the same subset of hosts. So, bigger number will represent domination of one of apps in that subset. And app can dominate either because it's faster or because of BOINC's imperfection (wrong best selection). That's noise I expect to be.

Having thought about it overnight, I think I can lower my flag of caution.

The GFlops figures on the applications page are - as individual numbers - well dodgy, but the relative numbers for two applications deployed on the same day (and as you say, for the same subset of hosts) should be a useful comparator.

The alarm bells that went off in my head last night related more to the other recent stats pages (CPU models, GPU models): it was the CPU list which started out with some very bad maths, but we got that corrected. Also, I think it's the cpu/gpu lists which must include anonymous platform data: I don't see how that could be done for the applications page. Sorry about the red herring.
ID: 1795584 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1795776 - Posted: 12 Jun 2016, 20:51:27 UTC - in response to Message 1795572.  

.

The slowdown in low ar comes from having a (only) one full pot to process.
That kind of makes all work to go to one SM/SMX unit.

More precisely as I explained in post on Lunatics (sadly they came down now) it's 8 independend arrays to search. 8 PoT arrays. Depending on workgroup you chose all of them, indeed, can go to only single CU (in OpenCL terms) that corresponds SM/SMx in NV/CUDA terms. One can artifically distribute it to 8 different CUs by appropriate workgroup limits, but this will underload each of CUs of course.
Another way is to unroll some of periods to provide more data to process inparallel.


My errors are probably from calculating the average from an artifically shortened pot. A pre-calculated avg is something I'll try tomorrow or in the next days,

I do avg pre-calculation in Triplet search. It was looked as good idea before. Currently if full signal search decoupling needed this provide additional dependence to get rid of. But cause it's not the single obstacle for full PoT on GPU I don't touch it yet.


then getting rid of the loop doing form LastP to FirstP and replacing that with grid.z and a parameter.

Not sure it's really possible. Such unroll comes with memory for arrays to hold.
Initially I did fixed-size x32 unroll for periods. Currently it configurable but still much less than total periods num. Max total periods num could be estimated as 2/3*(1024*1024/8). One need to have corresponding amount of memory to hold that number of separate (though little shortened on first iteration) arrays.
Maybe doable with 4GB GPUs? Worth to calculate.


Earlier I tried with 8 streams and other stuff to make more work parallel. It is hard to keep track of found results and to report them in the same order as the CPU version does. You have figured out a way to do that with SoG!!

Few queues (again, in OpenCL terms that correspond to CUDA stream) per single PoT search looks like increase in overhead. One could try few PoT searches in separate queues (not too big memory footprint increase, partially implemented) or even few icfft iterations simultaneously (that would be quite a big rework of existing code and unfortunately sharp increase in memory footprint).
Regarding particular signal order - for not overflowed task it's irrelevant while one have no false positives/negatives. For overflowed task it constitutes real issue. Ironically, they are "noisy" ones that will need separate treatment on postprocessing stage anyway. I decided to sacrifice absolute signal ordering and just attempt to keep differencies as small as possible to reduce numbers of mismatched overflows. Also, seems ordering issue existed even in original CUDA code (though in quite small degree). So I would recommend to concentrate on false positives/negatives more than on signal ordering.
EDIT: BTW, this quite differs from AstroPulse situation where signals (for example, in FFA) are updated for some proximity-establishing algorithm. There if such signal updates come in wrong order even non-overflow task will have wrong final signals. That took lot of time and some tricks in code to keep all found in parallel signals in order. One of released builds even grow in memory to some huge sizes because of this.


Thank You for a detailed explanation. I'll read it again and again at least three times or until I get all that is in it. Thank You.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1795776 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1795782 - Posted: 12 Jun 2016, 21:14:43 UTC - in response to Message 1795776.  

The slowdown in low ar comes from having a (only) one full pot to process. That kind of makes all work to go to one SM/SMX unit.

More precisely as I explained in post on Lunatics (sadly they came down now) it's 8 independend arrays to search. 8 PoT arrays. Depending on workgroup you chose all of them, indeed, can go to only single CU (in OpenCL terms) that corresponds SM/SMx in NV/CUDA terms. One can artifically distribute it to 8 different CUs by appropriate workgroup limits, but this will underload each of CUs of course.
Another way is to unroll some of periods to provide more data to process inparallel.


My errors are probably from calculating the average from an artifically shortened pot. A pre-calculated avg is something I'll try tomorrow or in the next days,

I do avg pre-calculation in Triplet search. It was looked as good idea before. Currently if full signal search decoupling needed this provide additional dependence to get rid of. But cause it's not the single obstacle for full PoT on GPU I don't touch it yet.

then getting rid of the loop doing form LastP to FirstP and replacing that with grid.z and a parameter.

Not sure it's really possible. Such unroll comes with memory for arrays to hold.
Initially I did fixed-size x32 unroll for periods. Currently it configurable but still much less than total periods num. Max total periods num could be estimated as 2/3*(1024*1024/8). One need to have corresponding amount of memory to hold that number of separate (though little shortened on first iteration) arrays.
Maybe doable with 4GB GPUs? Worth to calculate.


Earlier I tried with 8 streams and other stuff to make more work parallel. It is hard to keep track of found results and to report them in the same order as the CPU version does. You have figured out a way to do that with SoG!!

Few queues (again, in OpenCL terms that correspond to CUDA stream) per single PoT search looks like increase in overhead. One could try few PoT searches in separate queues (not too big memory footprint increase, partially implemented) or even few icfft iterations simultaneously (that would be quite a big rework of existing code and unfortunately sharp increase in memory footprint).
Regarding particular signal order - for not overflowed task it's irrelevant while one have no false positives/negatives. For overflowed task it constitutes real issue. Ironically, they are "noisy" ones that will need separate treatment on postprocessing stage anyway. I decided to sacrifice absolute signal ordering and just attempt to keep differencies as small as possible to reduce numbers of mismatched overflows. Also, seems ordering issue existed even in original CUDA code (though in quite small degree). So I would recommend to concentrate on false positives/negatives more than on signal ordering.
EDIT: BTW, this quite differs from AstroPulse situation where signals (for example, in FFA) are updated for some proximity-establishing algorithm. There if such signal updates come in wrong order even non-overflow task will have wrong final signals. That took lot of time and some tricks in code to keep all found in parallel signals in order. One of released builds even grow in memory to some huge sizes because of this.

Thank You for a detailed explanation. I'll read it again and again at least three times or until I get all that is in it. Thank You.

I found that the GUPPIs speed up quite a bit if you throw registers at them.
Note this task run with maxrregcount=32; http://setiathome.berkeley.edu/result.php?resultid=4979667790
Run time: 22 min 36 sec
CPU time: 22 min 19 sec
This task was run with the App set to maxrregcount=128; http://setiathome.berkeley.edu/result.php?resultid=4979681158
Run time: 8 min 22 sec
CPU time: 8 min 14 sec
ID: 1795782 · Report as offensive
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · Next

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows


 
©2022 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.