AVX Extensions - Ongoing development?

Message boards : Number crunching : AVX Extensions - Ongoing development?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1087684 - Posted: 17 Mar 2011, 5:14:44 UTC - in response to Message 1087597.  

Blocks in green are the 'done stuff'


Argh......feelin' stupid now.......
Thats why you are a developer and I just a dumb-ass-cruncher......


At least you didn't mistakenly depict yourself as a hatless smurf ::/
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1087684 · Report as offensive
aad

Send message
Joined: 3 Apr 99
Posts: 101
Credit: 204,131,099
RAC: 26
Netherlands
Message 1087822 - Posted: 17 Mar 2011, 17:40:03 UTC - in response to Message 1087684.  
Last modified: 17 Mar 2011, 17:42:08 UTC

At least you didn't mistakenly depict yourself as a hatless smurf ::/


No.....headless... or maybe clueless....... ;-)

ID: 1087822 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1087859 - Posted: 17 Mar 2011, 20:23:37 UTC

Here the link from Crunch3r: http://board.mpits.net/viewtopic.php?f=19&t=47.
He can't post by himself cause doesn't have appropriate RAC for this but asked to add such comment to this link:
"
it's completely untested(i dont have a sandy bridge CPU), so no guarantee that it will work at all.

It would be best if someone could try it in standalone mode with a short WU.
"
ID: 1087859 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1087863 - Posted: 17 Mar 2011, 20:50:40 UTC - in response to Message 1087859.  
Last modified: 17 Mar 2011, 20:58:45 UTC

Here the link from Crunch3r: http://board.mpits.net/viewtopic.php?f=19&t=47.
He can't post by himself cause doesn't have appropriate RAC for this but asked to add such comment to this link:
"
it's completely untested(i dont have a sandy bridge CPU), so no guarantee that it will work at all.

It would be best if someone could try it in standalone mode with a short WU.
"


Nice information about the new Intel compiler features too :). Sadly I don't have that one, so it rules me out from production releases using that one for now.

Hope it works, as it'll save a lot of hand effort.

Jason

[Edit:] Oh wait, the updates for my existing, licences ARE there. Just goes to show I should check there more often.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1087863 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1087920 - Posted: 18 Mar 2011, 0:46:14 UTC

I'll definatelly give it a try tommorow. I have large sets of WUs(4000+) saved for such purposes.
ID: 1087920 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1088764 - Posted: 20 Mar 2011, 9:51:07 UTC

Disappointing. Twice slower than SSE4.1.
ID: 1088764 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1088779 - Posted: 20 Mar 2011, 10:46:11 UTC - in response to Message 1088764.  

Disappointing. Twice slower than SSE4.1.


Humm well that's disappointing. Just getting started though it will get better.
Traveling through space at ~67,000mph!
ID: 1088779 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1088803 - Posted: 20 Mar 2011, 14:43:13 UTC

Yeah, me too. In the beginning I got excited looking at the numbers speeding up like it was a GPU, but it lasted not much longer, bcs they turned out to be shorties - 2.xxx - 3.xxx AR. WUs got finished in 25 minutes with AVX binary and in 15-16 minutes with SSE4.1 binary(2500K, running at 4600 MHz then).
Although, it was quite obvious from the beginnng, processor cores barely reached temperature of 60 deg. C, while in normal operating conditions, with SSE4.1 they are close, but bellow 65 degrees. Not generating heat, in general, means not much load is put on CPU.
I guess screenshots are not necessary. They remained on the test HDD, which is out of the box now.
ID: 1088803 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1088807 - Posted: 20 Mar 2011, 15:16:33 UTC - in response to Message 1088779.  
Last modified: 20 Mar 2011, 15:17:19 UTC

Disappointing. Twice slower than SSE4.1.


Humm well that's disappointing. Just getting started though it will get better.


Oh dear,
Well it was a long shot, and glad Crunch3r's had a first crack at it. I'll continue plugging away at my end on the larger picture. I have some Cuda & Stock V7 tasks to take care of before frying my brain re-engineering & hand-rewriting Alex's pulsefinding for AVX. At least it's very good confirmation that this part is likely to warrant extra effort down the road.

Has anyone let Crunch3r know, in case he has some more tricks up his sleeve ? both ssse3x & sse4.1 builds were highly optimised with Intels tools on top of hand vectorisation, so it's possible he hasn't done that part yet & just wants to know if it produces valid results.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1088807 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1088811 - Posted: 20 Mar 2011, 15:37:55 UTC

I ran VS2010 executable, VS2005 wasn't able to run at all. I installed VS2005 redistributable package, but still getting an error, stating that WU files cannot be found.
Would be nice if someone else can confirm my observations and perhaps run VS2005 executable.
ID: 1088811 · Report as offensive
Profile Joel

Send message
Joined: 31 Oct 08
Posts: 104
Credit: 4,838,348
RAC: 13
United States
Message 1089729 - Posted: 23 Mar 2011, 20:41:05 UTC
Last modified: 23 Mar 2011, 20:50:42 UTC

Jason, thanks for the roadmap. Great to know what you guys are up to. I'd help if I could, but it'll be a few years before I manage to squeeze in enough study of computer science to be of much use. My coding skills are too narrowly focused on statistics in my field to help in this arena.

We know what green, yellow and orange are on the chart, but could you maybe elucidate a little bit what the grey bracket pointing to Urs indicates (Urs, feel free to jump in if you are reading ;)) ? Is the AVX/v7/CUDA optimization for Linux/Mac a "longer term" goal? Dare I even ask about CUDA on Mac?

Thanks, and good luck
ID: 1089729 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1090014 - Posted: 24 Mar 2011, 16:28:30 UTC - in response to Message 1089729.  
Last modified: 24 Mar 2011, 16:29:02 UTC

Jason, thanks for the roadmap. Great to know what you guys are up to. I'd help if I could, but it'll be a few years before I manage to squeeze in enough study of computer science to be of much use. My coding skills are too narrowly focused on statistics in my field to help in this arena.

We know what green, yellow and orange are on the chart, but could you maybe elucidate a little bit what the grey bracket pointing to Urs indicates (Urs, feel free to jump in if you are reading ;)) ? Is the AVX/v7/CUDA optimization for Linux/Mac a "longer term" goal? Dare I even ask about CUDA on Mac?

Thanks, and good luck


Thanks Joel,
Thankfully I've had some help on the Linux/Cuda side recently, which has accelerated both refinement of the Cuda application a bit, as well as getting the already developed optimisations in (Thanks Aaron!).

The interrelationships between development are a bit more complex than can be shown on a simple roadmap, but it does try to capture them in a very general way.

It isn't a linear process, but in general:
'Opt1' Cuda refinements (Unit tests, done & closed) -> V7 work (Stock CPU) -> Cuda Pulsefinding (Fixes VLAR, 'Opt2' under commencement) -> AVX Pulsefinding ( Refactoring AK code )

V7 transition fits in there as well, and each of the processes feeds others. With the awesome work From the Linux Guys going on, Urs & Aaron, I've no doubts that Linux (& possibly Mac) enhanced v7 optimised clients, along with Cuda enahanced V7, would be available if not at the same time as Windows clients, possibly even intentionally earlier to sandbox potential flaws to a relatively skilled subset of users. That sortof situation would be justified by the level of prototype code (i.e. unproven, unlike original AKv8 ), and allow tight control.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1090014 · Report as offensive
Profile Crunch3r
Volunteer tester
Avatar

Send message
Joined: 15 Apr 99
Posts: 1546
Credit: 3,438,823
RAC: 0
Germany
Message 1094958 - Posted: 8 Apr 2011, 23:47:25 UTC - in response to Message 1090014.  

So since the last "test" was a complete failure and no one ran it in standalone mode to get some reasonable numbers...

I guess it's safe to assume that Intels claim that:

- Automatically converts SSE intrinsics to AVX-128
- Automatically converts SSE inline assembly to AVX-128
- Most apps written with intrinsics need only recompile.
- There is a straight forward porting of existing Intel SSE to Intel AVX 256 with Intel libraries, Intel® Integrated Performance Primitives (Intel® IPP), etc.


Is not true after all...


Join BOINC United now!
ID: 1094958 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1095113 - Posted: 9 Apr 2011, 3:48:45 UTC - in response to Message 1094958.  

So since the last "test" was a complete failure and no one ran it in standalone mode to get some reasonable numbers...

I guess it's safe to assume that Intels claim that:

- Automatically converts SSE intrinsics to AVX-128
- Automatically converts SSE inline assembly to AVX-128
- Most apps written with intrinsics need only recompile.
- There is a straight forward porting of existing Intel SSE to Intel AVX 256 with Intel libraries, Intel® Integrated Performance Primitives (Intel® IPP), etc.


Is not true after all...


Hiya Crunch3r,
Wish I had a SB myself to help out with that. I had to read the claims over and over to realise they were mostly talking about changing 128 bit SSE instructions to 128 bit AVX instructions. The more I look, the more I'm convinced Alex's pulsefinding will need considerable rewriting (by hand) to take advantage of the 256 bit AVX ones.

Referring to your comments in another thread, yeah those AKv8b builds (minor cosmetic polishing of AKv8) have stood for a long time, mostly due to being reliable & fast on their target architectures. They were heavily polished with PGO at the time, so writing off the auto AVX conversion completely in rebuilds might be a bit premature for the simpler parts of code.

I'm not sure what new techniques Raistmer's trying out with the Ati stuff, but over in my Cuda corner I've got a few tricks up my sleeve to pull out down the line, that may eventually filter through to CPU builds.

As far as I'm concerned, anyone that wants to contribute effort there's still plenty of work to do, and the newer hardware is going to justify some major changes.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1095113 · Report as offensive
Profile Crunch3r
Volunteer tester
Avatar

Send message
Joined: 15 Apr 99
Posts: 1546
Credit: 3,438,823
RAC: 0
Germany
Message 1095137 - Posted: 9 Apr 2011, 4:25:07 UTC - in response to Message 1095113.  
Last modified: 9 Apr 2011, 4:29:07 UTC


Wish I had a SB myself to help out with that.


yeah, me too.. i'd guess i need to complain about that to intel... i need one now!!! (INTEL, my shipping adress is already in you DB...ask the guys at http://www.intel.com/performance/!)



I had to read the claims over and over to realise they were mostly talking about changing 128 bit SSE instructions to 128 bit AVX instructions. The more I look, the more I'm convinced Alex's pulsefinding will need considerable rewriting (by hand) to take advantage of the 256 bit AVX ones.


Well, if the compiler is not able to convert the code (as claimed) to 128 bit avx instructions, then there's no way that a simple recompile will do. Mixing 128 intrucions with some tuned 256 bit code will make it even worse. At least that's what i've read and head. So basically all hand vectorized code needs to be rewritten to make use of 256 bit avx extensions...

At least for us, there's a way to emulate 256 bit avx on older CPUs using a special header file (similar to SSEPlus), to verify that the code gives valid results ( this is documented somewhere...)

Join BOINC United now!
ID: 1095137 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1095139 - Posted: 9 Apr 2011, 4:28:53 UTC - in response to Message 1095137.  
Last modified: 9 Apr 2011, 5:03:25 UTC

At least for us, there's a way to emulate 256 bit avx on older CPUs using a special header file (similar to SSEPlus), to verify that the code gives valid results ( this is documented somewhere...)


Yeah, I'm sure I read about that with older ICC, haven't come across it with the newer stuff, will keep looking.

[Edit:] Found it. http://software.intel.com/en-us/articles/avx-emulation-header-file
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1095139 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1095559 - Posted: 10 Apr 2011, 3:28:14 UTC - in response to Message 1095139.  

At least for us, there's a way to emulate 256 bit avx on older CPUs using a special header file (similar to SSEPlus), to verify that the code gives valid results ( this is documented somewhere...)


Yeah, I'm sure I read about that with older ICC, haven't come across it with the newer stuff, will keep looking.

[Edit:] Found it. http://software.intel.com/en-us/articles/avx-emulation-header-file

But note the requirement:
SSE4.2 support in your development environment as well as hardware is required in order to use the AVX emulation header file.


OTOH, the SDE (Intel Software Development Emulator) will supposedly work on a P4...
                                                                Joe
ID: 1095559 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1095583 - Posted: 10 Apr 2011, 5:02:55 UTC - in response to Message 1095559.  
Last modified: 10 Apr 2011, 5:08:23 UTC

At least for us, there's a way to emulate 256 bit avx on older CPUs using a special header file (similar to SSEPlus), to verify that the code gives valid results ( this is documented somewhere...)


Yeah, I'm sure I read about that with older ICC, haven't come across it with the newer stuff, will keep looking.

[Edit:] Found it. http://software.intel.com/en-us/articles/avx-emulation-header-file

But note the requirement:
SSE4.2 support in your development environment as well as hardware is required in order to use the AVX emulation header file.


OTOH, the SDE (Intel Software Development Emulator) will supposedly work on a P4...
                                                                Joe


Yeah, was looking at that too. Initially I'd problably just modify the header file to not use any 4.2 (I probably wouldn't need any of the AVX instructions that would require SSE4.2 instructions in their base ). The emulator will hopefully be great for seeing if a pure AVX binary will work, though the header file approach would be better for seeing how existing portions tranlstate&perform to 256 bit strides without an emulation layer.

As far as AVX is concerned, I'm still at tool/technique gathering at this time. As far as algorithm understanding goes I feel that through the Cuda research I've mastered enough of the chirp & power spectrum pipeline to be able to rewrite the first parts of the application from scratch (If I had to, but thankfully don't), But still some way to go with the pulsefinding/PoT.

Once the PoT is moulding to whatever I want on Cuda, If alignment specific forms of the pulse folding are still needed for CPU I'd rather write a macro generator to simulate Alex, rather than work them all out with the whisky & whiteboard technique... but that's still an option if needed.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1095583 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1101584 - Posted: 29 Apr 2011, 3:05:24 UTC

First steps toward some usable AVX code, see my post at Lunatics. It's a quick test which I hope will have some positive outcome on Win7 SP1 hosts with AVX capable CPU. If so, it shouldn't be difficult to port the test to Linux or go directly to including the functions in builds for S@H v7 or Enhanced.
                                                                   Joe
ID: 1101584 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1102351 - Posted: 1 May 2011, 4:08:52 UTC

My thanks to Dave for running the test on an AVX capable system, there's perhaps some progress. I've attached a revised test to http://lunatics.kwsn.net/1-discussion-forum/avx-optimized-app-development.msg37370.html#msg37370.
                                                                 Joe
ID: 1102351 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : AVX Extensions - Ongoing development?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.