Message boards :
Number crunching :
AVX Extensions - Ongoing development?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Blocks in green are the 'done stuff' At least you didn't mistakenly depict yourself as a hatless smurf ::/ "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
aad Send message Joined: 3 Apr 99 Posts: 101 Credit: 204,131,099 RAC: 26 |
At least you didn't mistakenly depict yourself as a hatless smurf ::/ No.....headless... or maybe clueless....... ;-) |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Here the link from Crunch3r: http://board.mpits.net/viewtopic.php?f=19&t=47. He can't post by himself cause doesn't have appropriate RAC for this but asked to add such comment to this link: " it's completely untested(i dont have a sandy bridge CPU), so no guarantee that it will work at all." |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Here the link from Crunch3r: http://board.mpits.net/viewtopic.php?f=19&t=47. Nice information about the new Intel compiler features too :). Sadly I don't have that one, so it rules me out from production releases using that one for now. Hope it works, as it'll save a lot of hand effort. Jason [Edit:] Oh wait, the updates for my existing, licences ARE there. Just goes to show I should check there more often. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
hbomber Send message Joined: 2 May 01 Posts: 437 Credit: 50,852,854 RAC: 0 |
I'll definatelly give it a try tommorow. I have large sets of WUs(4000+) saved for such purposes. |
hbomber Send message Joined: 2 May 01 Posts: 437 Credit: 50,852,854 RAC: 0 |
Disappointing. Twice slower than SSE4.1. |
-BeNt- Send message Joined: 17 Oct 99 Posts: 1234 Credit: 10,116,112 RAC: 0 |
Disappointing. Twice slower than SSE4.1. Humm well that's disappointing. Just getting started though it will get better. Traveling through space at ~67,000mph! |
hbomber Send message Joined: 2 May 01 Posts: 437 Credit: 50,852,854 RAC: 0 |
Yeah, me too. In the beginning I got excited looking at the numbers speeding up like it was a GPU, but it lasted not much longer, bcs they turned out to be shorties - 2.xxx - 3.xxx AR. WUs got finished in 25 minutes with AVX binary and in 15-16 minutes with SSE4.1 binary(2500K, running at 4600 MHz then). Although, it was quite obvious from the beginnng, processor cores barely reached temperature of 60 deg. C, while in normal operating conditions, with SSE4.1 they are close, but bellow 65 degrees. Not generating heat, in general, means not much load is put on CPU. I guess screenshots are not necessary. They remained on the test HDD, which is out of the box now. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Disappointing. Twice slower than SSE4.1. Oh dear, Well it was a long shot, and glad Crunch3r's had a first crack at it. I'll continue plugging away at my end on the larger picture. I have some Cuda & Stock V7 tasks to take care of before frying my brain re-engineering & hand-rewriting Alex's pulsefinding for AVX. At least it's very good confirmation that this part is likely to warrant extra effort down the road. Has anyone let Crunch3r know, in case he has some more tricks up his sleeve ? both ssse3x & sse4.1 builds were highly optimised with Intels tools on top of hand vectorisation, so it's possible he hasn't done that part yet & just wants to know if it produces valid results. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
hbomber Send message Joined: 2 May 01 Posts: 437 Credit: 50,852,854 RAC: 0 |
I ran VS2010 executable, VS2005 wasn't able to run at all. I installed VS2005 redistributable package, but still getting an error, stating that WU files cannot be found. Would be nice if someone else can confirm my observations and perhaps run VS2005 executable. |
Joel Send message Joined: 31 Oct 08 Posts: 104 Credit: 4,838,348 RAC: 13 |
Jason, thanks for the roadmap. Great to know what you guys are up to. I'd help if I could, but it'll be a few years before I manage to squeeze in enough study of computer science to be of much use. My coding skills are too narrowly focused on statistics in my field to help in this arena. We know what green, yellow and orange are on the chart, but could you maybe elucidate a little bit what the grey bracket pointing to Urs indicates (Urs, feel free to jump in if you are reading ;)) ? Is the AVX/v7/CUDA optimization for Linux/Mac a "longer term" goal? Dare I even ask about CUDA on Mac? Thanks, and good luck |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Jason, thanks for the roadmap. Great to know what you guys are up to. I'd help if I could, but it'll be a few years before I manage to squeeze in enough study of computer science to be of much use. My coding skills are too narrowly focused on statistics in my field to help in this arena. Thanks Joel, Thankfully I've had some help on the Linux/Cuda side recently, which has accelerated both refinement of the Cuda application a bit, as well as getting the already developed optimisations in (Thanks Aaron!). The interrelationships between development are a bit more complex than can be shown on a simple roadmap, but it does try to capture them in a very general way. It isn't a linear process, but in general: 'Opt1' Cuda refinements (Unit tests, done & closed) -> V7 work (Stock CPU) -> Cuda Pulsefinding (Fixes VLAR, 'Opt2' under commencement) -> AVX Pulsefinding ( Refactoring AK code ) V7 transition fits in there as well, and each of the processes feeds others. With the awesome work From the Linux Guys going on, Urs & Aaron, I've no doubts that Linux (& possibly Mac) enhanced v7 optimised clients, along with Cuda enahanced V7, would be available if not at the same time as Windows clients, possibly even intentionally earlier to sandbox potential flaws to a relatively skilled subset of users. That sortof situation would be justified by the level of prototype code (i.e. unproven, unlike original AKv8 ), and allow tight control. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Crunch3r Send message Joined: 15 Apr 99 Posts: 1546 Credit: 3,438,823 RAC: 0 |
So since the last "test" was a complete failure and no one ran it in standalone mode to get some reasonable numbers... I guess it's safe to assume that Intels claim that: - Automatically converts SSE intrinsics to AVX-128 Is not true after all... Join BOINC United now! |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
So since the last "test" was a complete failure and no one ran it in standalone mode to get some reasonable numbers... Hiya Crunch3r, Wish I had a SB myself to help out with that. I had to read the claims over and over to realise they were mostly talking about changing 128 bit SSE instructions to 128 bit AVX instructions. The more I look, the more I'm convinced Alex's pulsefinding will need considerable rewriting (by hand) to take advantage of the 256 bit AVX ones. Referring to your comments in another thread, yeah those AKv8b builds (minor cosmetic polishing of AKv8) have stood for a long time, mostly due to being reliable & fast on their target architectures. They were heavily polished with PGO at the time, so writing off the auto AVX conversion completely in rebuilds might be a bit premature for the simpler parts of code. I'm not sure what new techniques Raistmer's trying out with the Ati stuff, but over in my Cuda corner I've got a few tricks up my sleeve to pull out down the line, that may eventually filter through to CPU builds. As far as I'm concerned, anyone that wants to contribute effort there's still plenty of work to do, and the newer hardware is going to justify some major changes. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Crunch3r Send message Joined: 15 Apr 99 Posts: 1546 Credit: 3,438,823 RAC: 0 |
yeah, me too.. i'd guess i need to complain about that to intel... i need one now!!! (INTEL, my shipping adress is already in you DB...ask the guys at http://www.intel.com/performance/!)
Well, if the compiler is not able to convert the code (as claimed) to 128 bit avx instructions, then there's no way that a simple recompile will do. Mixing 128 intrucions with some tuned 256 bit code will make it even worse. At least that's what i've read and head. So basically all hand vectorized code needs to be rewritten to make use of 256 bit avx extensions... At least for us, there's a way to emulate 256 bit avx on older CPUs using a special header file (similar to SSEPlus), to verify that the code gives valid results ( this is documented somewhere...) Join BOINC United now! |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
At least for us, there's a way to emulate 256 bit avx on older CPUs using a special header file (similar to SSEPlus), to verify that the code gives valid results ( this is documented somewhere...) Yeah, I'm sure I read about that with older ICC, haven't come across it with the newer stuff, will keep looking. [Edit:] Found it. http://software.intel.com/en-us/articles/avx-emulation-header-file "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
At least for us, there's a way to emulate 256 bit avx on older CPUs using a special header file (similar to SSEPlus), to verify that the code gives valid results ( this is documented somewhere...) But note the requirement: SSE4.2 support in your development environment as well as hardware is required in order to use the AVX emulation header file. OTOH, the SDE (Intel Software Development Emulator) will supposedly work on a P4... Joe |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
At least for us, there's a way to emulate 256 bit avx on older CPUs using a special header file (similar to SSEPlus), to verify that the code gives valid results ( this is documented somewhere...) Yeah, was looking at that too. Initially I'd problably just modify the header file to not use any 4.2 (I probably wouldn't need any of the AVX instructions that would require SSE4.2 instructions in their base ). The emulator will hopefully be great for seeing if a pure AVX binary will work, though the header file approach would be better for seeing how existing portions tranlstate&perform to 256 bit strides without an emulation layer. As far as AVX is concerned, I'm still at tool/technique gathering at this time. As far as algorithm understanding goes I feel that through the Cuda research I've mastered enough of the chirp & power spectrum pipeline to be able to rewrite the first parts of the application from scratch (If I had to, but thankfully don't), But still some way to go with the pulsefinding/PoT. Once the PoT is moulding to whatever I want on Cuda, If alignment specific forms of the pulse folding are still needed for CPU I'd rather write a macro generator to simulate Alex, rather than work them all out with the whisky & whiteboard technique... but that's still an option if needed. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
First steps toward some usable AVX code, see my post at Lunatics. It's a quick test which I hope will have some positive outcome on Win7 SP1 hosts with AVX capable CPU. If so, it shouldn't be difficult to port the test to Linux or go directly to including the functions in builds for S@H v7 or Enhanced. Joe |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
My thanks to Dave for running the test on an AVX capable system, there's perhaps some progress. I've attached a revised test to http://lunatics.kwsn.net/1-discussion-forum/avx-optimized-app-development.msg37370.html#msg37370. Joe |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.