Cool VL Viewer forum
http://sldev.free.fr/forum/

Some more AVX/SSE2 for llface.cpp
http://sldev.free.fr/forum/viewtopic.php?f=10&t=2149
Page 1 of 1

Author:  kathrine [ 2021-03-23 01:12:36 ]
Post subject:  Some more AVX/SSE2 for llface.cpp

Hi Henri,

the Visual Studio Profiler told that llface.cpp was still a little hot, so i added some more AVX2/SSE2 stuff.
It is mostly identical to the previous one, just for the planar projection case.

Kathrine

P.S. i looked at ll_memcpy_nonaliased_aligned_16() in llmemory.h too and wondered if that is any faster than the VS2017 memcpy() or the one in recent glibc() anymore, especially when AVX2 is available with the stream instructions. Would probably be worth it to just throw out and replace with memcpy() and see what happens. It did not look worse in the profiler for Windows.

Attachments:
face.patch.gz [1.49 KiB]
Downloaded 101 times

Author:  Henri Beauchamp [ 2021-03-23 13:40:57 ]
Post subject:  Re: Some more AVX/SSE2 for llface.cpp

kathrine wrote:
the Visual Studio Profiler told that llface.cpp was still a little hot, so i added some more AVX2/SSE2 stuff.
It is mostly identical to the previous one, just for the planar projection case.
Thank you ! Added to next release. :D

Quote:
P.S. i looked at ll_memcpy_nonaliased_aligned_16() in llmemory.h too and wondered if that is any faster than the VS2017 memcpy() or the one in recent glibc() anymore, especially when AVX2 is available with the stream instructions. Would probably be worth it to just throw out and replace with memcpy() and see what happens. It did not look worse in the profiler for Windows.
Replacing it with memcpy() seems to make the code slightly slower here (glibc v2.31 with AVX2 optimizations on): 720-730fps against 730-740fps, in my skybox... Beside, I must ensure the code will run well when not optimized for AVX (the official builds are SSE2 ones only), *and* not everyone got a glibc with proper optimizations, *and* using AVX2 opts in glibc could well handicap Intel users when using one of those CPUs (e.g. Coffe Lake, when not setting the appropriate BIOS override for the "AVX offset") that lower their operating frequency when a single AVX instruction gets executed... Until proven LL's ll_memcpy_nonaliased_aligned_16() is slower than memcpy(), I'll keep it...

Author:  kathrine [ 2021-03-23 18:24:15 ]
Post subject:  Re: Some more AVX/SSE2 for llface.cpp

Sounds good.

Your framerates are just 10x as high as mine (AMD Vega 56, Ryzen 2700x, ..., crappy OpenGL on Windows, i never get above 90fps, with EEP and shadows disabled), so you probably see the difference for memcpy() better. I might give it a try with an AVX2 version with non cached stream ops, but my first attempt wasn't really worth it. According to my profile it was just 0.2% of CPU time anyway even with memcpy().

Kathrine

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/