Cool VL Viewer forum - View topic - Some more AVX/SSE2 for llface.cpp

View unanswered posts | View active topics

It is currently 2025-08-25 11:31:32

Some more AVX/SSE2 for llface.cpp

Page 1 of 1

[ 3 posts ]

Print view

Previous topic | Next topic

Some more AVX/SSE2 for llface.cpp

Author

Message

kathrine

Joined: 2011-10-07 10:39:20
Posts: 215

Some more AVX/SSE2 for llface.cpp

Hi Henri,

the Visual Studio Profiler told that llface.cpp was still a little hot, so i added some more AVX2/SSE2 stuff.
It is mostly identical to the previous one, just for the planar projection case.

Kathrine

P.S. i looked at ll_memcpy_nonaliased_aligned_16() in llmemory.h too and wondered if that is any faster than the VS2017 memcpy() or the one in recent glibc() anymore, especially when AVX2 is available with the stream instructions. Would probably be worth it to just throw out and replace with memcpy() and see what happens. It did not look worse in the profiler for Windows.


	Attachments: face.patch.gz [1.49 KiB] Downloaded 853 times

2021-03-23 01:12:36

Henri Beauchamp

Joined: 2009-03-17 18:42:51
Posts: 6043

Re: Some more AVX/SSE2 for llface.cpp


	kathrine wrote: the Visual Studio Profiler told that llface.cpp was still a little hot, so i added some more AVX2/SSE2 stuff. It is mostly identical to the previous one, just for the planar projection case.

Thank you ! Added to next release.


	Quote: P.S. i looked at ll_memcpy_nonaliased_aligned_16() in llmemory.h too and wondered if that is any faster than the VS2017 memcpy() or the one in recent glibc() anymore, especially when AVX2 is available with the stream instructions. Would probably be worth it to just throw out and replace with memcpy() and see what happens. It did not look worse in the profiler for Windows.

Replacing it with memcpy() seems to make the code slightly slower here (glibc v2.31 with AVX2 optimizations on): 720-730fps against 730-740fps, in my skybox... Beside, I must ensure the code will run well when not optimized for AVX (the official builds are SSE2 ones only), *and* not everyone got a glibc with proper optimizations, *and* using AVX2 opts in glibc could well handicap Intel users when using one of those CPUs (e.g. Coffe Lake, when not setting the appropriate BIOS override for the "AVX offset") that lower their operating frequency when a single AVX instruction gets executed... Until proven LL's ll_memcpy_nonaliased_aligned_16() is slower than memcpy(), I'll keep it...

2021-03-23 13:40:57

kathrine

Joined: 2011-10-07 10:39:20
Posts: 215

Re: Some more AVX/SSE2 for llface.cpp

Sounds good.

Your framerates are just 10x as high as mine (AMD Vega 56, Ryzen 2700x, ..., crappy OpenGL on Windows, i never get above 90fps, with EEP and shadows disabled), so you probably see the difference for memcpy() better. I might give it a try with an AVX2 version with non cached stream ops, but my first attempt wasn't really worth it. According to my profile it was just 0.2% of CPU time anyway even with memcpy().

Kathrine

2021-03-23 18:24:15

Page 1 of 1

[ 3 posts ]

Who is online

Users browsing this forum: No registered users and 242 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum