Cool VL Viewer forum

View unanswered posts | View active topics It is currently 2021-10-20 15:58:46



Reply to topic  [ 3 posts ] 
Some more AVX/SSE2 for llface.cpp 
Author Message

Joined: 2011-10-07 10:39:20
Posts: 140
Reply with quote
Hi Henri,

the Visual Studio Profiler told that llface.cpp was still a little hot, so i added some more AVX2/SSE2 stuff.
It is mostly identical to the previous one, just for the planar projection case.

Kathrine

P.S. i looked at ll_memcpy_nonaliased_aligned_16() in llmemory.h too and wondered if that is any faster than the VS2017 memcpy() or the one in recent glibc() anymore, especially when AVX2 is available with the stream instructions. Would probably be worth it to just throw out and replace with memcpy() and see what happens. It did not look worse in the profiler for Windows.


Attachments:
face.patch.gz [1.49 KiB]
Downloaded 20 times
2021-03-23 01:12:36
Profile

Joined: 2009-03-17 18:42:51
Posts: 4751
Reply with quote
kathrine wrote:
the Visual Studio Profiler told that llface.cpp was still a little hot, so i added some more AVX2/SSE2 stuff.
It is mostly identical to the previous one, just for the planar projection case.
Thank you ! Added to next release. :D

Quote:
P.S. i looked at ll_memcpy_nonaliased_aligned_16() in llmemory.h too and wondered if that is any faster than the VS2017 memcpy() or the one in recent glibc() anymore, especially when AVX2 is available with the stream instructions. Would probably be worth it to just throw out and replace with memcpy() and see what happens. It did not look worse in the profiler for Windows.
Replacing it with memcpy() seems to make the code slightly slower here (glibc v2.31 with AVX2 optimizations on): 720-730fps against 730-740fps, in my skybox... Beside, I must ensure the code will run well when not optimized for AVX (the official builds are SSE2 ones only), *and* not everyone got a glibc with proper optimizations, *and* using AVX2 opts in glibc could well handicap Intel users when using one of those CPUs (e.g. Coffe Lake, when not setting the appropriate BIOS override for the "AVX offset") that lower their operating frequency when a single AVX instruction gets executed... Until proven LL's ll_memcpy_nonaliased_aligned_16() is slower than memcpy(), I'll keep it...


2021-03-23 13:40:57
Profile WWW

Joined: 2011-10-07 10:39:20
Posts: 140
Reply with quote
Sounds good.

Your framerates are just 10x as high as mine (AMD Vega 56, Ryzen 2700x, ..., crappy OpenGL on Windows, i never get above 90fps, with EEP and shadows disabled), so you probably see the difference for memcpy() better. I might give it a try with an AVX2 version with non cached stream ops, but my first attempt wasn't really worth it. According to my profile it was just 0.2% of CPU time anyway even with memcpy().

Kathrine


2021-03-23 18:24:15
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 3 posts ] 

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software.