Cool VL Viewer forum

View unanswered posts | View active topics It is currently 2024-03-28 14:51:10



Reply to topic  [ 6 posts ] 
AVX2 Version of parts of tcd_decode_tile for openjpeg 
Author Message

Joined: 2011-10-07 10:39:20
Posts: 181
Reply with quote
Next try at a useful patch.

I noticed that CMAKE_CFLAGS_* isn't set on Windows unlike Linux (in 00-Common.cmake), so the /arch:AVX2 isn't used for libopenjpeg and other C-only libraries. That should probably also fixed to get the proper optimization flags.

This is also not in the hot path for rendering, but faster JPEG2000 decoding can't hurt.

Code:
*** linden/indra/libopenjpeg/tcd.c   Mon Apr 22 22:15:55 2019
--- linden/indra/libopenjpeg/tcd.c   Sun Oct  4 02:33:58 2020
***************
*** 33,38 ****
--- 33,43 ----
  #define _ISOC99_SOURCE /* lrintf is C99 */
  #include "opj_includes.h"
 
+ #if defined(__AVX2__)
+ #include <immintrin.h>
+ #include <stdint.h>
+ #endif
+
  void tcd_dump(FILE *fd, opj_tcd_t *tcd, opj_tcd_image_t * img) {
     int tileno, compno, resno, bandno, precno;//, cblkno;
 
***************
*** 1564,1569 ****
--- 1569,1599 ----
              }
           }
        }else{
+ #if defined(__AVX2__)
+          __m256i adjustv = _mm256_set1_epi32(adjust);
+          __m256i minv = _mm256_set1_epi32(min);
+          __m256i maxv = _mm256_set1_epi32(max);
+          for (j = res->y0; j < res->y1; ++j) {
+             // handle chunks of 8
+             for (i = res->x0; i + 8 < res->x1; i += 8) {
+                // lets do 8 per chunk
+                float* start = &(((float*)tilec->data)[i - res->x0 + (j - res->y0) * tw]);
+                __m256i tmp = _mm256_cvtps_epi32(_mm256_loadu_ps(start));
+                tmp = _mm256_add_epi32(tmp, adjustv);
+                // int_clamp vectorized...
+                tmp = _mm256_min_epi32(_mm256_max_epi32(tmp, minv), maxv);
+                int32_t* target = &(imagec->data[(i - offset_x) + (j - offset_y) * w]);
+                _mm256_storeu_si256((__m256i*)target, tmp);
+             }
+             // handle the rest of the row
+             for (; i < res->x1; ++i) {
+                float tmp = ((float*)tilec->data)[i - res->x0 + (j - res->y0) * tw];
+                int v = lrintf(tmp);
+                v += adjust;
+                imagec->data[(i - offset_x) + (j - offset_y) * w] = int_clamp(v, min, max);
+             }
+          }
+ #else
           for (j = res->y0; j < res->y1; ++j) {
              for (i = res->x0; i < res->x1; ++i) {
                 float tmp = ((float*)tilec->data)[i - res->x0 + (j - res->y0) * tw];
***************
*** 1572,1577 ****
--- 1602,1608 ----
                 imagec->data[(i - offset_x) + (j - offset_y) * w] = int_clamp(v, min, max);
              }
           }
+ #endif
        }
        opj_aligned_free(tilec->data);
     }


2020-10-04 08:43:47
Profile

Joined: 2009-03-17 18:42:51
Posts: 5523
Reply with quote
kathrine wrote:
Next try at a useful patch.
Keep them coming ! :P

Quote:
I noticed that CMAKE_CFLAGS_* isn't set on Windows unlike Linux (in 00-Common.cmake), so the /arch:AVX2 isn't used for libopenjpeg and other C-only libraries. That should probably also fixed to get the proper optimization flags.
Wow, good catch, indeed !... Thankfully "only" OpenJPEG was affected, but it will still be a big plus for Windows builds !
I also added to 00-BuildOptions.cmake:
Code:
# Compilation/optimization options: uncomment to enable (may also be passed as
# boolean defines to cmake: see Variables.cmake). Mainly relevant to Windows
# and macOS builds (for Linux, simply use the corresponding options in the
# buildlinux.sh script, which will pass the appropriate boolean defines to
# cmake).
#set(USELTO ON)
#set(USEAVX ON)
#set(USEAVX2 ON)
So that it is easier for people to build custom, optimized builds on Windows and macOS...

Quote:
This is also not in the hot path for rendering, but faster JPEG2000 decoding can't hurt.
I'll test it (mostly under Linux), many thanks ! If I do not see any ill-effect, your patch will be adopted for the next releases. :)


2020-10-04 10:10:46
Profile WWW

Joined: 2011-10-07 10:39:20
Posts: 181
Reply with quote
Well, default C_FLAGS for release builds include /O2, so it is not that bad, and for 64 bit that includes SSE, but all the extra flags got lost.

Anyway, hope it is useful. Might even be possible to tweak the patch a bit to do "i + 8 <= end" instead of the strictly <. Not sure if i miss the last 8 items of the loop if x aligns to 8 exactly.

This was also found by staring at VC performance analyzer stats, and yes, that thing can restrict stats to single threads, but the render loop wastes most of the time inside the atrocious slow OpenGL of AMD. Only some minor stuff left to optimize there, state sort and some other smaller parts, but nothing major. This part was the hotspot in the decoder thread, well lrintf() was actually, so i vectorized that inner loop to get rid of it and make it faster.

Kathrine


2020-10-04 10:32:53
Profile

Joined: 2009-03-17 18:42:51
Posts: 5523
Reply with quote
kathrine wrote:
Might even be possible to tweak the patch a bit to do "i + 8 <= end" instead of the strictly <. Not sure if i miss the last 8 items of the loop if x aligns to 8 exactly.
Yep, the equality case is totally fine (using it right now).

Quote:
This was also found by staring at VC performance analyzer stats, and yes, that thing can restrict stats to single threads, but the render loop wastes most of the time inside the atrocious slow OpenGL of AMD.
Well, you may perhaps subtract the drivers calls time from the rest and see what are the hot spots in that rest.

Quote:
This part was the hotspot in the decoder thread, well lrintf() was actually, so i vectorized that inner loop to get rid of it and make it faster.
Nice job ! It works flawlessly for me so far (tested with a long flying avatar session over main land, to stress the textures decoder).

A SSE2 version would be nice to have as well, since SSE2 is what most people will use the viewer with (it's what official builds do use, since they must stay compatible with anyone's hardware). :P


2020-10-04 16:41:51
Profile WWW

Joined: 2011-10-07 10:39:20
Posts: 181
Reply with quote
Is there any specific reason the viewer still sits on the kind of old (2011) openjpeg 1.4?
It seems the newer openjpeg 2.3.1 has a different API in parts, but claims to be much faster (but still slower than KDU), and memory hungry and has quite some SSE/AVX2 optimized parts.

Kathrine


2020-10-06 22:09:51
Profile

Joined: 2009-03-17 18:42:51
Posts: 5523
Reply with quote
kathrine wrote:
Is there any specific reason the viewer still sits on the kind of old (2011) openjpeg 1.4?
Yep. Newer versions fail to properly decode SL textures (something to do with texture LODs). SL viewers are among those rare programs that use that JPEG2000 feature which allows to only partially decode a texture (meaning it is not fully downloaded and is shown at a lower resolution/level of detail), so such an issue does not show in other usages of OpenJPEG, but is crucial to SL.


2020-10-07 08:32:10
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 6 posts ] 

Who is online

Users browsing this forum: No registered users and 20 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software.