Cool VL Viewer forum

View unanswered posts | View active topics It is currently 2022-10-06 08:19:12



Reply to topic  [ 9 posts ] 
Improving LLVCacheLRU::updateScores() with OpenMP 
Author Message

Joined: 2011-10-07 10:39:20
Posts: 155
Reply with quote
Hi Henri,

i ran the Visual Studio Profiler on a viewer run and found that updateScores() eats around 5% CPU time.

It can be parallelized quite simple with OpenMP, as illustrated below (use /openmp with VisualStudio).

Are you interested in a proper patch for this?

Code:
   void updateScores()
   {
      LLVCacheVertexData** data_iter = mCache + MaxSizeVertexCache;
      LLVCacheVertexData** end_data = mCache + MaxSizeVertexCache + 3;

      while (data_iter != end_data)
      {
         LLVCacheVertexData* data = *data_iter++;
         // Trailing 3 vertices aren't actually in the cache for scoring
         // purposes
         if (data)
         {
            data->mCacheTag = -1;
         }
      }

      data_iter = mCache;
      end_data = mCache + MaxSizeVertexCache;

      while (data_iter != end_data)
      {
         // Update scores of vertices in cache
         LLVCacheVertexData* data = *data_iter++;
         if (data)
         {
            data->mScore = find_vertex_score(*data);
         }
      }

      mBestTriangle = NULL;
      // Update triangle scores
      data_iter = mCache;
      end_data = mCache + MaxSizeVertexCache + 3;

      
      #pragma omp parallel
      {
         LLVCacheTriangleData* threadBestTriangle = NULL;

         while (data_iter != end_data)
         {
            LLVCacheVertexData* data = *data_iter++;
            if (data)
            {
               int64_t pos = 0;
               int64_t end = data->mTriangles.size();
               #pragma omp for
               for (pos = 0; pos < end; pos++)
               {
                  LLVCacheTriangleData* tri = data->mTriangles[pos];
                  if (tri && tri->mActive &&
                     tri->mVertex[0] && tri->mVertex[1] && tri->mVertex[2])
                  {
                     tri->mScore = tri->mVertex[0]->mScore;
                     tri->mScore += tri->mVertex[1]->mScore;
                     tri->mScore += tri->mVertex[2]->mScore;

                     if (!threadBestTriangle ||
                        threadBestTriangle->mScore < tri->mScore)
                     {
                        threadBestTriangle = tri;
                     }
                  }
               }
            }
         }
         
         // find global maximum
         #pragma omp critical
         {
            if (!mBestTriangle || mBestTriangle->mScore < threadBestTriangle->mScore) {
               mBestTriangle = threadBestTriangle;
            }
         }


      }
      // Knock trailing 3 vertices off the cache
      data_iter = mCache + MaxSizeVertexCache;
      end_data = mCache + MaxSizeVertexCache + 3;
      while (data_iter != end_data)
      {
         LLVCacheVertexData* data = *data_iter;
         if (data)
         {
            llassert(data->mCacheTag == -1);
            *data_iter = NULL;
         }
         ++data_iter;
      }
   }
};


2020-10-03 14:30:25
Profile

Joined: 2009-03-17 18:42:51
Posts: 5022
Reply with quote
kathrine wrote:
i ran the Visual Studio Profiler on a viewer run and found that updateScores() eats around 5% CPU time.
Wow !... :shock: It's pretty... surprising. I mean the mesh vertex cache optimization is supposed to be ran once per mesh LOD (whenever the latter LOD is downloaded for that mesh), so while it will definitely eat up some of the CPU processing power on rezzing (and on LOD changes, when the corresponding LODs have not yet been cached), it should not do so continuously... There's a debug setting (RenderOptimizeMeshVertexCache) that you can set to FALSE to entirely skip this vertex cache optimization code. Switching it off/on here in real time, I don't see noticeable frame rate drops...

Quote:
It can be parallelized quite simple with OpenMP, as illustrated below (use /openmp with VisualStudio).
I already experimented with OpenMP (under Linux) and the viewer in the past, but did not find interesting paths that could be optimized that way (aside from legacy clouds generation/updating, but the resulting gain is totally negligible).

Quote:
Are you interested in a proper patch for this?
Well, the code you gave here will be enough for me to give it a try, at least. The newly opened experimental branch is a good candidate for such patches. So, if you find more stuff that can be optimized via OpenMP (and does not cause crashes because of the race conditions it would introduce, which is the tricky part with OpenMP), you are welcome to post them here !


2020-10-03 14:51:26
Profile WWW

Joined: 2011-10-07 10:39:20
Posts: 155
Reply with quote
Ok, that might explain a bit.

I profiled for an hour and that part was when jumping to HBC with lots of AVs around and a clean cache, so lots of texture fetches etc.

Will have a look at some more quiet spots, but the threading optimization looks useful anyway. Its a bit sad VC only has openmp 2.0, but better than nothing for a start. Thats actually the reason for that int64_t stuff, as VC cannot simply use the iterators there.


2020-10-03 15:12:21
Profile

Joined: 2009-03-17 18:42:51
Posts: 5022
Reply with quote
OK, reusing my former OpenMP tests, I came up with the attached patch (to apply to the experimental branch sources) which should allow compiling your proposed code for Linux, Windows and even macOS (for Windows and macOS, uncomment "#set(OPENMP ON)" in indra/cmake/00-BuildOptions.cmake, and for Linux simply add the "-o" option to the buildlinux.sh command line).

Sadly, the result under Linux (with gcc 10.2.0 as the compiler) is a hanging mesh repository (it stays stuck forever when it attempts the first mesh cache optimization)...

You may review the patch to see if I made any mistake, but aside from the cmake/linking stuff, and a couple minor changes (coding standard-related), it's the code you posted here. There might be some issue with threading (the mesh repository being itself a thread: not too sure how OpenMP deals with threads under Linux; I'd have to investigate).

EDIT: same results when compiling with llvm/clang 10.0.1.


Attachments:
omp-patch.txt.gz [3.46 KiB]
Downloaded 134 times
2020-10-03 16:34:38
Profile WWW

Joined: 2011-10-07 10:39:20
Posts: 155
Reply with quote
Will have a look.

Probably messed something up on my side.

Kathrine


2020-10-03 18:05:08
Profile

Joined: 2009-03-17 18:42:51
Posts: 5022
Reply with quote
kathrine wrote:
Will have a look.

Probably messed something up on my side.
I may be wrong, but I got the feeling that the issue stems from using OpenMP within a method that it itself called from a child thread of the program (and not from its main thread). Normally, OpenMP would be used from the main thread (the whole point of OpenMP being to avoid dealing yourself with threads). Maybe it works under Windows and not under Linux...

Also, regarding this particular mesh optimizing method, I don't think the result you got while benchmarking it is very relevant (at least on multi-core CPUs). Why ? Because you apparently (if I understood correctly what you wrote) measured the CPU load, but not the relative time slice it takes in the main thread (which would be close to zero and only the cause of mutex locks overhead and such, since this method is called from the mesh repository thread).
On a mono-core CPU, your "5%" would definitely be a relevant measure and would deserve optimization, but on a multi-core CPU and considering this method is called in a child thread, it does not really impact the main thread loop run time (i.e. the rendering itself, the objects list and UI refreshes, plus all the sundry synchronization tasks between threads).
Granted, spreading the mesh optimization on more cores would definitely lower the rezzing time of meshes (so your work is not at all useless), even if by a small amount (relatively to the download time of the said meshes).

I do not know how the VS profiler works, but if possible, it would be worth running it only on the main viewer thread to highlight the hot spots that do count for the user.


2020-10-03 19:43:53
Profile WWW

Joined: 2011-10-07 10:39:20
Posts: 155
Reply with quote
Yeah, maybe not worth it, as it is not on the main loop.

But faster rezzing is nice too, if it works out.
It seems i messed up my cut&copy and got an older broken version.

This one actually works and does something useful.
Code:
   void updateScores()
   {
      LLVCacheVertexData** data_iter = mCache + MaxSizeVertexCache;
      LLVCacheVertexData** end_data = mCache + MaxSizeVertexCache + 3;

      while (data_iter != end_data)
      {
         LLVCacheVertexData* data = *data_iter++;
         // Trailing 3 vertices aren't actually in the cache for scoring
         // purposes
         if (data)
         {
            data->mCacheTag = -1;
         }
      }

      data_iter = mCache;
      end_data = mCache + MaxSizeVertexCache;

      while (data_iter != end_data)
      {
         // Update scores of vertices in cache
         LLVCacheVertexData* data = *data_iter++;
         if (data)
         {
            data->mScore = find_vertex_score(*data);
         }
      }

      mBestTriangle = NULL;
      // Update triangle scores
      data_iter = mCache;
      end_data = mCache + MaxSizeVertexCache + 3;
      
      while (data_iter != end_data)
      {
         LLVCacheVertexData* data = *data_iter++;
         
         if (data)

         {
            int64_t end = data->mTriangles.size();

#pragma omp parallel num_threads(4)
            {
               LLVCacheTriangleData* threadBestTriangle = NULL;
#pragma omp for nowait schedule(dynamic, 500)
               for (int64_t pos = 0; pos < end; pos++)
               {
                  LLVCacheTriangleData* tri = data->mTriangles[pos];
                  if (tri && tri->mActive &&
                     tri->mVertex[0] && tri->mVertex[1] && tri->mVertex[2])
                  {
                     tri->mScore = tri->mVertex[0]->mScore;
                     tri->mScore += tri->mVertex[1]->mScore;
                     tri->mScore += tri->mVertex[2]->mScore;

                     if (!threadBestTriangle || threadBestTriangle->mScore < tri->mScore)
                     {
                        threadBestTriangle = tri;
                     }
                  }
               }
               // find global maximum
#pragma omp critical
               {
                  if (threadBestTriangle && (!mBestTriangle || mBestTriangle->mScore < threadBestTriangle->mScore)) {
                     mBestTriangle = threadBestTriangle;
                  }
               }
            }
         }
      }
         

      // Knock trailing 3 vertices off the cache
      data_iter = mCache + MaxSizeVertexCache;
      end_data = mCache + MaxSizeVertexCache + 3;
      while (data_iter != end_data)
      {
         LLVCacheVertexData* data = *data_iter;
         if (data)
         {
            llassert(data->mCacheTag == -1);
            *data_iter = NULL;
         }
         ++data_iter;
      }
   }
};


2020-10-03 20:52:00
Profile

Joined: 2009-03-17 18:42:51
Posts: 5022
Reply with quote
Got a magnificent crash under Linux with this version:
Code:
0   com.secondlife.indra.viewer   0x14f46ca LLAppViewerLinux::handleSyncCrashTrace() + 298
1   com.secondlife.indra.viewer   0x1dff439 default_unix_signal_handler(int, siginfo_t*, void*) + 233
2   unknown   0x7ffff55c2900 /lib64/libpthread.so.0(+0x13900) [0x7ffff55c2900]
3   com.secondlife.indra.viewer   0x1dddccf LLVCacheLRU::updateScores() [clone ._omp_fn.0] + 95
4   unknown   0x7ffff7deaac9 /usr/lib64/libomp.so(+0x8dac9) [0x7ffff7deaac9]
5   unknown   0x7ffff7dfe2c3 /usr/lib64/libomp.so(__kmp_invoke_microtask+0x93) [0x7ffff7dfe2c3]
6   unknown   0x7ffff7d9c9ad /usr/lib64/libomp.so(+0x3f9ad) [0x7ffff7d9c9ad]
7   unknown   0x7ffff7d9ba19 /usr/lib64/libomp.so(+0x3ea19) [0x7ffff7d9ba19]
8   unknown   0x7ffff7de99db /usr/lib64/libomp.so(+0x8c9db) [0x7ffff7de99db]
9   unknown   0x7ffff55b7ded /lib64/libpthread.so.0(+0x8ded) [0x7ffff55b7ded]
10  unknown   0x7ffff54ec06f /lib64/libc.so.6(clone+0x3f) [0x7ffff54ec06f]

I still got the feeling that threading via OpenMP within a pthread thread is "wrong" (under Linux... or macOS !)... Probably a pthread reentrancy issue (as you can see in the stack trace, two pthread calls get nested).

EDIT: just found the answer confirming my inference on OpenMP forum. So, sadly, I must reject this patch. :cry:


2020-10-03 21:58:46
Profile WWW

Joined: 2011-10-07 10:39:20
Posts: 155
Reply with quote
Ah well, happens. Maybe my next patch will work better.


2020-10-04 08:36:52
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 9 posts ] 

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software.