About that SSIM and AVX/SSE
Checking the performance of rust with a naive implementation vs intrinsics powered code in the image-compare crate.
Prelude - why one should not even think of it
Just as a little intro: I think, that "Premature optimization is the root of all evil" (D. Knuth) is still as valid as it was decades ago. Therefore when I started implementing the image-compare crate I wrote the algorithms as plain and short as possible favoring legibility over performance in all places. This is also related to the YAGNI (You ain't gonna need it) principle - we don't have a problem yet, so we don't fix anything. After adding more and more tests and using the code productively on full-hd and 4k images, I actually ran into performance bottlenecks which were expected.
But what if I need to do it anyway?
There's the rule of thumb in programming to optimize the levels from top to bottom meaning, that a function you don't call often does not need to be super duper optimized to not impact program performance / NFR or whatever you might call it. Sometimes this can be achieved by:
- using a different architecture / design or
- doing some caching or even
- caching the results for a grid of values and interpolating in between. And these methods are usually superior to starting by optimizing hot paths.
In the case of comparing images, caching, cache-tables and interpolations are not possible. Since modern CPUs usually have many cores, we can leverage multithreading to speed up the computation. With tools like rayon this is super easy to achieve, so I went for it.
In the simplest cases, iter was replaced by par_iter and some little things in the structure needed to be changed for using multiple iterators with izip! from itertools.
In the end the conversion was easy and the performance increase was incredible.
Memory Layout considerations
In image-compare for rgb images, I split all channels beforehand anyway to try optimize the cache utilization. In GPGPU this discussion is usually called AoS vs SoA (Array of Structs vs Struct of Arrays, see e.g. wikipedia), where Struct of Arrays produces favorable memory layout for better processing but the code may end up a bit more unintuitive. It also allows us to do some fancy stuff like in-memory compression using algorithms like LZW. In databases, this is one of the reasons why column oriented storage is so hip nowadays. The memory layout would now be great for stream-processing the image on a GPU. It's not optimal for processing in windows though and we will fix this up in the next part of the series.
Going at it with SIMD
Following the memory layout, there has to be a better way than two dimensional for-loops or iterators over memory-adjacent elements. Naturally vector-extensions like SSE and AVX came to my mind. Let's look at the algorithm for SSIM. Since it's my spare time and I am interested in performance anyway, I gave it a shot even though I knew that it's hard to beat modern compiler optimizations. For any 8x8 pixel window, the SSIM can be written as as (taken from image-compare's ssim.rs):
So we would need to implement a mean, a variance and a covariance in SIMD code in rust.
Unfortunately we don't have portable SIMD just yet in stable rust, so I decided to use intrinsics to get a feeling for the performance difference.
To load 8 pixels of type u8 we can use the following function:
We actually load 16 pixels and discard the bottom half - problematic :/ We should really utilize them too, by processing two adjacent windows at once. But let's continue here for now. We need some code to do a horizontal sum over a vector, see sauce:
Now we can conveniently write down the mean, the variance and the covariance. The variance? you ask? The separate variance was introduced - again premature optimization at it's finest - to avoid the overhead of reading from the same memory address twice.
The resulting code is:
- Four times the lines of the iterator code
- Difficult to understand
- Not portable at all, only runs on AVX enabled x86 processors
- Unpredictable across different cpus in terms of performance
- As unsafe as it gets
But the worst part is:
- It's also unintuitive like hell => everybody will break something when trying to fix stuff here.
So it better be quick, right? We do 8 elements at a time, a complete row of the window instead of that nasty nested loop.
In short: We expect performance greatness for our endeavour!
Let's run it!
We can just use a simple test program to profile this and switch between the simd and the cpu variant
use ssim_simple;
const RUN_COUNT: usize = 2000;
So we start by comparing debug builds. On my Ryzen 5 notebook, the debug naive implementation for 20 runs takes 10.6s while the simd implementation takes 2.5s. Impressive! We achieved a factor of four. But let's see what comes out on a fully optimized release build, since that's what counts in the end. For the release build we increase the iterations to 1000, simd takes 7.5s (+-0.2s) while the plain cpu implementation takes 10.5s (+-0.1s). Now this may or may not be worth it - it's not as clear anymore given the added complexity. Let's remeasure the numbers on my desktop pc, a Ryzen 7 3800x.
Naive implementation for 2000 iterations takes 16.5s (+-0.1s) while the simd implementation takes 19.0s (+-0.1s). This is quite the surprise but I am sure one could find the reason in chapter 21 of Agner's bible if we digged deeper.
| PC | Naive | SIMD |
|---|---|---|
| laptop dbg | 10.6s | 2.5s |
| laptop rel | 10.5s | 7.5s |
| desktop rel | 16.5s | 19.0s |
I am really interested in the root cause for these differences, but frankly can't find the time currently to do the required deep-dive.
My take home messages
- After LLVMs optimizations (which are obviously incredible) there's no factor 4-10, only something about 30%.
- It's difficult to predict how good intrinsics code has to look like for different processors.
In the next part of this series, we will try to optimize the memory layout for the processing windows and see what we can gain from there. Stay tuned!