About that SSIM and AVX/SSE

Checking the performance of rust with a naive implementation vs intrinsics powered code in the image-compare crate.

Prelude - why one should not even think of it

Just as a little intro: I think, that "Premature optimization is the root of all evil" (D. Knuth) is still as valid as it was decades ago. Therefore when I started implementing the image-compare crate I wrote the algorithms as plain and short as possible favoring legibility over performance in all places. This is also related to the YAGNI (You ain't gonna need it) principle - we don't have a problem yet, so we don't fix anything. After adding more and more tests and using the code productively on full-hd and 4k images, I actually ran into performance bottlenecks which were expected.

But what if I need to do it anyway?

There's the rule of thumb in programming to optimize the levels from top to bottom meaning, that a function you don't call often does not need to be super duper optimized to not impact program performance / NFR or whatever you might call it. Sometimes this can be achieved by:

using a different architecture / design or
doing some caching or even
caching the results for a grid of values and interpolating in between. And these methods are usually superior to starting by optimizing hot paths.

In the case of comparing images, caching, cache-tables and interpolations are not possible. Since modern CPUs usually have many cores, we can leverage multithreading to speed up the computation. With tools like rayon this is super easy to achieve, so I went for it. In the simplest cases, iter was replaced by par_iter and some little things in the structure needed to be changed for using multiple iterators with izip! from itertools. In the end the conversion was easy and the performance increase was incredible.

Memory Layout considerations

In image-compare for rgb images, I split all channels beforehand anyway to try optimize the cache utilization. In GPGPU this discussion is usually called AoS vs SoA (Array of Structs vs Struct of Arrays, see e.g. wikipedia), where Struct of Arrays produces favorable memory layout for better processing but the code may end up a bit more unintuitive. It also allows us to do some fancy stuff like in-memory compression using algorithms like LZW. In databases, this is one of the reasons why column oriented storage is so hip nowadays. The memory layout would now be great for stream-processing the image on a GPU. It's not optimal for processing in windows though and we will fix this up in the next part of the series.

Going at it with SIMD

Following the memory layout, there has to be a better way than two dimensional for-loops or iterators over memory-adjacent elements. Naturally vector-extensions like SSE and AVX came to my mind. Let's look at the algorithm for SSIM. Since it's my spare time and I am interested in performance anyway, I gave it a shot even though I knew that it's hard to beat modern compiler optimizations. For any 8x8 pixel window, the SSIM can be written as as (taken from image-compare's ssim.rs):

fn ssim_for_window(first: &GrayImage, second: &GrayImage, window: &Window) -> f64 {
    let mean_x = mean(first, window);
    let mean_y = mean(second, window);
    let variance_x = covariance(first, mean_x, first, mean_x, window);
    let variance_y = covariance(second, mean_y, second, mean_y, window);
    let covariance = covariance(first, mean_x, second, mean_y, window);
    let numerator = (2. * mean_x * mean_y + C1) * (2. * covariance + C2);
    let denominator = (mean_x.powi(2) + mean_y.powi(2) + C1) * (variance_x + variance_y + C2);
    numerator / denominator
}

So we would need to implement a mean, a variance and a covariance in SIMD code in rust. Unfortunately we don't have portable SIMD just yet in stable rust, so I decided to use intrinsics to get a feeling for the performance difference. To load 8 pixels of type u8 we can use the following function:

#[inline(always)]
fn load_as_float(px: &[u8]) -> __m256 {
    unsafe {
        let row = _mm_loadu_si128(px.as_ptr() as *const _);
        let row_wide = _mm256_cvtepu8_epi32(row);
        _mm256_cvtepi32_ps(row_wide)
    }
}

We actually load 16 pixels and discard the bottom half - problematic :/ We should really utilize them too, by processing two adjacent windows at once. But let's continue here for now. We need some code to do a horizontal sum over a vector, see sauce:

#[inline(always)]
fn sum_reduce(rows: __m256) -> f32 {
    unsafe {
        let hi_quad = _mm256_extractf128_ps::<1>(rows);
        let lo_quad = _mm256_castps256_ps128(rows);
        let sum_quad = _mm_add_ps(lo_quad, hi_quad);
        let lo_dual = sum_quad;
        let hi_dual = _mm_movehl_ps(sum_quad, sum_quad);
        let sum_dual = _mm_add_ps(lo_dual, hi_dual);
        let lo = sum_dual;
        let hi = _mm_shuffle_ps::<0x1>(sum_dual, sum_dual);
        let sum = _mm_add_ss(lo, hi);
        _mm_cvtss_f32(sum)
    }
}

Now we can conveniently write down the mean, the variance and the covariance. The variance? you ask? The separate variance was introduced - again premature optimization at it's finest - to avoid the overhead of reading from the same memory address twice.

fn mean_simd(image: &GrayImage, window: &Window) -> f64 {
    if window.width() != 8 {
        return mean(image, window);
    }

    let mut sum = 0.0;
    for row in 0..window.height() {
        let row = row + window.top_left.1;
        let row_floats = load_as_float(&image.get_pixel(window.top_left.0, row).0);
        let result = sum_reduce(row_floats);
        sum += result;
    }

    sum as f64 / window.area() as f64
}

fn variance_simd(image_x: &GrayImage, mean_x: f64, window: &Window) -> f64 {
    if window.width() != 8 {
        return covariance(image_x, mean_x, image_x, mean_x, window);
    }
    let mut sum: f64 = 0.0;
    let mean_x_f32 = mean_x as f32;

    for row in 0..window.height() {
        unsafe {
            let row = row + window.top_left.1;
            let row_floats_x = load_as_float(&image_x.get_pixel(window.top_left.0, row).0);
            let row_floats_mean_x = _mm256_set1_ps(mean_x_f32);
            let diffs_x = _mm256_sub_ps(row_floats_x, row_floats_mean_x);
            let var = _mm256_mul_ps(diffs_x, diffs_x);
            let var_sum = sum_reduce(var);
            sum += var_sum as f64;
        }
    }
    sum
}

fn covariance_simd(
    image_x: &GrayImage,
    mean_x: f64,
    image_y: &GrayImage,
    mean_y: f64,
    window: &Window,
) -> f64 {
    if window.width() != 8 {
        return covariance(image_x, mean_x, image_y, mean_y, window);
    }
    let mut sum: f64 = 0.0;
    for row in 0..window.height() {
        let mean_x_f32 = mean_x as f32;
        let mean_y_f32 = mean_y as f32;
        unsafe {
            let row = row + window.top_left.1;
            let row_floats_x = load_as_float(&image_x.get_pixel(window.top_left.0, row).0);
            let row_floats_mean_x = _mm256_set1_ps(mean_x_f32);
            let row_floats_y = load_as_float(&image_y.get_pixel(window.top_left.0, row).0);
            let row_floats_mean_y = _mm256_set1_ps(mean_y_f32);

            let diffs_x = _mm256_sub_ps(row_floats_x, row_floats_mean_x);
            let diffs_y = _mm256_sub_ps(row_floats_y, row_floats_mean_y);
            let cov = _mm256_mul_ps(diffs_x, diffs_y);
            let cov_sum = sum_reduce(cov);
            sum += cov_sum as f64;
        }
    }
    sum
}

The resulting code is:

Four times the lines of the iterator code
Difficult to understand
Not portable at all, only runs on AVX enabled x86 processors
Unpredictable across different cpus in terms of performance
As unsafe as it gets

But the worst part is:

It's also unintuitive like hell => everybody will break something when trying to fix stuff here.

So it better be quick, right? We do 8 elements at a time, a complete row of the window instead of that nasty nested loop.

In short: We expect performance greatness for our endeavour!

Let's run it!

We can just use a simple test program to profile this and switch between the simd and the cpu variant

use image_compare::ssim_simple;
const RUN_COUNT: usize = 2000;
fn main() {
    let image_one = image::open("tests/data/pad_gaprao.png")
        .expect("Could not find test-image")
        .into_luma8();
    (0..RUN_COUNT).for_each(|_| {
        ssim_simple(&image_one, &image_one).unwrap();
    });
}

So we start by comparing debug builds. On my Ryzen 5 notebook, the debug naive implementation for 20 runs takes 10.6s while the simd implementation takes 2.5s. Impressive! We achieved a factor of four. But let's see what comes out on a fully optimized release build, since that's what counts in the end. For the release build we increase the iterations to 1000, simd takes 7.5s (+-0.2s) while the plain cpu implementation takes 10.5s (+-0.1s). Now this may or may not be worth it - it's not as clear anymore given the added complexity. Let's remeasure the numbers on my desktop pc, a Ryzen 7 3800x.

Naive implementation for 2000 iterations takes 16.5s (+-0.1s) while the simd implementation takes 19.0s (+-0.1s). This is quite the surprise but I am sure one could find the reason in chapter 21 of Agner's bible if we digged deeper.

PC	Naive	SIMD
laptop dbg	10.6s	2.5s
laptop rel	10.5s	7.5s
desktop rel	16.5s	19.0s

I am really interested in the root cause for these differences, but frankly can't find the time currently to do the required deep-dive.

My take home messages

After LLVMs optimizations (which are obviously incredible) there's no factor 4-10, only something about 30%.
It's difficult to predict how good intrinsics code has to look like for different processors.

In the next part of this series, we will try to optimize the memory layout for the processing windows and see what we can gain from there. Stay tuned!

Tagged:

General

Programming

'Blazingly fast' part 1: Pure naive rust vs intrinsics

written on 2022-05-10 17:51