The Iron Code

triangles

'Blazingly fast' part 2: Doing some memory optimizations

written on 2022-07-08 21:20

Prelude - why one probably also shouldn't think of it today

When you talk to some older guys in software development (do I have to consider myself an old guy with 35 years already?), you can often hear a lot of cool 'war stories' about how they edged out a few cycles here and there. Many of them will tell you that today, it's more about memory optimization and cache misses. So today, let's try this, after our SIMD approach failed so miserably in part 1 of the series.

How is this memory stuff working, yo?

What you have to know for these kinds of optimizations is that usually memory is read in lines of a size specific to the hardware you are operating on. When you try to read a char of a single byte, what you get loaded into the cache is usually at least a complete cache line with your byte inside. This means that any operation on the following bytes until the end of the cache line will not cause another read from the main memory, the request will be a cache hit, not a cache miss. Therefore, it would be really cool to utilize most of the data which is already in the cache after the first fetch. How large is such a cache line? It depends on your hardware. Zen-2 processors such as the one in my desktop pc have 64 bytes. Operating on a grayscale image with a single byte per pixel means we can get up 63 pixels for free. Sounds promising, lets get started!

And how are images and the algorithm working?

Once again, we start off with the most naive implementation (still the one which is live on master). We don't do parallelization for now, since it will be way easier to understand single-threaded. We are going to use the SSIM algorithm again, which operates on windows / tiles of 8x8=64 pixels. So in principle a whole window fits into my cache line, but will it? No! The standard for putting images into memory is by creating a one-dimensional array for the pixels - the rows are simply concatenated. What does this mean for the memory accesses? It's not ideal. We are fetching 64 pixels of the first row yet only processing 8. I will be overdramatic now, for the sake of story telling.. Modern CPUs have severals layers of caches with increasing sizes and are overall very smart in keeping data in them, so the effects will be less dramatic in practice ;) . Each access, we are wasting 56/64 of the fetched pixels. This is so much, that we should easily be able to inch out a nice performance gain by reordering the memory first. Let's develop an acceleration structure for this purpose and linearize the problem.

Introducing: LinearAccelerator

We will need a little more code here, since we want to foresee thread safety for the multithreaded version later on. First of all, we need some structure to access a slice from multiple threads at once. If - as in this benchmark - there is only one thread, it should not produce too much overhead. I will show the code here for the struct called UnsafeSlice, I found this somewhere on the Internet and can't remember the original author, but will gladly give credit if I happen to find out:

#[derive(Copy, Clone)]
pub struct UnsafeSlice<'a, T> {
    slice: &'a [UnsafeCell<T>],
}
unsafe impl<'a, T: Send + Sync> Send for UnsafeSlice<'a, T> {}
unsafe impl<'a, T: Send + Sync> Sync for UnsafeSlice<'a, T> {}

impl<'a, T> UnsafeSlice<'a, T> {
    pub fn new(slice: &'a mut [T]) -> Self {
        let ptr = slice as *mut [T] as *const [UnsafeCell<T>];
        Self {
            slice: unsafe { &*ptr },
        }
    }
    pub unsafe fn write(&self, i: usize, value: T) {
        let ptr = self.slice[i].get();
        *ptr = value;
    }
}

We will need to iterate over all the pixels of the image anyways to reorder them, we might just as well do the sum of each window / tile at this pass to save some cycles later on. Also, we are using an AtomicU16 to store the sum, it's almost no overhead for the single thread case and works multithreaded, too. The 16 bit integer is another performance optimization because adding 16 bit ints is way faster than converting two u8 to f64 and adding them. As in part 1 of this series, we are going to cheat a lot for our optimized versions, to make them extra fast :).

Let's introduce a new struct for this: WindowCache

pub struct WindowCache {
    // the window tile
    pub window: Window,
    // the offset of the linearized data in the UnsafeSlice
    pub data_offset: usize,
    // the sum of all pixels
    pub sum: AtomicU16,
}

impl WindowCache {
    pub fn mean(&self) -> f64 {
        self.sum.load(Ordering::Relaxed) as f64 / self.window.area() as f64
    }

    pub fn variance(&self, mean: f64, data: &[u8]) -> f64 {
        (self.data_offset..self.window.area() as usize)
            .into_iter()
            .map(|i| (data[i] as f64 - mean).powi(2))
            .sum::<f64>()
            / self.window.area() as f64
    }

    pub fn covariance(&self, mean: f64, data: &[u8], other_mean: f64, other_data: &[u8]) -> f64 {
        (self.data_offset..self.window.area() as usize)
            .into_iter()
            .map(|i| (data[i] as f64 - mean) * (other_data[i] as f64 - other_mean))
            .sum::<f64>()
            / self.window.area() as f64
    }
}

The acceleration structure LinearAcceleratorholds everything together:

impl LinearAccelerator {
    pub fn mk_cached_subdivision(image: &GrayImage) -> LinearAccelerator {
        let mut buffer = vec![0u8; (image.width() * image.height()) as usize];
        let uslice = UnsafeSlice::new(buffer.as_mut_slice());
        let windows: Vec<Window> =
            Window::from_image(image).subdivide_by_offset(DEFAULT_WINDOW_SIZE);
        let offset = AtomicUsize::new(0);
        let window_caches: Vec<_> = windows
            .into_iter()
            .map(|window| {
                let current_offset = offset.fetch_add(window.area() as usize, Ordering::Relaxed);
                WindowCache {
                    window,
                    data_offset: current_offset,
                    sum: AtomicU16::new(0),
                }
            })
            .collect();
        let mut window_cols = image.width() / DEFAULT_WINDOW_SIZE;
        if image.width() % DEFAULT_WINDOW_SIZE > 0 {
            window_cols += 1;
        }

        image.enumerate_rows().for_each(|(_, row)| {
            for (x, y, p) in row {
                let window_x = x / DEFAULT_WINDOW_SIZE;
                let window_y = y / DEFAULT_WINDOW_SIZE;
                let idx = window_y * window_cols + window_x;
                let item = unsafe { window_caches.get_unchecked(idx as usize) };
                item.sum.fetch_add(p[0] as u16, Ordering::Relaxed);
                unsafe {
                    uslice.write(item.data_offset, p[0]);
                }
            }
        });

        LinearAccelerator {
            windows: window_caches,
            buffer,
        }
    }
}

This is great in multiple ways, the first pass will produce good cache hits, since we iterate the rows which are contiguous in memory. Also, we can calculate the means for free in this pass which comes in handy later. The second pass for each window will then also maximize cache usage. Lets benchmark this!

Results

Version Time Instructions
Standard 9.2 s 101 bn
Optimized 4.8 s 32 bn

So the optimized version, despite reordering the image completely is still a factor of two faster! That's very good, we should check that version in on master, or not?

A new challenger has entered the stage: Multithreading

I can't think of a single pc or laptop with less than four cores nowadays. As we already mentioned in part 1 of this series, the rule of thumb is to always optimize at the highest architectural level, this is also the level where you want to parallelize.

For our naive implementation, we would parallelize the windows / tiles, which is pretty simple. Let's just include rayon and par_iter() them. For the acceleration structure it's not that obvious. We want to generate the caches for both images at the same time and we want to process each row in parallel. Rayon comes in handy here with par_bridge() in these two places and we're done.

Results

So let's run this again on my ryzen processor and compare the results:

Version Time Instructions
Standard (MT) 6.0 s 498 tn
Optimized (MT) 11.4 s 190 tn

The naive implementation smokes our FuLly-HiGh-Per4mance branch yet again. How could this happen? Well, if we compare the instructions of each run we see roughly a factor of 3 in single threaded mode and 2.5 in multi threaded mode. So our optimized algorithm worked as intended. The difference is, that our premise of 56 out of 64 pixels being unused and missed is not true anymore in the multithreaded cause. If 8 threads are processing 8 row-adjacent windows at the same time, a single cache line of 64 pixels can keep all of them busy with maximum cache hits. In the optimized case, one cache line contains a single window of a single thread. The second thread will have to 'wait' to fetch it's own window-data before it can even start processing. The result is, that we are slower than the naive implementation again. Even cutting all those corners like using smaller data types for the sum and calculating the mean on first pass will not make up for this. This issue is getting more important the more threads you have and the longer your cache lines are. GPUs with thread counts going totally bonkers are ideal candidates for fuckups here.

My take home messages

  • Again take advice on must-do optimizations with grams of salt
  • Don't optimize without benchmarks indicating a need
  • Know your target architecture and thread count
  • Try the "let's just TBB / rayon this high-level loop" approach, it's not only the easiest but often yields excellent results

So, I think I am done with performance optimizations for now. I end up with the most naive version on master coming out on top in almost all metrics anyway, so lets keep the code that is most legible, easiest to maintain, most compact and keeps up with manually optimized versions in most cases :)

Thanks for reading and the journey!

Tagged:
General
Programming