This check was wrong; 1 << 0 = 1. The intent was to skip allocating a 1x1
matrix by assuming it is a constant 0.5. But as written, the check never
actually executed.
This did not affect the runtime performance, nor did it leak memory; but it
did mean we didn't hit the intended `assert`.
Since we only need 8 bytes to store the dither matrix pointer, we actually
still have 8 bytes left-over. That means we could either store the 8-bit
row offset directly, or alternatively compute a 16-bit pointer offsets.
I have chosen to do the former for the C backend, in the interest of
simplicity.
The one downside of this approach is that it would fail on hypothetical
128-bit platforms; although I seriously hope that this code does not live
long enough to see the need for 128-bit addressable memory.
This will serve as a reference for the SIMD backends to come. That said,
with auto-vectorization enabled, the performance of this is not atrocious.
It easily beats the old C code and sometimes even the old SIMD.
In theory, we can dramatically speed it up by using GCC vectors instead of
arrays, but the performance gains from this are too dependent on exact GCC
versions and flags, so it practice it's not a substitute for a SIMD
implementation.