Faster zlib compression on Apple M1
last updated: Oct 20, 2023
https://dougallj.wordpress.com/2022/08/20/faster-zlib-deflate-decompression-on-the-apple-m1-and-x86/
Most of the speedup just comes from reading and applying Fabian Giesen’s posts (intro to dataflow graphs) on Huffman decoding
The goal was to use roughly the following variant-4-style loop to decode and unconditionally refill with eight-cycle latency on Apple M1...
Very detailed analysis of how to speed up deflate (and urging you to use zstd instead)