Interesting Failure #4 - Not speeding up Libsamplerate in 2008
My initial motivation for fooling with
libsamplerate was that I was working on mixing down a HUGE, 12 hour, 12 track, 96khz audio recording, to a portable machine that only supported 44.1khz output, so I was constantly using libsamplerate to mix down and convert dozens of songs to something I could actually hear, and spending 40 minutes per iteration fiddling with my thumbs waiting for that to happen.
I also had a hardware audio server project coming up that looked like it might use the then-new Arm VFP instructions for samplerate conversion and I wanted to learn how to vectorize code on that processor. Not having an Arm 926 yet to play with I started playing with the Intel/Arm SSE2 and SSE3 extensions and various compilers (gcc, icc) and their options - to see what I could do on that chip series to improve matters.
It also bugged me that the default samplerate converter for pulseaudio - now a defacto standard - is speex-float1, which has a S/N ratio that I haven't been able to find.
You CAN hear the difference between speex-float1 and libsamplerate's medium to high modes, but unless you are a passionate listener to high quality (24 bit) music and video at non-44.1khz sample rates, it's difficult. The first thing I do upon getting a new Linux box is switch it to libsamplerate medium from speex-float1... I listen to a lot of music. The added computational overhead puts an atom on the moon, but bigger processors do ok.
So there I was, with too much time on my hands between mixdowns... With full source code to my problem children...
When I oprofiled the code, bells went off in my head - something like 99% of the mix was being spent in a pair of fairly small routines that looked very optimizable at first glance. In particular, they weren't using SSE at all! Even though the hardware could process two doubles at the same time through the pipe, the C code didn't compile down to anything even close to that...
So I rewrote those core routines using the SSE intrinsics and got them to work on both x86 and x86_64 mode. The generated code got really dense, but the instruction length got larger.
The initial results were encouraging, especially on x86_64. By taking better advantage of all the registers available, AND pushing two doubles through the pipeline, and eliminating/parallelizing several float->double conversions, I got anywhere from a 14% to 50% speed enhancement on my hardware. The x86_64 scaled up to more channels far better than the x86 did, too, showing that one of the big problems in the code was register pressure.
I also cut the C code size by a considerable amount with some pre-processor magic. I was HAPPY, and announced my initial results on the mailing list... and started work on making something more generic and general purpose than my initial (two channel only) hacks. Doing a CUDA version even looked plausible and I was dying to try working in that...
In digging into the generated assembly I found that the gcc compilers would arbitrarily reschedule software prefetching from where I thought it should be to where it did little good.
So I went to work on making the code better, reading instruction tables, clock counts, and generally having a merry time researching the darkest details of the x86_64 implementations of both Intel and AMD. I remember that instruction decoding was fascinating - I learnt one hell of a lot about how the various levels of SSE actually worked, and about the schism in instruction set extensions that occurred between AMD and Intel after SSE3, and about how difficult the C intrinsics were to actually use in a language that didn't have a native 128bit type, or conception of vector arithmetic. In frustration, I found myself reminiscing about APL one day...
Ultimately I produced a pure x86_64 assembly version which did up to 8 channels of samplerate conversion, and met most of my requirements, among many other things the average instruction length was shorter to fit into the decoding windows better... but it was difficult to maintain, debug, and link to, especially as it worked best as inline assembly, not as a function call, and it didn't outperform my initial attempts by very much.
So, as my final shot at the problem, I rewrote the pure assembly into something that could be inlined using gcc's inline assembly extensions. It used up the entire x86_64 register set (this always makes an assembly language programmer feel warm and fuzzy - and most people would be shocked at how few registers most subroutines actually use), BUT: it wouldn't compile due to not having enough virtual registers defined in the gcc compiler.
I made one small patch to the compiler to fix that, and
filed a bug on it. (Which the gcc developers decided to ignore)
Then I started testing with more processors than I'd had on hand at the beginning.
A lot of oprofiling later...
On the then-new Atom processor, in 64 bit mode, the optimized SSE code was actually SLOWER than the unoptimized code. I puzzled over this for a while but it wasn't until I noticed that on an old Opteron, the code also lost performance when doubles were used in the highest quality samplerate conversion setting that I realized the source of the problem.
The size of the lookup table basically wiped out the L2 and L3 cache on any but the largest Intel processors available a the time. The large stride did terrible things to hardware prefetching, too. My hand placed software prefetches worked well on only a few processors.
So I'd goofed (at least for older/cheaper hardware) by optimizing for floating point pipelining performance where memory bandwidth was more key to the performance of the algorithm. I easily forgive myself for this - it was impossible to "know" this without doing all the work I did delving into the algorithm.
While a CUDA version would still be interesting to try, most CUDA hardware available then (and most now!) doesn't have a lot of support for floating point math using doubles... And I think memory bandwidth would still be a problem.
And... building a version that had sufficient abstractions internally to take advantage of the dozens of different CPU types - much less CUDA! - the core loops would have to be optimized for proved to be hard too. It required far more object oriented coding in C than I cared to do (icc has a built in method making the support of different instruction set extensions easier, and I would like it if gcc had the same capability), and some tricky and hard to maintain cpu recognition routines.
Finally I found that when I compiled the original algorithm for the normal x86 register set, not SSE2, on x86_64, that it performed almost identically to my extensively hand optimized code.
This was discouraging in the utmost. I gave up.
The memory bandwidth limitation would be even worse on the Arm. Although using the VFP floating point unit made using libsamplerate feasible in the first place, I felt that it would perform badly on that chip at almost any sample rate conversion setting. So in the end I dropped that portion of the project, too.
As I write that hardware project is still x86 based, and we ended up using a DAC that was capable of sample rates all the way up to 192Khz, natively.
I bought a 3x faster Intel box, finished the massive recording project, and switched to recording most of my source material at 44.1khz, 32 bit. It would be interesting to play with the code and compilers again, once things like the VFX extension become more common, and Arms get more cache... Maybe the compilers have improved some...
But I'm also considerably more deaf than I was when I started caring about audio quality and find it harder and harder to care as overall cpu performance has doubled again since I started working on this. If anyone cares, I'll dig up the bzr tree.
Labels: audio, failures