MarkF Senior Heliman Location: Palo Alto, CA
My Posts This: Topic Forum | QPSK Architecture, Part 1.1?G'Day!
[Warning! This post is way too long, fairly complex, and will probably be of little interest to most folks. I include it here only to help those few folks who might be interested in seeing some of the thought processes that you sometimes go through while creating a software-defined radio project. Note, too, that I haven't even tried these ideas out yet - you're seeing an interative design process occur at the same time I'm learning; some ideas will work, and some won't! Please accept my apologies if this is just too far out!]
Well, I was hoping to write part II, but it turns out that we're not ready yet. In a nutshell, the problem is CPU performance. As time progresses, I'm discovering that this QPSK project is totally unlike the receiver I just completed at work. It's not worth explaining in detail, but believe it or not, we'll be using radically different algorithms for all eight of the processing steps listed above! Of these, filtering is the most problematic. On the big QPSK receiver, filtering was performed in hardware, but if possible, I'd prefer to perform it in software here for cost effectiveness. While I was going to delay further discussion on this for a while, it's just too big of a concern to ignore.
As hinted at previously, the front-end square root raised cosine filter is currently a "128-tap FIR filter", meaning that it takes 128 multiplications and 127 additions per sample. With 96,000 samples/second, that's 12,288,000 multiply and accumulates (MACs) per second. Unfortunately, ARM CPUs aren't very good at performing MACs - in this application, it'll take about 10 CPU clock cycles/MAC. That means that filtering alone would require a 123 MHz ARM CPU, but we're only using a 60 MHz CPU! Clearly, we need some drastic efficiency improvements somewhere.
One tempting way out would be to use a DSP chip. DSPs in the $8 price range can perform about 800M MACs/second (in particular, the Analog Devices Blackfin is a great candidate here). Moreover, with glueless interfaces to 96 KHz A/D and D/As, DMA, etc, we could just ignore the complexity of the problem and crunch right through it! This is worth keeping in mind, but before we change horses mid-race, let's first see if it's possible to figure out a way to work smarter, instead of harder.
As I mentioned before, the first attempt at solving this didn't work: changing the SRRC filter into polyphase form did reduce CPU requirements, but it didn't measure up from an RF performance perspective. There's another way we might approach this, and to understand it, we need to talk about "sample rate conversion", which is simply the process of changing from one sampling rate to another.
Start with the basics. For modulation or demodulation, we just need one sample/symbol, or 6,000 samples/second. For symbol timing recovery, and most of the other functions in the receiver, we'll need two samples/symbol, or 12,000 samples/second. At the digital/analog interface, though, we've got that pesky 96,000 samples/second. It'd be nice if we could figure out a way to do as little work as possible at the 96,000 samples/second rate, and do everything else except modulation and demodulation at 12,000 samples/second (mod and demod will always be done at 6,000 samples/second).
Enter "interpolation" and "decimation". Interpolation is the process of converting from a lower sample rate to a higher sample rate, and decimation is the process of moving from a higher sample rate to a lower sample rate. Now, we could just duplicate samples to upconvert, or drop samples to downconvert, but as it turns out, quite counter-intuitively, that that adds "spectral images" and "aliasing" components that muck up the output signal, and will deliver terrible RF performance. Instead, the right thing to do is to use a filtering operation in either direction; the appropriate filters to do that are unsurprisingly called an "interpolation filter" and a "decimation filter". While there are dozens of different approaches to creating these filters, they essentially are all variations on a basic low-pass filter.
The idea is this. Instead of trying to create a 96 KHz polyphase SRRC filter in one step, what about splitting it into two steps? In step one, we'll square-root raised cosine filter the transmit data at two samples/symbol, or 12,000 samples/second. Next, we could use an interpolation filter to convert those 12,000 samples/second to the output 96,000 sample/second rate.
Wait, though, how could using two filters make life easier than using one filter? Well, if the 2 sample/symbol SRRC filter required, say, 32 MACs, and the interpolation filter required perhaps 10 MACs, then we'd need a total of just 32/(96/12) + 10 = 14 MACs/output sample component (for each of I & Q), which would be about 4.5X efficiency improvement from the 128 MACs/output sample required for the brute-force approach! Will it work? I don't know yet, but I will let you know once I've tried it!
Another interesting possible simplification comes to mind. I'd mentioned before that the easiest way to combine I & Q into a single carrier was at 4X the I.F. frequency (thanks to that 90 degree delay in Q), but maybe we can try another approach here, too. This one's a bit funkier, but it might be worth trying, as well. Basic [Nyquist] theory says that to transmit a 24 KHz carrier in a bandwidth-limited channel, we only need 48K samples/second, not 96K samples/second. Since we'll already be using an interpolator to generate the output samples (assuming that is, that the previous step works!), we could in fact add a 90 degree offset to Q almost "for free" by simply telling the interpolator to add an additional 1/2 sample to it's delay target for the Q samples.
What's that again? Some kinds of interpolators work by synthesizing a new output sample value in-between two input samples; and the amount of time delay is expressed as a fraction between 0.0 (which would output the first sample), and 1.0 (which would output the second sample). With this kind of interpolator, we could just add 0.5 to the Q delay to easily add a 90 degree offset to Q!
[Almost. The 90 degree Q delay refers to a 90 degree delay at the 24 KHz carrier/I.F. frequency, not at the 2 sample/symbol input frequency. So, assuming that we're interpolating from 2 samples/symbol (12 KHz) to 8 samples/symbol (48 KHz), then we'd actually tell the interpolator to delay Q by an additional 0.5/((8/2)/2) = 0.25, or, 1/4th of an input sample.]
If, indeed this second approach will work, assuming the same filter lengths as before, then we'd need 32 / (48/12) + 10 MACs = 18 MACs/output sample, but that's at 48 KHz instead of 96 KHz, so we'd have the equivalent of just 9 MACs/96 KHz output sample component, for each of I & Q.
[Yet another minor glitch to consider. When we combine the I & Q waveforms onto the I.F. carrier, we do so by first multiplying I by cos(), and then Q by sin(), and then adding them together. At 4X the I.F., cos() is conveniently 1, 0, -1, 0; while sin() is 0, 1, 0, -1. So at 4X I.F., the multiplies disappear, and we just alternately output I & Q, negating I & Q every other sample pair. By moving to a 2X I.F., it'll become necessary to multiply the Q samples by the sin() of a 90/(96/48) = 45 degree offset first. Thus, the total complexity for this approach becomes the equivalent of 9.5 MACs/96 KHz output component, for each of I & Q.]
If I cross my fingers and all of this actually works, we'd have a reduction from 200% of the CPU for filtering, down to just 30% (48000 * (18+19) * 10 / 60000000) of the CPU! As I'm writing this, it sounds too good to be true, but the only way to know if either of these approaches will work is to go and code them up. Hopefully, I'll get the chance to check this out within a couple of days. Once I do, I'll pass on the results; good, bad, or ugly!
Cheers!
MarkF
CORRECTION: I'd initially neglected to include I & Q in the CPU overhead computation, and had guessed it would be 15% instead of 30%. This is now fixed above. |