README for fft-arm version 0.01 (c)JDB [jdb@lartmaker.nl], 20040418. Disclaimer: It has been three years since I touched this code. Some parts are a tad hazy, so bear with me. *** Strongly suggested reading If you want to do anything at all with this code other than using the existing FFT sizes, I'd suggest getting "Numerical Recipes in C" from Cambridge University Press. Buy it. Seriously. I bought it when I was a starving undergrad, and still use it regularly. Any reasonably stocked technical bookstore should have it. Most libraries have it, especially college libraries. Failing that, go to http://www.library.cornell.edu/nr/bookcpdf.html and check out chapters 12 and 13 (and THEN go and buy the book). "Fast algorithms for digital signal processing" by Richard E. Blahut, published by Addison-Wesley Publishing Company. The best book I could find on FFTs, although a bit heavy on the math side. *** What is it ? This code is a very minimal set of functions for radix 4/5 complex fixed point in-place FFT routines, optimized for the DEC/Intel StrongARM. Let's take that sentence apart. Minimal: All that's supported as of now are FFTs with size 20, 64 and 80. It just so happens that I wanted to do 80-point FFTs (for a 64-carrier OFDM system with 16 guard carriers, if you must know). The other sizes were convenient stepping stones. radix 4/5: It is a common misconception that FFTs only work on array sizes which are a power of two. Fast Fourier Transforms can be applied to arrays of any size, although sizes which can be factorized into small numbers work best. Two routines are at the core of the code, one to do a radix-4 FFT and one to do a radix-5 FFT, with the restriction that the radix-5 FFT can only appear as the last stage. By chaining multiple stages of these FFT kernels together, any transform of size 4^m * 5^n, where m>=0 and 0<=n<=1, can be handled. Adding a radix-2 stage so that all transforms of size 2^m * 5^n can be offered is left as an excersize to the reader (patches gladly accepted). complex: An FFT is in principle a transform from an array of complex numbers to an array of complex numbers. If your input data is real, see "Numerical Recipes" for ways arround this. fixed point: Most existing FFT libraries assume floating point data types. This only works well on a processor with a (fast) FPU, something the ARM processors lack. Going for fixed point data adds some complexity, mostly because of overflow in the intermediate stages (google for "fixed point overflow" if this means nothing to you). The fft-arm code is designed so that 12-bit numbers plus sign can be handled without overflow. in-place: the input array is overwritten by the output of the FFT. optimized for StrongARM: The core code (which lives in butterfly.c) may look weird, but it is written in this format so that a C-compiler for ARM can directly translate each line into a single instruction, with optimal scheduling for StrongARM. It is optimized to make maximum use of the full ARM register set (to minimize loads/stores). Since it's all C code, a smart compiler can tune the scheduling for newer processors (like XScale), and you can even test the functionality of the code on another architecture (like an x86 Linux box). *** Random hints Sorry, no makefile yet (patces gladly accepted). Compile for ARM with: arm-linux-gcc -O3 -march=armv4 -mtune=strongarm1100 -fomit-frame-pointer -o ffttest-arm radix4fft.c testfft.c testmain.c -lm Test on your native platform with: gcc -O3 -o ffttest-native radix4fft.c testfft.c testmain.c -lm These FFTs are normalized, which means that the energy of the input array is the same as the energy of the output array (Parseval's theorem, if memory serves me well). This normalization is required to avoid fixed point overflow, and IMHO it's generally a Good Thing anyway. As with all FFT routines, the output array is normally not in-order, but rather bit reversed (or more severely shuffled, in the case of a radix-5 transform). The routine reorder_generic() in radix4fft.c fixes this, at the expense of some CPU time. If you don't care about the output ordering you can forego this step by removing reorder_generic (or changing the #if 1 to #if 0). Want an inverse FFT ? If you look closely at the formula for the FFT, you'll see that the only difference between an FFT and its inverse is the sign in the exponent of e. This means that *by reordering the output alone* you can turn an FFT into its inverse. Again, it's been over three years since I wrote this code, but have a look at TransTable in testmain.c for more hints. My gut feeling is that you need the "In-place IFFT table". Need more than 12 bits of resolution ? Keep in mind that the FFT is a linear process, so you may be able to split the input data into high/low words and add it together afterwards. Then again, the quantization noise introduced by fixed point twiddle factors may drown out your low words. Try it and let me know. Lastly, these routines used to be over three times as fast as the ones in the Intel Performance Primitives library. However, that too was over three years ago, I no longer have access to the IPP libs, and maybe Intel have improved their code. Maybe.