PhD defens for Xiaojun Wang Monday, December 10, 2007 Title: Variable Precision Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Abstract: Field Programmable Gate Arrays (FPGAs) have become an increasingly popular way to develop cost effective custom hardware due to their flexibility and fast time to market. For those applications where the data has large dynamic range, floating-point arithmetic is desirable due to the inherent limitations of fixed-point arithmetic. Moreover, optimal reconfigurable hardware implementations may require the use of arbitrary floating-point formats that do not necessarily conform to IEEE specified sizes in order to make the best use of available hardware resources. Division and square root are important operators in many digital signal processing (DSP) applications including matrix inversion, vector normalization, and Cholesky decomposition. We present variable precision floating-point divide and square root implementations on FPGAs. The floating-point divide and square root operators support many different floating-point formats including IEEE standard formats. Both modules demonstrate a good tradeoff between area, latency and throughput. They are also fully pipelined to aid the designer in implementing fast, complex, and pipelined designs. To demonstrate the usefulness of the floating-point divide and square root operators, two applications are investigated. First, we use the floating-point divide to implement the mean updating step for K-means clustering in FPGA hardware, allowing the entire application to run on the FPGA. This frees the host to work on other tasks concurrently with K-means clustering, thus allowing the user to exploit the coarse-grained parallelism available in the system. The second application is QR decomposition using Givens rotations. QR decomposition is a key step in many DSP applications including sonar beamforming, channel equalization, and 3G wireless communication. Our implementation uses a truly two dimensional systolic array architecture so latency scales well for larger matrices. Unlike previous works that avoid divide and square root operations by using special operations such as CORDIC (COordinate Rotation by DIgital Computer) or special number systems such as the logarithmic number system (LNS), we implement the Givens rotations algorithm using our floating-point divide and square root. The QR module is fully pipelined with a throughput of over 130MHz for the IEEE single-precision floating-point format.