Matching Items (6)

150476-Thumbnail Image.png

Multidimensional DFT IP generators for FPGA platforms

Description

Multidimensional (MD) discrete Fourier transform (DFT) is a key kernel algorithm in many signal processing applications, such as radar imaging and medical imaging. Traditionally, a two-dimensional (2-D) DFT is computed

Multidimensional (MD) discrete Fourier transform (DFT) is a key kernel algorithm in many signal processing applications, such as radar imaging and medical imaging. Traditionally, a two-dimensional (2-D) DFT is computed using Row-Column (RC) decomposition, where one-dimensional (1-D) DFTs are computed along the rows followed by 1-D DFTs along the columns. However, architectures based on RC decomposition are not efficient for large input size data which have to be stored in external memories based Synchronous Dynamic RAM (SDRAM). In this dissertation, first an efficient architecture to implement 2-D DFT for large-sized input data is proposed. This architecture achieves very high throughput by exploiting the inherent parallelism due to a novel 2-D decomposition and by utilizing the row-wise burst access pattern of the SDRAM external memory. In addition, an automatic IP generator is provided for mapping this architecture onto a reconfigurable platform of Xilinx Virtex-5 devices. For a 2048x2048 input size, the proposed architecture is 1.96 times faster than RC decomposition based implementation under the same memory constraints, and also outperforms other existing implementations. While the proposed 2-D DFT IP can achieve high performance, its output is bit-reversed. For systems where the output is required to be in natural order, use of this DFT IP would result in timing overhead. To solve this problem, a new bandwidth-efficient MD DFT IP that is transpose-free and produces outputs in natural order is proposed. It is based on a novel decomposition algorithm that takes into account the output order, FPGA resources, and the characteristics of off-chip memory access. An IP generator is designed and integrated into an in-house FPGA development platform, AlgoFLEX, for easy verification and fast integration. The corresponding 2-D and 3-D DFT architectures are ported onto the BEE3 board and their performance measured and analyzed. The results shows that the architecture can maintain the maximum memory bandwidth throughout the whole procedure while avoiding matrix transpose operations used in most other MD DFT implementations. The proposed architecture has also been ported onto the Xilinx ML605 board. When clocked at 100 MHz, 2048x2048 images with complex single-precision can be processed in less than 27 ms. Finally, transpose-free imaging flows for range-Doppler algorithm (RDA) and chirp-scaling algorithm (CSA) in SAR imaging are proposed. The corresponding implementations take advantage of the memory access patterns designed for the MD DFT IP and have superior timing performance. The RDA and CSA flows are mapped onto a unified architecture which is implemented on an FPGA platform. When clocked at 100MHz, the RDA and CSA computations with data size 4096x4096 can be completed in 323ms and 162ms, respectively. This implementation outperforms existing SAR image accelerators based on FPGA and GPU.

Contributors

Agent

Created

Date Created
  • 2012

155154-Thumbnail Image.png

FPGA accelerator architecture for Q-learning and its applications in space exploration rovers

Description

Achieving human level intelligence is a long-term goal for many Artificial Intelligence (AI) researchers. Recent developments in combining deep learning and reinforcement learning helped us to move a step forward

Achieving human level intelligence is a long-term goal for many Artificial Intelligence (AI) researchers. Recent developments in combining deep learning and reinforcement learning helped us to move a step forward in achieving this goal. Reinforcement learning using a delayed reward mechanism is an approach to machine intelligence which studies decision making with control and how a decision making agent can learn to act optimally in an environment-unaware conditions.

Q-learning is one of the model-free reinforcement directed learning strategies which uses temporal differences to estimate the performances of state-action pairs called Q values. A simple implementation of Q-learning algorithm can be done using a Q table memory to store and update the Q values. However, with an increase in state space data due to a complex environment, and with an increase in possible number of actions an agent can perform, Q table reaches its space limit and would be difficult to scale well. Q-learning with neural networks eliminates the use of Q table by approximating the Q function using neural networks.

Autonomous agents need to develop cognitive properties and become self-adaptive to be deployable in any environment. Reinforcement learning with Q-learning have been very efficient in solving such problems. However, embedded systems like space rovers and autonomous robots rarely implement such techniques due to the constraints faced like processing power, chip area, convergence rate and cost of the chip. These problems present a need for a portable, low power, area efficient hardware accelerator to accelerate the process of such learning.

This problem is targeted by implementing a hardware schematic architecture for Q-learning using Artificial Neural networks. This architecture exploits the massive parallelism provided by neural network with a dedicated fine grain parallelism provided by a Field Programmable Gate Array (FPGA) thereby processing the Q values at a high throughput. Mars exploration rovers currently use Xilinx-Space-grade FPGA devices for image processing, pyrotechnic operation control and obstacle avoidance. The hardware resource consumption for the architecture has been synthesized considering Xilinx Virtex7 FPGA as the target device.

Contributors

Agent

Created

Date Created
  • 2016

157727-Thumbnail Image.png

Highly multiplexed superconducting detectors and readout electronics for balloon-borne and ground-based far-infrared imaging and polarimetry

Description

This dissertation details the development of an open source, frequency domain multiplexed (FDM) readout for large-format arrays of superconducting lumped-element kinetic inductance detectors (LEKIDs). The system architecture is designed to

This dissertation details the development of an open source, frequency domain multiplexed (FDM) readout for large-format arrays of superconducting lumped-element kinetic inductance detectors (LEKIDs). The system architecture is designed to meet the requirements of current and next generation balloon-borne and ground-based submillimeter (sub-mm), far-infrared (FIR) and millimeter-wave (mm-wave) astronomical cameras, whose science goals will soon drive the pixel counts of sub-mm detector arrays from the kilopixel to the megapixel regime. The in-flight performance of the readout system was verified during the summer, 2018 flight of ASI's OLIMPO balloon-borne telescope, from Svalbard, Norway. This was the first flight for both LEKID detectors and their associated readout electronics. In winter 2019/2020, the system will fly on NASA's long-duration Balloon Borne Large Aperture Submillimeter Telescope (BLAST-TNG), a sub-mm polarimeter which will map the polarized thermal emission from cosmic dust at 250, 350 and 500 microns (spatial resolution of 30", 41" and 59"). It is also a core system in several upcoming ground based mm-wave instruments which will soon observe at the 50 m Large Millimeter Telescope (e.g., TolTEC, SuperSpec, MUSCAT), at Sierra Negra, Mexico.

The design and verification of the FPGA firmware, software and electronics which make up the system are described in detail. Primary system requirements are derived from the science objectives of BLAST-TNG, and discussed in the context of relevant size, weight, power and cost (SWaP-C) considerations for balloon platforms. The system was used to characterize the instrumental performance of the BLAST-TNG receiver and detector arrays in the lead-up to the 2019/2020 flight attempt from McMurdo Station, Antarctica. The results of this characterization are interpreted by applying a parametric software model of a LEKID detector to the measured data in order to estimate important system parameters, including the optical efficiency, optical passbands and sensitivity.

The role that magnetic fields (B-fields) play in shaping structures on various scales in the interstellar medium is one of the central areas of research which is carried out by sub-mm/FIR observatories. The Davis-Chandrasekhar-Fermi Method (DCFM) is applied to a BLASTPol 2012 map (smoothed to 5') of the inner ~1.25 deg2 of the Carina Nebula Complex (CNC, NGC 3372) in order to estimate the strength of the B-field in the plane-of-the-sky (B-pos). The resulting map contains estimates of B-pos along several thousand sightlines through the CNC. This data analysis pipeline will be used to process maps of the CNC and other science targets which will be produced during the upcoming BLAST-TNG flight. A target selection survey of five nearby external galaxies which will be mapped during the flight is also presented.

Contributors

Agent

Created

Date Created
  • 2019

151971-Thumbnail Image.png

Efficient Bayesian tracking of multiple sources of neural activity: algorithms and real-time FPGA implementation

Description

Electrical neural activity detection and tracking have many applications in medical research and brain computer interface technologies. In this thesis, we focus on the development of advanced signal processing algorithms

Electrical neural activity detection and tracking have many applications in medical research and brain computer interface technologies. In this thesis, we focus on the development of advanced signal processing algorithms to track neural activity and on the mapping of these algorithms onto hardware to enable real-time tracking. At the heart of these algorithms is particle filtering (PF), a sequential Monte Carlo technique used to estimate the unknown parameters of dynamic systems. First, we analyze the bottlenecks in existing PF algorithms, and we propose a new parallel PF (PPF) algorithm based on the independent Metropolis-Hastings (IMH) algorithm. We show that the proposed PPF-IMH algorithm improves the root mean-squared error (RMSE) estimation performance, and we demonstrate that a parallel implementation of the algorithm results in significant reduction in inter-processor communication. We apply our implementation on a Xilinx Virtex-5 field programmable gate array (FPGA) platform to demonstrate that, for a one-dimensional problem, the PPF-IMH architecture with four processing elements and 1,000 particles can process input samples at 170 kHz by using less than 5% FPGA resources. We also apply the proposed PPF-IMH to waveform-agile sensing to achieve real-time tracking of dynamic targets with high RMSE tracking performance. We next integrate the PPF-IMH algorithm to track the dynamic parameters in neural sensing when the number of neural dipole sources is known. We analyze the computational complexity of a PF based method and propose the use of multiple particle filtering (MPF) to reduce the complexity. We demonstrate the improved performance of MPF using numerical simulations with both synthetic and real data. We also propose an FPGA implementation of the MPF algorithm and show that the implementation supports real-time tracking. For the more realistic scenario of automatically estimating an unknown number of time-varying neural dipole sources, we propose a new approach based on the probability hypothesis density filtering (PHDF) algorithm. The PHDF is implemented using particle filtering (PF-PHDF), and it is applied in a closed-loop to first estimate the number of dipole sources and then their corresponding amplitude, location and orientation parameters. We demonstrate the improved tracking performance of the proposed PF-PHDF algorithm and map it onto a Xilinx Virtex-5 FPGA platform to show its real-time implementation potential. Finally, we propose the use of sensor scheduling and compressive sensing techniques to reduce the number of active sensors, and thus overall power consumption, of electroencephalography (EEG) systems. We propose an efficient sensor scheduling algorithm which adaptively configures EEG sensors at each measurement time interval to reduce the number of sensors needed for accurate tracking. We combine the sensor scheduling method with PF-PHDF and implement the system on an FPGA platform to achieve real-time tracking. We also investigate the sparsity of EEG signals and integrate compressive sensing with PF to estimate neural activity. Simulation results show that both sensor scheduling and compressive sensing based methods achieve comparable tracking performance with significantly reduced number of sensors.

Contributors

Agent

Created

Date Created
  • 2013

152527-Thumbnail Image.png

FPGA-based implementation of QR decomposition

Description

This thesis report aims at introducing the background of QR decomposition and its application. QR decomposition using Givens rotations is a efficient method to prevent directly matrix inverse in solving

This thesis report aims at introducing the background of QR decomposition and its application. QR decomposition using Givens rotations is a efficient method to prevent directly matrix inverse in solving least square minimization problem, which is a typical approach for weight calculation in adaptive beamforming. Furthermore, this thesis introduces Givens rotations algorithm and two general VLSI (very large scale integrated circuit) architectures namely triangular systolic array and linear systolic array for numerically QR decomposition. To fulfill the goal, a 4 input channels triangular systolic array with 16 bits fixed-point format and a 5 input channels linear systolic array are implemented on FPGA (Field programmable gate array). The final result shows that the estimated clock frequencies of 65 MHz and 135 MHz on post-place and route static timing report could be achieved using Xilinx Virtex 6 xc6vlx240t chip. Meanwhile, this report proposes a new method to test the dynamic range of QR-D. The dynamic range of the both architectures can be achieved around 110dB.

Contributors

Agent

Created

Date Created
  • 2014

154862-Thumbnail Image.png

Data path implementation for a spatially programmable architecture customized for image processing applications

Description

The last decade has witnessed a paradigm shift in computing platforms, from laptops and servers to mobile devices like smartphones and tablets. These devices host an immense variety of applications

The last decade has witnessed a paradigm shift in computing platforms, from laptops and servers to mobile devices like smartphones and tablets. These devices host an immense variety of applications many of which are computationally expensive and thus are power hungry. As most of these mobile platforms are powered by batteries, energy efficiency has become one of the most critical aspects of such devices. Thus, the energy cost of the fundamental arithmetic operations executed in these applications has to be reduced. As voltage scaling has effectively ended, the energy efficiency of integrated circuits has ceased to improve within successive generations of transistors. This resulted in widespread use of Application Specific Integrated Circuits (ASIC), which provide incredible energy efficiency. However, these are not flexible and have high non-recurring engineering (NRE) cost. Alternatively, Field Programmable Gate Arrays (FPGA) offer flexibility to implement any application, but at the cost of higher area and energy compared to ASIC.

In this work, a spatially programmable architecture customized for image processing applications is proposed. The intent is to bridge the efficiency gap between ASICs and FPGAs, by offering FPGA-like flexibility and ASIC-like energy efficiency. This architecture minimizes the energy overheads in FPGAs, which result from the use of fine-grained programming style and global interconnect. It is flexible compared to an ASIC and can accommodate multiple applications.

The main contribution of the thesis is the feasibility analysis of the data path of this architecture, customized for image processing applications. The data path is implemented at the register transfer level (RTL), and the synthesis results are obtained in 45nm technology cell library from a leading foundry. The results of image-processing applications demonstrate that this architecture is within a factor of 10x of the energy and area efficiency of ASIC implementations.

Contributors

Agent

Created

Date Created
  • 2016