Matching Items (21)
Filtering by

Clear all filters

152527-Thumbnail Image.png
Description
This thesis report aims at introducing the background of QR decomposition and its application. QR decomposition using Givens rotations is a efficient method to prevent directly matrix inverse in solving least square minimization problem, which is a typical approach for weight calculation in adaptive beamforming. Furthermore, this thesis introduces Givens

This thesis report aims at introducing the background of QR decomposition and its application. QR decomposition using Givens rotations is a efficient method to prevent directly matrix inverse in solving least square minimization problem, which is a typical approach for weight calculation in adaptive beamforming. Furthermore, this thesis introduces Givens rotations algorithm and two general VLSI (very large scale integrated circuit) architectures namely triangular systolic array and linear systolic array for numerically QR decomposition. To fulfill the goal, a 4 input channels triangular systolic array with 16 bits fixed-point format and a 5 input channels linear systolic array are implemented on FPGA (Field programmable gate array). The final result shows that the estimated clock frequencies of 65 MHz and 135 MHz on post-place and route static timing report could be achieved using Xilinx Virtex 6 xc6vlx240t chip. Meanwhile, this report proposes a new method to test the dynamic range of QR-D. The dynamic range of the both architectures can be achieved around 110dB.
ContributorsYu, Hanguang (Author) / Bliss, Daniel W (Thesis advisor) / Ying, Lei (Committee member) / Chakrabarti, Chaitali (Committee member) / Arizona State University (Publisher)
Created2014
150476-Thumbnail Image.png
Description
Multidimensional (MD) discrete Fourier transform (DFT) is a key kernel algorithm in many signal processing applications, such as radar imaging and medical imaging. Traditionally, a two-dimensional (2-D) DFT is computed using Row-Column (RC) decomposition, where one-dimensional (1-D) DFTs are computed along the rows followed by 1-D DFTs along the columns.

Multidimensional (MD) discrete Fourier transform (DFT) is a key kernel algorithm in many signal processing applications, such as radar imaging and medical imaging. Traditionally, a two-dimensional (2-D) DFT is computed using Row-Column (RC) decomposition, where one-dimensional (1-D) DFTs are computed along the rows followed by 1-D DFTs along the columns. However, architectures based on RC decomposition are not efficient for large input size data which have to be stored in external memories based Synchronous Dynamic RAM (SDRAM). In this dissertation, first an efficient architecture to implement 2-D DFT for large-sized input data is proposed. This architecture achieves very high throughput by exploiting the inherent parallelism due to a novel 2-D decomposition and by utilizing the row-wise burst access pattern of the SDRAM external memory. In addition, an automatic IP generator is provided for mapping this architecture onto a reconfigurable platform of Xilinx Virtex-5 devices. For a 2048x2048 input size, the proposed architecture is 1.96 times faster than RC decomposition based implementation under the same memory constraints, and also outperforms other existing implementations. While the proposed 2-D DFT IP can achieve high performance, its output is bit-reversed. For systems where the output is required to be in natural order, use of this DFT IP would result in timing overhead. To solve this problem, a new bandwidth-efficient MD DFT IP that is transpose-free and produces outputs in natural order is proposed. It is based on a novel decomposition algorithm that takes into account the output order, FPGA resources, and the characteristics of off-chip memory access. An IP generator is designed and integrated into an in-house FPGA development platform, AlgoFLEX, for easy verification and fast integration. The corresponding 2-D and 3-D DFT architectures are ported onto the BEE3 board and their performance measured and analyzed. The results shows that the architecture can maintain the maximum memory bandwidth throughout the whole procedure while avoiding matrix transpose operations used in most other MD DFT implementations. The proposed architecture has also been ported onto the Xilinx ML605 board. When clocked at 100 MHz, 2048x2048 images with complex single-precision can be processed in less than 27 ms. Finally, transpose-free imaging flows for range-Doppler algorithm (RDA) and chirp-scaling algorithm (CSA) in SAR imaging are proposed. The corresponding implementations take advantage of the memory access patterns designed for the MD DFT IP and have superior timing performance. The RDA and CSA flows are mapped onto a unified architecture which is implemented on an FPGA platform. When clocked at 100MHz, the RDA and CSA computations with data size 4096x4096 can be completed in 323ms and 162ms, respectively. This implementation outperforms existing SAR image accelerators based on FPGA and GPU.
ContributorsYu, Chi-Li (Author) / Chakrabarti, Chaitali (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Karam, Lina (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)
Created2012
156845-Thumbnail Image.png
Description
The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency

The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the deep learning algorithm inference. However, deploying CNNs on portable and embedded systems is still challenging due to large data volume, intensive computation, varying algorithm structures, and frequent memory accesses. This dissertation proposes a complete design methodology and framework to accelerate the inference process of various CNN algorithms on FPGA hardware with high performance, efficiency and flexibility.

As convolution contributes most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. An efficient dataflow and hardware architecture of CNN acceleration are proposed to minimize the data communication while maximizing the resource utilization to achieve high performance.

Although great performance and efficiency can be achieved by customizing the FPGA hardware for each CNN model, significant efforts and expertise are required leading to long development time, which makes it difficult to catch up with the rapid development of CNN algorithms. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g. NiN, VGG, GoogLeNet and ResNet, on two different standalone FPGAs achieving state-of-the art performance.

Based on the optimized acceleration strategy, there are still a lot of design options, e.g. the degree and dimension of computation parallelism, the size of on-chip buffers, and the external memory bandwidth, which impact the utilization of computation resources and data communication efficiency, and finally affect the performance and energy consumption of the accelerator. The large design space of the accelerator makes it impractical to explore the optimal design choice during the real implementation phase. Therefore, a performance model is proposed in this work to quantitatively estimate the accelerator performance and resource utilization. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase.
ContributorsMa, Yufei (Author) / Vrudhula, Sarma (Thesis advisor) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Barnaby, Hugh (Committee member) / Arizona State University (Publisher)
Created2018
153668-Thumbnail Image.png
Description
Error correcting systems have put increasing demands on system designers, both due to increasing error correcting requirements and higher throughput targets. These requirements have led to greater silicon area, power consumption and have forced system designers to make trade-offs in Error Correcting Code (ECC) functionality. Solutions to increase the efficiency

Error correcting systems have put increasing demands on system designers, both due to increasing error correcting requirements and higher throughput targets. These requirements have led to greater silicon area, power consumption and have forced system designers to make trade-offs in Error Correcting Code (ECC) functionality. Solutions to increase the efficiency of ECC systems are very important to system designers and have become a heavily researched area.

Many such systems incorporate the Bose-Chaudhuri-Hocquenghem (BCH) method of error correcting in a multi-channel configuration. BCH is a commonly used code because of its configurability, low storage overhead, and low decoding requirements when compared to other codes. Multi-channel configurations are popular with system designers because they offer a straightforward way to increase bandwidth. The ECC hardware is duplicated for each channel and the throughput increases linearly with the number of channels. The combination of these two technologies provides a configurable and high throughput ECC architecture.

This research proposes a new method to optimize a BCH error correction decoder in multi-channel configurations. In this thesis, I examine how error frequency effects the utilization of BCH hardware. Rather than implement each decoder as a single pipeline of independent decoding stages, the channels are considered together and served by a pool of decoding stages. Modified hardware blocks for handling common cases are included and the pool is sized based on an acceptable, but negligible decrease in performance.
ContributorsDill, Russell (Author) / Shrivastava, Aviral (Thesis advisor) / Oh, Hyunok (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)
Created2015
153829-Thumbnail Image.png
Description
The reduced availability of 3He is a motivation for developing alternative neutron detectors. 6Li-enriched CLYC (Cs2LiYCl6), a scintillator, is a promising candidate to replace 3He. The neutron and gamma ray signals from CLYC have different shapes due to the slower decay of neutron pulses. Some of the well-known pulse shape

The reduced availability of 3He is a motivation for developing alternative neutron detectors. 6Li-enriched CLYC (Cs2LiYCl6), a scintillator, is a promising candidate to replace 3He. The neutron and gamma ray signals from CLYC have different shapes due to the slower decay of neutron pulses. Some of the well-known pulse shape discrimination techniques are charge comparison method, pulse gradient method and frequency gradient method. In the work presented here, we have applied a normalized cross correlation (NCC) approach to real neutron and gamma ray pulses produced by exposing CLYC scintillators to a mixed radiation environment generated by 137Cs, 22Na, 57Co and 252Cf/AmBe at different event rates. The cross correlation analysis produces distinctive results for measured neutron pulses and gamma ray pulses when they are cross correlated with reference neutron and/or gamma templates. NCC produces good separation between neutron and gamma rays at low (< 100 kHz) to mid event rate (< 200 kHz). However, the separation disappears at high event rate (> 200 kHz) because of pileup, noise and baseline shift. This is also confirmed by observing the pulse shape discrimination (PSD) plots and figure of merit (FOM) of NCC. FOM is close to 3, which is good, for low event rate but rolls off significantly along with the increase in the event rate and reaches 1 at high event rate. Future efforts are required to reduce the noise by using better hardware system, remove pileup and detect the NCC shapes of neutron and gamma rays using advanced techniques.
ContributorsChandhran, Premkumar (Author) / Holbert, Keith E. (Thesis advisor) / Spanias, Andreas (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)
Created2015
154195-Thumbnail Image.png
Description
Improving energy efficiency has always been the prime objective of the custom and automated digital circuit design techniques. As a result, a multitude of methods to reduce power without sacrificing performance have been proposed. However, as the field of design automation has matured over the last few decades, there have

Improving energy efficiency has always been the prime objective of the custom and automated digital circuit design techniques. As a result, a multitude of methods to reduce power without sacrificing performance have been proposed. However, as the field of design automation has matured over the last few decades, there have been no new automated design techniques, that can provide considerable improvements in circuit power, leakage and area. Although emerging nano-devices are expected to replace the existing MOSFET devices, they are far from being as mature as semiconductor devices and their full potential and promises are many years away from being practical.

The research described in this dissertation consists of four main parts. First is a new circuit architecture of a differential threshold logic flipflop called PNAND. The PNAND gate is an edge-triggered multi-input sequential cell whose next state function is a threshold function of its inputs. Second a new approach, called hybridization, that replaces flipflops and parts of their logic cones with PNAND cells is described. The resulting \hybrid circuit, which consists of conventional logic cells and PNANDs, is shown to have significantly less power consumption, smaller area, less standby power and less power variation.

Third, a new architecture of a field programmable array, called field programmable threshold logic array (FPTLA), in which the standard lookup table (LUT) is replaced by a PNAND is described. The FPTLA is shown to have as much as 50% lower energy-delay product compared to conventional FPGA using well known FPGA modeling tool called VPR.

Fourth, a novel clock skewing technique that makes use of the completion detection feature of the differential mode flipflops is described. This clock skewing method improves the area and power of the ASIC circuits by increasing slack on timing paths. An additional advantage of this method is the elimination of hold time violation on given short paths.

Several circuit design methodologies such as retiming and asynchronous circuit design can use the proposed threshold logic gate effectively. Therefore, the use of threshold logic flipflops in conventional design methodologies opens new avenues of research towards more energy-efficient circuits.
ContributorsKulkarni, Niranjan (Author) / Vrudhula, Sarma (Thesis advisor) / Colbourn, Charles (Committee member) / Seo, Jae-Sun (Committee member) / Yu, Shimeng (Committee member) / Arizona State University (Publisher)
Created2015
155816-Thumbnail Image.png
Description
Digital systems are increasingly pervading in the everyday lives of humans. The security of these systems is a concern due to the sensitive data stored in them. The physically unclonable function (PUF) implemented on hardware provides a way to protect these systems. Static random-access memories (SRAMs) are designed and used

Digital systems are increasingly pervading in the everyday lives of humans. The security of these systems is a concern due to the sensitive data stored in them. The physically unclonable function (PUF) implemented on hardware provides a way to protect these systems. Static random-access memories (SRAMs) are designed and used as a strong PUF to generate random numbers unique to the manufactured integrated circuit (IC).

Digital systems are important to the technological improvements in space exploration. Space exploration requires radiation hardened microprocessors which minimize the functional disruptions in the presence of radiation. The design highly efficient radiation-hardened microprocessor for enabling spacecraft (HERMES) is a radiation-hardened microprocessor with performance comparable to the commercially available designs. These designs are manufactured using a foundry complementary metal-oxide semiconductor (CMOS) 55-nm triple-well process. This thesis presents the post silicon validation results of the HERMES and the PUF mode of SRAM across process corners.

Chapter 1 gives an overview of the blocks implemented on the test chip 25. It also talks about the pre-silicon functional verification methodology used for the test chip. Chapter 2 discusses about the post silicon testing setup of test chip 25 and the validation of the setup. Chapter 3 describes the architecture and the test bench of the HERMES along with its testing results. Chapter 4 discusses the test bench and the perl scripts used to test the SRAM along with its testing results. Chapter 5 gives a summary of the post-silicon validation results of the HERMES and the PUF mode of SRAM.
ContributorsMedapuram, Sai Bharadwaj (Author) / Clark, Lawrence T (Thesis advisor) / Allee, David R. (Committee member) / Brunhaver, John S (Committee member) / Arizona State University (Publisher)
Created2017
155631-Thumbnail Image.png
Description
The information era has brought about many technological advancements in the past

few decades, and that has led to an exponential increase in the creation of digital images and

videos. Constantly, all digital images go through some image processing algorithm for

various reasons like compression, transmission, storage, etc. There is data loss during

The information era has brought about many technological advancements in the past

few decades, and that has led to an exponential increase in the creation of digital images and

videos. Constantly, all digital images go through some image processing algorithm for

various reasons like compression, transmission, storage, etc. There is data loss during this

process which leaves us with a degraded image. Hence, to ensure minimal degradation of

images, the requirement for quality assessment has become mandatory. Image Quality

Assessment (IQA) has been researched and developed over the last several decades to

predict the quality score in a manner that agrees with human judgments of quality. Modern

image quality assessment (IQA) algorithms are quite effective at prediction accuracy, and

their development has not focused on improving computational performance. The existing

serial implementation requires a relatively large run-time on the order of seconds for a single

frame. Hardware acceleration using Field programmable gate arrays (FPGAs) provides

reconfigurable computing fabric that can be tailored for a broad range of applications.

Usually, programming FPGAs has required expertise in hardware descriptive languages

(HDLs) or high-level synthesis (HLS) tool. OpenCL is an open standard for cross-platform,

parallel programming of heterogeneous systems along with Altera OpenCL SDK, enabling

developers to use FPGA's potential without extensive hardware knowledge. Hence, this

thesis focuses on accelerating the computationally intensive part of the most apparent

distortion (MAD) algorithm on FPGA using OpenCL. The results are compared with CPU

implementation to evaluate performance and efficiency gains.
ContributorsGunavelu Mohan, Aswin (Author) / Sohoni, Sohum (Thesis advisor) / Ren, Fengbo (Thesis advisor) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2017
168486-Thumbnail Image.png
Description
The Balloon-borne Large Aperture Submillimeter Telescope - The Next Generation (BLAST-TNG) was designed to map the polarized emission from dust in star forming regions of our galaxy. The dust is thought to trace magnetic fields and thus inform us of the role that it plays in star formation. BLAST-TNG improves

The Balloon-borne Large Aperture Submillimeter Telescope - The Next Generation (BLAST-TNG) was designed to map the polarized emission from dust in star forming regions of our galaxy. The dust is thought to trace magnetic fields and thus inform us of the role that it plays in star formation. BLAST-TNG improves upon the previous generation of balloon-borne sub-mm polarimeters by increasing the number of detectors by over an order of magnitude. A novel detector technology which is naturally multiplexed, Kinetic Inductance Detectors have been developed as an elegant solution to the challenge of packing cryogenic focal plane arrays with detectors. To readout the multiplexed arrays, custom firmware and control software was developed for the ROACH2 FPGA based system. On January 6th 2020 the telescope was launched on a high-altitude balloon from Antarctica and flew for approximately 15 hours in the mid-stratosphere. During this time various calibration tasks occurred such as atmospheric skydips, the mapping of a sub-mm source, and the flashing of an internal calibration lamp. A mechanical failure shortened the flight so that only calibration scans were performed. In this dissertation I will present my analysis of the in-flight calibration data leading to measures of the overall telescope sensitivity and detector performance. The results of which prove kinetic inductance detectors as a viable candidate for future space based sub-mm telescopes. In parallel the fields of digital communications and radar signal processing have spawned the development of the Radio Frequency System On a Chip (RFSoC). This product by Xilinx incorporates a fabric of reconfigurable logic, ARM microprocessors, and high speed digitizers all into one chip. The system specs provide an improvement in every category of size, weight, power, and bandwidth.This is naturally the desired platform for the next generation of far-infrared telescopes which are pushing the limits of detector counts. I present the development of one of the first frequency multiplexed detector readouts on the RFSoC platform. Alternative firmware designs implemented on the RFSoC are also discussed. The firmware work presented will be used in part or in full for multiple current and upcoming far-infrared telescopes.
ContributorsSinclair, Adrian Kai (Author) / Mauskopf, Philip D (Thesis advisor) / Borthakur, Sanchayeeta (Committee member) / Groppi, Christopher (Committee member) / Jacobs, Daniel (Committee member) / Hubmayr, Johannes (Committee member) / Arizona State University (Publisher)
Created2021
132021-Thumbnail Image.png
Description
Machine learning is a powerful tool for processing and understanding the vast amounts of data produced by sensors every day. Machine learning has found use in a wide variety of fields, from making medical predictions through correlations invisible to the human eye to classifying images in computer vision applications. A

Machine learning is a powerful tool for processing and understanding the vast amounts of data produced by sensors every day. Machine learning has found use in a wide variety of fields, from making medical predictions through correlations invisible to the human eye to classifying images in computer vision applications. A wide range of machine learning algorithms have been developed to attempt to solve these problems, each with different metrics in accuracy, throughput, and energy efficiency. However, even after they are trained, these algorithms require substantial computations to make a prediction. General-purpose CPUs are not well-optimized to this task, so other hardware solutions have developed over time, including the use of a GPU, FPGA, or ASIC.

This project considers the FPGA implementations of MLP and CNN feedforward. While FPGAs provide significant performance improvements, they come at a substantial financial cost. We explore the options of implementing these algorithms on a smaller budget. We successfully implement a multilayer perceptron that identifies handwritten digits from the MNIST dataset on a student-level DE10-Lite FPGA with a test accuracy of 91.99%. We also apply our trained network to external image data loaded through a webcam and a Raspberry Pi, but we observe lower test accuracy in these images. Later, we consider the requirements necessary to implement a more elaborate convolutional neural network on the same FPGA. The study deems the CNN implementation feasible in the criteria of memory requirements and basic architecture. We suggest the CNN implementation on the same FPGA to be worthy of further exploration.
ContributorsLythgoe, Zachary James (Author) / Allee, David (Thesis director) / Hartin, Olin (Committee member) / Electrical Engineering Program (Contributor) / Barrett, The Honors College (Contributor)
Created2019-12