Search Content

Energy Efficient ASIC/FPGA Neural Network Accelerators

Description

Convolutional neural networks(CNNs) achieve high accuracy on large datasets but requires significant computation and storage requirement for training/testing. While many applications demand low latency and energy-efficient processing of the images, deploying these complex algorithms on the hardware is a challenging task. This dissertation first presents a compiler-based CNN training accelerator…

Convolutional neural networks(CNNs) achieve high accuracy on large datasets but requires significant computation and storage requirement for training/testing. While many applications demand low latency and energy-efficient processing of the images, deploying these complex algorithms on the hardware is a challenging task. This dissertation first presents a compiler-based CNN training accelerator using DDR3 and HBM2 memory. An optimized RTL library is implemented to perform training-specific tasks and an RTL compiler is developed to generate FPGA-synthesizable RTL based on user-defined constraints. High Bandwidth Memory(HBM) provides efficient off-chip communication and improves the training performance. The impact of HBM2 on CNN training workloads is analyzed and compressively compared with DDR3. For training ResNet-20/VGG-like CNNs for the CIFAR-10 dataset, the proposed CNN training accelerator on Stratix-10 GX FPGA(DDR3) demonstrates 479 GOPS performance, and on Stratix-10 MX FPGA(HBM) shows 4.5/9.7 X energy-efficiency improvement compared to Tesla V100 GPU. Next, the FPGA online learning accelerator is presented. Adopting model segmentation techniques from Progressive Segmented Training(PST), the online learning accelerator achieved a 4.2X reduction in training latency. Furthermore, this dissertation presents an 8-bit floating-point (FP8) training processor which implements (1) Highly parallel tensor cores that maintain high PE utilization, (2) Hardware-efficient channel gating for dynamic output activation sparsity (3) Dynamic weight sparsity based on group Lasso (4) Gradient skipping based on FP prediction error. The 28nm prototype chip demonstrates significant improvements in FLOPs reduction (7.3×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×) for both supervised training and self-supervised training tasks. In addition to the training accelerators, this dissertation also presents a CNN inference accelerator on ASIC(FixyNN) and FPGA(FixyFPGA). FixyNN consists of a fixed-weight feature extractor that generates ubiquitous CNN features and a conventional programmable CNN accelerator. In the fixed-weight feature extractor, the network weights are hard-coded into hardware and used as a fixed operand for the multiplication. Experimental results demonstrate FixyNN can achieve very high energy efficiencies up to 26.6 TOPS/W, and FixyFPGA achieves $2.34\times$ higher GOPS on ImageNet classification. In summary, this dissertation comprehensively discusses novel architectures of high-performance and energy-efficient ASIC/FPGA CNN inference/training accelerators.

ContributorsKolala Venkataramaniah, Shreyas (Author) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Chakrabarti, Chaitali (Committee member) / Fan, Deliang (Committee member) / Arizona State University (Publisher)

Created2022

Computational Imaging for Energy-Efficient Cameras: Adaptive ROI-based Object Tracking and Optically Defocused Event-based Sensing.

Description

Computer vision is becoming an essential component of embedded system applications such as smartphones, wearables, autonomous systems and internet-of-things (IoT). These applications are generally deployed into environments with limited energy, memory bandwidth and computational resources. This trend is driving the development of energy-effi cient image processing solutions from sensing to…

Computer vision is becoming an essential component of embedded system applications such as smartphones, wearables, autonomous systems and internet-of-things (IoT). These applications are generally deployed into environments with limited energy, memory bandwidth and computational resources. This trend is driving the development of energy-effi cient image processing solutions from sensing to computation. In this thesis, diff erent alternatives are explored to implement energy-efficient computer vision systems. First, I present a fi eld programmable gate array (FPGA) implementation of an adaptive subsampling algorithm for region-of-interest (ROI) -based object tracking. By implementing the computationally intensive sections of this algorithm on an FPGA, I aim to offl oad computing resources from energy-ineffi cient graphics processing units (GPUs) and/or general-purpose central processing units (CPUs). I also present a working system executing this algorithm in near real-time latency implemented on a standalone embedded device. Secondly, I present a neural network-based pipeline to improve the performance of event-based cameras in non-ideal optical conditions. Event-based cameras or dynamic vision sensors (DVS) are bio-inspired sensors that measure logarithmic per-pixel brightness changes in a scene. Their advantages include high dynamic range, low latency and ultra-low power when compared to standard frame-based cameras. Several tasks have been proposed to take advantage of these novel sensors but they rely on perfectly calibrated optical lenses that are in-focus. In this work I propose a methodto reconstruct events captured with an out-of-focus event-camera so they can be fed into an intensity reconstruction task. The network is trained with a dataset generated by simulating defocus blur in sequences from object tracking datasets such as LaSOT and OTB100. I also test the generalization performance of this network in scenes captured with a DAVIS event-based sensor equipped with an out-of-focus lens.

ContributorsTorres Muro, Victor Isaac (Author) / Jayasuriya, Suren (Thesis advisor) / Spanias, Andreas (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2022

Characterization and Testing of the Weighted-Overlap-and-Add High-Speed Polyphase Filterbank

Description

The Discrete Fourier Transform (DFT) is a mathematical operation utilized in various signal processing applications including Astronomy and digital communications (satellite, cellphone, radar, etc.) to separate signals at different frequencies. Performing DFT on a signal by itself suffers from inter-channel leakage. For an ultrasensitive application like radio astronomy, it is…

The Discrete Fourier Transform (DFT) is a mathematical operation utilized in various signal processing applications including Astronomy and digital communications (satellite, cellphone, radar, etc.) to separate signals at different frequencies. Performing DFT on a signal by itself suffers from inter-channel leakage. For an ultrasensitive application like radio astronomy, it is important to minimize frequency sidelobes. To achieve this, the Polyphase Filterbank (PFB) technique is used which modifies the bin-response of the DFT to a rectangular function and suppresses out-of-band crosstalk. This helps achieve the Signal-to-Noise Ratio (SNR) required for astronomy measurements. In practice, 2N DFT can be efficiently implemented on Digital Signal Processing (DSP) hardware by the popular Fast Fourier Transform (FFT) algorithm. Hence, 2N tap-filters are commonly used in the Filterbank stage before the FFT. At present, Field Programmable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs) from different vendors (e.g. Xilinx, Altera, Microsemi, etc.) are available which offer high performance. Xilinx Radio-Frequency System-on-Chip (RFSoC) is the latest kind of such a platform offering Radio-frequency (RF) signal capture / generate capability on the same chip. This thesis describes the characterization of the Analog-to-Digital Converter (ADC) available on the Xilinx ZCU111 RFSoC platform, detailed design steps of a Critically-Sampled PFB, and the testing and debugging of a Weighted OverLap and Add (WOLA) PFB to examine the feasibility of implementation on custom ASICs for future space missions. The design and testing of an analog Printed Circuit Board (PCB) circuit for biasing cryogenic detectors and readout components are also presented here.

ContributorsBiswas, Raj (Author) / Mauskopf, Philip (Thesis advisor) / Bliss, Daniel (Thesis advisor) / Hooks, Tracee J (Committee member) / Groppi, Christopher (Committee member) / Zeinolabedinzadeh, Saeed (Committee member) / Arizona State University (Publisher)

Created2022

Steps Towards Proving Quantum Entanglement

Description

Quantum entanglement, a phenomenon first introduced in the realm of quantum mechanics by the famous Einstein-Podolsky-Rosen (EPR) paradox, has intrigued physicists and philosophers alike for nearly a century. Its implications for the nature of reality, particularly its apparent violation of local realism, have sparked intense debate and spurred numerous experimental…

Quantum entanglement, a phenomenon first introduced in the realm of quantum mechanics by the famous Einstein-Podolsky-Rosen (EPR) paradox, has intrigued physicists and philosophers alike for nearly a century. Its implications for the nature of reality, particularly its apparent violation of local realism, have sparked intense debate and spurred numerous experimental investigations. This thesis presents a comprehensive examination of quantum entanglement with a focus on probing its non-local aspects. Central to this thesis is the development of a detailed project document outlining a proposed experimental approach to investigate the non-local nature of quantum entanglement. Drawing upon recent advancements in quantum technology, including the manipulation and control of entangled particles, the proposed experiment aims to rigorously test the predictions of quantum mechanics against the framework of local realism. The experimental setup involves the generation of entangled particle pairs, such as photons or ions, followed by the precise manipulation of their quantum states. By implementing a series of carefully designed measurements on spatially separated entangled particles, the experiment seeks to discern correlations that defy explanation within a local realistic framework.

ContributorsWasserbeck, Noah (Author) / Lukens, Joseph (Thesis director) / Arenz, Christian (Committee member) / Barrett, The Honors College (Contributor) / Electrical Engineering Program (Contributor) / School of Mathematical and Statistical Sciences (Contributor)

Created2024-05

FPGA Acceleration of CNNs Using OpenCL

Description

Convolutional Neural Network (CNN) has achieved state-of-the-art performance in numerous applications like computer vision, natural language processing, robotics etc. The advancement of High-Performance Computing systems equipped with dedicated hardware accelerators has also paved the way towards the success of compute intensive CNNs. Graphics Processing Units (GPUs), with massive processing capability,…

Convolutional Neural Network (CNN) has achieved state-of-the-art performance in numerous applications like computer vision, natural language processing, robotics etc. The advancement of High-Performance Computing systems equipped with dedicated hardware accelerators has also paved the way towards the success of compute intensive CNNs. Graphics Processing Units (GPUs), with massive processing capability, have been of general interest for the acceleration of CNNs. Recently, Field Programmable Gate Arrays (FPGAs) have been promising in CNN acceleration since they offer high performance while also being re-configurable to support the evolution of CNNs. This work focuses on a design methodology to accelerate CNNs on FPGA with low inference latency and high-throughput which are crucial for scenarios like self-driving cars, video surveillance etc. It also includes optimizations which reduce the resource utilization by a large margin with a small degradation in performance thus making the design suitable for low-end FPGA devices as well.

FPGA accelerators often suffer due to the limited main memory bandwidth. Also, highly parallel designs with large resource utilization often end up achieving low operating frequency due to poor routing. This work employs data fetch and buffer mechanisms, designed specifically for the memory access pattern of CNNs, that overlap computation with memory access. This work proposes a novel arrangement of the systolic processing element array to achieve high frequency and consume less resources than the existing works. Also, support has been extended to more complicated CNNs to do video processing. On Intel Arria 10 GX1150, the design operates at a frequency as high as 258MHz and performs single inference of VGG-16 and C3D in 23.5ms and 45.6ms respectively. For VGG-16 and C3D the design offers a throughput of 66.1 and 23.98 inferences/s respectively. This design can outperform other FPGA 2D CNN accelerators by up to 9.7 times and 3D CNN accelerators by up to 2.7 times.

ContributorsRavi, Pravin Kumar (Author) / Zhao, Ming (Thesis advisor) / Li, Baoxin (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)

Created2020

Accelerator for Flexible QR Decomposition and Back Substitution

Description

QR decomposition (QRD) of a matrix is one of the most common linear algebra operationsused for the decomposition of a square
on-square matrix. It has a wide range
of applications especially in Multiple Input-Multiple Output (MIMO) communication
systems. Unfortunately it has high computation complexity { for matrix size of nxn,
QRD has O(n3) complexity…

QR decomposition (QRD) of a matrix is one of the most common linear algebra operationsused for the decomposition of a square
on-square matrix. It has a wide range
of applications especially in Multiple Input-Multiple Output (MIMO) communication
systems. Unfortunately it has high computation complexity { for matrix size of nxn,
QRD has O(n3) complexity and back substitution, which is used to solve a system
of linear equations, has O(n2) complexity. Thus, as the matrix size increases, the
hardware resource requirement for QRD and back substitution increases signicantly.
This thesis presents the design and implementation of a
exible QRD and back substitution accelerator using a folded architecture. It can support matrix sizes of
4x4, 8x8, 12x12, 16x16, and 20x20 with low hardware resource requirement.
The proposed architecture is based on the systolic array implementation of the
Givens algorithm for QRD. It is built with three dierent types of computation blocks
which are connected in a 2-D array structure. These blocks are controlled by a
scheduler which facilitates reusability of the blocks to perform computation for any
input matrix size which is a multiple of 4. These blocks are designed using two
basic programming elements which support both the forward and backward paths to
compute matrix R in QRD and column-matrix X in back substitution computation.
The proposed architecture has been mapped to Xilinx Zynq Ultrascale+ FPGA
(Field Programmable Gate Array), ZCU102. All inputs are complex with precision
of 40 bits (38 fractional bits and 1 signed bit). The architecture can be clocked at
50 MHz. The synthesis results of the folded architecture for dierent matrix sizes
are presented. The results show that the folded architecture can support QRD and
back substitution for inputs of large sizes which otherwise cannot t on an FPGA
when implemented using a
at architecture. The memory sizes required for dierent
matrix sizes are also presented.

ContributorsKanagala, Srimayee (Author) / Chakrabarti, Chaitali (Thesis advisor) / Bliss, Daniel (Committee member) / Cao, Yu (Kevin) (Committee member) / Arizona State University (Publisher)

Created2020

FPGA-Based Edge-Computing Acceleration

Description

The rapid growth of Internet-of-things (IoT) and artificial intelligence applications have called forth a new computing paradigm--edge computing. Edge computing applications, such as video surveillance, autonomous driving, and augmented reality, are highly computationally intensive and require real-time processing. Current edge systems are typically based on commodity general-purpose hardware such as…

The rapid growth of Internet-of-things (IoT) and artificial intelligence applications have called forth a new computing paradigm--edge computing. Edge computing applications, such as video surveillance, autonomous driving, and augmented reality, are highly computationally intensive and require real-time processing. Current edge systems are typically based on commodity general-purpose hardware such as Central Processing Units (CPUs) and Graphical Processing Units (GPUs) , which are mainly designed for large, non-time-sensitive jobs in the cloud and do not match the needs of the edge workloads. Also, these systems are usually power hungry and are not suitable for resource-constrained edge deployments. Such application-hardware mismatch calls forth a new computing backbone to support the high-bandwidth, low-latency, and energy-efficient requirements. Also, the new system should be able to support a variety of edge applications with different characteristics. This thesis addresses the above challenges by studying the use of Field Programmable Gate Array (FPGA) -based computing systems for accelerating the edge workloads, from three critical angles. First, it investigates the feasibility of FPGAs for edge computing, in comparison to conventional CPUs and GPUs. Second, it studies the acceleration of common algorithmic characteristics, identified as loop patterns, using FPGAs, and develops a benchmark tool for analyzing the performance of these patterns on different accelerators. Third, it designs a new edge computing platform using multiple clustered FPGAs to provide high-bandwidth and low-latency acceleration of convolutional neural networks (CNNs) widely used in edge applications. Finally, it studies the acceleration of the emerging neural networks, randomly-wired neural networks, on the multi-FPGA platform. The experimental results from this work show that the new generation of workloads requires rethinking the current edge-computing architecture. First, through the acceleration of common loops, it demonstrates that FPGAs can outperform GPUs in specific loops types up to 14 times. Second, it shows the linear scalability of multi-FPGA platforms in accelerating neural networks. Third, it demonstrates the superiority of the new scheduler to optimally place randomly-wired neural networks on multi-FPGA platforms with 81.1 times better throughput than the available scheduling mechanisms.

ContributorsBiookaghazadeh, Saman (Author) / Zhao, Ming (Thesis advisor) / Ren, Fengbo (Thesis advisor) / Li, Baoxin (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2021

FPGA Machine Learning: MLP and CNN Feedforward with Minimal Hardware Resources

Description

Machine learning is a powerful tool for processing and understanding the vast amounts of data produced by sensors every day. Machine learning has found use in a wide variety of fields, from making medical predictions through correlations invisible to the human eye to classifying images in computer vision applications. A…

Machine learning is a powerful tool for processing and understanding the vast amounts of data produced by sensors every day. Machine learning has found use in a wide variety of fields, from making medical predictions through correlations invisible to the human eye to classifying images in computer vision applications. A wide range of machine learning algorithms have been developed to attempt to solve these problems, each with different metrics in accuracy, throughput, and energy efficiency. However, even after they are trained, these algorithms require substantial computations to make a prediction. General-purpose CPUs are not well-optimized to this task, so other hardware solutions have developed over time, including the use of a GPU, FPGA, or ASIC.

This project considers the FPGA implementations of MLP and CNN feedforward. While FPGAs provide significant performance improvements, they come at a substantial financial cost. We explore the options of implementing these algorithms on a smaller budget. We successfully implement a multilayer perceptron that identifies handwritten digits from the MNIST dataset on a student-level DE10-Lite FPGA with a test accuracy of 91.99%. We also apply our trained network to external image data loaded through a webcam and a Raspberry Pi, but we observe lower test accuracy in these images. Later, we consider the requirements necessary to implement a more elaborate convolutional neural network on the same FPGA. The study deems the CNN implementation feasible in the criteria of memory requirements and basic architecture. We suggest the CNN implementation on the same FPGA to be worthy of further exploration.

ContributorsLythgoe, Zachary James (Author) / Allee, David (Thesis director) / Hartin, Olin (Committee member) / Electrical Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2019-12

FPGAs as an Edge Computing Solution

Description

As the Internet of Things continues to expand, not only must our computing power grow
alongside it, our very approach must evolve. While the recent trend has been to centralize our
computing resources in the cloud, it now looks beneficial to push more computing power
towards the “edge” with so called edge computing,…

As the Internet of Things continues to expand, not only must our computing power grow
alongside it, our very approach must evolve. While the recent trend has been to centralize our
computing resources in the cloud, it now looks beneficial to push more computing power
towards the “edge” with so called edge computing, reducing the immense strain on cloud
servers and the latency experienced by IoT devices. A new computing paradigm also brings
new opportunities for innovation, and one such innovation could be the use of FPGAs as edge
servers. In this research project, I learn the design flow for developing OpenCL kernels and
custom FPGA BSPs. Using these tools, I investigate the viability of using FPGAs as standalone
edge computing devices. Concluding that—although the technology is a great fit—the current
necessity of dynamically reprogrammable FPGAs to be closely coupled with a host CPU is
holding them back from this purpose. I propose a modification to the architecture of the Intel
Arria 10 GX that would allow it to be decoupled from its host CPU, allowing it to truly serve as a
viable edge computing solution.

ContributorsBarth, Brandon Albert (Author) / Ren, Fengbo (Thesis director) / Vrudhula, Sarma (Committee member) / Computer Science and Engineering Program (Contributor, Contributor) / Barrett, The Honors College (Contributor)

Created2019-05

Exploring the Implementation of Multiple Partial Reconfiguration Regions to use FPGAs in Edge Computing

Description

Edge computing is an emerging field that improves upon cloud computing by moving the service from a centralized server to several de-centralized servers that are closer to the end user to decrease the latency, bandwidth, and cost requirements. Field programmable grid array (FPGA) devices are highly reconfigurable and excel in…

Edge computing is an emerging field that improves upon cloud computing by moving the service from a centralized server to several de-centralized servers that are closer to the end user to decrease the latency, bandwidth, and cost requirements. Field programmable grid array (FPGA) devices are highly reconfigurable and excel in highly parallelized tasks, making them popular in many applications including digital signal processing and cryptography, while also making them a great candidate for edge computation. The purpose of this project was to explore existing board support packages for the Arria 10 GX FPGA and propose a BSP design with multiple partial reconfiguration regions to better support the use of FPGAs in edge computing. In this project, the general OpenCL development flow was studied, OpenCL workflow for Altera/Intel FPGAs was researched, the reference OpenCL BSP was explored to understand the connections between the modules, and a customized BSP with two partial reconfiguration regions was proposed. The existing BSP was explored using the Intel Quartus Prime software suite and the block diagrams for the existing and proposed designs were created using Microsoft Visio.

ContributorsLam, Evan (Author) / Ren, Fengbo (Thesis director) / Vrudhula, Sarma (Committee member) / Computer Science and Engineering Program (Contributor, Contributor) / Barrett, The Honors College (Contributor)

Created2019-05