Search Content

Hardware Acceleration of Deep Convolutional Neural Networks on FPGA

Description

The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency…

The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the deep learning algorithm inference. However, deploying CNNs on portable and embedded systems is still challenging due to large data volume, intensive computation, varying algorithm structures, and frequent memory accesses. This dissertation proposes a complete design methodology and framework to accelerate the inference process of various CNN algorithms on FPGA hardware with high performance, efficiency and flexibility.

As convolution contributes most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. An efficient dataflow and hardware architecture of CNN acceleration are proposed to minimize the data communication while maximizing the resource utilization to achieve high performance.

Although great performance and efficiency can be achieved by customizing the FPGA hardware for each CNN model, significant efforts and expertise are required leading to long development time, which makes it difficult to catch up with the rapid development of CNN algorithms. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g. NiN, VGG, GoogLeNet and ResNet, on two different standalone FPGAs achieving state-of-the art performance.

Based on the optimized acceleration strategy, there are still a lot of design options, e.g. the degree and dimension of computation parallelism, the size of on-chip buffers, and the external memory bandwidth, which impact the utilization of computation resources and data communication efficiency, and finally affect the performance and energy consumption of the accelerator. The large design space of the accelerator makes it impractical to explore the optimal design choice during the real implementation phase. Therefore, a performance model is proposed in this work to quantitatively estimate the accelerator performance and resource utilization. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase.

ContributorsMa, Yufei (Author) / Vrudhula, Sarma (Thesis advisor) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Barnaby, Hugh (Committee member) / Arizona State University (Publisher)

Created2018

A Scalable FPGA-based Multi-channel Data Acquisition System for Parallel Plate Ionization Chamber

Description

Proton beam therapy has been proven to be effective for cancer treatment. Protons allow for complete energy deposition to occur inside patients, rendering this a superior treatment compared to other types of radiotherapy based on photons or electrons. This same characteristic makes quality assurance critical driving the need for detectors…

Proton beam therapy has been proven to be effective for cancer treatment. Protons allow for complete energy deposition to occur inside patients, rendering this a superior treatment compared to other types of radiotherapy based on photons or electrons. This same characteristic makes quality assurance critical driving the need for detectors capable of direct beam positioning and fluence measurement. This work showcases a flexible and scalable data acquisition system for a multi-channel and segmented readout parallel plate ionization chamber instrument for proton beam fluence and positioning detection. Utilizing readily available, modern, off-the-shelf hardware components, including an FPGA with an embedded CPU in the same package, a data acquisition system for the detector was designed. The undemanding detector signal bandwidth allows the absence of ASICs and their associated costs and lead times in the system. The data acquisition system is showcased experimentally for a 96-readout channel detector demonstrating sub millisecond beam characteristics and beam reconstruction. The system demonstrated scalability up to 1064-readout channels, the limiting factor being FPGA I/O availability as well as amplification and sampling power consumption.

ContributorsAcuna Briceno, Rafael Andres (Author) / Barnaby, Hugh (Thesis advisor) / Brunhaver, John (Committee member) / Blyth, David (Committee member) / Arizona State University (Publisher)

Created2021

Steps Towards Proving Quantum Entanglement

Description

Quantum entanglement, a phenomenon first introduced in the realm of quantum mechanics by the famous Einstein-Podolsky-Rosen (EPR) paradox, has intrigued physicists and philosophers alike for nearly a century. Its implications for the nature of reality, particularly its apparent violation of local realism, have sparked intense debate and spurred numerous experimental…

Quantum entanglement, a phenomenon first introduced in the realm of quantum mechanics by the famous Einstein-Podolsky-Rosen (EPR) paradox, has intrigued physicists and philosophers alike for nearly a century. Its implications for the nature of reality, particularly its apparent violation of local realism, have sparked intense debate and spurred numerous experimental investigations. This thesis presents a comprehensive examination of quantum entanglement with a focus on probing its non-local aspects. Central to this thesis is the development of a detailed project document outlining a proposed experimental approach to investigate the non-local nature of quantum entanglement. Drawing upon recent advancements in quantum technology, including the manipulation and control of entangled particles, the proposed experiment aims to rigorously test the predictions of quantum mechanics against the framework of local realism. The experimental setup involves the generation of entangled particle pairs, such as photons or ions, followed by the precise manipulation of their quantum states. By implementing a series of carefully designed measurements on spatially separated entangled particles, the experiment seeks to discern correlations that defy explanation within a local realistic framework.

ContributorsWasserbeck, Noah (Author) / Lukens, Joseph (Thesis director) / Arenz, Christian (Committee member) / Barrett, The Honors College (Contributor) / Electrical Engineering Program (Contributor) / School of Mathematical and Statistical Sciences (Contributor)

Created2024-05

FPGA Machine Learning: MLP and CNN Feedforward with Minimal Hardware Resources

Description

Machine learning is a powerful tool for processing and understanding the vast amounts of data produced by sensors every day. Machine learning has found use in a wide variety of fields, from making medical predictions through correlations invisible to the human eye to classifying images in computer vision applications. A…

Machine learning is a powerful tool for processing and understanding the vast amounts of data produced by sensors every day. Machine learning has found use in a wide variety of fields, from making medical predictions through correlations invisible to the human eye to classifying images in computer vision applications. A wide range of machine learning algorithms have been developed to attempt to solve these problems, each with different metrics in accuracy, throughput, and energy efficiency. However, even after they are trained, these algorithms require substantial computations to make a prediction. General-purpose CPUs are not well-optimized to this task, so other hardware solutions have developed over time, including the use of a GPU, FPGA, or ASIC.

This project considers the FPGA implementations of MLP and CNN feedforward. While FPGAs provide significant performance improvements, they come at a substantial financial cost. We explore the options of implementing these algorithms on a smaller budget. We successfully implement a multilayer perceptron that identifies handwritten digits from the MNIST dataset on a student-level DE10-Lite FPGA with a test accuracy of 91.99%. We also apply our trained network to external image data loaded through a webcam and a Raspberry Pi, but we observe lower test accuracy in these images. Later, we consider the requirements necessary to implement a more elaborate convolutional neural network on the same FPGA. The study deems the CNN implementation feasible in the criteria of memory requirements and basic architecture. We suggest the CNN implementation on the same FPGA to be worthy of further exploration.

ContributorsLythgoe, Zachary James (Author) / Allee, David (Thesis director) / Hartin, Olin (Committee member) / Electrical Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2019-12

Development and Analysis of an Accurate Timing Reference

Description

Precise Position, Navigation, and Timing (PNT) is necessary for the functioning of many critical infrastructure sectors relied upon by millions every day. Specifically, precise timing is primarily provided through the Global Positioning System (GPS) and its system of satellites that each house multiple atomic clocks. Without precise timing, utilities such…

Precise Position, Navigation, and Timing (PNT) is necessary for the functioning of many critical infrastructure sectors relied upon by millions every day. Specifically, precise timing is primarily provided through the Global Positioning System (GPS) and its system of satellites that each house multiple atomic clocks. Without precise timing, utilities such as the internet, the power grid, navigational systems, and financial systems would cease operation. Because oscillator devices experience frequency drift during operation, many systems rely on the precise time provided by GPS to maintain synchronization across the globe. However, GPS signals are particularly susceptible to disruption – both intentional and unintentional – due to their space-based, low-power, and unencrypted nature. It is for these reasons that there is a need to develop a system that can provide an accurate timing reference – one disciplined by a GPS signal – and can also maintain its nominal frequency in scenarios of intermittent GPS availability. This project considers an accurate timing reference deployed via Field Programmable Gate Array (FPGA) and disciplined by a GPS module. The objective is to implement a timing reference on a DE10-Lite FPGA disciplined by the 1 Pulse-Per-Second (PPS) output of an MTK3333 GPS module. When a signal lock is achieved with GPS, the MTK3333 delivers a pulse input to the FPGA on the leading edge of every second. The FPGA aligns a digital oscillator to this PPS reference, providing a disciplined output signal at a 10 MHz frequency that is maintained in events of intermittent GPS availability. The developed solution is evaluated using a frequency counter disciplined by an atomic clock in addition to an oscilloscope. The findings deem the software solution acceptable with more work needed to debug the hardware solution

ContributorsWitthus, Alexander (Author) / Allee, David (Thesis director) / Hartin, Olin (Committee member) / Barrett, The Honors College (Contributor) / Electrical Engineering Program (Contributor)

Created2022-05

Filtering by

Hardware Acceleration of Deep Convolutional Neural Networks on FPGA

A Scalable FPGA-­based Multi­-channel Data Acquisition System for Parallel Plate Ionization Chamber

Steps Towards Proving Quantum Entanglement

FPGA Machine Learning: MLP and CNN Feedforward with Minimal Hardware Resources

Development and Analysis of an Accurate Timing Reference

A Scalable FPGA-based Multi-channel Data Acquisition System for Parallel Plate Ionization Chamber