Search Content

Algorithm and Hardware Design for Efficient Deep Learning Inference

Description

Deep learning (DL) has proved itself be one of the most important developements till date with far reaching impacts in numerous fields like robotics, computer vision, surveillance, speech processing, machine translation, finance, etc. They are now widely used for countless applications because of their ability to generalize real world data,…

Deep learning (DL) has proved itself be one of the most important developements till date with far reaching impacts in numerous fields like robotics, computer vision, surveillance, speech processing, machine translation, finance, etc. They are now widely used for countless applications because of their ability to generalize real world data, robustness to noise in previously unseen data and high inference accuracy. With the ability to learn useful features from raw sensor data, deep learning algorithms have out-performed tradinal AI algorithms and pushed the boundaries of what can be achieved with AI. In this work, we demonstrate the power of deep learning by developing a neural network to automatically detect cough instances from audio recorded in un-constrained environments. For this, 24 hours long recordings from 9 dierent patients is collected and carefully labeled by medical personel. A pre-processing algorithm is proposed to convert event based cough dataset to a more informative dataset with start and end of coughs and also introduce data augmentation for regularizing the training procedure. The proposed neural network achieves 92.3% leave-one-out accuracy on data captured in real world.

Deep neural networks are composed of multiple layers that are compute/memory intensive. This makes it difficult to execute these algorithms real-time with low power consumption using existing general purpose computers. In this work, we propose hardware accelerators for a traditional AI algorithm based on random forest trees and two representative deep convolutional neural networks (AlexNet and VGG). With the proposed acceleration techniques, ~ 30x performance improvement was achieved compared to CPU for random forest trees. For deep CNNS, we demonstrate that much higher performance can be achieved with architecture space exploration using any optimization algorithms with system level performance and area models for hardware primitives as inputs and goal of minimizing latency with given resource constraints. With this method, ~30GOPs performance was achieved for Stratix V FPGA boards.

Hardware acceleration of DL algorithms alone is not always the most ecient way and sucient to achieve desired performance. There is a huge headroom available for performance improvement provided the algorithms are designed keeping in mind the hardware limitations and bottlenecks. This work achieves hardware-software co-optimization for Non-Maximal Suppression (NMS) algorithm. Using the proposed algorithmic changes and hardware architecture

With CMOS scaling coming to an end and increasing memory bandwidth bottlenecks, CMOS based system might not scale enough to accommodate requirements of more complicated and deeper neural networks in future. In this work, we explore RRAM crossbars and arrays as compact, high performing and energy efficient alternative to CMOS accelerators for deep learning training and inference. We propose and implement RRAM periphery read and write circuits and achieved ~3000x performance improvement in online dictionary learning compared to CPU.

This work also examines the realistic RRAM devices and their non-idealities. We do an in-depth study of the effects of RRAM non-idealities on inference accuracy when a pretrained model is mapped to RRAM based accelerators. To mitigate this issue, we propose Random Sparse Adaptation (RSA), a novel scheme aimed at tuning the model to take care of the faults of the RRAM array on which it is mapped. Our proposed method can achieve inference accuracy much higher than what traditional Read-Verify-Write (R-V-W) method could achieve. RSA can also recover lost inference accuracy 100x ~ 1000x faster compared to R-V-W. Using 32-bit high precision RSA cells, we achieved ~10% higher accuracy using fautly RRAM arrays compared to what can be achieved by mapping a deep network to an 32 level RRAM array with no variations.

ContributorsMohanty, Abinash (Author) / Cao, Yu (Thesis advisor) / Seo, Jae-Sun (Committee member) / Vrudhula, Sarma (Committee member) / Chakrabarti, Chaitali (Committee member) / Arizona State University (Publisher)

Created2018

Transformers: An Architectural Analysis, Survey and Implementation

Description

The aim of this project is to understand the basic algorithmic components of the transformer deep learning architecture. At a high level, a transformer is a machine learning model based off of a recurrent neural network that adopts a self-attention mechanism, which can weigh significant parts of sequential input data…

The aim of this project is to understand the basic algorithmic components of the transformer deep learning architecture. At a high level, a transformer is a machine learning model based off of a recurrent neural network that adopts a self-attention mechanism, which can weigh significant parts of sequential input data which is very useful for solving problems in natural language processing and computer vision. There are other approaches to solving these problems which have been implemented in the past (i.e., convolutional neural networks and recurrent neural networks), but these architectures introduce the issue of the vanishing gradient problem when an input becomes too long (which essentially means the network loses its memory and halts learning) and have a slow training time in general. The transformer architecture’s features enable a much better “memory” and a faster training time, which makes it a more optimal architecture in solving problems. Most of this project will be spent producing a survey that captures the current state of research on the transformer, and any background material to understand it. First, I will do a keyword search of the most well cited and up-to-date peer reviewed publications on transformers to understand them conceptually. Next, I will investigate any necessary programming frameworks that will be required to implement the architecture. I will use this to implement a simplified version of the architecture or follow an easy to use guide or tutorial in implementing the architecture. Once the programming aspect of the architecture is understood, I will then Implement a transformer based on the academic paper “Attention is All You Need”. I will then slightly tweak this model using my understanding of the architecture to improve performance. Once finished, the details (i.e., successes, failures, process and inner workings) of the implementation will be evaluated and reported, as well as the fundamental concepts surveyed. The motivation behind this project is to explore the rapidly growing area of AI algorithms, and the transformer algorithm in particular was chosen because it is a major milestone for engineering with AI and software. Since their introduction, transformers have provided a very effective way of solving natural language processing, which has allowed any related applications to succeed with high speed while maintaining accuracy. Since then, this type of model can be applied to more cutting edge natural language processing applications, such as extracting semantic information from a text description and generating an image to satisfy it.

ContributorsCereghini, Nicola (Author) / Acuna, Ruben (Thesis director) / Bansal, Ajay (Committee member) / Barrett, The Honors College (Contributor) / Software Engineering (Contributor)

Created2023-05

A Neural Network Model for a Tutoring Companion Supporting Students in a Programming with Java Course

Description

Feedback represents a vital component of the learning process and is especially important for Computer Science students. With class sizes that are often large, it can be challenging to provide individualized feedback to students. Consistent, constructive, supportive feedback through a tutoring companion can scaffold the learning process for students.

This work…

Feedback represents a vital component of the learning process and is especially important for Computer Science students. With class sizes that are often large, it can be challenging to provide individualized feedback to students. Consistent, constructive, supportive feedback through a tutoring companion can scaffold the learning process for students.

This work contributes to the construction of a tutoring companion designed to provide this feedback to students. It aims to bridge the gap between the messages the compiler delivers, and the support required for a novice student to understand the problem and fix their code. Particularly, it provides support for students learning about recursion in a beginning university Java programming course. Besides also providing affective support, a tutoring companion could be more effective when it is embedded into the environment that the student is already using, instead of an additional tool for the student to learn. The proposed Tutoring Companion is embedded into the Eclipse Integrated Development Environment (IDE).

This thesis focuses on the reasoning model for the Tutoring Companion and is developed using the techniques of a neural network. While a student uses the IDE, the Tutoring Companion collects 16 data points, including the presence of certain key words, cyclomatic complexity, and error messages from the compiler, every time it detects an event, such as a run attempt, debug attempt, or a request for help, in the IDE. This data is used as inputs to the neural network. The neural network produces a correlating single output code for the feedback to be provided to the student, which is displayed in the IDE.

The effectiveness of the approach is examined among 38 Computer Science students who solve a programming assignment while the Tutoring Companion assists them. Data is collected from these interactions, including all inputs and outputs for the neural network, and students are surveyed regarding their experience. Results suggest that students feel supported while working with the Companion and promising potential for using a neural network with an embedded companion in the future. Challenges in developing an embedded companion are discussed, as well as opportunities for future work.

ContributorsDay, Melissa (Author) / Gonzalez-Sanchez, Javier (Thesis advisor) / Bansal, Ajay (Committee member) / Mehlhase, Alexandra (Committee member) / Arizona State University (Publisher)

Created2019

Energy-Efficient ASIC Accelerators for Machine/Deep Learning Algorithms

Description

While machine/deep learning algorithms have been successfully used in many practical applications including object detection and image/video classification, accurate, fast, and low-power hardware implementations of such algorithms are still a challenging task, especially for mobile systems such as Internet of Things, autonomous vehicles, and smart drones.

This work presents an energy-efficient…

While machine/deep learning algorithms have been successfully used in many practical applications including object detection and image/video classification, accurate, fast, and low-power hardware implementations of such algorithms are still a challenging task, especially for mobile systems such as Internet of Things, autonomous vehicles, and smart drones.

This work presents an energy-efficient programmable application-specific integrated circuit (ASIC) accelerator for object detection. The proposed ASIC supports multi-class (face/traffic sign/car license plate/pedestrian), many-object (up to 50) in one image with different sizes (6 down-/11 up-scaling), and high accuracy (87% for face detection datasets). The proposed accelerator is composed of an integral channel detector with 2,000 classifiers for five rigid boosted templates to make a strong object detection. By jointly optimizing the algorithm and efficient hardware architecture, the prototype chip implemented in 65nm demonstrates real-time object detection of 20-50 frames/s with 22.5-181.7mW (0.54-1.75nJ/pixel) at 0.58-1.1V supply.

In this work, to reduce computation without accuracy degradation, an energy-efficient deep convolutional neural network (DCNN) accelerator is proposed based on a novel conditional computing scheme and integrates convolution with subsequent max-pooling operations. This way, the total number of bit-wise convolutions could be reduced by ~2x, without affecting the output feature values. This work also has been developing an optimized dataflow that exploits sparsity, maximizes data re-use and minimizes off-chip memory access, which can improve upon existing hardware works. The total off-chip memory access can be saved by 2.12x. Preliminary results of the proposed DCNN accelerator achieved a peak 7.35 TOPS/W for VGG-16 by post-layout simulation results in 40nm.

A number of recent efforts have attempted to design custom inference engine based on various approaches, including the systolic architecture, near memory processing, and in-meomry computing concept. This work evaluates a comprehensive comparison of these various approaches in a unified framework. This work also presents the proposed energy-efficient in-memory computing accelerator for deep neural networks (DNNs) by integrating many instances of in-memory computing macros with an ensemble of peripheral digital circuits, which supports configurable multibit activations and large-scale DNNs seamlessly while substantially improving the chip-level energy-efficiency. Proposed accelerator is fully designed in 65nm, demonstrating ultralow energy consumption for DNNs.

ContributorsKim, Minkyu (Author) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu Kevin (Committee member) / Vrudhula, Sarma (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)

Created2019

Reduced Order Models and Approximations for Hardware Acceleration of Neural Networks

Description

Many real-world engineering problems require simulations to evaluate the design objectives and constraints. Often, due to the complexity of the system model, simulations can be prohibitive in terms of computation time. One approach to overcome this issue is to construct a surrogate model, which approximates the original model. The focus…

Many real-world engineering problems require simulations to evaluate the design objectives and constraints. Often, due to the complexity of the system model, simulations can be prohibitive in terms of computation time. One approach to overcome this issue is to construct a surrogate model, which approximates the original model. The focus of this work is on the data-driven surrogate models, in which empirical approximations of the output are performed given the input parameters. Recently neural networks (NN) have re-emerged as a popular method for constructing data-driven surrogate models. Although, NNs have achieved excellent accuracy and are widely used, they pose their own challenges. This work addresses two common challenges, the need for: (1) hardware acceleration and (2) uncertainty quantification (UQ) in the presence of input variability. The high demand in the inference phase of deep NNs in cloud servers/edge devices calls for the design of low power custom hardware accelerators. The first part of this work describes the design of an energy-efficient long short-term memory (LSTM) accelerator. The overarching goal is to aggressively reduce the power consumption and area of the LSTM components using approximate computing, and then use architectural level techniques to boost the performance. The proposed design is synthesized and placed and routed as an application-specific integrated circuit (ASIC). The results demonstrate that this accelerator is 1.2X and 3.6X more energy-efficient and area-efficient than the baseline LSTM. In the second part of this work, a robust framework is developed based on an alternate data-driven surrogate model referred to as polynomial chaos expansion (PCE) for addressing UQ. In contrast to many existing approaches, no assumptions are made on the elements of the function space and UQ is a function of the expansion coefficients. Moreover, the sensitivity of the output with respect to any subset of the input variables can be computed analytically by post-processing the PCE coefficients. This provides a systematic and incremental method to pruning or changing the order of the model. This framework is evaluated on several real-world applications from different domains and is extended for classification tasks as well.

ContributorsAzari, Elham (Author) / Vrudhula, Sarma (Thesis advisor) / Fainekos, Georgios (Committee member) / Ren, Fengbo (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)

Created2021

Filtering by

Algorithm and Hardware Design for Efficient Deep Learning Inference

Transformers: An Architectural Analysis, Survey and Implementation

A Neural Network Model for a Tutoring Companion Supporting Students in a Programming with Java Course

Energy-Efficient ASIC Accelerators for Machine/Deep Learning Algorithms

Reduced Order Models and Approximations for Hardware Acceleration of Neural Networks