Search Content

Stagioni: Temperature management to enable near-sensor processing for performance, fidelity, and energy-efficiency of vision and imaging workloads

Description

Vision processing on traditional architectures is inefficient due to energy-expensive off-chip data movements. Many researchers advocate pushing processing close to the sensor to substantially reduce data movements. However, continuous near-sensor processing raises the sensor temperature, impairing the fidelity of imaging/vision tasks.

The work characterizes the thermal implications of using 3D stacked…

Vision processing on traditional architectures is inefficient due to energy-expensive off-chip data movements. Many researchers advocate pushing processing close to the sensor to substantially reduce data movements. However, continuous near-sensor processing raises the sensor temperature, impairing the fidelity of imaging/vision tasks.

The work characterizes the thermal implications of using 3D stacked image sensors with near-sensor vision processing units. The characterization reveals that near-sensor processing reduces system power but degrades image quality. For reasonable image fidelity, the sensor temperature needs to stay below a threshold, situationally determined by application needs. Fortunately, the characterization also identifies opportunities -- unique to the needs of near-sensor processing -- to regulate temperature based on dynamic visual task requirements and rapidly increase capture quality on demand.

Based on the characterization, the work proposes and investigate two thermal management strategies -- stop-capture-go and seasonal migration -- for imaging-aware thermal management. The work present parameters that govern the policy decisions and explore the trade-offs between system power and policy overhead. The work's evaluation shows that the novel dynamic thermal management strategies can unlock the energy-efficiency potential of near-sensor processing with minimal performance impact, without compromising image fidelity.

ContributorsKodukula, Venkatesh (Author) / LiKamWa, Robert (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Brunhaver, John (Committee member) / Arizona State University (Publisher)

Created2019

Energy Modeling of Machine Learning Algorithms on General Purpose Hardware

Description

Articial Neural Network(ANN) has become a for-bearer in the field of Articial Intel-

ligence. The innovations in ANN has led to ground breaking technological advances

like self-driving vehicles,medical diagnosis,speech Processing,personal assistants and

many more. These were inspired by evolution and working of our brains. Similar

to how our brain evolved using a combination of…

Articial Neural Network(ANN) has become a for-bearer in the field of Articial Intel-

ligence. The innovations in ANN has led to ground breaking technological advances

like self-driving vehicles,medical diagnosis,speech Processing,personal assistants and

many more. These were inspired by evolution and working of our brains. Similar

to how our brain evolved using a combination of epigenetics and live stimulus,ANN

require training to learn patterns.The training usually requires a lot of computation

and memory accesses. To realize these systems in real embedded hardware many

Energy/Power/Performance issues needs to be solved. The purpose of this research

is to focus on methods to study data movement requirement for generic Neural Net-

work along with the energy associated with it and suggest some ways to improve the

design.Many methods have suggested ways to optimize using mix of computation and

data movement solutions without affecting task accuracy. But these methods lack a

computation model to calculate the energy and depend on mere back of the envelope calculation. We realized that there is a need for a generic quantitative analysis

for memory access energy which helps in better architectural exploration. We show

that the present architectural tools are either incompatible or too slow and we need

a better analytical method to estimate data movement energy. We also propose a

simplistic yet effective approach that is robust and expandable by users to support

various systems.

ContributorsChowdary, Hidayatullah (Author) / Cao, Yu (Thesis advisor) / Seo, JaeSun (Committee member) / Chakrabarti, Chaitali (Committee member) / Arizona State University (Publisher)

Created2018

Energy Efficient Hardware Design of Neural Networks

Description

Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements…

Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements called neurons with several connections between them called synapses are used to build these networks. Hence, it involves operations that exhibit a high level of parallelism making it computationally and memory intensive. Constrained by computing resources and memory, most of the applications require a neural network which utilizes less energy. Energy efficient implementation of these computationally intense algorithms on neuromorphic hardware demands a lot of architectural optimizations. One of these optimizations would be the reduction in the network size using compression and several studies investigated compression by introducing element-wise or row-/column-/block-wise sparsity via pruning and regularization. Additionally, numerous recent works have concentrated on reducing the precision of activations and weights with some reducing to a single bit. However, combining various sparsity structures with binarized or very-low-precision (2-3 bit) neural networks have not been comprehensively explored. Output activations in these deep neural network algorithms are habitually non-binary making it difficult to exploit sparsity. On the other hand, biologically realistic models like spiking neural networks (SNN) closely mimic the operations in biological nervous systems and explore new avenues for brain-like cognitive computing. These networks deal with binary spikes, and they can exploit the input-dependent sparsity or redundancy to dynamically scale the amount of computation in turn leading to energy-efficient hardware implementation. This work discusses configurable spiking neuromorphic architecture that supports multiple hidden layers exploiting hardware reuse. It also presents design techniques for minimum-area/-energy DNN hardware with minimal degradation in accuracy. Area, performance and energy results of these DNN and SNN hardware is reported for the MNIST dataset. The Neuromorphic hardware designed for SNN algorithm in 28nm CMOS demonstrates high classification accuracy (>98% on MNIST) and low energy (51.4 - 773 (nJ) per classification). The optimized DNN hardware designed in 40nm CMOS that combines 8X structured compression and 3-bit weight precision showed 98.4% accuracy at 33 (nJ) per classification.

ContributorsKolala Venkataramanaiah, Shreyas (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2018

Algorithm Architecture Co-design for Dense and Sparse Matrix Computations

Description

With the end of Dennard scaling and Moore's law, architects have moved towards

heterogeneous designs consisting of specialized cores to achieve higher performance

and energy efficiency for a target application domain. Applications of linear algebra

are ubiquitous in the field of scientific computing, machine learning, statistics,

etc. with matrix computations being fundamental to these…

With the end of Dennard scaling and Moore's law, architects have moved towards

heterogeneous designs consisting of specialized cores to achieve higher performance

and energy efficiency for a target application domain. Applications of linear algebra

are ubiquitous in the field of scientific computing, machine learning, statistics,

etc. with matrix computations being fundamental to these linear algebra based solutions.

Design of multiple dense (or sparse) matrix computation routines on the

same platform is quite challenging. Added to the complexity is the fact that dense

and sparse matrix computations have large differences in their storage and access

patterns and are difficult to optimize on the same architecture. This thesis addresses

this challenge and introduces a reconfigurable accelerator that supports both dense

and sparse matrix computations efficiently.

The reconfigurable architecture has been optimized to execute the following linear

algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM

(Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver),

LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication),

SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where

each core consists of a 2D array of processing elements (PE).

The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix

updates efficiently. A sequence of such updates is used to solve a larger problem inside

a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR)

is used to perform sparse kernel updates. Scalable partitioning and mapping schemes

are presented that map input matrices of any given size to the multicore architecture.

Design trade-offs related to the PE array dimension, size of local memory inside a core

and the bandwidth between on-chip memories and the cores have been presented. An

optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto

32 GOPS using a single core.

ContributorsAnimesh, Saurabh (Author) / Chakrabarti, Chaitali (Thesis advisor) / Brunhaver, John (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)

Created2018

Using Capsule Networks for Image and Speech Recognition Problems

Description

In recent years, conventional convolutional neural network (CNN) has achieved outstanding performance in image and speech processing applications. Unfortunately, the pooling operation in CNN ignores important spatial information which is an important attribute in many applications. The recently proposed capsule network retains spatial information and improves the capabilities of traditional…

In recent years, conventional convolutional neural network (CNN) has achieved outstanding performance in image and speech processing applications. Unfortunately, the pooling operation in CNN ignores important spatial information which is an important attribute in many applications. The recently proposed capsule network retains spatial information and improves the capabilities of traditional CNN. It uses capsules to describe features in multiple dimensions and dynamic routing to increase the statistical stability of the network.

In this work, we first use capsule network for overlapping digit recognition problem. We evaluate the performance of the network with respect to recognition accuracy, convergence and training time per epoch. We show that capsule network achieves higher accuracy when training set size is small. When training set size is larger, capsule network and conventional CNN have comparable recognition accuracy. The training time per epoch for capsule network is longer than conventional CNN because of the dynamic routing algorithm. An analysis of the GPU timing shows that adjusting the capsule structure can help decrease the time complexity of the dynamic routing algorithm significantly.

Next, we design a capsule network for speech recognition, specifically, overlapping word recognition. We use both capsule network and conventional CNN to recognize 2 overlapping words in speech files created from 5 word classes. We show that capsule network achieves a considerably higher recognition accuracy (96.92%) compared to conventional CNN (85.19%). Our results show that capsule network recognizes overlapping word by recognizing each individual word in the speech. We also verify the scalability of capsule network by increasing the number of word classes from 5 to 10. Capsule network still shows a high recognition accuracy of 95.42% in case of 10 words while the accuracy of conventional CNN decreases sharply to 73.18%.

ContributorsXiong, Yan (Author) / Chakrabarti, Chaitali (Thesis advisor) / Berisha, Visar (Thesis advisor) / Weng, Yang (Committee member) / Arizona State University (Publisher)

Created2018

Approximate neural networks for speech applications in resource-constrained environments

Description

Speech recognition and keyword detection are becoming increasingly popular applications for mobile systems. While deep neural network (DNN) implementation of these systems have very good performance,

they have large memory and compute resource requirements, making their implementation on a mobile device quite challenging. In this thesis, techniques to reduce the…

Speech recognition and keyword detection are becoming increasingly popular applications for mobile systems. While deep neural network (DNN) implementation of these systems have very good performance,

they have large memory and compute resource requirements, making their implementation on a mobile device quite challenging. In this thesis, techniques to reduce the memory and computation cost

of keyword detection and speech recognition networks (or DNNs) are presented.

The first technique is based on representing all weights and biases by a small number of bits and mapping all nodal computations into fixed-point ones with minimal degradation in the

accuracy. Experiments conducted on the Resource Management (RM) database show that for the keyword detection neural network, representing the weights by 5 bits results in a 6 fold reduction in memory compared to a floating point implementation with very little loss in performance. Similarly, for the speech recognition neural network, representing the weights by 6 bits results in a 5 fold reduction in memory while maintaining an error rate similar to a floating point implementation. Additional reduction in memory is achieved by a technique called weight pruning,

where the weights are classified as sensitive and insensitive and the sensitive weights are represented with higher precision. A combination of these two techniques helps reduce the memory

footprint by 81 - 84% for speech recognition and keyword detection networks respectively.

Further reduction in memory size is achieved by judiciously dropping connections for large blocks of weights. The corresponding technique, termed coarse-grain sparsification, introduces

hardware-aware sparsity during DNN training, which leads to efficient weight memory compression and significant reduction in the number of computations during classification without

loss of accuracy. Keyword detection and speech recognition DNNs trained with 75% of the weights dropped and classified with 5-6 bit weight precision effectively reduced the weight memory

requirement by ~95% compared to a fully-connected network with double precision, while showing similar performance in keyword detection accuracy and word error rate.

ContributorsArunachalam, Sairam (Author) / Chakrabarti, Chaitali (Thesis advisor) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2016

Development and implementation of physical layer kernels for wireless communication protocols

Description

Historically, wireless communication devices have been developed to process one specific waveform. In contrast, a modern cellular phone supports multiple waveforms corresponding to LTE, WCDMA(3G) and 2G standards. The selection of the network is controlled by software running on a general purpose processor, not by the user. Now, instead of…

Historically, wireless communication devices have been developed to process one specific waveform. In contrast, a modern cellular phone supports multiple waveforms corresponding to LTE, WCDMA(3G) and 2G standards. The selection of the network is controlled by software running on a general purpose processor, not by the user. Now, instead of selecting from a set of complete radios as in software controlled radio, what if the software could select the building blocks based on the user needs. This is the new software-defined flexible radio which would enable users to construct wireless systems that fit their needs, rather than forcing to use from a small set of pre-existing protocols.

To develop and implement flexible protocols, a flexible hardware very similar to a Software Defined Radio (SDR) is required. In this thesis, the Intel T2200 board is chosen as the SDR platform. It is a heterogeneous platform with ARM, CEVA DSP and several accelerators. A wide range of protocols is mapped onto this platform and their performance evaluated. These include two OFDM based protocols (WiFi-Lite-A, WiFi-Lite-B), one DFT-spread OFDM based protocol (SCFDM-Lite) and one single carrier based protocol (SC-Lite). The transmitter and receiver blocks of the different protocols are first mapped on ARM in the T2200 board. The timing results show that IFFT, FFT, and Viterbi decoder blocks take most of the transmitter and receiver execution time and so in the next step these are mapped onto CEVA DSP. Mapping onto CEVA DSP resulted in significant execution time savings. The savings for WiFi-Lite-A were 60%, for WiFi-Lite-B were 64%, and for SCFDM-Lite were 71.5%. No savings are reported for SC-Lite since it was not mapped onto CEVA DSP.

Significant reduction in execution time is achieved for WiFi-Lite-A and WiFi-Lite-B protocols by implementing the entire transmitter and receiver chains on CEVA DSP. For instance, for WiFi-Lite-A, the savings were as large as 90%. Such huge savings are because the entire transmitter or receiver chain are implemented on CEVA and the timing overhead due to ARM-CEVA communication is completely eliminated. Finally, over-the-air testing was done for WiFi-Lite-A and WiFi-Lite-B protocols. Data was sent over the air using one Intel T2200 WBS board and received using another Intel T2200 WBS board. The received frames were decoded with no errors, thereby validating the over-the-air-communications.

ContributorsChagari, Vamsi Reddy (Author) / Chakrabarti, Chaitali (Thesis advisor) / Lee, Hyunseok (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)

Created2016

Low Complexity Optical Flow Using Neighbor-Guided Semi-Global Matching

Description

Many real-time vision applications require accurate estimation of optical flow. This problem is quite challenging due to extremely high computation and memory requirements. This thesis focuses on designing low complexity dense optical flow algorithms.

First, a new method for optical flow that is based on Semi-Global Matching (SGM), a popular dynamic…

Many real-time vision applications require accurate estimation of optical flow. This problem is quite challenging due to extremely high computation and memory requirements. This thesis focuses on designing low complexity dense optical flow algorithms.

First, a new method for optical flow that is based on Semi-Global Matching (SGM), a popular dynamic programming algorithm for stereo vision, is presented. In SGM, the disparity of each pixel is calculated by aggregating local matching costs over the entire image to resolve local ambiguity in texture-less and occluded regions. The proposed method, Neighbor-Guided Semi-Global Matching (NG-fSGM) achieves significantly less complexity compared to SGM, by 1) operating on a subset of the search space that has been aggressively pruned based on neighboring pixels’ information, 2) using a simple cost aggregation function, 3) approximating aggregated cost array and embedding pixel-wise matching cost computation and flow computation in aggregation. Evaluation on the Middlebury benchmark suite showed that, compared to a prior SGM extension for optical flow, the proposed basic NG-fSGM provides robust optical flow with 0.53% accuracy improvement, 40x reduction in number of operations and 6x reduction in memory size. To further reduce the complexity, sparse-to-dense flow estimation method is proposed. The number of operations and memory size are reduced by 68% and 47%, respectively, with only 0.42% accuracy degradation, compared to the basic NG-fSGM.

A parallel block-based version of NG-fSGM is also proposed. The image is divided into overlapping blocks and the blocks are processed in parallel to improve throughput, latency and power efficiency. To minimize the amount of overlap among blocks with minimal effect on the accuracy, temporal information is used to estimate a flow map that guides flow vector selections for pixels along block boundaries. The proposed block-based NG-fSGM achieves significant reduction in complexity with only 0.51% accuracy degradation compared to the basic NG-fSGM.

ContributorsXiang, Jiang (Author) / Chakrabarti, Chaitali (Thesis advisor) / Karam, Lina (Committee member) / Kim, Hun Seok (Committee member) / Arizona State University (Publisher)

Created2017

Development of Multiple Protocols in Novel Simulation Environment

Description

When one considers the current state of wireless communications, it becomes clear that it is both absolutely amazing and something of a mess. Present communications standards are the result of local optimizations over time that led to a confusing set of suboptimal and fragile wireless standards. Starting from a clean…

When one considers the current state of wireless communications, it becomes clear that it is both absolutely amazing and something of a mess. Present communications standards are the result of local optimizations over time that led to a confusing set of suboptimal and fragile wireless standards. Starting from a clean sheet of paper, Bliss Laboratory for Information, Signals, and Systems (BLISS) is considering a fluid set of communications standards co-optimized with flexible but power-efficient computational implementations that will enable the next revolution of wireless communications. The main aim is to enable much higher data rates and much lower data rates with corresponding lower power consumption as the needs of the users vary.

The thesis mainly looks at the different sections of the work done, to prime the development of the protocol development engine. It discusses channel modeling, and system integration of receiver and channel noise. It also proposes a Carrier-Sense Multiple Access (CSMA) Media Access Control (MAC) layer protocol implementation for (Wireless Fidelity) Wi-Fi protocol. This work also talks about the Graphical User Interface (GUI), which is a part of Protocol Development Kit (PDK) - a combination of the Protocol Recommendation Engine (PRE) and simulation package to aid the development of protocols. It also sheds light on the Automatic Dependent Surveillance - Broadcast (ADS-B) radio protocol, that will eventually replace radar as Air Traffic Control's (ATC) primary tool for separating aircraft.

All the algorithms used in this thesis, to define radio operation were in principle defined by mathematical descriptions; however, to test and implement these algorithms they had to be converted to a computer language. There were multiple phases of this conversion. In the first phase, the implementation of these algorithms was done in Matrix Laboratory (MATLAB). To aid this development, basic radio finite state machines and radio algorithmic tools were provided.

ContributorsRupakula, Venkata Sai Karteek (Author) / Bliss, Daniel W (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / McGiffen, Tom (Committee member) / Arizona State University (Publisher)

Created2017

Low Complexity Wireless Communication Digital Baseband Design

Description

This thesis addresses two problems in digital baseband design of wireless communication systems, namely, those in Internet of Things (IoT) terminals that support long range communications and those in full-duplex systems that are designed for high spectral efficiency.

IoT terminals for long range communications are typically based on Orthogonal Frequency-Division Multiple…

This thesis addresses two problems in digital baseband design of wireless communication systems, namely, those in Internet of Things (IoT) terminals that support long range communications and those in full-duplex systems that are designed for high spectral efficiency.

IoT terminals for long range communications are typically based on Orthogonal Frequency-Division Multiple Access (OFDMA) and spread spectrum technologies. In order to design an efficient baseband architecture for such terminals, the workload profiles of both systems are analyzed. Since frame detection unit has by far the highest computational load, a simple architecture that uses only a scalar datapath is proposed. To optimize for low energy consumption, application-specific instructions that minimize register accesses and address generation units for streamlined memory access are introduced. Two parameters, namely, correlation window size and threshold value, affect the detection probability, the false alarm probability and hence energy consumption. Next, energy-optimal operation settings for correlation window size and threshold value are derived for different channel conditions. For both good and bad channel conditions, if target signal detection probability is greater than 0.9, the baseband processor has the lowest energy when the frame detection algorithm uses the longest correlation window and the highest threshold value.

A full-duplex system has high spectral efficiency but suffers from self-interference. Part of the interference can be cancelled digitally using equalization techniques. The cancellation performance and computation complexity of the competing equalization algorithms, namely, Least Mean Square (LMS), Normalized LMS (NLMS), Recursive Least Square (RLS) and feedback equalizers based on LMS, NLMS and RLS are analyzed, and a trade-off between performance and complexity established. NLMS linear equalizer is found to be suitable for resource-constrained mobile devices and NLMS decision feedback equalizer is more appropriate for base stations that are not energy constrained.

ContributorsWu, Shunyao (Author) / Chakrabarti, Chaitali (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Lee, Hyunseok (Committee member) / Arizona State University (Publisher)

Created2017

Filtering by