Search Content

End-to-End Performance Benchmarking Tool for High-Speed Memory Access in Deep Learning

Description

Due to high DRAM access latency and energy, several convolutional neural network(CNN) accelerators face performance and energy efficiency challenges, which are critical for embedded implementations. As these applications exploit larger datasets, memory accesses of these emerging applications are increasing. As a result, it is difficult to predict the combined…

Due to high DRAM access latency and energy, several convolutional neural network(CNN) accelerators face performance and energy efficiency challenges, which are critical for embedded implementations. As these applications exploit larger datasets, memory accesses of these emerging applications are increasing. As a result, it is difficult to predict the combined dynamic random access memory (DRAM) workload behavior, which can sabotage memory optimizations in software. To understand the impact of external memory access on CNN accelerators which reduces the high DRAMaccess latency and energy, simulators such as RAMULATOR and VAMPIRE have been proposed in prior work. In this work, we utilize these simulators to benchmark external memory access in CNN accelerators. Experiments are performed generating trace files based on the number of parameters and data precision and also using trace file generated for CNN Accelerator Altera Arria 10 GX 1150 FPGA data to complete the end to end workflow using the mentioned simulators. Besides that, certain modifications were made in the default VAMPIRE code to implement certain functionalities such as PREA(Precharge All) and REF(Refresh). Then, precalculated energies were computed for DDR3, DDR4, and HBM based on the micron model to mention it in the dram specification file inputted to the VAMPIRE tool. An experimental study was performed and a comparison is made between DDR3, DDR4, and HBM, it was proved that DDR4 is nearly 31% more energy-efficient than DDR3 and HBMis 54% energy-efficient than DDR3. Performed modeling and experimental analysis on a large set of data and then split it into a set of data and compared the results of the small sets multiplied with the number of sets and the large data set and concluded that the results were nearly the same. Finally, a GUI is developed by wrapping both the simulators. GUI provides user-friendly access and can analyze the parameters without much prior knowledge and understanding of the working.

ContributorsPannala, Manvitha (Author) / Cao, Yu (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2021

Computational Imaging for Energy-Efficient Cameras: Adaptive ROI-based Object Tracking and Optically Defocused Event-based Sensing.

Description

Computer vision is becoming an essential component of embedded system applications such as smartphones, wearables, autonomous systems and internet-of-things (IoT). These applications are generally deployed into environments with limited energy, memory bandwidth and computational resources. This trend is driving the development of energy-effi cient image processing solutions from sensing to…

Computer vision is becoming an essential component of embedded system applications such as smartphones, wearables, autonomous systems and internet-of-things (IoT). These applications are generally deployed into environments with limited energy, memory bandwidth and computational resources. This trend is driving the development of energy-effi cient image processing solutions from sensing to computation. In this thesis, diff erent alternatives are explored to implement energy-efficient computer vision systems. First, I present a fi eld programmable gate array (FPGA) implementation of an adaptive subsampling algorithm for region-of-interest (ROI) -based object tracking. By implementing the computationally intensive sections of this algorithm on an FPGA, I aim to offl oad computing resources from energy-ineffi cient graphics processing units (GPUs) and/or general-purpose central processing units (CPUs). I also present a working system executing this algorithm in near real-time latency implemented on a standalone embedded device. Secondly, I present a neural network-based pipeline to improve the performance of event-based cameras in non-ideal optical conditions. Event-based cameras or dynamic vision sensors (DVS) are bio-inspired sensors that measure logarithmic per-pixel brightness changes in a scene. Their advantages include high dynamic range, low latency and ultra-low power when compared to standard frame-based cameras. Several tasks have been proposed to take advantage of these novel sensors but they rely on perfectly calibrated optical lenses that are in-focus. In this work I propose a methodto reconstruct events captured with an out-of-focus event-camera so they can be fed into an intensity reconstruction task. The network is trained with a dataset generated by simulating defocus blur in sequences from object tracking datasets such as LaSOT and OTB100. I also test the generalization performance of this network in scenes captured with a DAVIS event-based sensor equipped with an out-of-focus lens.

ContributorsTorres Muro, Victor Isaac (Author) / Jayasuriya, Suren (Thesis advisor) / Spanias, Andreas (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2022

Energy Efficient Hardware Design of Neural Networks

Description

Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements…

Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements called neurons with several connections between them called synapses are used to build these networks. Hence, it involves operations that exhibit a high level of parallelism making it computationally and memory intensive. Constrained by computing resources and memory, most of the applications require a neural network which utilizes less energy. Energy efficient implementation of these computationally intense algorithms on neuromorphic hardware demands a lot of architectural optimizations. One of these optimizations would be the reduction in the network size using compression and several studies investigated compression by introducing element-wise or row-/column-/block-wise sparsity via pruning and regularization. Additionally, numerous recent works have concentrated on reducing the precision of activations and weights with some reducing to a single bit. However, combining various sparsity structures with binarized or very-low-precision (2-3 bit) neural networks have not been comprehensively explored. Output activations in these deep neural network algorithms are habitually non-binary making it difficult to exploit sparsity. On the other hand, biologically realistic models like spiking neural networks (SNN) closely mimic the operations in biological nervous systems and explore new avenues for brain-like cognitive computing. These networks deal with binary spikes, and they can exploit the input-dependent sparsity or redundancy to dynamically scale the amount of computation in turn leading to energy-efficient hardware implementation. This work discusses configurable spiking neuromorphic architecture that supports multiple hidden layers exploiting hardware reuse. It also presents design techniques for minimum-area/-energy DNN hardware with minimal degradation in accuracy. Area, performance and energy results of these DNN and SNN hardware is reported for the MNIST dataset. The Neuromorphic hardware designed for SNN algorithm in 28nm CMOS demonstrates high classification accuracy (>98% on MNIST) and low energy (51.4 - 773 (nJ) per classification). The optimized DNN hardware designed in 40nm CMOS that combines 8X structured compression and 3-bit weight precision showed 98.4% accuracy at 33 (nJ) per classification.

ContributorsKolala Venkataramanaiah, Shreyas (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2018

Joint Optimization of Quantization and Structured Sparsity for Compressed Deep Neural Networks

Description

Deep neural networks (DNN) have shown tremendous success in various cognitive tasks, such as image classification, speech recognition, etc. However, their usage on resource-constrained edge devices has been limited due to high computation and large memory requirement.

To overcome these challenges, recent works have extensively investigated model compression techniques such…

Deep neural networks (DNN) have shown tremendous success in various cognitive tasks, such as image classification, speech recognition, etc. However, their usage on resource-constrained edge devices has been limited due to high computation and large memory requirement.

To overcome these challenges, recent works have extensively investigated model compression techniques such as element-wise sparsity, structured sparsity and quantization. While most of these works have applied these compression techniques in isolation, there have been very few studies on application of quantization and structured sparsity together on a DNN model.

This thesis co-optimizes structured sparsity and quantization constraints on DNN models during training. Specifically, it obtains optimal setting of 2-bit weight and 2-bit activation coupled with 4X structured compression by performing combined exploration of quantization and structured compression settings. The optimal DNN model achieves 50X weight memory reduction compared to floating-point uncompressed DNN. This memory saving is significant since applying only structured sparsity constraints achieves 2X memory savings and only quantization constraints achieves 16X memory savings. The algorithm has been validated on both high and low capacity DNNs and on wide-sparse and deep-sparse DNN models. Experiments demonstrated that deep-sparse DNN outperforms shallow-dense DNN with varying level of memory savings depending on DNN precision and sparsity levels. This work further proposed a Pareto-optimal approach to systematically extract optimal DNN models from a huge set of sparse and dense DNN models. The resulting 11 optimal designs were further evaluated by considering overall DNN memory which includes activation memory and weight memory. It was found that there is only a small change in the memory footprint of the optimal designs corresponding to the low sparsity DNNs. However, activation memory cannot be ignored for high sparsity DNNs.

ContributorsSrivastava, Gaurav (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2018

Wide input common-mode range fully integrated low-dropout voltage regulators

Description

The modern era of consumer electronics is dominated by compact, portable, affordable smartphones and wearable computing devices. Power management integrated circuits (PMICs) play a crucial role in on-chip power management, extending battery life and efficiency of integrated analog, radio-frequency (RF), and mixed-signal cores. Low-dropout (LDO) regulators are commonly used to…

The modern era of consumer electronics is dominated by compact, portable, affordable smartphones and wearable computing devices. Power management integrated circuits (PMICs) play a crucial role in on-chip power management, extending battery life and efficiency of integrated analog, radio-frequency (RF), and mixed-signal cores. Low-dropout (LDO) regulators are commonly used to provide clean supply for low voltage integrated circuits, where point-of-load regulation is important. In System-On-Chip (SoC) applications, digital circuits can change their mode of operation regularly at a very high speed, imposing various load transient conditions for the regulator. These quick changes of load create a glitch in LDO output voltage, which hamper performance of the digital circuits unfavorably. For an LDO designer, minimizing output voltage variation and speeding up voltage glitch settling is an important task.

The presented research introduces two fully integrated LDO voltage regulators for SoC applications. N-type Metal-Oxide-Semiconductor (NMOS) power transistor based operation achieves high bandwidth owing to the source follower configuration of the regulation loop. A low input impedance and high output impedance error amplifier ensures wide regulation loop bandwidth and high gain. Current-reused dynamic biasing technique has been employed to increase slew-rate at the gate of power transistor during full-load variations, by a factor of two. Three design variations for a 1-1.8 V, 50 mA NMOS LDO voltage regulator have been implemented in a 180 nm Mixed-mode/RF process. The whole LDO core consumes 0.130 mA of nominal quiescent ground current at 50 mA load and occupies 0.21 mm x mm. LDO has a dropout voltage of 200 mV and is able to recover in 30 ns from a 65 mV of undershoot for 0-50 pF of on-chip load capacitance.

ContributorsDesai, Chirag (Author) / Kiaei, Sayfe (Thesis advisor) / Bakkaloglu, Bertan (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2016

Mixed-mode adaptive ripple canceller for switching regulators

Description

State of art modern System-On-Chip architectures often require very low noise supplies without overhead on high efficiencies. Low noise supplies are especially important in noise sensitive analog blocks such as high precision Analog-to-Digital Converters, Phase Locked Loops etc., and analog signal processing blocks. Switching regulators, while providing high efficiency power…

State of art modern System-On-Chip architectures often require very low noise supplies without overhead on high efficiencies. Low noise supplies are especially important in noise sensitive analog blocks such as high precision Analog-to-Digital Converters, Phase Locked Loops etc., and analog signal processing blocks. Switching regulators, while providing high efficiency power conversion suffer from inherent ripple on their output. A typical solution for high efficiency low noise supply is to cascade switching regulators with Low Dropout linear regulators (LDO) which generate inherently quiet supplies. The switching frequencies of switching regulators keep scaling to higher values in order to reduce the sizes of the passive inductor and capacitors at the output of switching regulators. This poses a challenge for existing solutions of switching regulators followed by LDO since the Power Supply Rejection (PSR) of LDOs are band-limited. In order to achieve high PSR over a wideband, the penalty would be to increase the quiescent power consumed to increase the bandwidth of the LDO and increase in solution area of the LDO. Hence, an alternative to the existing approach is required which improves the ripple cancellation at the output of switching regulator while overcoming the deficiencies of the LDO.

This research focuses on developing an innovative technique to cancel the ripple at the output of switching regulator which is scalable across a wide range of switching frequencies. The proposed technique consists of a primary ripple canceller and an auxiliary ripple canceller, both of which facilitate in the generation of a quiet supply and help to attenuate the ripple at the output of buck converter by over 22dB. These techniques can be applied to any DC-DC converter and are scalable across frequency, load current, output voltage as compared to LDO without significant overhead on efficiency or area. The proposed technique also presents a fully integrated solution without the need of additional off-chip components which, considering the push for full-integration of Power Management Integrated Circuits, is a big advantage over using LDOs.

ContributorsJoshi, Kishan (Author) / Bakkaloglu, Bertan (Thesis advisor) / Garrity, Douglas (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2016

The feasibility of domain specific compilation for spatially programmable architectures

Description

Integrated circuits must be energy efficient. This efficiency affects all aspects of chip design, from the battery life of embedded devices to thermal heating on high performance servers. As technology scaling slows, future generations of transistors will lack the energy efficiency gains as it has had in previous generations. Therefore,…

Integrated circuits must be energy efficient. This efficiency affects all aspects of chip design, from the battery life of embedded devices to thermal heating on high performance servers. As technology scaling slows, future generations of transistors will lack the energy efficiency gains as it has had in previous generations. Therefore, other sources of energy efficiency will be much more important. Many computations have the potential to be executed for extreme energy efficiency but are not instigated because the platforms they run on are not optimized for efficient execution. ASICs improve energy efficiency by reducing flexibility and leveraging the properties of a specific computation. However, ASICs are fixed in function and therefore have incredible opportunity cost. FPGAs offer a reconfigurable solution but are 25x less energy efficient than ASIC implementation. Spatially programmable architectures (SPAs) are similar in design and structure to ASICs and FPGAs but are able bridge the ASIC-FPGA energy efficiency gap by trading flexibility for efficiency. However, SPAs are difficult to program because they do not share the same programming model as normal architectures that execute in time. This work addresses compiler challenges for coarse grained, locally interconnected SPA for domain efficiency (SPADE). A novel SPADE topology, called the wave pipeline, is introduced that is designed for the image signal processing domain that is both efficient and simple to compile to. A compiler for the wave pipeline is created that solves for maximum energy and area efficiency using low complexity, greedy methods. The wave pipeline topology and compiler allow for us to investigate and experiment with image signal processing applications to prove the feasibility of SPADE compilers.

ContributorsMackay, Curtis (Author) / Brunhaver, John (Thesis advisor) / Karam, Lina J (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2016

Emerging neural coincidences in rats agranular medial and agranular lateral cortices during learning of a directional choice task

Description

To uncover the neural correlates to go-directed behavior, single unit action potentials are considered fundamental computing units and have been examined by different analytical methodologies under a broad set of hypotheses. Using a behaving rat performing a directional choice learning task, we aim to study changes in rat's cortical neural…

To uncover the neural correlates to go-directed behavior, single unit action potentials are considered fundamental computing units and have been examined by different analytical methodologies under a broad set of hypotheses. Using a behaving rat performing a directional choice learning task, we aim to study changes in rat's cortical neural patterns while he improved his task performance accuracy from chance to 80% or higher. Specifically, simultaneous multi-channel single unit neural recordings from the rat's agranular medial (AGm) and Agranular lateral (AGl) cortices were analyzed using joint peristimulus time histogram (JPSTHs), which effectively unveils firing coincidences in neural action potentials. My results based on data from six rats revealed that coincidences of pair-wise neural action potentials are higher when rats were performing the task than they were not at the learning stage, and this trend abated after the rats learned the task. Another finding is that the coincidences at the learning stage are stronger than that when the rats learned the task especially when they were performing the task. Therefore, this coincidence measure is the highest when the rats were performing the task at the learning stage. This may suggest that neural coincidences play a role in the coordination and communication among populations of neurons engaged in a purposeful act. Additionally, attention and working memory may have contributed to the modulation of neural coincidences during the designed task.

ContributorsCheng, Bing (Author) / Si, Jennie (Thesis advisor) / Chae, Junseok (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2014

Automated place and route methodologies for multi-project test chips

Description

This work describes the development of automated flows to generate pad rings, mixed signal power grids, and mega cells in a multi-project test chip. There were three major design flows that were created to create the test chip. The first was the pad ring which was used as the staring…

This work describes the development of automated flows to generate pad rings, mixed signal power grids, and mega cells in a multi-project test chip. There were three major design flows that were created to create the test chip. The first was the pad ring which was used as the staring block for creating the test chip. This flow put all of the signals for the chip in the order that was wanted along the outside of the die along with creation of the power ring that is used to supply the chip with a robust power source.

The second flow that was created was used to put together a flash block that is based off of a XILIX XCFXXP. This flow was somewhat similar to how the pad ring flow worked except that optimizations and a clock tree was added into the flow. There was a couple of design redoes due to timing and orientation constraints.

Finally, the last flow that was created was the top level flow which is where all of the components are combined together to create a finished test chip ready for fabrication. The main components that were used were the finished flash block, HERMES, test structures, and a clock instance along with the pad ring flow for the creation of the pad ring and power ring.

Also discussed is some work that was done on a previous multi-project test chip. The work that was done was the creation of power gaters that were used like switches to turn the power on and off for some flash modules. To control the power gaters the functionality change of some pad drivers was done so that they output a higher voltage than what is seen in the core of the chip.

ContributorsLieb, Christopher (Author) / Clark, Lawrence (Thesis advisor) / Holbert, Keith E. (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2015

A low power digital controller for DC-DC converter applications with integrated PFM mode detector

Description

Switching Converters (SC) are an excellent choice for hand held devices due to their high power conversion efficiency. However, they suffer from two major drawbacks. The first drawback is that their dynamic response is sensitive to variations in inductor (L) and capacitor (C) values. A cost effective solution is implemented…

Switching Converters (SC) are an excellent choice for hand held devices due to their high power conversion efficiency. However, they suffer from two major drawbacks. The first drawback is that their dynamic response is sensitive to variations in inductor (L) and capacitor (C) values. A cost effective solution is implemented by designing a programmable digital controller. Despite variations in L and C values, the target dynamic response can be achieved by computing and programming the filter coefficients for a particular L and C. Besides, digital controllers have higher immunity to environmental changes such as temperature and aging of components. The second drawback of SCs is their poor efficiency during low load conditions if operated in Pulse Width Modulation (PWM) mode. However, if operated in Pulse Frequency Modulation (PFM) mode, better efficiency numbers can be achieved. A mostly-digital way of detecting PFM mode is implemented. Besides, a slow serial interface to program the chip, and a high speed serial interface to characterize mixed signal blocks as well as to ship data in or out for debug purposes are designed. The chip is taped out in 0.18µm IBM's radiation hardened CMOS process technology. A test board is built with the chip, external power FETs and driver IC. At the time of this writing, PWM operation, PFM detection, transitions between PWM and PFM, and both serial interfaces are validated on the test board.

ContributorsMumma Reddy, Abhiram (Author) / Bakkaloglu, Bertan (Thesis advisor) / Ogras, Umit Y. (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2014

Filtering by