Matching Items (58)
Filtering by

Clear all filters

156195-Thumbnail Image.png
Description
Over the past few decades, the silicon complementary-metal-oxide-semiconductor (CMOS) technology has been greatly scaled down to achieve higher performance, density and lower power consumption. As the device dimension is approaching its fundamental physical limit, there is an increasing demand for exploration of emerging devices with distinct operating principles from conventional

Over the past few decades, the silicon complementary-metal-oxide-semiconductor (CMOS) technology has been greatly scaled down to achieve higher performance, density and lower power consumption. As the device dimension is approaching its fundamental physical limit, there is an increasing demand for exploration of emerging devices with distinct operating principles from conventional CMOS. In recent years, many efforts have been devoted in the research of next-generation emerging non-volatile memory (eNVM) technologies, such as resistive random access memory (RRAM) and phase change memory (PCM), to replace conventional digital memories (e.g. SRAM) for implementation of synapses in large-scale neuromorphic computing systems.

Essentially being compact and “analog”, these eNVM devices in a crossbar array can compute vector-matrix multiplication in parallel, significantly speeding up the machine/deep learning algorithms. However, non-ideal eNVM device and array properties may hamper the learning accuracy. To quantify their impact, the sparse coding algorithm was used as a starting point, where the strategies to remedy the accuracy loss were proposed, and the circuit-level design trade-offs were also analyzed. At architecture level, the parallel “pseudo-crossbar” array to prevent the write disturbance issue was presented. The peripheral circuits to support various parallel array architectures were also designed. One key component is the read circuit that employs the principle of integrate-and-fire neuron model to convert the analog column current to digital output. However, the read circuit is not area-efficient, which was proposed to be replaced with a compact two-terminal oscillation neuron device that exhibits metal-insulator-transition phenomenon.

To facilitate the design exploration, a circuit-level macro simulator “NeuroSim” was developed in C++ to estimate the area, latency, energy and leakage power of various neuromorphic architectures. NeuroSim provides a wide variety of design options at the circuit/device level. NeuroSim can be used alone or as a supporting module to provide circuit-level performance estimation in neural network algorithms. A 2-layer multilayer perceptron (MLP) simulator with integration of NeuroSim was demonstrated to evaluate both the learning accuracy and circuit-level performance metrics for the online learning and offline classification, as well as to study the impact of eNVM reliability issues such as data retention and write endurance on the learning performance.
ContributorsChen, Pai-Yu (Author) / Yu, Shimeng (Thesis advisor) / Cao, Yu (Committee member) / Seo, Jae-Sun (Committee member) / Chakrabarti, Chaitali (Committee member) / Arizona State University (Publisher)
Created2018
155918-Thumbnail Image.png
Description
The aging mechanism in devices is prone to uncertainties due to dynamic stress conditions. In AMS circuits these can lead to momentary fluctuations in circuit voltage that may be missed by a compact model and hence cause unpredictable failure. Firstly, multiple aging effects in the devices may have underlying correlations.

The aging mechanism in devices is prone to uncertainties due to dynamic stress conditions. In AMS circuits these can lead to momentary fluctuations in circuit voltage that may be missed by a compact model and hence cause unpredictable failure. Firstly, multiple aging effects in the devices may have underlying correlations. The generation of new traps during TDDB may significantly accelerate BTI, since these traps are close to the dielectric-Si interface in scaled technology. Secondly, the prevalent reliability analysis lacks a direct validation of the lifetime of devices and circuits. The aging mechanism of BTI causes gradual degradation of the device leading to threshold voltage shift and increasing the failure rate. In the 28nm HKMG technology, contribution of BTI to NMOS degradation has become significant at high temperature as compared to Channel Hot Carrier (CHC). This requires revising the End of Lifetime (EOL) calculation based on contribution from induvial aging effects especially in feedback loops. Conventionally, aging in devices is extrapolated from a short-term measurement, but this practice results in unreliable prediction of EOL caused by variability in initial parameters and stress conditions. To mitigate the extrapolation issues and improve predictability, this work aims at providing a new approach to test the device to EOL in a fast and controllable manner. The contributions of this thesis include: (1) based on stochastic trapping/de-trapping mechanism, new compact BTI models are developed and verified with 14nm FinFET and 28nm HKMG data. Moreover, these models are implemented into circuit simulation, illustrating a significant increase in failure rate due to accelerated BTI, (2) developing a model to predict accelerated aging under special conditions like feedback loops and stacked inverters, (3) introducing a feedback loop based test methodology called Adaptive Accelerated Aging (AAA) that can generate accurate aging data till EOL, (4) presenting simulation and experimental data for the models and providing test setup for multiple stress conditions, including those for achieving EOL in 1 hour device as well as ring oscillator (RO) circuit for validation of the proposed methodology, and (5) scaling these models for finding a guard band for VLSI design circuits that can provide realistic aging impact.
ContributorsPatra, Devyani (Author) / Cao, Yu (Thesis advisor) / Barnaby, Hugh (Thesis advisor) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2017
156813-Thumbnail Image.png
Description
Articial Neural Network(ANN) has become a for-bearer in the field of Articial Intel-

ligence. The innovations in ANN has led to ground breaking technological advances

like self-driving vehicles,medical diagnosis,speech Processing,personal assistants and

many more. These were inspired by evolution and working of our brains. Similar

to how our brain evolved using a combination of

Articial Neural Network(ANN) has become a for-bearer in the field of Articial Intel-

ligence. The innovations in ANN has led to ground breaking technological advances

like self-driving vehicles,medical diagnosis,speech Processing,personal assistants and

many more. These were inspired by evolution and working of our brains. Similar

to how our brain evolved using a combination of epigenetics and live stimulus,ANN

require training to learn patterns.The training usually requires a lot of computation

and memory accesses. To realize these systems in real embedded hardware many

Energy/Power/Performance issues needs to be solved. The purpose of this research

is to focus on methods to study data movement requirement for generic Neural Net-

work along with the energy associated with it and suggest some ways to improve the

design.Many methods have suggested ways to optimize using mix of computation and

data movement solutions without affecting task accuracy. But these methods lack a

computation model to calculate the energy and depend on mere back of the envelope calculation. We realized that there is a need for a generic quantitative analysis

for memory access energy which helps in better architectural exploration. We show

that the present architectural tools are either incompatible or too slow and we need

a better analytical method to estimate data movement energy. We also propose a

simplistic yet effective approach that is robust and expandable by users to support

various systems.
ContributorsChowdary, Hidayatullah (Author) / Cao, Yu (Thesis advisor) / Seo, JaeSun (Committee member) / Chakrabarti, Chaitali (Committee member) / Arizona State University (Publisher)
Created2018
156804-Thumbnail Image.png
Description
Semiconductor memory is a key component of the computing systems. Beyond the conventional memory and data storage applications, in this dissertation, both mainstream and eNVM memory technologies are explored for radiation environment, hardware security system and machine learning applications.

In the radiation environment, e.g. aerospace, the memory devices face different

Semiconductor memory is a key component of the computing systems. Beyond the conventional memory and data storage applications, in this dissertation, both mainstream and eNVM memory technologies are explored for radiation environment, hardware security system and machine learning applications.

In the radiation environment, e.g. aerospace, the memory devices face different energetic particles. The strike of these energetic particles can generate electron-hole pairs (directly or indirectly) as they pass through the semiconductor device, resulting in photo-induced current, and may change the memory state. First, the trend of radiation effects of the mainstream memory technologies with technology node scaling is reviewed. Then, single event effects of the oxide based resistive switching random memory (RRAM), one of eNVM technologies, is investigated from the circuit-level to the system level.

Physical Unclonable Function (PUF) has been widely investigated as a promising hardware security primitive, which employs the inherent randomness in a physical system (e.g. the intrinsic semiconductor manufacturing variability). In the dissertation, two RRAM-based PUF implementations are proposed for cryptographic key generation (weak PUF) and device authentication (strong PUF), respectively. The performance of the RRAM PUFs are evaluated with experiment and simulation. The impact of non-ideal circuit effects on the performance of the PUFs is also investigated and optimization strategies are proposed to solve the non-ideal effects. Besides, the security resistance against modeling and machine learning attacks is analyzed as well.

Deep neural networks (DNNs) have shown remarkable improvements in various intelligent applications such as image classification, speech classification and object localization and detection. Increasing efforts have been devoted to develop hardware accelerators. In this dissertation, two types of compute-in-memory (CIM) based hardware accelerator designs with SRAM and eNVM technologies are proposed for two binary neural networks, i.e. hybrid BNN (HBNN) and XNOR-BNN, respectively, which are explored for the hardware resource-limited platforms, e.g. edge devices.. These designs feature with high the throughput, scalability, low latency and high energy efficiency. Finally, we have successfully taped-out and validated the proposed designs with SRAM technology in TSMC 65 nm.

Overall, this dissertation paves the paths for memory technologies’ new applications towards the secure and energy-efficient artificial intelligence system.
ContributorsLiu, Rui (Author) / Yu, Shimeng (Thesis advisor, Committee member) / Cao, Yu (Committee member) / Barnaby, Hugh (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2018
156845-Thumbnail Image.png
Description
The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency

The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the deep learning algorithm inference. However, deploying CNNs on portable and embedded systems is still challenging due to large data volume, intensive computation, varying algorithm structures, and frequent memory accesses. This dissertation proposes a complete design methodology and framework to accelerate the inference process of various CNN algorithms on FPGA hardware with high performance, efficiency and flexibility.

As convolution contributes most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. An efficient dataflow and hardware architecture of CNN acceleration are proposed to minimize the data communication while maximizing the resource utilization to achieve high performance.

Although great performance and efficiency can be achieved by customizing the FPGA hardware for each CNN model, significant efforts and expertise are required leading to long development time, which makes it difficult to catch up with the rapid development of CNN algorithms. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g. NiN, VGG, GoogLeNet and ResNet, on two different standalone FPGAs achieving state-of-the art performance.

Based on the optimized acceleration strategy, there are still a lot of design options, e.g. the degree and dimension of computation parallelism, the size of on-chip buffers, and the external memory bandwidth, which impact the utilization of computation resources and data communication efficiency, and finally affect the performance and energy consumption of the accelerator. The large design space of the accelerator makes it impractical to explore the optimal design choice during the real implementation phase. Therefore, a performance model is proposed in this work to quantitatively estimate the accelerator performance and resource utilization. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase.
ContributorsMa, Yufei (Author) / Vrudhula, Sarma (Thesis advisor) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Barnaby, Hugh (Committee member) / Arizona State University (Publisher)
Created2018
157305-Thumbnail Image.png
Description
The Resistive Random Access Memory (ReRAM) is an emerging non-volatile memory

technology because of its attractive attributes, including excellent scalability (< 10 nm), low

programming voltage (< 3 V), fast switching speed (< 10 ns), high OFF/ON ratio (> 10),

good endurance (up to 1012 cycles) and great compatibility with silicon CMOS technology

The Resistive Random Access Memory (ReRAM) is an emerging non-volatile memory

technology because of its attractive attributes, including excellent scalability (< 10 nm), low

programming voltage (< 3 V), fast switching speed (< 10 ns), high OFF/ON ratio (> 10),

good endurance (up to 1012 cycles) and great compatibility with silicon CMOS technology [1].

However, ReRAM suffers from larger write latency, energy and reliability issue compared to

Dynamic Random Access Memory (DRAM). To improve the energy-efficiency, latency efficiency and reliability of ReRAM storage systems, a low cost cross-layer approach that spans device, circuit, architecture and system levels is proposed.

For 1T1R 2D ReRAM system, the effect of both retention and endurance errors on

ReRAM reliability is considered. Proposed approach is to design circuit-level and architecture-level techniques to reduce raw Bit Error Rate significantly and then employ low cost Error Control Coding to achieve the desired lifetime.

For 1S1R 2D ReRAM system, a cross-point array with “multi-bit per access” per subarray

is designed for high energy-efficiency and good reliability. The errors due to cell-level as well

as array-level variations are analyzed and a low cost scheme to maintain reliability and latency

with low energy consumption is proposed.

For 1S1R 3D ReRAM system, access schemes which activate multiple subarrays with

multiple layers in a subarray are used to achieve high energy efficiency through activating fewer

subarray, and good reliability is achieved through innovative data organization.

Finally, a novel ReRAM-based accelerator design is proposed to support multiple

Convolutional Neural Networks (CNN) topologies including VGGNet, AlexNet and ResNet.

The multi-tiled architecture consists of 9 processing elements per tile, where each tile

implements the dot product operation using ReRAM as computation unit. The processing

elements operate in a systolic fashion, thereby maximizing input feature map reuse and

minimizing interconnection cost. The system-level evaluation on several network benchmarks

show that the proposed architecture can improve computation efficiency and energy efficiency

compared to a state-of-the-art ReRAM-based accelerator.
ContributorsMao, Manqing (Author) / Chakrabariti, Chaitali (Thesis advisor) / Yu, Shimeng (Committee member) / Cao, Yu (Committee member) / Orgas, Umit (Committee member) / Arizona State University (Publisher)
Created2019
157015-Thumbnail Image.png
Description
Deep learning (DL) has proved itself be one of the most important developements till date with far reaching impacts in numerous fields like robotics, computer vision, surveillance, speech processing, machine translation, finance, etc. They are now widely used for countless applications because of their ability to generalize real world data,

Deep learning (DL) has proved itself be one of the most important developements till date with far reaching impacts in numerous fields like robotics, computer vision, surveillance, speech processing, machine translation, finance, etc. They are now widely used for countless applications because of their ability to generalize real world data, robustness to noise in previously unseen data and high inference accuracy. With the ability to learn useful features from raw sensor data, deep learning algorithms have out-performed tradinal AI algorithms and pushed the boundaries of what can be achieved with AI. In this work, we demonstrate the power of deep learning by developing a neural network to automatically detect cough instances from audio recorded in un-constrained environments. For this, 24 hours long recordings from 9 dierent patients is collected and carefully labeled by medical personel. A pre-processing algorithm is proposed to convert event based cough dataset to a more informative dataset with start and end of coughs and also introduce data augmentation for regularizing the training procedure. The proposed neural network achieves 92.3% leave-one-out accuracy on data captured in real world.

Deep neural networks are composed of multiple layers that are compute/memory intensive. This makes it difficult to execute these algorithms real-time with low power consumption using existing general purpose computers. In this work, we propose hardware accelerators for a traditional AI algorithm based on random forest trees and two representative deep convolutional neural networks (AlexNet and VGG). With the proposed acceleration techniques, ~ 30x performance improvement was achieved compared to CPU for random forest trees. For deep CNNS, we demonstrate that much higher performance can be achieved with architecture space exploration using any optimization algorithms with system level performance and area models for hardware primitives as inputs and goal of minimizing latency with given resource constraints. With this method, ~30GOPs performance was achieved for Stratix V FPGA boards.

Hardware acceleration of DL algorithms alone is not always the most ecient way and sucient to achieve desired performance. There is a huge headroom available for performance improvement provided the algorithms are designed keeping in mind the hardware limitations and bottlenecks. This work achieves hardware-software co-optimization for Non-Maximal Suppression (NMS) algorithm. Using the proposed algorithmic changes and hardware architecture

With CMOS scaling coming to an end and increasing memory bandwidth bottlenecks, CMOS based system might not scale enough to accommodate requirements of more complicated and deeper neural networks in future. In this work, we explore RRAM crossbars and arrays as compact, high performing and energy efficient alternative to CMOS accelerators for deep learning training and inference. We propose and implement RRAM periphery read and write circuits and achieved ~3000x performance improvement in online dictionary learning compared to CPU.

This work also examines the realistic RRAM devices and their non-idealities. We do an in-depth study of the effects of RRAM non-idealities on inference accuracy when a pretrained model is mapped to RRAM based accelerators. To mitigate this issue, we propose Random Sparse Adaptation (RSA), a novel scheme aimed at tuning the model to take care of the faults of the RRAM array on which it is mapped. Our proposed method can achieve inference accuracy much higher than what traditional Read-Verify-Write (R-V-W) method could achieve. RSA can also recover lost inference accuracy 100x ~ 1000x faster compared to R-V-W. Using 32-bit high precision RSA cells, we achieved ~10% higher accuracy using fautly RRAM arrays compared to what can be achieved by mapping a deep network to an 32 level RRAM array with no variations.
ContributorsMohanty, Abinash (Author) / Cao, Yu (Thesis advisor) / Seo, Jae-Sun (Committee member) / Vrudhula, Sarma (Committee member) / Chakrabarti, Chaitali (Committee member) / Arizona State University (Publisher)
Created2018
154803-Thumbnail Image.png
Description
Over decades, scientists have been scaling devices to increasingly smaller feature sizes for ever better performance of complementary metal-oxide semiconductor (CMOS) technology to meet requirements on speed, complexity, circuit density, power consumption and ultimately cost required by many advanced applications. However, going to these ultra-scaled CMOS devices also brings some

Over decades, scientists have been scaling devices to increasingly smaller feature sizes for ever better performance of complementary metal-oxide semiconductor (CMOS) technology to meet requirements on speed, complexity, circuit density, power consumption and ultimately cost required by many advanced applications. However, going to these ultra-scaled CMOS devices also brings some drawbacks. Aging due to bias-temperature-instability (BTI) and Hot carrier injection (HCI) is the dominant cause of functional failure in large scale logic circuits. The aging phenomena, on top of process variations, translate into complexity and reduced design margin for circuits. Such issues call for “Design for Reliability”. In order to increase the overall design efficiency, it is important to (i) study the impact of aging on circuit level along with the transistor level understanding (ii) calibrate the theoretical findings with measurement data (iii) implementing tools that analyze the impact of BTI and HCI reliability on circuit timing into VLSI design process at each stage. In this work, post silicon measurements of a 28nm HK-MG technology are done to study the effect of aging on Frequency Degradation of digital circuits. A novel voltage controlled ring oscillator (VCO) structure, developed by NIMO research group is used to determine the effect of aging mechanisms like NBTI, PBTI and SILC on circuit parameters. Accelerated aging mechanism is proposed to avoid the time consuming measurement process and extrapolation of data to the end of life thus instead of predicting the circuit behavior, one can measure it, within a short period of time. Finally, to bridge the gap between device level models and circuit level aging analysis, a System Level Reliability Analysis Flow (SyRA) developed by NIMO group, is implemented for a TSMC 65nm industrial level design to achieve one-step reliability prediction for digital design.
ContributorsBansal, Ankita (Author) / Cao, Yu (Thesis advisor) / Seo, Jae sun (Committee member) / Barnaby, Hugh (Committee member) / Arizona State University (Publisher)
Created2016
155885-Thumbnail Image.png
Description
Vision is the ability to see and interpret any visual stimulus. It is one of the most fundamental and complex tasks the brain performs. Its complexity can be understood from the fact that close to 50% of the human brain is dedicated to vision. The brain receives an overwhelming amount

Vision is the ability to see and interpret any visual stimulus. It is one of the most fundamental and complex tasks the brain performs. Its complexity can be understood from the fact that close to 50% of the human brain is dedicated to vision. The brain receives an overwhelming amount of sensory information from the retina – estimated at up to 100 Mbps per optic nerve. Parallel processing of the entire visual field in real time is likely impossible for even the most sophisticated brains due to the high computational complexity of the task [1]. Yet, organisms can efficiently process this information to parse complex scenes in real time. This amazing feat of nature relies on selective attention which allows the brain to filter sensory information to select only a small subset of it for further processing.

Today, Computer Vision has become ubiquitous in our society with several in image understanding, medicine, drones, self-driving cars and many more. With the advent of GPUs and the availability of huge datasets like ImageNet, Convolutional Neural Networks (CNNs) have come to play a very important role in solving computer vision tasks, e.g object detection. However, the size of the networks become

prohibitive when higher accuracies are needed, which in turn demands more hardware. This hinders the application of CNNs to mobile platforms and stops them from hitting the real-time mark. The computational efficiency of a computer vision task, like object detection, can be enhanced by adopting a selective attention mechanism into the algorithm. In this work, this idea is explored by using Visual Proto Object Saliency algorithm [1] to crop out the areas of an image without relevant objects before a computationally intensive network like the Faster R-CNN [2] processes it.
ContributorsGorthy, Sai Rama Srivatsava (Author) / Cao, Yu (Thesis advisor) / Seo, Jae-Sun (Committee member) / Vrudhula, Sarma (Committee member) / Arizona State University (Publisher)
Created2017
155897-Thumbnail Image.png
Description
Machine learning technology has made a lot of incredible achievements in recent years. It has rivalled or exceeded human performance in many intellectual tasks including image recognition, face detection and the Go game. Many machine learning algorithms require huge amount of computation such as in multiplication of large matrices. As

Machine learning technology has made a lot of incredible achievements in recent years. It has rivalled or exceeded human performance in many intellectual tasks including image recognition, face detection and the Go game. Many machine learning algorithms require huge amount of computation such as in multiplication of large matrices. As silicon technology has scaled to sub-14nm regime, simply scaling down the device cannot provide enough speed-up any more. New device technologies and system architectures are needed to improve the computing capacity. Designing specific hardware for machine learning is highly in demand. Efforts need to be made on a joint design and optimization of both hardware and algorithm.

For machine learning acceleration, traditional SRAM and DRAM based system suffer from low capacity, high latency, and high standby power. Instead, emerging memories, such as Phase Change Random Access Memory (PRAM), Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM), and Resistive Random Access Memory (RRAM), are promising candidates providing low standby power, high data density, fast access and excellent scalability. This dissertation proposes a hierarchical memory modeling framework and models PRAM and STT-MRAM in four different levels of abstraction. With the proposed models, various simulations are conducted to investigate the performance, optimization, variability, reliability, and scalability.

Emerging memory devices such as RRAM can work as a 2-D crosspoint array to speed up the multiplication and accumulation in machine learning algorithms. This dissertation proposes a new parallel programming scheme to achieve in-memory learning with RRAM crosspoint array. The programming circuitry is designed and simulated in TSMC 65nm technology showing 900X speedup for the dictionary learning task compared to the CPU performance.

From the algorithm perspective, inspired by the high accuracy and low power of the brain, this dissertation proposes a bio-plausible feedforward inhibition spiking neural network with Spike-Rate-Dependent-Plasticity (SRDP) learning rule. It achieves more than 95% accuracy on the MNIST dataset, which is comparable to the sparse coding algorithm, but requires far fewer number of computations. The role of inhibition in this network is systematically studied and shown to improve the hardware efficiency in learning.
ContributorsXu, Zihan (Author) / Cao, Yu (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Seo, Jae-Sun (Committee member) / Yu, Shimeng (Committee member) / Arizona State University (Publisher)
Created2017