Search Content

Semiconductor Memory Applications in Radiation Environment, Hardware Security and Machine Learning System

Description

Semiconductor memory is a key component of the computing systems. Beyond the conventional memory and data storage applications, in this dissertation, both mainstream and eNVM memory technologies are explored for radiation environment, hardware security system and machine learning applications.

In the radiation environment, e.g. aerospace, the memory devices face different…

Semiconductor memory is a key component of the computing systems. Beyond the conventional memory and data storage applications, in this dissertation, both mainstream and eNVM memory technologies are explored for radiation environment, hardware security system and machine learning applications.

In the radiation environment, e.g. aerospace, the memory devices face different energetic particles. The strike of these energetic particles can generate electron-hole pairs (directly or indirectly) as they pass through the semiconductor device, resulting in photo-induced current, and may change the memory state. First, the trend of radiation effects of the mainstream memory technologies with technology node scaling is reviewed. Then, single event effects of the oxide based resistive switching random memory (RRAM), one of eNVM technologies, is investigated from the circuit-level to the system level.

Physical Unclonable Function (PUF) has been widely investigated as a promising hardware security primitive, which employs the inherent randomness in a physical system (e.g. the intrinsic semiconductor manufacturing variability). In the dissertation, two RRAM-based PUF implementations are proposed for cryptographic key generation (weak PUF) and device authentication (strong PUF), respectively. The performance of the RRAM PUFs are evaluated with experiment and simulation. The impact of non-ideal circuit effects on the performance of the PUFs is also investigated and optimization strategies are proposed to solve the non-ideal effects. Besides, the security resistance against modeling and machine learning attacks is analyzed as well.

Deep neural networks (DNNs) have shown remarkable improvements in various intelligent applications such as image classification, speech classification and object localization and detection. Increasing efforts have been devoted to develop hardware accelerators. In this dissertation, two types of compute-in-memory (CIM) based hardware accelerator designs with SRAM and eNVM technologies are proposed for two binary neural networks, i.e. hybrid BNN (HBNN) and XNOR-BNN, respectively, which are explored for the hardware resource-limited platforms, e.g. edge devices.. These designs feature with high the throughput, scalability, low latency and high energy efficiency. Finally, we have successfully taped-out and validated the proposed designs with SRAM technology in TSMC 65 nm.

Overall, this dissertation paves the paths for memory technologies’ new applications towards the secure and energy-efficient artificial intelligence system.

ContributorsLiu, Rui (Author) / Yu, Shimeng (Thesis advisor, Committee member) / Cao, Yu (Committee member) / Barnaby, Hugh (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2018

On-chip learning and inference acceleration of sparse representations

Description

The past decade has seen a tremendous surge in running machine learning (ML) functions on mobile devices, from mere novelty applications to now indispensable features for the next generation of devices.

While the mobile platform capabilities range widely, long battery life and reliability are common design concerns that are crucial to…

The past decade has seen a tremendous surge in running machine learning (ML) functions on mobile devices, from mere novelty applications to now indispensable features for the next generation of devices.

While the mobile platform capabilities range widely, long battery life and reliability are common design concerns that are crucial to remain competitive.

Consequently, state-of-the-art mobile platforms have become highly heterogeneous by combining a powerful CPUs with GPUs to accelerate the computation of deep neural networks (DNNs), which are the most common structures to perform ML operations.

But traditional von Neumann architectures are not optimized for the high memory bandwidth and massively parallel computation demands required by DNNs.

Hence, propelling research into non-von Neumann architectures to support the demands of DNNs.

The re-imagining of computer architectures to perform efficient DNN computations requires focusing on the prohibitive demands presented by DNNs and alleviating them. The two central challenges for efficient computation are (1) large memory storage and movement due to weights of the DNN and (2) massively parallel multiplications to compute the DNN output.

Introducing sparsity into the DNNs, where certain percentage of either the weights or the outputs of the DNN are zero, greatly helps with both challenges. This along with algorithm-hardware co-design to compress the DNNs is demonstrated to provide efficient solutions to greatly reduce the power consumption of hardware that compute DNNs. Additionally, exploring emerging technologies such as non-volatile memories and 3-D stacking of silicon in conjunction with algorithm-hardware co-design architectures will pave the way for the next generation of mobile devices.

Towards the objectives stated above, our specific contributions include (a) an architecture based on resistive crosspoint array that can update all values stored and compute matrix vector multiplication in parallel within a single cycle, (b) a framework of training DNNs with a block-wise sparsity to drastically reduce memory storage and total number of computations required to compute the output of DNNs, (c) the exploration of hardware implementations of sparse DNNs and architectural guidelines to reduce power consumption for the implementations in monolithic 3D integrated circuits, and (d) a prototype chip in 65nm CMOS accelerator for long-short term memory networks trained with the proposed block-wise sparsity scheme.

ContributorsKadetotad, Deepak Vinayak (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Vrudhula, Sarma (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2019

Algorithm and Hardware Design for Efficient Deep Learning Inference

Description

Deep learning (DL) has proved itself be one of the most important developements till date with far reaching impacts in numerous fields like robotics, computer vision, surveillance, speech processing, machine translation, finance, etc. They are now widely used for countless applications because of their ability to generalize real world data,…

Deep learning (DL) has proved itself be one of the most important developements till date with far reaching impacts in numerous fields like robotics, computer vision, surveillance, speech processing, machine translation, finance, etc. They are now widely used for countless applications because of their ability to generalize real world data, robustness to noise in previously unseen data and high inference accuracy. With the ability to learn useful features from raw sensor data, deep learning algorithms have out-performed tradinal AI algorithms and pushed the boundaries of what can be achieved with AI. In this work, we demonstrate the power of deep learning by developing a neural network to automatically detect cough instances from audio recorded in un-constrained environments. For this, 24 hours long recordings from 9 dierent patients is collected and carefully labeled by medical personel. A pre-processing algorithm is proposed to convert event based cough dataset to a more informative dataset with start and end of coughs and also introduce data augmentation for regularizing the training procedure. The proposed neural network achieves 92.3% leave-one-out accuracy on data captured in real world.

Deep neural networks are composed of multiple layers that are compute/memory intensive. This makes it difficult to execute these algorithms real-time with low power consumption using existing general purpose computers. In this work, we propose hardware accelerators for a traditional AI algorithm based on random forest trees and two representative deep convolutional neural networks (AlexNet and VGG). With the proposed acceleration techniques, ~ 30x performance improvement was achieved compared to CPU for random forest trees. For deep CNNS, we demonstrate that much higher performance can be achieved with architecture space exploration using any optimization algorithms with system level performance and area models for hardware primitives as inputs and goal of minimizing latency with given resource constraints. With this method, ~30GOPs performance was achieved for Stratix V FPGA boards.

Hardware acceleration of DL algorithms alone is not always the most ecient way and sucient to achieve desired performance. There is a huge headroom available for performance improvement provided the algorithms are designed keeping in mind the hardware limitations and bottlenecks. This work achieves hardware-software co-optimization for Non-Maximal Suppression (NMS) algorithm. Using the proposed algorithmic changes and hardware architecture

With CMOS scaling coming to an end and increasing memory bandwidth bottlenecks, CMOS based system might not scale enough to accommodate requirements of more complicated and deeper neural networks in future. In this work, we explore RRAM crossbars and arrays as compact, high performing and energy efficient alternative to CMOS accelerators for deep learning training and inference. We propose and implement RRAM periphery read and write circuits and achieved ~3000x performance improvement in online dictionary learning compared to CPU.

This work also examines the realistic RRAM devices and their non-idealities. We do an in-depth study of the effects of RRAM non-idealities on inference accuracy when a pretrained model is mapped to RRAM based accelerators. To mitigate this issue, we propose Random Sparse Adaptation (RSA), a novel scheme aimed at tuning the model to take care of the faults of the RRAM array on which it is mapped. Our proposed method can achieve inference accuracy much higher than what traditional Read-Verify-Write (R-V-W) method could achieve. RSA can also recover lost inference accuracy 100x ~ 1000x faster compared to R-V-W. Using 32-bit high precision RSA cells, we achieved ~10% higher accuracy using fautly RRAM arrays compared to what can be achieved by mapping a deep network to an 32 level RRAM array with no variations.

ContributorsMohanty, Abinash (Author) / Cao, Yu (Thesis advisor) / Seo, Jae-Sun (Committee member) / Vrudhula, Sarma (Committee member) / Chakrabarti, Chaitali (Committee member) / Arizona State University (Publisher)

Created2018

Design and Optimization of Resistive RAM-based Storage and Computing Systems

Description

The Resistive Random Access Memory (ReRAM) is an emerging non-volatile memory

technology because of its attractive attributes, including excellent scalability (< 10 nm), low

programming voltage (< 3 V), fast switching speed (< 10 ns), high OFF/ON ratio (> 10),

good endurance (up to 1012 cycles) and great compatibility with silicon CMOS technology…

The Resistive Random Access Memory (ReRAM) is an emerging non-volatile memory

technology because of its attractive attributes, including excellent scalability (< 10 nm), low

programming voltage (< 3 V), fast switching speed (< 10 ns), high OFF/ON ratio (> 10),

good endurance (up to 1012 cycles) and great compatibility with silicon CMOS technology [1].

However, ReRAM suffers from larger write latency, energy and reliability issue compared to

Dynamic Random Access Memory (DRAM). To improve the energy-efficiency, latency efficiency and reliability of ReRAM storage systems, a low cost cross-layer approach that spans device, circuit, architecture and system levels is proposed.

For 1T1R 2D ReRAM system, the effect of both retention and endurance errors on

ReRAM reliability is considered. Proposed approach is to design circuit-level and architecture-level techniques to reduce raw Bit Error Rate significantly and then employ low cost Error Control Coding to achieve the desired lifetime.

For 1S1R 2D ReRAM system, a cross-point array with “multi-bit per access” per subarray

is designed for high energy-efficiency and good reliability. The errors due to cell-level as well

as array-level variations are analyzed and a low cost scheme to maintain reliability and latency

with low energy consumption is proposed.

For 1S1R 3D ReRAM system, access schemes which activate multiple subarrays with

multiple layers in a subarray are used to achieve high energy efficiency through activating fewer

subarray, and good reliability is achieved through innovative data organization.

Finally, a novel ReRAM-based accelerator design is proposed to support multiple

Convolutional Neural Networks (CNN) topologies including VGGNet, AlexNet and ResNet.

The multi-tiled architecture consists of 9 processing elements per tile, where each tile

implements the dot product operation using ReRAM as computation unit. The processing

elements operate in a systolic fashion, thereby maximizing input feature map reuse and

minimizing interconnection cost. The system-level evaluation on several network benchmarks

show that the proposed architecture can improve computation efficiency and energy efficiency

compared to a state-of-the-art ReRAM-based accelerator.

ContributorsMao, Manqing (Author) / Chakrabariti, Chaitali (Thesis advisor) / Yu, Shimeng (Committee member) / Cao, Yu (Committee member) / Orgas, Umit (Committee member) / Arizona State University (Publisher)

Created2019

Reliability issues and design solutions in advanced CMOS design

Description

Over decades, scientists have been scaling devices to increasingly smaller feature sizes for ever better performance of complementary metal-oxide semiconductor (CMOS) technology to meet requirements on speed, complexity, circuit density, power consumption and ultimately cost required by many advanced applications. However, going to these ultra-scaled CMOS devices also brings some…

Over decades, scientists have been scaling devices to increasingly smaller feature sizes for ever better performance of complementary metal-oxide semiconductor (CMOS) technology to meet requirements on speed, complexity, circuit density, power consumption and ultimately cost required by many advanced applications. However, going to these ultra-scaled CMOS devices also brings some drawbacks. Aging due to bias-temperature-instability (BTI) and Hot carrier injection (HCI) is the dominant cause of functional failure in large scale logic circuits. The aging phenomena, on top of process variations, translate into complexity and reduced design margin for circuits. Such issues call for “Design for Reliability”. In order to increase the overall design efficiency, it is important to (i) study the impact of aging on circuit level along with the transistor level understanding (ii) calibrate the theoretical findings with measurement data (iii) implementing tools that analyze the impact of BTI and HCI reliability on circuit timing into VLSI design process at each stage. In this work, post silicon measurements of a 28nm HK-MG technology are done to study the effect of aging on Frequency Degradation of digital circuits. A novel voltage controlled ring oscillator (VCO) structure, developed by NIMO research group is used to determine the effect of aging mechanisms like NBTI, PBTI and SILC on circuit parameters. Accelerated aging mechanism is proposed to avoid the time consuming measurement process and extrapolation of data to the end of life thus instead of predicting the circuit behavior, one can measure it, within a short period of time. Finally, to bridge the gap between device level models and circuit level aging analysis, a System Level Reliability Analysis Flow (SyRA) developed by NIMO group, is implemented for a TSMC 65nm industrial level design to achieve one-step reliability prediction for digital design.

ContributorsBansal, Ankita (Author) / Cao, Yu (Thesis advisor) / Seo, Jae sun (Committee member) / Barnaby, Hugh (Committee member) / Arizona State University (Publisher)

Created2016

Radiation hardened clock design

Description

Clock generation and distribution are essential to CMOS microchips, providing synchronization to external devices and between internal sequential logic. Clocks in microprocessors are highly vulnerable to single event effects and designing reliable energy efficient clock networks for mission critical applications is a major challenge. This dissertation studies the basics of…

Clock generation and distribution are essential to CMOS microchips, providing synchronization to external devices and between internal sequential logic. Clocks in microprocessors are highly vulnerable to single event effects and designing reliable energy efficient clock networks for mission critical applications is a major challenge. This dissertation studies the basics of radiation hardening, essentials of clock design and impact of particle strikes on clocks in detail and presents design techniques for hardening complete clock systems in digital ICs.

Since the sequential elements play a key role in deciding the robustness of any clocking strategy, hardened-by-design implementations of triple-mode redundant (TMR) pulse clocked latches and physical design methodologies for using TMR master-slave flip-flops in application specific ICs (ASICs) are proposed. A novel temporal pulse clocked latch design for low power radiation hardened applications is also proposed. Techniques for designing custom RHBD clock distribution networks (clock spines) and ASIC clock trees for a radiation hardened microprocessor using standard CAD tools are presented. A framework for analyzing the vulnerabilities of clock trees in general, and study the parameters that contribute the most to the tree’s failure, including impact on controlled latches is provided. This is then used to design an integrated temporally redundant clock tree and pulse clocked flip-flop based clocking scheme that is robust to single event transients (SETs) and single event upsets (SEUs). Subsequently, designing robust clock delay lines for use in double data rate (DDRx) memory applications is studied in detail. Several modules of the proposed radiation hardened all-digital delay locked loop are designed and studied. Many of the circuits proposed in this entire body of work have been implemented and tested on a standard low-power 90-nm process.

ContributorsChellappa, Srivatsan (Author) / Clark, Lawrence T (Thesis advisor) / Holbert, Keith E. (Committee member) / Cao, Yu (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)

Created2015

RRAM-based PUF: design and applications in cryptography

Description

The recent flurry of security breaches have raised serious concerns about the security of data communication and storage. A promising way to enhance the security of the system is through physical root of trust, such as, through use of physical unclonable functions (PUF). PUF leverages the inherent randomness in physical…

The recent flurry of security breaches have raised serious concerns about the security of data communication and storage. A promising way to enhance the security of the system is through physical root of trust, such as, through use of physical unclonable functions (PUF). PUF leverages the inherent randomness in physical systems to provide device specific authentication and encryption.

In this thesis, first the design of a highly reliable resistive random access memory (RRAM) PUF is presented. Compared to existing 1 cell/bit RRAM, here the sum of the read-out currents of multiple RRAM cells are used for generating one response bit. This method statistically minimizes any early-lifetime failure due to RRAM retention degradation at high temperature or under voltage stress. Using a device model that was calibrated using IMEC HfOx RRAM experimental data, it was shown that an 8 cells/bit architecture achieves 99.9999% reliability for a lifetime >10 years at 125℃ . Also, the hardware area overhead of the proposed 8 cells/bit RRAM PUF architecture was smaller than 1 cell/bit RRAM PUF that requires error correction coding to achieve the same reliability.

Next, a basic security primitive is presented, where the RRAM PUF is embedded in the cryptographic module, SHA-256. This architecture is referred to as Embedded PUF or EPUF. EPUF has a security advantage over SHA-256 as it never exposes the PUF response to the outside world. Instead, in each round, the PUF response is used to change a few bits of the message word to produce a unique message digest for each IC. The use of EPUF as a key generation module for AES is also shown. The hardware area requirement for SHA-256 and AES-128 is then analyzed using synthesis results based on TSMC 65nm library. It is shown that the area overhead of 8 cells/bit RRAM PUF is only 1.08% of the SHA-256 module and 0.04% of the AES-128 module. The security analysis of the PUF based systems is also presented. It is shown that the EPUF-based systems are resistant towards standard attacks on PUFs, and that the security of the cryptographic modules is not compromised.

ContributorsShrivastava, Ayush (Author) / Chakrabarti, Chaitali (Thesis advisor) / Yu, Shimeng (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2015

Reconfigurable architectures and systems for IoT applications

Description

Internet of Things (IoT) has become a popular topic in industry over the recent years, which describes an ecosystem of internet-connected devices or things that enrich the everyday life by improving our productivity and efficiency. The primary components of the IoT ecosystem are hardware, software and services. While the software…

Internet of Things (IoT) has become a popular topic in industry over the recent years, which describes an ecosystem of internet-connected devices or things that enrich the everyday life by improving our productivity and efficiency. The primary components of the IoT ecosystem are hardware, software and services. While the software and services of IoT system focus on data collection and processing to make decisions, the underlying hardware is responsible for sensing the information, preprocess and transmit it to the servers. Since the IoT ecosystem is still in infancy, there is a great need for rapid prototyping platforms that would help accelerate the hardware design process. However, depending on the target IoT application, different sensors are required to sense the signals such as heart-rate, temperature, pressure, acceleration, etc., and there is a great need for reconfigurable platforms that can prototype different sensor interfacing circuits.

This thesis primarily focuses on two important hardware aspects of an IoT system: (a) an FPAA based reconfigurable sensing front-end system and (b) an FPGA based reconfigurable processing system. To enable reconfiguration capability for any sensor type, Programmable ANalog Device Array (PANDA), a transistor-level analog reconfigurable platform is proposed. CAD tools required for implementation of front-end circuits on the platform are also developed. To demonstrate the capability of the platform on silicon, a small-scale array of 24×25 PANDA cells is fabricated in 65nm technology. Several analog circuit building blocks including amplifiers, bias circuits and filters are prototyped on the platform, which demonstrates the effectiveness of the platform for rapid prototyping IoT sensor interfaces.

IoT systems typically use machine learning algorithms that run on the servers to process the data in order to make decisions. Recently, embedded processors are being used to preprocess the data at the energy-constrained sensor node or at IoT gateway, which saves considerable energy for transmission and bandwidth. Using conventional CPU based systems for implementing the machine learning algorithms is not energy-efficient. Hence an FPGA based hardware accelerator is proposed and an optimization methodology is developed to maximize throughput of any convolutional neural network (CNN) based machine learning algorithm on a resource-constrained FPGA.

ContributorsSuda, Naveen (Author) / Cao, Yu (Thesis advisor) / Bakkaloglu, Bertan (Committee member) / Ozev, Sule (Committee member) / Yu, Shimeng (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2016

Low-overhead built-in self-test for advanced RF transceiver architectures

Description

Due to high level of integration in RF System on Chip (SOC), the test access points are limited to the baseband and RF inputs/outputs of the system. This limited access poses a big challenge particularly for advanced RF architectures where calibration of internal parameters is necessary and ensure proper operation.…

Due to high level of integration in RF System on Chip (SOC), the test access points are limited to the baseband and RF inputs/outputs of the system. This limited access poses a big challenge particularly for advanced RF architectures where calibration of internal parameters is necessary and ensure proper operation. Therefore low-overhead built-in Self-Test (BIST) solution for advanced RF transceiver is proposed. In this dissertation. Firstly, comprehensive BIST solution for RF polar transceivers using on-chip resources is presented. In the receiver, phase and gain mismatches degrade sensitivity and error vector magnitude (EVM). In the transmitter, delay skew between the envelope and phase signals and the finite envelope bandwidth can create intermodulation distortion (IMD) that leads to violation of spectral mask requirements. Characterization and calibration of these parameters with analytical model would reduce the test time and cost considerably. Hence, a technique to measure and calibrate impairments of the polar transceiver in the loop-back mode is proposed.

Secondly, robust amplitude measurement technique for RF BIST application and BIST circuits for loop-back connection are discussed. Test techniques using analytical model are explained and BIST circuits are introduced.

Next, a self-compensating built-in self-test solution for RF Phased Array Mismatch is proposed. In the proposed method, a sinusoidal test signal with unknown amplitude is applied to the inputs of two adjacent phased array elements and measure the baseband output signal after down-conversion. Mathematical modeling of the circuit impairments and phased array behavior indicates that by using two distinct input amplitudes, both of which can remain unknown, it is possible to measure the important parameters of the phased array, such as gain and phase mismatch. In addition, proposed BIST system is designed and fabricated using IBM 180nm process and a prototype four-element phased-array PCB is also designed and fabricated for verifying the proposed method.

Finally, process independent gain measurement via BIST/DUT co-design is explained. Design methodology how to reduce performance impact significantly is discussed.

Simulation and hardware measurements results for the proposed techniques show that the proposed technique can characterize the targeted impairments accurately.

ContributorsJeong, Jae Woong (Author) / Ozev, Sule (Thesis advisor) / Kitchen, Jennifer (Committee member) / Cao, Yu (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)

Created2015

Novel rail clamp architectures and their systematic design

Description

Rail clamp circuits are widely used for electrostatic discharge (ESD) protection in semiconductor products today. A step-by-step design procedure for the traditional RC and single-inverter-based rail clamp circuit and the design, simulation, implementation, and operation of two novel rail clamp circuits are described for use in the ESD protection of…

Rail clamp circuits are widely used for electrostatic discharge (ESD) protection in semiconductor products today. A step-by-step design procedure for the traditional RC and single-inverter-based rail clamp circuit and the design, simulation, implementation, and operation of two novel rail clamp circuits are described for use in the ESD protection of complementary metal-oxide-semiconductor (CMOS) circuits. The step-by-step design procedure for the traditional circuit is technology-node independent, can be fully automated, and aims to achieve a minimal area design that meets specified leakage and ESD specifications under all valid process, voltage, and temperature (PVT) conditions. The first novel rail clamp circuit presented employs a comparator inside the traditional circuit to reduce the value of the time constant needed. The second circuit uses a dynamic time constant approach in which the value of the time constant is dynamically adjusted after the clamp is triggered. Important metrics for the two new circuits such as ESD performance, latch-on immunity, clamp recovery time, supply noise immunity, fastest power-on time supported, and area are evaluated over an industry-standard PVT space using SPICE simulations and measurements on a fabricated 40 nm test chip.

ContributorsVenkatasubramanian, Ramachandran (Author) / Ozev, Sule (Thesis advisor) / Bakkaloglu, Bertan (Committee member) / Cao, Yu (Committee member) / Kitchen, Jennifer (Committee member) / Arizona State University (Publisher)

Created2016

Filtering by