Search Content

Algorithm and Hardware Co-design for Learning On-a-chip

Description

Machine learning technology has made a lot of incredible achievements in recent years. It has rivalled or exceeded human performance in many intellectual tasks including image recognition, face detection and the Go game. Many machine learning algorithms require huge amount of computation such as in multiplication of large matrices. As…

Machine learning technology has made a lot of incredible achievements in recent years. It has rivalled or exceeded human performance in many intellectual tasks including image recognition, face detection and the Go game. Many machine learning algorithms require huge amount of computation such as in multiplication of large matrices. As silicon technology has scaled to sub-14nm regime, simply scaling down the device cannot provide enough speed-up any more. New device technologies and system architectures are needed to improve the computing capacity. Designing specific hardware for machine learning is highly in demand. Efforts need to be made on a joint design and optimization of both hardware and algorithm.

For machine learning acceleration, traditional SRAM and DRAM based system suffer from low capacity, high latency, and high standby power. Instead, emerging memories, such as Phase Change Random Access Memory (PRAM), Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM), and Resistive Random Access Memory (RRAM), are promising candidates providing low standby power, high data density, fast access and excellent scalability. This dissertation proposes a hierarchical memory modeling framework and models PRAM and STT-MRAM in four different levels of abstraction. With the proposed models, various simulations are conducted to investigate the performance, optimization, variability, reliability, and scalability.

Emerging memory devices such as RRAM can work as a 2-D crosspoint array to speed up the multiplication and accumulation in machine learning algorithms. This dissertation proposes a new parallel programming scheme to achieve in-memory learning with RRAM crosspoint array. The programming circuitry is designed and simulated in TSMC 65nm technology showing 900X speedup for the dictionary learning task compared to the CPU performance.

From the algorithm perspective, inspired by the high accuracy and low power of the brain, this dissertation proposes a bio-plausible feedforward inhibition spiking neural network with Spike-Rate-Dependent-Plasticity (SRDP) learning rule. It achieves more than 95% accuracy on the MNIST dataset, which is comparable to the sparse coding algorithm, but requires far fewer number of computations. The role of inhibition in this network is systematically studied and shown to improve the hardware efficiency in learning.

ContributorsXu, Zihan (Author) / Cao, Yu (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Seo, Jae-Sun (Committee member) / Yu, Shimeng (Committee member) / Arizona State University (Publisher)

Created2017

6T-SRAM 1Mb design with test structures and post silicon validation

Description

Static random-access memories (SRAM) are integral part of design systems as caches and data memories that and occupy one-third of design space. The work presents an embedded low power SRAM on a triple well process that allows body-biasing control. In addition to the normal mode operation, the design is embedded…

Static random-access memories (SRAM) are integral part of design systems as caches and data memories that and occupy one-third of design space. The work presents an embedded low power SRAM on a triple well process that allows body-biasing control. In addition to the normal mode operation, the design is embedded with Physical Unclonable Function (PUF) [Suh07] and Sense Amplifier Test (SA Test) mode. With PUF mode structures, the fabrication and environmental mismatches in bit cells are used to generate unique identification bits. These bits are fixed and known as preferred state of an SRAM bit cell. The direct access test structure is a measurement unit for offset voltage analysis of sense amplifiers. These designs are manufactured using a foundry bulk CMOS 55 nm low-power (LP) process. The details about SRAM bit-cell and peripheral circuit design is discussed in detail, for certain cases the circuit simulation analysis is performed with random variations embedded in SPICE models. Further, post-silicon testing results are discussed for normal operation of SRAMs and the special test modes. The silicon and circuit simulation results for various tests are presented.

ContributorsDosi, Ankita (Author) / Clark, Lawrence (Thesis advisor) / Seo, Jae-Sun (Committee member) / Brunhaver, John (Committee member) / Arizona State University (Publisher)

Created2017

Hardware Acceleration of Most Apparent Distortion Image Quality Assessment Algorithm on FPGA Using OpenCL

Description

The information era has brought about many technological advancements in the past

few decades, and that has led to an exponential increase in the creation of digital images and

videos. Constantly, all digital images go through some image processing algorithm for

various reasons like compression, transmission, storage, etc. There is data loss during…

The information era has brought about many technological advancements in the past

few decades, and that has led to an exponential increase in the creation of digital images and

videos. Constantly, all digital images go through some image processing algorithm for

various reasons like compression, transmission, storage, etc. There is data loss during this

process which leaves us with a degraded image. Hence, to ensure minimal degradation of

images, the requirement for quality assessment has become mandatory. Image Quality

Assessment (IQA) has been researched and developed over the last several decades to

predict the quality score in a manner that agrees with human judgments of quality. Modern

image quality assessment (IQA) algorithms are quite effective at prediction accuracy, and

their development has not focused on improving computational performance. The existing

serial implementation requires a relatively large run-time on the order of seconds for a single

frame. Hardware acceleration using Field programmable gate arrays (FPGAs) provides

reconfigurable computing fabric that can be tailored for a broad range of applications.

Usually, programming FPGAs has required expertise in hardware descriptive languages

(HDLs) or high-level synthesis (HLS) tool. OpenCL is an open standard for cross-platform,

parallel programming of heterogeneous systems along with Altera OpenCL SDK, enabling

developers to use FPGA's potential without extensive hardware knowledge. Hence, this

thesis focuses on accelerating the computationally intensive part of the most apparent

distortion (MAD) algorithm on FPGA using OpenCL. The results are compared with CPU

implementation to evaluate performance and efficiency gains.

ContributorsGunavelu Mohan, Aswin (Author) / Sohoni, Sohum (Thesis advisor) / Ren, Fengbo (Thesis advisor) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2017

FPGA accelerator architecture for Q-learning and its applications in space exploration rovers

Description

Achieving human level intelligence is a long-term goal for many Artificial Intelligence (AI) researchers. Recent developments in combining deep learning and reinforcement learning helped us to move a step forward in achieving this goal. Reinforcement learning using a delayed reward mechanism is an approach to machine intelligence which studies decision…

Achieving human level intelligence is a long-term goal for many Artificial Intelligence (AI) researchers. Recent developments in combining deep learning and reinforcement learning helped us to move a step forward in achieving this goal. Reinforcement learning using a delayed reward mechanism is an approach to machine intelligence which studies decision making with control and how a decision making agent can learn to act optimally in an environment-unaware conditions.

Q-learning is one of the model-free reinforcement directed learning strategies which uses temporal differences to estimate the performances of state-action pairs called Q values. A simple implementation of Q-learning algorithm can be done using a Q table memory to store and update the Q values. However, with an increase in state space data due to a complex environment, and with an increase in possible number of actions an agent can perform, Q table reaches its space limit and would be difficult to scale well. Q-learning with neural networks eliminates the use of Q table by approximating the Q function using neural networks.

Autonomous agents need to develop cognitive properties and become self-adaptive to be deployable in any environment. Reinforcement learning with Q-learning have been very efficient in solving such problems. However, embedded systems like space rovers and autonomous robots rarely implement such techniques due to the constraints faced like processing power, chip area, convergence rate and cost of the chip. These problems present a need for a portable, low power, area efficient hardware accelerator to accelerate the process of such learning.

This problem is targeted by implementing a hardware schematic architecture for Q-learning using Artificial Neural networks. This architecture exploits the massive parallelism provided by neural network with a dedicated fine grain parallelism provided by a Field Programmable Gate Array (FPGA) thereby processing the Q values at a high throughput. Mars exploration rovers currently use Xilinx-Space-grade FPGA devices for image processing, pyrotechnic operation control and obstacle avoidance. The hardware resource consumption for the architecture has been synthesized considering Xilinx Virtex7 FPGA as the target device.

ContributorsGankidi, Pranay Reddy (Author) / Thangavelautham, Jekanthan (Thesis advisor) / Ren, Fengbo (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2016

A mixed signal adaptive ripple cancellation technique for integrated buck converters

Description

Switching regulator has several advantages over linear regulator, but the drawback of switching regulator is ripple voltage on output. Previously people use LDO following a buck converter and multi-phase buck converter to reduce the output voltage ripple. However, these two solutions also have obvious drawbacks and limitations.

…

Switching regulator has several advantages over linear regulator, but the drawback of switching regulator is ripple voltage on output. Previously people use LDO following a buck converter and multi-phase buck converter to reduce the output voltage ripple. However, these two solutions also have obvious drawbacks and limitations.

In this thesis, a novel mixed signal adaptive ripple cancellation technique is presented. The idea is to generate an artificial ripple current with the same amplitude as inductor current ripple but opposite phase that has high linearity tracking behavior. To generate the artificial triangular current, duty cycle information and inductor current ripple amplitude information are needed. By sensing switching node SW, the duty cycle information can be obtained; by using feedback the amplitude of the artificial ripple current can be regulated. The artificial ripple current cancels out the inductor current, and results in a very low ripple output current flowing to load. In top level simulation, 19.3dB ripple rejection can be achieved.

ContributorsYang, Zhe (Author) / Bakkaloglu, Bertan (Thesis advisor) / Seo, Jae-Sun (Committee member) / Lei, Qin (Committee member) / Arizona State University (Publisher)

Created2016

Load-Sharing Low-Dropout Linear Regulators and Time-Domain Switching Regulators

Description

The development of portable electronic systems has been a fundamental factor to the emergence of new applications including ubiquitous smart devices, self-driving vehicles. Power-Management Integrated Circuits (PMICs) which are a key component of such systems must maintain high efficiency and reliability for the final system to be appealing from a…

The development of portable electronic systems has been a fundamental factor to the emergence of new applications including ubiquitous smart devices, self-driving vehicles. Power-Management Integrated Circuits (PMICs) which are a key component of such systems must maintain high efficiency and reliability for the final system to be appealing from a size and cost perspective. As technology advances, such portable systems require high output currents at low voltages from their PMICs leading to thermal reliability concerns. The reliability and power integrity of PMICs in such systems also degrades when operated in harsh environments. This dissertation presents solutions to solve two such reliability problems.The first part of this work presents a scalable, daisy-chain solution to parallelize multiple low-dropout linear (LDO) regulators to increase the total output current at low voltages. This printed circuit board (PCB) friendly approach achieves output current sharing without the need for any off-chip active or passive components or matched PCB traces thus reducing the overall system cost. Fully integrated current sensing based on dynamic element matching eliminates the need for any off-chip current sensing components. A current sharing accuracy of 2.613% and 2.789% for output voltages of 3V and 1V respectively and an output current of 2A per LDO are measured for the parallel LDO system implemented in a 0.18μm process. Thermal images demonstrate that the parallel LDO system achieves thermal equilibrium and stable reliable operation. The remainder of the thesis deals with time-domain switching regulators for high-reliability applications. A time-domain based buck and boost controller with time as the processing variable is developed for use in harsh environments. The controller features adaptive on-time / off-time generation for quasi-constant switching frequency and a time-domain comparator to implement current-mode hysteretic control. A triple redundant bandgap reference is also developed to mitigate the effects of radiation. Measurement results are showcased for a buck and boost converter with a common controller IC implemented in a 0.18μm process and an external power stage. The converter achieves a peak efficiency of 92.22% as a buck for an output current of 5A and an output voltage of 5V. Similarly, the converter achieves an efficiency of 95.97% as a boost for an output current of 1.25A and an output voltage of 30.4V.

ContributorsTalele, Bhushan (Author) / Bakkaloglu, Bertan (Thesis advisor) / Garrity, Douglas (Committee member) / Seo, Jae-Sun (Committee member) / Kitchen, Jennifer (Committee member) / Arizona State University (Publisher)

Created2021

Enabling Deep Learning at Edge: From Efficient and Dynamic Inference to On-Device Learning

Description

In recent years, Artificial Intelligence (AI) (e.g., Deep Neural Networks (DNNs), Transformer) has shown great success in real-world applications due to its superior performance in various cognitive tasks. The impressive performance achieved by AI models normally accompanies the cost of enormous model size and high computational complexity, which significantly hampers…

In recent years, Artificial Intelligence (AI) (e.g., Deep Neural Networks (DNNs), Transformer) has shown great success in real-world applications due to its superior performance in various cognitive tasks. The impressive performance achieved by AI models normally accompanies the cost of enormous model size and high computational complexity, which significantly hampers their implementation on resource-limited Cyber-Physical Systems (CPS), Internet-of-Things (IoT), or Edge systems due to their tightly constrained energy, computing, size, and memory budget. Thus, the urgent demand for enhancing the \textbf{Efficiency} of DNN has drawn significant research interests across various communities. Motivated by the aforementioned concerns, this doctoral research has been mainly focusing on Enabling Deep Learning at Edge: From Efficient and Dynamic Inference to On-Device Learning. Specifically, from the inference perspective, this dissertation begins by investigating a hardware-friendly model compression method that effectively reduces the size of AI model while simultaneously achieving improved speed on edge devices. Additionally, due to the fact that diverse resource constraints of different edge devices, this dissertation further explores dynamic inference, which allows for real-time tuning of inference model size, computation, and latency to accommodate the limitations of each edge device. Regarding efficient on-device learning, this dissertation starts by analyzing memory usage during transfer learning training. Based on this analysis, a novel framework called "Reprogramming Network'' (Rep-Net) is introduced that offers a fresh perspective on the on-device transfer learning problem. The Rep-Net enables on-device transferlearning by directly learning to reprogram the intermediate features of a pre-trained model. Lastly, this dissertation studies an efficient continual learning algorithm that facilitates learning multiple tasks without the risk of forgetting previously acquired knowledge. In practice, through the exploration of task correlation, an interesting phenomenon is observed that the intermediate features are highly correlated between tasks with the self-supervised pre-trained model. Building upon this observation, a novel approach called progressive task-correlated layer freezing is proposed to gradually freeze a subset of layers with the highest correlation ratios for each task leading to training efficiency.

ContributorsYang, Li (Author) / Fan, Deliang (Thesis advisor) / Seo, Jae-Sun (Committee member) / Zhang, Junshan (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2023

Analog-based Neural Network Implementation Using Hexagonal Boron Nitride Memristors

Description

Resistive random-access memory (RRAM) or memristor, is an emerging technology used in neuromorphic computing to exceed the traditional von Neumann obstacle by merging the processing and memory units. Two-dimensional (2D) materials with non-volatile switching behavior can be used as the switching layer of RRAMs, exhibiting superior behavior compared to conventional…

Resistive random-access memory (RRAM) or memristor, is an emerging technology used in neuromorphic computing to exceed the traditional von Neumann obstacle by merging the processing and memory units. Two-dimensional (2D) materials with non-volatile switching behavior can be used as the switching layer of RRAMs, exhibiting superior behavior compared to conventional oxide-based RRAMs. The use of 2D materials allows scaling the resistive switching layer thickness to sub-nanometer dimensions enabling devices to operate with low switching voltages and high programming speeds, offering large improvements in efficiency and performance as well as ultra-dense integration. This dissertation presents an extensive study of linear and logistic regression algorithms implemented with 1-transistor-1-resistor (1T1R) memristor crossbars arrays. For this task, a simulation platform is used that wraps circuit-level simulations of 1T1R crossbars and physics-based model of RRAM to elucidate the impact of device variability on algorithm accuracy, convergence rate, and precision. Moreover, a smart pulsing strategy is proposed for the practical implementation of synaptic weight updates that can accelerate training in real crossbar architectures. Next, this dissertation reports on the hardware implementation of analog dot-product operation on arrays of 2D hexagonal boron nitride (h-BN) memristors. This extends beyond previous work that studied isolated device characteristics towards the application of analog neural network accelerators based on 2D memristor arrays. The wafer-level fabrication of the memristor arrays is enabled by large-area transfer of CVD-grown few-layer h-BN films. The dot-product operation shows excellent linearity and repeatability, with low read energy consumption, with minimal error and deviation over various measurement cycles. Moreover, the successful implementation of a stochastic linear and logistic regression algorithm in 2D h-BN memristor hardware is presented for the classification of noisy images. Additionally, the electrical performance of novel 2D h-BN memristor for SNN applications is extensively investigated. Then, using the experimental behavior of the h-BN memristor as the artificial synapse, an unsupervised spiking neural network (SNN) is simulated for the image classification task. A novel and simple Spike-Timing-Dependent-Plasticity (STDP)-based dropout technique is presented to enhance the recognition task of the h-BN memristor-based SNN.

ContributorsAfshari, Sahra (Author) / Sanchez Esqueda, Ivan (Thesis advisor) / Barnaby, Hugh J (Committee member) / Seo, Jae-Sun (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2023

A Wide Bandwidth High Power Supply Rejection Ratio PMOS Linear Low-Dropout Regulator With Ultra Low Quiescent Current

Description

With the push for integration, a slew of modern switching power management circuits are operating at higher switching frequencies in order to reduce passive filter sizes. But while these switching regulators provide power conversion at high efficiencies, their output is prone to ripples due to the inherent switching behavior. These…

With the push for integration, a slew of modern switching power management circuits are operating at higher switching frequencies in order to reduce passive filter sizes. But while these switching regulators provide power conversion at high efficiencies, their output is prone to ripples due to the inherent switching behavior. These switching regulators use linear-low dropout regulators (LDOs) downstream to provide clean supplies. Typically, these LDOs have good power supply rejection (PSR) at lower frequencies but this degrades at higher frequencies. Therefore, some residual ripple is still manifested on the output. Because of this, high power supply rejection (PSR) with a wide rejection frequency band is becoming a critical requirement in linear low-dropout regulators (LDOs) used in complex systems- on-chip (SOCs).

Typical LDOs achieve higher PSR within their loop-bandwidth; however, their supply rejection performance degrades with reduced loop-gain outside their loop- bandwidth. The LDOs with external filtering capacitors may also have spectral peaking in their PSR response, causing excess system- level supply noise. This work presents an LDO design approach, which achieves a PSR of higher than 68 dB up to 2 MHz frequency and over a wide range of loads up to 250 mA. The wide PSR bandwidth is achieved using a current-mode feedforward ripple canceller (CFFRC) amplifier which provides up to 25 dB of PSR improvement. The feedforward path gain is inherently matched to the forward gain of the LDO, not requiring calibration. The LDO has a fast load transient response with a recovery time of 6.1μs and has a quiescent current of 5.6μA. For a full load transition, the LDO achieves settling with overshoot and undershoot voltages below 27.6 mV and 36.36 mV, respectively. The LDO is designed and fabricated in a 180 nm bipolar/CMOS/DMOS (BCD) technology. The CFFRC amplifier helps to achieve low quiescent power due to its inherent current mode nature, eliminating the need for supply ripple summing amplifiers and adaptive biasing.

ContributorsJoshi, Kishan (Author) / Bakkaloglu, Bertan (Thesis advisor) / Garrity, Douglas (Committee member) / Seo, Jae-Sun (Committee member) / Kitchen, Jennifer (Committee member) / Arizona State University (Publisher)

Created2020

An Active EMI Cancellation Technique Achieving a 25-dB Reduction in Conducted EMI of LIN Drivers in System Basis Chips

Description

Modern-day automobiles are becoming more connected and reliant on wireless connectivity. Thus, automotive electronics can be both a cause of and highly sensitive to electromagnetic interference (EMI), and the consequences of failure can be fatal. Technology advancements in engineering have brought several features into the automotive field but at the…

Modern-day automobiles are becoming more connected and reliant on wireless connectivity. Thus, automotive electronics can be both a cause of and highly sensitive to electromagnetic interference (EMI), and the consequences of failure can be fatal. Technology advancements in engineering have brought several features into the automotive field but at the expense of electromagnetic compatibility issues. Automotive EMC problems are the result of the emissions from electronic assemblies inside a vehicle and the susceptibility of the electronics when exposed to external EMI sources. In both cases, automotive EMC problems can cause unintended changes in the automotive system operation. Robustness to electromagnetic interference (EMI) is one of the primary design aspects of state-of-the-art automotive ICs like System Basis Chips (SBCs) which provide a wide range of analog, power regulation and digital functions on the same die. One of the primary sources of conducted EMI on the Local Interconnect Network (LIN) driver output is an integrated switching DC-DC regulator noise coupling through the parasitic substrate capacitance of the SBC. In this dissertation an adaptive active EMI cancellation technique to cancel the switching noise of the DC-DC regulator on the LIN driver output to ensure electromagnetic compatibility (EMC) is presented. The proposed active EMI cancellation circuit synthesizes a phase synchronized cancellation pulse which is then injected onto the LIN driver output using an on-chip tunable capacitor array to cancel the switching noise injected via the substrate. The proposed EMI reduction technique can track and cancel substrate noise independent of process technology and device parasitics, input voltage, duty cycle, and loading conditions of the DC-DC switching regulator. The EMI cancellation system is designed and fabricated on a 180nm Bipolar-CMOS-DMOS (BCD) process with an integrated power stage of a DC-DC buck regulator at a switching frequency of 2MHz along with an automotive LIN driver. The EMI cancellation circuit occupies an area of 0.7 mm2, which is less than 3% of the overall area in a standard SBC and consumes 12.5 mW of power and achieves 25 dB reduction of conducted EMI in the LIN driver output’s power spectrum at the switching frequency and its harmonics.

ContributorsRay, Abhishek (Author) / Bakkaloglu, Bertan (Thesis advisor) / Garrity, Douglas (Committee member) / Kitchen, Jennifer (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2023

ASU Electronic Theses and Dissertations