Search Content

Efficient test strategies for Analog/RF circuits

Description

Test cost has become a significant portion of device cost and a bottleneck in high volume manufacturing. Increasing integration density and shrinking feature sizes increased test time/cost and reduce observability. Test engineers have to put a tremendous effort in order to maintain test cost within an acceptable budget. Unfortunately, there…

Test cost has become a significant portion of device cost and a bottleneck in high volume manufacturing. Increasing integration density and shrinking feature sizes increased test time/cost and reduce observability. Test engineers have to put a tremendous effort in order to maintain test cost within an acceptable budget. Unfortunately, there is not a single straightforward solution to the problem. Products that are tested have several application domains and distinct customer profiles. Some products are required to operate for long periods of time while others are required to be low cost and optimized for low cost. Multitude of constraints and goals make it impossible to find a single solution that work for all cases. Hence, test development/optimization is typically design/circuit dependent and even process specific. Therefore, test optimization cannot be performed using a single test approach, but necessitates a diversity of approaches. This works aims at addressing test cost minimization and test quality improvement at various levels. In the first chapter of the work, we investigate pre-silicon strategies, such as design for test and pre-silicon statistical simulation optimization. In the second chapter, we investigate efficient post-silicon test strategies, such as adaptive test, adaptive multi-site test, outlier analysis, and process shift detection/tracking.

ContributorsYilmaz, Ender (Author) / Ozev, Sule (Thesis advisor) / Bakkaloglu, Bertan (Committee member) / Cao, Yu (Committee member) / Christen, Jennifer Blain (Committee member) / Arizona State University (Publisher)

Created2012

Integrated inductors with micro-patterned magnetic thin films for RF and power applications

Description

With increasing demand for System on Chip (SoC) and System in Package (SiP) design in computer and communication technologies, integrated inductor which is an essential passive component has been widely used in numerous integrated circuits (ICs) such as in voltage regulators and RF circuits. In this work, soft ferromagnetic core…

With increasing demand for System on Chip (SoC) and System in Package (SiP) design in computer and communication technologies, integrated inductor which is an essential passive component has been widely used in numerous integrated circuits (ICs) such as in voltage regulators and RF circuits. In this work, soft ferromagnetic core material, amorphous Co-Zr-Ta-B, was incorporated into on-chip and in-package inductors in order to scale down inductors and improve inductors performance in both inductance density and quality factor. With two layers of 500 nm Co-Zr-Ta-B films a 3.5X increase in inductance and a 3.9X increase in quality factor over inductors without magnetic films were measured at frequencies as high as 1 GHz. By laminating technology, up to 9.1X increase in inductance and more than 5X increase in quality factor (Q) were obtained from stripline inductors incorporated with 50 nm by 10 laminated films with a peak Q at 300 MHz. It was also demonstrated that this peak Q can be pushed towards high frequency as far as 1GHz by a combination of patterning magnetic films into fine bars and laminations. The role of magnetic vias in magnetic flux and eddy current control was investigated by both simulation and experiment using different patterning techniques and by altering the magnetic via width. Finger-shaped magnetic vias were designed and integrated into on-chip RF inductors improving the frequency of peak quality factor from 400 MHz to 800 MHz without sacrificing inductance enhancement. Eddy current and magnetic flux density in different areas of magnetic vias were analyzed by HFSS 3D EM simulation. With optimized magnetic vias, high frequency response of up to 2 GHz was achieved. Furthermore, the effect of applied magnetic field on on-chip inductors was investigated for high power applications. It was observed that as applied magnetic field along the hard axis (HA) increases, inductance maintains similar value initially at low fields, but decreases at larger fields until the magnetic films become saturated. The high frequency quality factor showed an opposite trend which is correlated to the reduction of ferromagnetic resonant absorption in the magnetic film. In addition, experiments showed that this field-dependent inductance change varied with different patterned magnetic film structures, including bars/slots and fingers structures. Magnetic properties of Co-Zr-Ta-B films on standard organic package substrates including ABF and polyimide were also characterized. Effects of substrate roughness and stress were analyzed and simulated which provide strategies for integrating Co-Zr-Ta-B into package inductors and improving inductors performance. Stripline and spiral inductors with Co-Zr-Ta-B films were fabricated on both ABF and polyimide substrates. Maximum 90% inductance increase in hundreds MHz frequency range were achieved in stripline inductors which are suitable for power delivery applications. Spiral inductors with Co-Zr-Ta-B films showed 18% inductance increase with quality factor of 4 at frequency up to 3 GHz.

ContributorsWu, Hao (Author) / Yu, Hongbin (Thesis advisor) / Bakkaloglu, Bertan (Committee member) / Cao, Yu (Committee member) / Chickamenahalli, Shamala (Committee member) / Arizona State University (Publisher)

Created2013

Path selection based branching for coarse grained reconfigurable arrays

Description

Coarse Grain Reconfigurable Arrays (CGRAs) are promising accelerators capable of

achieving high performance at low power consumption. While CGRAs can efficiently

accelerate loop kernels, accelerating loops with control flow (loops with if-then-else

structures) is quite challenging. Techniques that handle control flow execution in

CGRAs generally use predication. Such techniques execute both branches of an

if-then-else…

Coarse Grain Reconfigurable Arrays (CGRAs) are promising accelerators capable of

achieving high performance at low power consumption. While CGRAs can efficiently

accelerate loop kernels, accelerating loops with control flow (loops with if-then-else

structures) is quite challenging. Techniques that handle control flow execution in

CGRAs generally use predication. Such techniques execute both branches of an

if-then-else structure and select outcome of either branch to commit based on the

result of the conditional. This results in poor utilization of CGRA s computational

resources. Dual-issue scheme which is the state of the art technique for control flow

fetches instructions from both paths of the branch and selects one to execute at

runtime based on the result of the conditional. This technique has an overhead in

instruction fetch bandwidth. In this thesis, to improve performance of control flow

execution in CGRAs, I propose a solution in which the result of the conditional

expression that decides the branch outcome is communicated to the instruction fetch

unit to selectively issue instructions from the path taken by the branch at run time.

Experimental results show that my solution can achieve 34.6% better performance

and 52.1% improvement in energy efficiency on an average compared to state of the

art dual issue scheme without imposing any overhead in instruction fetch bandwidth.

ContributorsRajendran Radhika, Shri Hari (Author) / Shrivastava, Aviral (Thesis advisor) / Christen, Jennifer Blain (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2014

Built-in-self test of transmitter I/Q mismatch and nonlinearities using self-mixing envelope detector

Description

Built-in-Self-Test (BiST) for transmitters is a desirable choice since it eliminates the reliance on expensive instrumentation to do RF signal analysis. Existing on-chip resources, such as power or envelope detectors, or small additional circuitry can be used for BiST purposes. However, due to limited bandwidth, measurement of complex specifications, such…

Built-in-Self-Test (BiST) for transmitters is a desirable choice since it eliminates the reliance on expensive instrumentation to do RF signal analysis. Existing on-chip resources, such as power or envelope detectors, or small additional circuitry can be used for BiST purposes. However, due to limited bandwidth, measurement of complex specifications, such as IQ imbalance, is challenging. In this work, a BiST technique to compute transmitter IQ imbalances using measurements out of a self-mixing envelope detector is proposed. Both the linear and non linear parameters of the RF transmitter path are extracted successfully. We first derive an analytical expression for the output signal. Using this expression, we devise test signals to isolate the effects of gain and phase imbalance, DC offsets, time skews and system nonlinearity from other parameters of the system. Once isolated, these parameters are calculated easily with a few mathematical operations. Simulations and hardware measurements show that the technique can provide accurate characterization of IQ imbalances. One of the glaring advantages of this method is that, the impairments are extracted from analyzing the response at baseband frequency and thereby eliminating the need of high frequency ATE (Automated Test Equipment).

ContributorsByregowda, Srinath (Author) / Ozev, Sule (Thesis advisor) / Cao, Yu (Committee member) / Yu, Hongbin (Committee member) / Arizona State University (Publisher)

Created2012

Low-overhead built-in self-test for advanced RF transceiver architectures

Description

Due to high level of integration in RF System on Chip (SOC), the test access points are limited to the baseband and RF inputs/outputs of the system. This limited access poses a big challenge particularly for advanced RF architectures where calibration of internal parameters is necessary and ensure proper operation.…

Due to high level of integration in RF System on Chip (SOC), the test access points are limited to the baseband and RF inputs/outputs of the system. This limited access poses a big challenge particularly for advanced RF architectures where calibration of internal parameters is necessary and ensure proper operation. Therefore low-overhead built-in Self-Test (BIST) solution for advanced RF transceiver is proposed. In this dissertation. Firstly, comprehensive BIST solution for RF polar transceivers using on-chip resources is presented. In the receiver, phase and gain mismatches degrade sensitivity and error vector magnitude (EVM). In the transmitter, delay skew between the envelope and phase signals and the finite envelope bandwidth can create intermodulation distortion (IMD) that leads to violation of spectral mask requirements. Characterization and calibration of these parameters with analytical model would reduce the test time and cost considerably. Hence, a technique to measure and calibrate impairments of the polar transceiver in the loop-back mode is proposed.

Secondly, robust amplitude measurement technique for RF BIST application and BIST circuits for loop-back connection are discussed. Test techniques using analytical model are explained and BIST circuits are introduced.

Next, a self-compensating built-in self-test solution for RF Phased Array Mismatch is proposed. In the proposed method, a sinusoidal test signal with unknown amplitude is applied to the inputs of two adjacent phased array elements and measure the baseband output signal after down-conversion. Mathematical modeling of the circuit impairments and phased array behavior indicates that by using two distinct input amplitudes, both of which can remain unknown, it is possible to measure the important parameters of the phased array, such as gain and phase mismatch. In addition, proposed BIST system is designed and fabricated using IBM 180nm process and a prototype four-element phased-array PCB is also designed and fabricated for verifying the proposed method.

Finally, process independent gain measurement via BIST/DUT co-design is explained. Design methodology how to reduce performance impact significantly is discussed.

Simulation and hardware measurements results for the proposed techniques show that the proposed technique can characterize the targeted impairments accurately.

ContributorsJeong, Jae Woong (Author) / Ozev, Sule (Thesis advisor) / Kitchen, Jennifer (Committee member) / Cao, Yu (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)

Created2015

Embedding Logic and Non-volatile Devices in CMOS Digital Circuits for Improving Energy Efficiency

Description

Static CMOS logic has remained the dominant design style of digital systems for

more than four decades due to its robustness and near zero standby current. Static

CMOS logic circuits consist of a network of combinational logic cells and clocked sequential

elements, such as latches and flip-flops that are used for sequencing computations

over…

Static CMOS logic has remained the dominant design style of digital systems for

more than four decades due to its robustness and near zero standby current. Static

CMOS logic circuits consist of a network of combinational logic cells and clocked sequential

elements, such as latches and flip-flops that are used for sequencing computations

over time. The majority of the digital design techniques to reduce power, area, and

leakage over the past four decades have focused almost entirely on optimizing the

combinational logic. This work explores alternate architectures for the flip-flops for

improving the overall circuit performance, power and area. It consists of three main

sections.

First, is the design of a multi-input configurable flip-flop structure with embedded

logic. A conventional D-type flip-flop may be viewed as realizing an identity function,

in which the output is simply the value of the input sampled at the clock edge. In

contrast, the proposed multi-input flip-flop, named PNAND, can be configured to

realize one of a family of Boolean functions called threshold functions. In essence,

the PNAND is a circuit implementation of the well-known binary perceptron. Unlike

other reconfigurable circuits, a PNAND can be configured by simply changing the

assignment of signals to its inputs. Using a standard cell library of such gates, a technology

mapping algorithm can be applied to transform a given netlist into one with

an optimal mixture of conventional logic gates and threshold gates. This approach

was used to fabricate a 32-bit Wallace Tree multiplier and a 32-bit booth multiplier

in 65nm LP technology. Simulation and chip measurements show more than 30%

improvement in dynamic power and more than 20% reduction in core area.

The functional yield of the PNAND reduces with geometry and voltage scaling.

The second part of this research investigates the use of two mechanisms to improve

the robustness of the PNAND circuit architecture. One is the use of forward and reverse body biases to change the device threshold and the other is the use of RRAM

devices for low voltage operation.

The third part of this research focused on the design of flip-flops with non-volatile

storage. Spin-transfer torque magnetic tunnel junctions (STT-MTJ) are integrated

with both conventional D-flipflop and the PNAND circuits to implement non-volatile

logic (NVL). These non-volatile storage enhanced flip-flops are able to save the state of

system locally when a power interruption occurs. However, manufacturing variations

in the STT-MTJs and in the CMOS transistors significantly reduce the yield, leading

to an overly pessimistic design and consequently, higher energy consumption. A

detailed analysis of the design trade-offs in the driver circuitry for performing backup

and restore, and a novel method to design the energy optimal driver for a given yield is

presented. Efficient designs of two nonvolatile flip-flop (NVFF) circuits are presented,

in which the backup time is determined on a per-chip basis, resulting in minimizing

the energy wastage and satisfying the yield constraint. To achieve a yield of 98%,

the conventional approach would have to expend nearly 5X more energy than the

minimum required, whereas the proposed tunable approach expends only 26% more

energy than the minimum. A non-volatile threshold gate architecture NV-TLFF are

designed with the same backup and restore circuitry in 65nm technology. The embedded

logic in NV-TLFF compensates performance overhead of NVL. This leads to the

possibility of zero-overhead non-volatile datapath circuits. An 8-bit multiply-and-

accumulate (MAC) unit is designed to demonstrate the performance benefits of the

proposed architecture. Based on the results of HSPICE simulations, the MAC circuit

with the proposed NV-TLFF cells is shown to consume at least 20% less power and

area as compared to the circuit designed with conventional DFFs, without sacrificing

any performance.

ContributorsYang, Jinghua (Author) / Vrudhula, Sarma (Thesis advisor) / Barnaby, Hugh (Committee member) / Cao, Yu (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2018

Energy Efficient Hardware Design of Neural Networks

Description

Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements…

Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements called neurons with several connections between them called synapses are used to build these networks. Hence, it involves operations that exhibit a high level of parallelism making it computationally and memory intensive. Constrained by computing resources and memory, most of the applications require a neural network which utilizes less energy. Energy efficient implementation of these computationally intense algorithms on neuromorphic hardware demands a lot of architectural optimizations. One of these optimizations would be the reduction in the network size using compression and several studies investigated compression by introducing element-wise or row-/column-/block-wise sparsity via pruning and regularization. Additionally, numerous recent works have concentrated on reducing the precision of activations and weights with some reducing to a single bit. However, combining various sparsity structures with binarized or very-low-precision (2-3 bit) neural networks have not been comprehensively explored. Output activations in these deep neural network algorithms are habitually non-binary making it difficult to exploit sparsity. On the other hand, biologically realistic models like spiking neural networks (SNN) closely mimic the operations in biological nervous systems and explore new avenues for brain-like cognitive computing. These networks deal with binary spikes, and they can exploit the input-dependent sparsity or redundancy to dynamically scale the amount of computation in turn leading to energy-efficient hardware implementation. This work discusses configurable spiking neuromorphic architecture that supports multiple hidden layers exploiting hardware reuse. It also presents design techniques for minimum-area/-energy DNN hardware with minimal degradation in accuracy. Area, performance and energy results of these DNN and SNN hardware is reported for the MNIST dataset. The Neuromorphic hardware designed for SNN algorithm in 28nm CMOS demonstrates high classification accuracy (>98% on MNIST) and low energy (51.4 - 773 (nJ) per classification). The optimized DNN hardware designed in 40nm CMOS that combines 8X structured compression and 3-bit weight precision showed 98.4% accuracy at 33 (nJ) per classification.

ContributorsKolala Venkataramanaiah, Shreyas (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2018

Hardware Acceleration of Deep Convolutional Neural Networks on FPGA

Description

The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency…

The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the deep learning algorithm inference. However, deploying CNNs on portable and embedded systems is still challenging due to large data volume, intensive computation, varying algorithm structures, and frequent memory accesses. This dissertation proposes a complete design methodology and framework to accelerate the inference process of various CNN algorithms on FPGA hardware with high performance, efficiency and flexibility.

As convolution contributes most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. An efficient dataflow and hardware architecture of CNN acceleration are proposed to minimize the data communication while maximizing the resource utilization to achieve high performance.

Although great performance and efficiency can be achieved by customizing the FPGA hardware for each CNN model, significant efforts and expertise are required leading to long development time, which makes it difficult to catch up with the rapid development of CNN algorithms. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g. NiN, VGG, GoogLeNet and ResNet, on two different standalone FPGAs achieving state-of-the art performance.

Based on the optimized acceleration strategy, there are still a lot of design options, e.g. the degree and dimension of computation parallelism, the size of on-chip buffers, and the external memory bandwidth, which impact the utilization of computation resources and data communication efficiency, and finally affect the performance and energy consumption of the accelerator. The large design space of the accelerator makes it impractical to explore the optimal design choice during the real implementation phase. Therefore, a performance model is proposed in this work to quantitatively estimate the accelerator performance and resource utilization. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase.

ContributorsMa, Yufei (Author) / Vrudhula, Sarma (Thesis advisor) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Barnaby, Hugh (Committee member) / Arizona State University (Publisher)

Created2018

Enabling Deep Learning at Edge: From Efficient and Dynamic Inference to On-Device Learning

Description

In recent years, Artificial Intelligence (AI) (e.g., Deep Neural Networks (DNNs), Transformer) has shown great success in real-world applications due to its superior performance in various cognitive tasks. The impressive performance achieved by AI models normally accompanies the cost of enormous model size and high computational complexity, which significantly hampers…

In recent years, Artificial Intelligence (AI) (e.g., Deep Neural Networks (DNNs), Transformer) has shown great success in real-world applications due to its superior performance in various cognitive tasks. The impressive performance achieved by AI models normally accompanies the cost of enormous model size and high computational complexity, which significantly hampers their implementation on resource-limited Cyber-Physical Systems (CPS), Internet-of-Things (IoT), or Edge systems due to their tightly constrained energy, computing, size, and memory budget. Thus, the urgent demand for enhancing the \textbf{Efficiency} of DNN has drawn significant research interests across various communities. Motivated by the aforementioned concerns, this doctoral research has been mainly focusing on Enabling Deep Learning at Edge: From Efficient and Dynamic Inference to On-Device Learning. Specifically, from the inference perspective, this dissertation begins by investigating a hardware-friendly model compression method that effectively reduces the size of AI model while simultaneously achieving improved speed on edge devices. Additionally, due to the fact that diverse resource constraints of different edge devices, this dissertation further explores dynamic inference, which allows for real-time tuning of inference model size, computation, and latency to accommodate the limitations of each edge device. Regarding efficient on-device learning, this dissertation starts by analyzing memory usage during transfer learning training. Based on this analysis, a novel framework called "Reprogramming Network'' (Rep-Net) is introduced that offers a fresh perspective on the on-device transfer learning problem. The Rep-Net enables on-device transferlearning by directly learning to reprogram the intermediate features of a pre-trained model. Lastly, this dissertation studies an efficient continual learning algorithm that facilitates learning multiple tasks without the risk of forgetting previously acquired knowledge. In practice, through the exploration of task correlation, an interesting phenomenon is observed that the intermediate features are highly correlated between tasks with the self-supervised pre-trained model. Building upon this observation, a novel approach called progressive task-correlated layer freezing is proposed to gradually freeze a subset of layers with the highest correlation ratios for each task leading to training efficiency.

ContributorsYang, Li (Author) / Fan, Deliang (Thesis advisor) / Seo, Jae-Sun (Committee member) / Zhang, Junshan (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2023

Energy Efficient ASIC/FPGA Neural Network Accelerators

Description

Convolutional neural networks(CNNs) achieve high accuracy on large datasets but requires significant computation and storage requirement for training/testing. While many applications demand low latency and energy-efficient processing of the images, deploying these complex algorithms on the hardware is a challenging task. This dissertation first presents a compiler-based CNN training accelerator…

Convolutional neural networks(CNNs) achieve high accuracy on large datasets but requires significant computation and storage requirement for training/testing. While many applications demand low latency and energy-efficient processing of the images, deploying these complex algorithms on the hardware is a challenging task. This dissertation first presents a compiler-based CNN training accelerator using DDR3 and HBM2 memory. An optimized RTL library is implemented to perform training-specific tasks and an RTL compiler is developed to generate FPGA-synthesizable RTL based on user-defined constraints. High Bandwidth Memory(HBM) provides efficient off-chip communication and improves the training performance. The impact of HBM2 on CNN training workloads is analyzed and compressively compared with DDR3. For training ResNet-20/VGG-like CNNs for the CIFAR-10 dataset, the proposed CNN training accelerator on Stratix-10 GX FPGA(DDR3) demonstrates 479 GOPS performance, and on Stratix-10 MX FPGA(HBM) shows 4.5/9.7 X energy-efficiency improvement compared to Tesla V100 GPU. Next, the FPGA online learning accelerator is presented. Adopting model segmentation techniques from Progressive Segmented Training(PST), the online learning accelerator achieved a 4.2X reduction in training latency. Furthermore, this dissertation presents an 8-bit floating-point (FP8) training processor which implements (1) Highly parallel tensor cores that maintain high PE utilization, (2) Hardware-efficient channel gating for dynamic output activation sparsity (3) Dynamic weight sparsity based on group Lasso (4) Gradient skipping based on FP prediction error. The 28nm prototype chip demonstrates significant improvements in FLOPs reduction (7.3×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×) for both supervised training and self-supervised training tasks. In addition to the training accelerators, this dissertation also presents a CNN inference accelerator on ASIC(FixyNN) and FPGA(FixyFPGA). FixyNN consists of a fixed-weight feature extractor that generates ubiquitous CNN features and a conventional programmable CNN accelerator. In the fixed-weight feature extractor, the network weights are hard-coded into hardware and used as a fixed operand for the multiplication. Experimental results demonstrate FixyNN can achieve very high energy efficiencies up to 26.6 TOPS/W, and FixyFPGA achieves $2.34\times$ higher GOPS on ImageNet classification. In summary, this dissertation comprehensively discusses novel architectures of high-performance and energy-efficient ASIC/FPGA CNN inference/training accelerators.

ContributorsKolala Venkataramaniah, Shreyas (Author) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Chakrabarti, Chaitali (Committee member) / Fan, Deliang (Committee member) / Arizona State University (Publisher)

Created2022

Filtering by