Matching Items (11)
Filtering by

Clear all filters

153334-Thumbnail Image.png
Description
Three dimensional (3-D) ultrasound is safe, inexpensive, and has been shown to drastically improve system ease-of-use, diagnostic efficiency, and patient throughput. However, its high computational complexity and resulting high power consumption has precluded its use in hand-held applications.

In this dissertation, algorithm-architecture co-design techniques that aim to make hand-held 3-D ultrasound

Three dimensional (3-D) ultrasound is safe, inexpensive, and has been shown to drastically improve system ease-of-use, diagnostic efficiency, and patient throughput. However, its high computational complexity and resulting high power consumption has precluded its use in hand-held applications.

In this dissertation, algorithm-architecture co-design techniques that aim to make hand-held 3-D ultrasound a reality are presented. First, image enhancement methods to improve signal-to-noise ratio (SNR) are proposed. These include virtual source firing techniques and a low overhead digital front-end architecture using orthogonal chirps and orthogonal Golay codes.

Second, algorithm-architecture co-design techniques to reduce the power consumption of 3-D SAU imaging systems is presented. These include (i) a subaperture multiplexing strategy and the corresponding apodization method to alleviate the signal bandwidth bottleneck, and (ii) a highly efficient iterative delay calculation method to eliminate complex operations such as multiplications, divisions and square-root in delay calculation during beamforming. These techniques were used to define Sonic Millip3De, a 3-D die stacked architecture for digital beamforming in SAU systems. Sonic Millip3De produces 3-D high resolution images at 2 frames per second with system power consumption of 15W in 45nm technology.

Third, a new beamforming method based on separable delay decomposition is proposed to reduce the computational complexity of the beamforming unit in an SAU system. The method is based on minimizing the root-mean-square error (RMSE) due to delay decomposition. It reduces the beamforming complexity of a SAU system by 19x while providing high image fidelity that is comparable to non-separable beamforming. The resulting modified Sonic Millip3De architecture supports a frame rate of 32 volumes per second while maintaining power consumption of 15W in 45nm technology.

Next a 3-D plane-wave imaging system that utilizes both separable beamforming and coherent compounding is presented. The resulting system has computational complexity comparable to that of a non-separable non-compounding baseline system while significantly improving contrast-to-noise ratio and SNR. The modified Sonic Millip3De architecture is now capable of generating high resolution images at 1000 volumes per second with 9-fire-angle compounding.
ContributorsYang, Ming (Author) / Chakrabarti, Chaitali (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Karam, Lina (Committee member) / Frakes, David (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)
Created2015
156189-Thumbnail Image.png
Description
Static CMOS logic has remained the dominant design style of digital systems for

more than four decades due to its robustness and near zero standby current. Static

CMOS logic circuits consist of a network of combinational logic cells and clocked sequential

elements, such as latches and flip-flops that are used for sequencing computations

over

Static CMOS logic has remained the dominant design style of digital systems for

more than four decades due to its robustness and near zero standby current. Static

CMOS logic circuits consist of a network of combinational logic cells and clocked sequential

elements, such as latches and flip-flops that are used for sequencing computations

over time. The majority of the digital design techniques to reduce power, area, and

leakage over the past four decades have focused almost entirely on optimizing the

combinational logic. This work explores alternate architectures for the flip-flops for

improving the overall circuit performance, power and area. It consists of three main

sections.

First, is the design of a multi-input configurable flip-flop structure with embedded

logic. A conventional D-type flip-flop may be viewed as realizing an identity function,

in which the output is simply the value of the input sampled at the clock edge. In

contrast, the proposed multi-input flip-flop, named PNAND, can be configured to

realize one of a family of Boolean functions called threshold functions. In essence,

the PNAND is a circuit implementation of the well-known binary perceptron. Unlike

other reconfigurable circuits, a PNAND can be configured by simply changing the

assignment of signals to its inputs. Using a standard cell library of such gates, a technology

mapping algorithm can be applied to transform a given netlist into one with

an optimal mixture of conventional logic gates and threshold gates. This approach

was used to fabricate a 32-bit Wallace Tree multiplier and a 32-bit booth multiplier

in 65nm LP technology. Simulation and chip measurements show more than 30%

improvement in dynamic power and more than 20% reduction in core area.

The functional yield of the PNAND reduces with geometry and voltage scaling.

The second part of this research investigates the use of two mechanisms to improve

the robustness of the PNAND circuit architecture. One is the use of forward and reverse body biases to change the device threshold and the other is the use of RRAM

devices for low voltage operation.

The third part of this research focused on the design of flip-flops with non-volatile

storage. Spin-transfer torque magnetic tunnel junctions (STT-MTJ) are integrated

with both conventional D-flipflop and the PNAND circuits to implement non-volatile

logic (NVL). These non-volatile storage enhanced flip-flops are able to save the state of

system locally when a power interruption occurs. However, manufacturing variations

in the STT-MTJs and in the CMOS transistors significantly reduce the yield, leading

to an overly pessimistic design and consequently, higher energy consumption. A

detailed analysis of the design trade-offs in the driver circuitry for performing backup

and restore, and a novel method to design the energy optimal driver for a given yield is

presented. Efficient designs of two nonvolatile flip-flop (NVFF) circuits are presented,

in which the backup time is determined on a per-chip basis, resulting in minimizing

the energy wastage and satisfying the yield constraint. To achieve a yield of 98%,

the conventional approach would have to expend nearly 5X more energy than the

minimum required, whereas the proposed tunable approach expends only 26% more

energy than the minimum. A non-volatile threshold gate architecture NV-TLFF are

designed with the same backup and restore circuitry in 65nm technology. The embedded

logic in NV-TLFF compensates performance overhead of NVL. This leads to the

possibility of zero-overhead non-volatile datapath circuits. An 8-bit multiply-and-

accumulate (MAC) unit is designed to demonstrate the performance benefits of the

proposed architecture. Based on the results of HSPICE simulations, the MAC circuit

with the proposed NV-TLFF cells is shown to consume at least 20% less power and

area as compared to the circuit designed with conventional DFFs, without sacrificing

any performance.
ContributorsYang, Jinghua (Author) / Vrudhula, Sarma (Thesis advisor) / Barnaby, Hugh (Committee member) / Cao, Yu (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2018
156845-Thumbnail Image.png
Description
The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency

The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the deep learning algorithm inference. However, deploying CNNs on portable and embedded systems is still challenging due to large data volume, intensive computation, varying algorithm structures, and frequent memory accesses. This dissertation proposes a complete design methodology and framework to accelerate the inference process of various CNN algorithms on FPGA hardware with high performance, efficiency and flexibility.

As convolution contributes most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. An efficient dataflow and hardware architecture of CNN acceleration are proposed to minimize the data communication while maximizing the resource utilization to achieve high performance.

Although great performance and efficiency can be achieved by customizing the FPGA hardware for each CNN model, significant efforts and expertise are required leading to long development time, which makes it difficult to catch up with the rapid development of CNN algorithms. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g. NiN, VGG, GoogLeNet and ResNet, on two different standalone FPGAs achieving state-of-the art performance.

Based on the optimized acceleration strategy, there are still a lot of design options, e.g. the degree and dimension of computation parallelism, the size of on-chip buffers, and the external memory bandwidth, which impact the utilization of computation resources and data communication efficiency, and finally affect the performance and energy consumption of the accelerator. The large design space of the accelerator makes it impractical to explore the optimal design choice during the real implementation phase. Therefore, a performance model is proposed in this work to quantitatively estimate the accelerator performance and resource utilization. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase.
ContributorsMa, Yufei (Author) / Vrudhula, Sarma (Thesis advisor) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Barnaby, Hugh (Committee member) / Arizona State University (Publisher)
Created2018
187820-Thumbnail Image.png
Description
With the advent of new advanced analysis tools and access to related published data, it is getting more difficult for data owners to suppress private information from published data while still providing useful information. This dual problem of providing useful, accurate information and protecting it at the same time has

With the advent of new advanced analysis tools and access to related published data, it is getting more difficult for data owners to suppress private information from published data while still providing useful information. This dual problem of providing useful, accurate information and protecting it at the same time has been challenging, especially in healthcare. The data owners lack an automated resource that provides layers of protection on a published dataset with validated statistical values for usability. Differential privacy (DP) has gained a lot of attention in the past few years as a solution to the above-mentioned dual problem. DP is defined as a statistical anonymity model that can protect the data from adversarial observation while still providing intended usage. This dissertation introduces a novel DP protection mechanism called Inexact Data Cloning (IDC), which simultaneously protects and preserves information in published data while conveying source data intent. IDC preserves the privacy of the records by converting the raw data records into clonesets. The clonesets then pass through a classifier that removes potential compromising clonesets, filtering only good inexact cloneset. The mechanism of IDC is dependent on a set of privacy protection metrics called differential privacy protection metrics (DPPM), which represents the overall protection level. IDC uses two novel performance values, differential privacy protection score (DPPS) and clone classifier selection percentage (CCSP), to estimate the privacy level of protected data. In support of using IDC as a viable data security product, a software tool chain prototype, differential privacy protection architecture (DPPA), was developed to utilize the IDC. DPPA used the engineering security mechanism of IDC. DPPA is a hub which facilitates a market for data DP security mechanisms. DPPA works by incorporating standalone IDC mechanisms and provides automation, IDC protected published datasets and statistically verified IDC dataset diagnostic report. DPPA is currently doing functional, and operational benchmark processes that quantifies the DP protection of a given published dataset. The DPPA tool was recently used to test a couple of health datasets. The test results further validate the IDC mechanism as being feasible.
Contributorsthomas, zelpha (Author) / Bliss, Daniel W (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Banerjee, Ayan (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2023
189353-Thumbnail Image.png
Description
In recent years, Artificial Intelligence (AI) (e.g., Deep Neural Networks (DNNs), Transformer) has shown great success in real-world applications due to its superior performance in various cognitive tasks. The impressive performance achieved by AI models normally accompanies the cost of enormous model size and high computational complexity, which significantly hampers

In recent years, Artificial Intelligence (AI) (e.g., Deep Neural Networks (DNNs), Transformer) has shown great success in real-world applications due to its superior performance in various cognitive tasks. The impressive performance achieved by AI models normally accompanies the cost of enormous model size and high computational complexity, which significantly hampers their implementation on resource-limited Cyber-Physical Systems (CPS), Internet-of-Things (IoT), or Edge systems due to their tightly constrained energy, computing, size, and memory budget. Thus, the urgent demand for enhancing the \textbf{Efficiency} of DNN has drawn significant research interests across various communities. Motivated by the aforementioned concerns, this doctoral research has been mainly focusing on Enabling Deep Learning at Edge: From Efficient and Dynamic Inference to On-Device Learning. Specifically, from the inference perspective, this dissertation begins by investigating a hardware-friendly model compression method that effectively reduces the size of AI model while simultaneously achieving improved speed on edge devices. Additionally, due to the fact that diverse resource constraints of different edge devices, this dissertation further explores dynamic inference, which allows for real-time tuning of inference model size, computation, and latency to accommodate the limitations of each edge device. Regarding efficient on-device learning, this dissertation starts by analyzing memory usage during transfer learning training. Based on this analysis, a novel framework called "Reprogramming Network'' (Rep-Net) is introduced that offers a fresh perspective on the on-device transfer learning problem. The Rep-Net enables on-device transferlearning by directly learning to reprogram the intermediate features of a pre-trained model. Lastly, this dissertation studies an efficient continual learning algorithm that facilitates learning multiple tasks without the risk of forgetting previously acquired knowledge. In practice, through the exploration of task correlation, an interesting phenomenon is observed that the intermediate features are highly correlated between tasks with the self-supervised pre-trained model. Building upon this observation, a novel approach called progressive task-correlated layer freezing is proposed to gradually freeze a subset of layers with the highest correlation ratios for each task leading to training efficiency.
ContributorsYang, Li (Author) / Fan, Deliang (Thesis advisor) / Seo, Jae-Sun (Committee member) / Zhang, Junshan (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)
Created2023
156306-Thumbnail Image.png
Description
Software-defined radio provides users with a low-cost and flexible platform for implementing and studying advanced communications and remote sensing applications. Two such applications include unmanned aerial system-to-ground communications channel and joint sensing and communication systems. In this work, these applications are studied.

In the first part, unmanned aerial system-to-ground communications

Software-defined radio provides users with a low-cost and flexible platform for implementing and studying advanced communications and remote sensing applications. Two such applications include unmanned aerial system-to-ground communications channel and joint sensing and communication systems. In this work, these applications are studied.

In the first part, unmanned aerial system-to-ground communications channel models are derived from empirical data collected from software-defined radio transceivers in residential and mountainous desert environments using a small (< 20 kg) unmanned aerial system during low-altitude flight (< 130 m). The Kullback-Leibler divergence measure was employed to characterize model mismatch from the empirical data. Using this measure the derived models accurately describe the underlying data.

In the second part, an experimental joint sensing and communications system is implemented using a network of software-defined radio transceivers. A novel co-design receiver architecture is presented and demonstrated within a three-node joint multiple access system topology consisting of an independent radar and communications transmitter along with a joint radar and communications receiver. The receiver tracks an emulated target moving along a predefined path and simultaneously decodes a communications message. Experimental system performance bounds are characterized jointly using the communications channel capacity and novel estimation information rate.
ContributorsGutierrez, Richard (Author) / Bliss, Daniel W (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Ogras, Umit Y. (Committee member) / Tepedelenlioğlu, Cihan (Committee member) / Arizona State University (Publisher)
Created2018
171744-Thumbnail Image.png
Description
Convolutional neural networks(CNNs) achieve high accuracy on large datasets but requires significant computation and storage requirement for training/testing. While many applications demand low latency and energy-efficient processing of the images, deploying these complex algorithms on the hardware is a challenging task. This dissertation first presents a compiler-based CNN training accelerator

Convolutional neural networks(CNNs) achieve high accuracy on large datasets but requires significant computation and storage requirement for training/testing. While many applications demand low latency and energy-efficient processing of the images, deploying these complex algorithms on the hardware is a challenging task. This dissertation first presents a compiler-based CNN training accelerator using DDR3 and HBM2 memory. An optimized RTL library is implemented to perform training-specific tasks and an RTL compiler is developed to generate FPGA-synthesizable RTL based on user-defined constraints. High Bandwidth Memory(HBM) provides efficient off-chip communication and improves the training performance. The impact of HBM2 on CNN training workloads is analyzed and compressively compared with DDR3. For training ResNet-20/VGG-like CNNs for the CIFAR-10 dataset, the proposed CNN training accelerator on Stratix-10 GX FPGA(DDR3) demonstrates 479 GOPS performance, and on Stratix-10 MX FPGA(HBM) shows 4.5/9.7 X energy-efficiency improvement compared to Tesla V100 GPU. Next, the FPGA online learning accelerator is presented. Adopting model segmentation techniques from Progressive Segmented Training(PST), the online learning accelerator achieved a 4.2X reduction in training latency. Furthermore, this dissertation presents an 8-bit floating-point (FP8) training processor which implements (1) Highly parallel tensor cores that maintain high PE utilization, (2) Hardware-efficient channel gating for dynamic output activation sparsity (3) Dynamic weight sparsity based on group Lasso (4) Gradient skipping based on FP prediction error. The 28nm prototype chip demonstrates significant improvements in FLOPs reduction (7.3×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×) for both supervised training and self-supervised training tasks. In addition to the training accelerators, this dissertation also presents a CNN inference accelerator on ASIC(FixyNN) and FPGA(FixyFPGA). FixyNN consists of a fixed-weight feature extractor that generates ubiquitous CNN features and a conventional programmable CNN accelerator. In the fixed-weight feature extractor, the network weights are hard-coded into hardware and used as a fixed operand for the multiplication. Experimental results demonstrate FixyNN can achieve very high energy efficiencies up to 26.6 TOPS/W, and FixyFPGA achieves $2.34\times$ higher GOPS on ImageNet classification. In summary, this dissertation comprehensively discusses novel architectures of high-performance and energy-efficient ASIC/FPGA CNN inference/training accelerators.
ContributorsKolala Venkataramaniah, Shreyas (Author) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Chakrabarti, Chaitali (Committee member) / Fan, Deliang (Committee member) / Arizona State University (Publisher)
Created2022
171895-Thumbnail Image.png
Description
Adversarial threats of deep learning are increasingly becoming a concern due to the ubiquitous deployment of deep neural networks(DNNs) in many security-sensitive domains. Among the existing threats, adversarial weight perturbation is an emerging class of threats that attempts to perturb the weight parameters of DNNs to breach security and privacy.In

Adversarial threats of deep learning are increasingly becoming a concern due to the ubiquitous deployment of deep neural networks(DNNs) in many security-sensitive domains. Among the existing threats, adversarial weight perturbation is an emerging class of threats that attempts to perturb the weight parameters of DNNs to breach security and privacy.In this thesis, the first weight perturbation attack introduced is called Bit-Flip Attack (BFA), which can maliciously flip a small number of bits within a computer’s main memory system storing the DNN weight parameter to achieve malicious objectives. Our developed algorithm can achieve three specific attack objectives: I) Un-targeted accuracy degradation attack, ii) Targeted attack, & iii) Trojan attack. Moreover, BFA utilizes the rowhammer technique to demonstrate the bit-flip attack in an actual computer prototype. While the bit-flip attack is conducted in a white-box setting, the subsequent contribution of this thesis is to develop another novel weight perturbation attack in a black-box setting. Consequently, this thesis discusses a new study of DNN model vulnerabilities in a multi-tenant Field Programmable Gate Array (FPGA) cloud under a strict black-box framework. This newly developed attack framework injects faults in the malicious tenant by duplicating specific DNN weight packages during data transmission between off-chip memory and on-chip buffer of a victim FPGA. The proposed attack is also experimentally validated in a multi-tenant cloud FPGA prototype. In the final part, the focus shifts toward deep learning model privacy, popularly known as model extraction, that can steal partial DNN weight parameters remotely with the aid of a memory side-channel attack. In addition, a novel training algorithm is designed to utilize the partially leaked DNN weight bit information, making the model extraction attack more effective. The algorithm effectively leverages the partial leaked bit information and generates a substitute prototype of the victim model with almost identical performance to the victim.
ContributorsRakin, Adnan Siraj (Author) / Fan, Deliang (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Seo, Jae-Sun (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)
Created2022
171654-Thumbnail Image.png
Description
The advancement and marked increase in the use of computing devices in health care for large scale and personal medical use has transformed the field of medicine and health care into a data rich domain. This surge in the availability of data has allowed domain experts to investigate, study and

The advancement and marked increase in the use of computing devices in health care for large scale and personal medical use has transformed the field of medicine and health care into a data rich domain. This surge in the availability of data has allowed domain experts to investigate, study and discover inherent patterns in diseases from new perspectives and in turn, further the field of medicine. Storage and analysis of this data in real time aids in enhancing the response time and efficiency of doctors and health care specialists. However, due to the time critical nature of most life- threatening diseases, there is a growing need to make informed decisions prior to the occurrence of any fatal outcome. Alongside time sensitivity, analyzing data specific to diseases and their effects on an individual basis leads to more efficient prognosis and rapid deployment of cures. The primary challenge in addressing both of these issues arises from the time varying and time sensitive nature of the data being studied and in the ability to successfully predict anomalous events using only observed data.This dissertation introduces adaptive machine learning algorithms that aid in the prediction of anomalous situations arising due to abnormalities present in patients diagnosed with certain types of diseases. Emphasis is given to the adaptation and development of algorithms based on an individual basis to further the accuracy of all predictions made. The main objectives are to learn the underlying representation of the data using empirical methods and enhance it using domain knowledge. The learned model is then utilized as a guide for statistical machine learning methods to predict the occurrence of anomalous events in the near future. Further enhancement of the learned model is achieved by means of tuning the objective function of the algorithm to incorporate domain knowledge. Along with anomaly forecasting using multi-modal data, this dissertation also investigates the use of univariate time series data towards the prediction of onset of diseases using Bayesian nonparametrics.
ContributorsDas, Subhasish (Author) / Gupta, Sandeep K.S. (Thesis advisor) / Banerjee, Ayan (Committee member) / Indic, Premananda (Committee member) / Papandreou-Suppappola, Antonia (Committee member) / Arizona State University (Publisher)
Created2022
157998-Thumbnail Image.png
Description
The marked increase in the inflow of remotely sensed data from satellites have trans- formed the Earth and Space Sciences to a data rich domain creating a rich repository for domain experts to analyze. These observations shed light on a diverse array of disciplines ranging from monitoring Earth system components

The marked increase in the inflow of remotely sensed data from satellites have trans- formed the Earth and Space Sciences to a data rich domain creating a rich repository for domain experts to analyze. These observations shed light on a diverse array of disciplines ranging from monitoring Earth system components to planetary explo- ration by highlighting the expected trend and patterns in the data. However, the complexity of these patterns from local to global scales, coupled with the volume of this ever-growing repository necessitates advanced techniques to sequentially process the datasets to determine the underlying trends. Such techniques essentially model the observations to learn characteristic parameters of data-generating processes and highlight anomalous planetary surface observations to help domain scientists for making informed decisions. The primary challenge in defining such models arises due to the spatio-temporal variability of these processes.

This dissertation introduces models of multispectral satellite observations that sequentially learn the expected trend from the data by extracting salient features of planetary surface observations. The main objectives are to learn the temporal variability for modeling dynamic processes and to build representations of features of interest that is learned over the lifespan of an instrument. The estimated model parameters are then exploited in detecting anomalies due to changes in land surface reflectance as well as novelties in planetary surface landforms. A model switching approach is proposed that allows the selection of the best matched representation given the observations that is designed to account for rate of time-variability in land surface. The estimated parameters are exploited to design a change detector, analyze the separability of change events, and form an expert-guided representation of planetary landforms for prioritizing the retrieval of scientifically relevant observations with both onboard and post-downlink applications.
ContributorsChakraborty, Srija (Author) / Papandreou-Suppappola, Antonia (Thesis advisor) / Christensen, Philip R. (Philip Russel) (Thesis advisor) / Richmond, Christ (Committee member) / Maurer, Alexander (Committee member) / Arizona State University (Publisher)
Created2019