Matching Items (22)
Filtering by

Clear all filters

153033-Thumbnail Image.png
Description
Coarse Grain Reconfigurable Arrays (CGRAs) are promising accelerators capable of

achieving high performance at low power consumption. While CGRAs can efficiently

accelerate loop kernels, accelerating loops with control flow (loops with if-then-else

structures) is quite challenging. Techniques that handle control flow execution in

CGRAs generally use predication. Such techniques execute both branches of an

if-then-else

Coarse Grain Reconfigurable Arrays (CGRAs) are promising accelerators capable of

achieving high performance at low power consumption. While CGRAs can efficiently

accelerate loop kernels, accelerating loops with control flow (loops with if-then-else

structures) is quite challenging. Techniques that handle control flow execution in

CGRAs generally use predication. Such techniques execute both branches of an

if-then-else structure and select outcome of either branch to commit based on the

result of the conditional. This results in poor utilization of CGRA s computational

resources. Dual-issue scheme which is the state of the art technique for control flow

fetches instructions from both paths of the branch and selects one to execute at

runtime based on the result of the conditional. This technique has an overhead in

instruction fetch bandwidth. In this thesis, to improve performance of control flow

execution in CGRAs, I propose a solution in which the result of the conditional

expression that decides the branch outcome is communicated to the instruction fetch

unit to selectively issue instructions from the path taken by the branch at run time.

Experimental results show that my solution can achieve 34.6% better performance

and 52.1% improvement in energy efficiency on an average compared to state of the

art dual issue scheme without imposing any overhead in instruction fetch bandwidth.
ContributorsRajendran Radhika, Shri Hari (Author) / Shrivastava, Aviral (Thesis advisor) / Christen, Jennifer Blain (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)
Created2014
156189-Thumbnail Image.png
Description
Static CMOS logic has remained the dominant design style of digital systems for

more than four decades due to its robustness and near zero standby current. Static

CMOS logic circuits consist of a network of combinational logic cells and clocked sequential

elements, such as latches and flip-flops that are used for sequencing computations

over

Static CMOS logic has remained the dominant design style of digital systems for

more than four decades due to its robustness and near zero standby current. Static

CMOS logic circuits consist of a network of combinational logic cells and clocked sequential

elements, such as latches and flip-flops that are used for sequencing computations

over time. The majority of the digital design techniques to reduce power, area, and

leakage over the past four decades have focused almost entirely on optimizing the

combinational logic. This work explores alternate architectures for the flip-flops for

improving the overall circuit performance, power and area. It consists of three main

sections.

First, is the design of a multi-input configurable flip-flop structure with embedded

logic. A conventional D-type flip-flop may be viewed as realizing an identity function,

in which the output is simply the value of the input sampled at the clock edge. In

contrast, the proposed multi-input flip-flop, named PNAND, can be configured to

realize one of a family of Boolean functions called threshold functions. In essence,

the PNAND is a circuit implementation of the well-known binary perceptron. Unlike

other reconfigurable circuits, a PNAND can be configured by simply changing the

assignment of signals to its inputs. Using a standard cell library of such gates, a technology

mapping algorithm can be applied to transform a given netlist into one with

an optimal mixture of conventional logic gates and threshold gates. This approach

was used to fabricate a 32-bit Wallace Tree multiplier and a 32-bit booth multiplier

in 65nm LP technology. Simulation and chip measurements show more than 30%

improvement in dynamic power and more than 20% reduction in core area.

The functional yield of the PNAND reduces with geometry and voltage scaling.

The second part of this research investigates the use of two mechanisms to improve

the robustness of the PNAND circuit architecture. One is the use of forward and reverse body biases to change the device threshold and the other is the use of RRAM

devices for low voltage operation.

The third part of this research focused on the design of flip-flops with non-volatile

storage. Spin-transfer torque magnetic tunnel junctions (STT-MTJ) are integrated

with both conventional D-flipflop and the PNAND circuits to implement non-volatile

logic (NVL). These non-volatile storage enhanced flip-flops are able to save the state of

system locally when a power interruption occurs. However, manufacturing variations

in the STT-MTJs and in the CMOS transistors significantly reduce the yield, leading

to an overly pessimistic design and consequently, higher energy consumption. A

detailed analysis of the design trade-offs in the driver circuitry for performing backup

and restore, and a novel method to design the energy optimal driver for a given yield is

presented. Efficient designs of two nonvolatile flip-flop (NVFF) circuits are presented,

in which the backup time is determined on a per-chip basis, resulting in minimizing

the energy wastage and satisfying the yield constraint. To achieve a yield of 98%,

the conventional approach would have to expend nearly 5X more energy than the

minimum required, whereas the proposed tunable approach expends only 26% more

energy than the minimum. A non-volatile threshold gate architecture NV-TLFF are

designed with the same backup and restore circuitry in 65nm technology. The embedded

logic in NV-TLFF compensates performance overhead of NVL. This leads to the

possibility of zero-overhead non-volatile datapath circuits. An 8-bit multiply-and-

accumulate (MAC) unit is designed to demonstrate the performance benefits of the

proposed architecture. Based on the results of HSPICE simulations, the MAC circuit

with the proposed NV-TLFF cells is shown to consume at least 20% less power and

area as compared to the circuit designed with conventional DFFs, without sacrificing

any performance.
ContributorsYang, Jinghua (Author) / Vrudhula, Sarma (Thesis advisor) / Barnaby, Hugh (Committee member) / Cao, Yu (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2018
156822-Thumbnail Image.png
Description
Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements

Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements called neurons with several connections between them called synapses are used to build these networks. Hence, it involves operations that exhibit a high level of parallelism making it computationally and memory intensive. Constrained by computing resources and memory, most of the applications require a neural network which utilizes less energy. Energy efficient implementation of these computationally intense algorithms on neuromorphic hardware demands a lot of architectural optimizations. One of these optimizations would be the reduction in the network size using compression and several studies investigated compression by introducing element-wise or row-/column-/block-wise sparsity via pruning and regularization. Additionally, numerous recent works have concentrated on reducing the precision of activations and weights with some reducing to a single bit. However, combining various sparsity structures with binarized or very-low-precision (2-3 bit) neural networks have not been comprehensively explored. Output activations in these deep neural network algorithms are habitually non-binary making it difficult to exploit sparsity. On the other hand, biologically realistic models like spiking neural networks (SNN) closely mimic the operations in biological nervous systems and explore new avenues for brain-like cognitive computing. These networks deal with binary spikes, and they can exploit the input-dependent sparsity or redundancy to dynamically scale the amount of computation in turn leading to energy-efficient hardware implementation. This work discusses configurable spiking neuromorphic architecture that supports multiple hidden layers exploiting hardware reuse. It also presents design techniques for minimum-area/-energy DNN hardware with minimal degradation in accuracy. Area, performance and energy results of these DNN and SNN hardware is reported for the MNIST dataset. The Neuromorphic hardware designed for SNN algorithm in 28nm CMOS demonstrates high classification accuracy (>98% on MNIST) and low energy (51.4 - 773 (nJ) per classification). The optimized DNN hardware designed in 40nm CMOS that combines 8X structured compression and 3-bit weight precision showed 98.4% accuracy at 33 (nJ) per classification.
ContributorsKolala Venkataramanaiah, Shreyas (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)
Created2018
156796-Thumbnail Image.png
Description
Mobile devices have penetrated into every aspect of modern world. For one thing, they are becoming ubiquitous in daily life. For the other thing, they are storing more and more data, including sensitive data. Therefore, security and privacy of mobile devices are indispensable. This dissertation consists of five parts: two

Mobile devices have penetrated into every aspect of modern world. For one thing, they are becoming ubiquitous in daily life. For the other thing, they are storing more and more data, including sensitive data. Therefore, security and privacy of mobile devices are indispensable. This dissertation consists of five parts: two authentication schemes, two attacks, and one countermeasure related to security and privacy of mobile devices.

Specifically, in Chapter 1, I give an overview the challenges and existing solutions in these areas. In Chapter 2, a novel authentication scheme is presented, which is based on a user’s tapping or sliding on the touchscreen of a mobile device. In Chapter 3, I focus on mobile app fingerprinting and propose a method based on analyzing the power profiles of targeted mobile devices. In Chapter 4, I mainly explore a novel liveness detection method for face authentication on mobile devices. In Chapter 5, I investigate a novel keystroke inference attack on mobile devices based on user eye movements. In Chapter 6, a novel authentication scheme is proposed, based on detecting a user’s finger gesture through acoustic sensing. In Chapter 7, I discuss the future work.
ContributorsChen, Yimin (Author) / Zhang, Yanchao (Thesis advisor) / Zhang, Junshan (Committee member) / Reisslein, Martin (Committee member) / Ying, Lei (Committee member) / Arizona State University (Publisher)
Created2018
156845-Thumbnail Image.png
Description
The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency

The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the deep learning algorithm inference. However, deploying CNNs on portable and embedded systems is still challenging due to large data volume, intensive computation, varying algorithm structures, and frequent memory accesses. This dissertation proposes a complete design methodology and framework to accelerate the inference process of various CNN algorithms on FPGA hardware with high performance, efficiency and flexibility.

As convolution contributes most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. An efficient dataflow and hardware architecture of CNN acceleration are proposed to minimize the data communication while maximizing the resource utilization to achieve high performance.

Although great performance and efficiency can be achieved by customizing the FPGA hardware for each CNN model, significant efforts and expertise are required leading to long development time, which makes it difficult to catch up with the rapid development of CNN algorithms. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g. NiN, VGG, GoogLeNet and ResNet, on two different standalone FPGAs achieving state-of-the art performance.

Based on the optimized acceleration strategy, there are still a lot of design options, e.g. the degree and dimension of computation parallelism, the size of on-chip buffers, and the external memory bandwidth, which impact the utilization of computation resources and data communication efficiency, and finally affect the performance and energy consumption of the accelerator. The large design space of the accelerator makes it impractical to explore the optimal design choice during the real implementation phase. Therefore, a performance model is proposed in this work to quantitatively estimate the accelerator performance and resource utilization. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase.
ContributorsMa, Yufei (Author) / Vrudhula, Sarma (Thesis advisor) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Barnaby, Hugh (Committee member) / Arizona State University (Publisher)
Created2018
153686-Thumbnail Image.png
Description
A principal goal of this dissertation is to study wireless network design and optimization with the focus on two perspectives: 1) socially-aware mobile networking and computing; 2) security and privacy in wireless networking. Under this common theme, this dissertation can be broadly organized into three parts.

The first part studies socially-aware

A principal goal of this dissertation is to study wireless network design and optimization with the focus on two perspectives: 1) socially-aware mobile networking and computing; 2) security and privacy in wireless networking. Under this common theme, this dissertation can be broadly organized into three parts.

The first part studies socially-aware mobile networking and computing. First, it studies random access control and power control under a social group utility maximization (SGUM) framework. The socially-aware Nash equilibria (SNEs) are derived and analyzed. Then, it studies mobile crowdsensing under an incentive mechanism that exploits social trust assisted reciprocity (STAR). The efficacy of the STAR mechanism is thoroughly investigated. Next, it studies mobile users' data usage behaviors under the impact of social services and the wireless operator's pricing. Based on a two-stage Stackelberg game formulation, the user demand equilibrium (UDE) is analyzed in Stage II and the optimal pricing strategy is developed in Stage I. Last, it studies opportunistic cooperative networking under an optimal stopping framework with two-level decision-making. For both cases with or without dedicated relays, the optimal relaying strategies are derived and analyzed.

The second part studies radar sensor network coverage for physical security. First, it studies placement of bistatic radar (BR) sensor networks for barrier coverage. The optimality of line-based placement is analyzed, and the optimal placement of BRs on a line segment is characterized. Then, it studies the coverage of radar sensor networks that exploits the Doppler effect. Based on a Doppler coverage model, an efficient method is devised to characterize Doppler-covered regions and an algorithm is developed to find the minimum radar density required for Doppler coverage.

The third part studies cyber security and privacy in socially-aware networking and computing. First, it studies random access control, cooperative jamming, and spectrum access under an extended SGUM framework that incorporates negative social ties. The SNEs are derived and analyzed. Then, it studies pseudonym change for personalized location privacy under the SGUM framework. The SNEs are analyzed and an efficient algorithm is developed to find an SNE with desirable properties.
ContributorsGong, Xiaowen (Author) / Zhang, Junshan (Thesis advisor) / Cochran, Douglas (Committee member) / Ying, Lei (Committee member) / Zhang, Yanchao (Committee member) / Arizona State University (Publisher)
Created2015
155244-Thumbnail Image.png
Description
Mobile devices are penetrating everyday life. According to a recent Cisco report [10], the number of mobile connected devices such as smartphones, tablets, laptops, eReaders, and Machine-to-Machine (M2M) modules will hit 11.6 billion by 2021, exceeding the world's projected population at that time (7.8 billion). The rapid development of mobile

Mobile devices are penetrating everyday life. According to a recent Cisco report [10], the number of mobile connected devices such as smartphones, tablets, laptops, eReaders, and Machine-to-Machine (M2M) modules will hit 11.6 billion by 2021, exceeding the world's projected population at that time (7.8 billion). The rapid development of mobile devices has brought a number of emerging security and privacy issues in mobile computing. This dissertation aims to address a number of challenging security and privacy issues in mobile computing.

This dissertation makes fivefold contributions. The first and second parts study the security and privacy issues in Device-to-Device communications. Specifically, the first part develops a novel scheme to enable a new way of trust relationship called spatiotemporal matching in a privacy-preserving and efficient fashion. To enhance the secure communication among mobile users, the second part proposes a game-theoretical framework to stimulate the cooperative shared secret key generation among mobile users. The third and fourth parts investigate the security and privacy issues in mobile crowdsourcing. In particular, the third part presents a secure and privacy-preserving mobile crowdsourcing system which strikes a good balance among object security, user privacy, and system efficiency. The fourth part demonstrates a differentially private distributed stream monitoring system via mobile crowdsourcing. Finally, the fifth part proposes VISIBLE, a novel video-assisted keystroke inference framework that allows an attacker to infer a tablet user's typed inputs on the touchscreen by recording and analyzing the video of the tablet backside during the user's input process. Besides, some potential countermeasures to this attack are also discussed. This dissertation sheds the light on the state-of-the-art security and privacy issues in mobile computing.
ContributorsSun, Jingchao (Author) / Zhang, Yanchao (Thesis advisor) / Zhang, Junshan (Committee member) / Ying, Lei (Committee member) / Ahn, Gail-Joon (Committee member) / Arizona State University (Publisher)
Created2017
168293-Thumbnail Image.png
Description
Edge networks pose unique challenges for machine learning and network management. The primary objective of this dissertation is to study deep learning and adaptive control aspects of edge networks and to address some of the unique challenges therein. This dissertation explores four particular problems of interest at the intersection of

Edge networks pose unique challenges for machine learning and network management. The primary objective of this dissertation is to study deep learning and adaptive control aspects of edge networks and to address some of the unique challenges therein. This dissertation explores four particular problems of interest at the intersection of edge intelligence, deep learning and network management. The first problem explores the learning of generative models in edge learning setting. Since the learning tasks in similar environments share model similarity, it is plausible to leverage pre-trained generative models from other edge nodes. Appealing to optimal transport theory tailored towards Wasserstein-1 generative adversarial networks, this part aims to develop a framework which systematically optimizes the generative model learning performance using local data at the edge node while exploiting the adaptive coalescence of pre-trained generative models from other nodes. In the second part, a many-to-one wireless architecture for federated learning at the network edge, where multiple edge devices collaboratively train a model using local data, is considered. The unreliable nature of wireless connectivity, togetherwith the constraints in computing resources at edge devices, dictates that the local updates at edge devices should be carefully crafted and compressed to match the wireless communication resources available and should work in concert with the receiver. Therefore, a Stochastic Gradient Descent based bandlimited coordinate descent algorithm is designed for such settings. The third part explores the adaptive traffic engineering algorithms in a dynamic network environment. The ages of traffic measurements exhibit significant variation due to asynchronization and random communication delays between routers and controllers. Inspired by the software defined networking architecture, a controller-assisted distributed routing scheme with recursive link weight reconfigurations, accounting for the impact of measurement ages and routing instability, is devised. The final part focuses on developing a federated learning based framework for traffic reshaping of electric vehicle (EV) charging. The absence of private EV owner information and scattered EV charging data among charging stations motivates the utilization of a federated learning approach. Federated learning algorithms are devised to minimize peak EV charging demand both spatially and temporarily, while maximizing the charging station profit.
ContributorsDedeoglu, Mehmet (Author) / Zhang, Junshan (Thesis advisor) / Kosut, Oliver (Committee member) / Zhang, Yanchao (Committee member) / Fan, Deliang (Committee member) / Arizona State University (Publisher)
Created2021
189353-Thumbnail Image.png
Description
In recent years, Artificial Intelligence (AI) (e.g., Deep Neural Networks (DNNs), Transformer) has shown great success in real-world applications due to its superior performance in various cognitive tasks. The impressive performance achieved by AI models normally accompanies the cost of enormous model size and high computational complexity, which significantly hampers

In recent years, Artificial Intelligence (AI) (e.g., Deep Neural Networks (DNNs), Transformer) has shown great success in real-world applications due to its superior performance in various cognitive tasks. The impressive performance achieved by AI models normally accompanies the cost of enormous model size and high computational complexity, which significantly hampers their implementation on resource-limited Cyber-Physical Systems (CPS), Internet-of-Things (IoT), or Edge systems due to their tightly constrained energy, computing, size, and memory budget. Thus, the urgent demand for enhancing the \textbf{Efficiency} of DNN has drawn significant research interests across various communities. Motivated by the aforementioned concerns, this doctoral research has been mainly focusing on Enabling Deep Learning at Edge: From Efficient and Dynamic Inference to On-Device Learning. Specifically, from the inference perspective, this dissertation begins by investigating a hardware-friendly model compression method that effectively reduces the size of AI model while simultaneously achieving improved speed on edge devices. Additionally, due to the fact that diverse resource constraints of different edge devices, this dissertation further explores dynamic inference, which allows for real-time tuning of inference model size, computation, and latency to accommodate the limitations of each edge device. Regarding efficient on-device learning, this dissertation starts by analyzing memory usage during transfer learning training. Based on this analysis, a novel framework called "Reprogramming Network'' (Rep-Net) is introduced that offers a fresh perspective on the on-device transfer learning problem. The Rep-Net enables on-device transferlearning by directly learning to reprogram the intermediate features of a pre-trained model. Lastly, this dissertation studies an efficient continual learning algorithm that facilitates learning multiple tasks without the risk of forgetting previously acquired knowledge. In practice, through the exploration of task correlation, an interesting phenomenon is observed that the intermediate features are highly correlated between tasks with the self-supervised pre-trained model. Building upon this observation, a novel approach called progressive task-correlated layer freezing is proposed to gradually freeze a subset of layers with the highest correlation ratios for each task leading to training efficiency.
ContributorsYang, Li (Author) / Fan, Deliang (Thesis advisor) / Seo, Jae-Sun (Committee member) / Zhang, Junshan (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)
Created2023
156819-Thumbnail Image.png
Description
Internet of Things (IoT) is emerging as part of the infrastructures for advancing a large variety of applications involving connections of many intelligent devices, leading to smart communities. Due to the severe limitation of the computing resources of IoT devices, it is common to offload tasks of various applications requiring

Internet of Things (IoT) is emerging as part of the infrastructures for advancing a large variety of applications involving connections of many intelligent devices, leading to smart communities. Due to the severe limitation of the computing resources of IoT devices, it is common to offload tasks of various applications requiring substantial computing resources to computing systems with sufficient computing resources, such as servers, cloud systems, and/or data centers for processing. However, this offloading method suffers from both high latency and network congestion in the IoT infrastructures.

Recently edge computing has emerged to reduce the negative impacts of tasks offloading to remote computing systems. As edge computing is in close proximity to IoT devices, it can reduce the latency of task offloading and reduce network congestion. Yet, edge computing has its drawbacks, such as the limited computing resources of some edge computing devices and the unbalanced loads among these devices. In order to effectively explore the potential of edge computing to support IoT applications, it is necessary to have efficient task management and load balancing in edge computing networks.

In this dissertation research, an approach is presented to periodically distributing tasks within the edge computing network while satisfying the quality-of-service (QoS) requirements of tasks. The QoS requirements include task completion deadline and security requirement. The approach aims to maximize the number of tasks that can be accommodated in the edge computing network, with consideration of tasks’ priorities. The goal is achieved through the joint optimization of the computing resource allocation and network bandwidth provisioning. Evaluation results show the improvement of the approach in increasing the number of tasks that can be accommodated in the edge computing network and the efficiency in resource utilization.
ContributorsSong, Yaozhong (Author) / Yau, Sik-Sang (Thesis advisor) / Huang, Dijiang (Committee member) / Sarjoughian, Hessam S. (Committee member) / Zhang, Yanchao (Committee member) / Arizona State University (Publisher)
Created2018