Matching Items (20)
Filtering by
- All Subjects: Computer Engineering
- Creators: Vrudhula, Sarma
Description
With increasing transistor volume and reducing feature size, it has become a major design constraint to reduce power consumption also. This has given rise to aggressive architectural changes for on-chip power management and rapid development to energy efficient hardware accelerators. Accordingly, the objective of this research work is to facilitate software developers to leverage these hardware techniques and improve energy efficiency of the system. To achieve this, I propose two solutions for Linux kernel: Optimal use of these architectural enhancements to achieve greater energy efficiency requires accurate modeling of processor power consumption. Though there are many models available in literature to model processor power consumption, there is a lack of such models to capture power consumption at the task-level. Task-level energy models are a requirement for an operating system (OS) to perform real-time power management as OS time multiplexes tasks to enable sharing of hardware resources. I propose a detailed design methodology for constructing an architecture agnostic task-level power model and incorporating it into a modern operating system to build an online task-level power profiler. The profiler is implemented inside the latest Linux kernel and validated for Intel Sandy Bridge processor. It has a negligible overhead of less than 1\% hardware resource consumption. The profiler power prediction was demonstrated for various application benchmarks from SPEC to PARSEC with less than 4\% error. I also demonstrate the importance of the proposed profiler for emerging architectural techniques through use case scenarios, which include heterogeneous computing and fine grained per-core DVFS. Along with architectural enhancement in general purpose processors to improve energy efficiency, hardware accelerators like Coarse Grain reconfigurable architecture (CGRA) are gaining popularity. Unlike vector processors, which rely on data parallelism, CGRA can provide greater flexibility and compiler level control making it more suitable for present SoC environment. To provide streamline development environment for CGRA, I propose a flexible framework in Linux to do design space exploration for CGRA. With accurate and flexible hardware models, fine grained integration with accurate architectural simulator, and Linux memory management and DMA support, a user can carry out limitless experiments on CGRA in full system environment.
ContributorsDesai, Digant Pareshkumar (Author) / Vrudhula, Sarma (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)
Created2013
Description
Multicore processors have proliferated in nearly all forms of computing, from servers, desktop, to smartphones. The primary reason for this large adoption of multicore processors is due to its ability to overcome the power-wall by providing higher performance at a lower power consumption rate. With multi-cores, there is increased need for dynamic energy management (DEM), much more than for single-core processors, as DEM for multi-cores is no more a mechanism just to ensure that a processor is kept under specified temperature limits, but also a set of techniques that manage various processor controls like dynamic voltage and frequency scaling (DVFS), task migration, fan speed, etc. to achieve a stated objective. The objectives span a wide range from maximizing throughput, minimizing power consumption, reducing peak temperature, maximizing energy efficiency, maximizing processor reliability, and so on, along with much more wider constraints of temperature, power, timing, and reliability constraints. Thus DEM can be very complex and challenging to achieve. Since often times many DEMs operate together on a single processor, there is a need to unify various DEM techniques. This dissertation address such a need. In this work, a framework for DEM is proposed that provides a unifying processor model that includes processor power, thermal, timing, and reliability models, supports various DEM control mechanisms, many different objective functions along with equally diverse constraint specifications. Using the framework, a range of novel solutions is derived for instances of DEM problems, that include maximizing processor performance, energy efficiency, or minimizing power consumption, peak temperature under constraints of maximum temperature, memory reliability and task deadlines. Finally, a robust closed-loop controller to implement the above solutions on a real processor platform with a very low operational overhead is proposed. Along with the controller design, a model identification methodology for obtaining the required power and thermal models for the controller is also discussed. The controller is architecture independent and hence easily portable across many platforms. The controller has been successfully deployed on Intel Sandy Bridge processor and the use of the controller has increased the energy efficiency of the processor by over 30%
ContributorsHanumaiah, Vinay (Author) / Vrudhula, Sarma (Thesis advisor) / Chatha, Karamvir (Committee member) / Chakrabarti, Chaitali (Committee member) / Rodriguez, Armando (Committee member) / Askin, Ronald (Committee member) / Arizona State University (Publisher)
Created2013
Description
With the growth of IT products and sophisticated software in various operating systems, I observe that security risks in systems are skyrocketing constantly. Consequently, Security Assessment is now considered as one of primary security mechanisms to measure assurance of systems since systems that are not compliant with security requirements may lead adversaries to access critical information by circumventing security practices. In order to ensure security, considerable efforts have been spent to develop security regulations by facilitating security best-practices. Applying shared security standards to the system is critical to understand vulnerabilities and prevent well-known threats from exploiting vulnerabilities. However, many end users tend to change configurations of their systems without paying attention to the security. Hence, it is not straightforward to protect systems from being changed by unconscious users in a timely manner. Detecting the installation of harmful applications is not sufficient since attackers may exploit risky software as well as commonly used software. In addition, checking the assurance of security configurations periodically is disadvantageous in terms of time and cost due to zero-day attacks and the timing attacks that can leverage the window between each security checks. Therefore, event-driven monitoring approach is critical to continuously assess security of a target system without ignoring a particular window between security checks and lessen the burden of exhausted task to inspect the entire configurations in the system. Furthermore, the system should be able to generate a vulnerability report for any change initiated by a user if such changes refer to the requirements in the standards and turn out to be vulnerable. Assessing various systems in distributed environments also requires to consistently applying standards to each environment. Such a uniformed consistent assessment is important because the way of assessment approach for detecting security vulnerabilities may vary across applications and operating systems. In this thesis, I introduce an automated event-driven security assessment framework to overcome and accommodate the aforementioned issues. I also discuss the implementation details that are based on the commercial-off-the-self technologies and testbed being established to evaluate approach. Besides, I describe evaluation results that demonstrate the effectiveness and practicality of the approaches.
ContributorsSeo, Jeong-Jin (Author) / Ahn, Gail-Joon (Thesis advisor) / Yau, Stephen S. (Committee member) / Lee, Joohyung (Committee member) / Arizona State University (Publisher)
Created2014
Description
Access control is necessary for information assurance in many of today's applications such as banking and electronic health record. Access control breaches are critical security problems that can result from unintended and improper implementation of security policies. Security testing can help identify security vulnerabilities early and avoid unexpected expensive cost in handling breaches for security architects and security engineers. The process of security testing which involves creating tests that effectively examine vulnerabilities is a challenging task. Role-Based Access Control (RBAC) has been widely adopted to support fine-grained access control. However, in practice, due to its complexity including role management, role hierarchy with hundreds of roles, and their associated privileges and users, systematically testing RBAC systems is crucial to ensure the security in various domains ranging from cyber-infrastructure to mission-critical applications. In this thesis, we introduce i) a security testing technique for RBAC systems considering the principle of maximum privileges, the structure of the role hierarchy, and a new security test coverage criterion; ii) a MTBDD (Multi-Terminal Binary Decision Diagram) based representation of RBAC security policy including RHMTBDD (Role Hierarchy MTBDD) to efficiently generate effective positive and negative security test cases; and iii) a security testing framework which takes an XACML-based RBAC security policy as an input, parses it into a RHMTBDD representation and then generates positive and negative test cases. We also demonstrate the efficacy of our approach through case studies.
ContributorsGupta, Poonam (Author) / Ahn, Gail-Joon (Thesis advisor) / Collofello, James (Committee member) / Huang, Dijiang (Committee member) / Arizona State University (Publisher)
Created2014
Description
Stream computing has emerged as an importantmodel of computation for embedded system applications particularly in the multimedia and network processing domains. In recent past several programming languages and embedded multi-core processors have been proposed for streaming applications. This thesis examines the execution and dynamic scheduling of stream programs on embedded multi-core processors. The thesis addresses the problem in the context of a multi-tasking environment with a time varying allocation of processing elements for a particular streaming application. As a solution the thesis proposes a two step approach where the stream program is compiled to gather key application information, and to generate re-targetable code. A light weight dynamic scheduler incorporates the second stage of the approach. The dynamic scheduler utilizes the static information and available resources to assign or partition the application across the multi-core architecture. The objective of the dynamic scheduler is to maximize the throughput of the application, and it is sensitive to the resource (processing elements, scratch-pad memory, DMA bandwidth) constraints imposed by the target architecture. We evaluate the proposed approach by compiling and scheduling benchmark stream programs on a representative embedded multi-core processor. We present experimental results that evaluate the quality of the solutions generated by the proposed approach by comparisons with existing techniques.
ContributorsLee, Haeseung (Author) / Chatha, Karamvir (Thesis advisor) / Vrudhula, Sarma (Committee member) / Chakrabarti, Chaitali (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)
Created2013
Description
Users often join an online social networking (OSN) site, like Facebook, to remain social, by either staying connected with friends or expanding social networks. On an OSN site, users generally share variety of personal information which is often expected to be visible to their friends, but sometimes vulnerable to unwarranted access from others. The recent study suggests that many personal attributes, including religious and political affiliations, sexual orientation, relationship status, age, and gender, are predictable using users' personal data from an OSN site. The majority of users want to remain socially active, and protect their personal data at the same time. This tension leads to a user's vulnerability, allowing privacy attacks which can cause physical and emotional distress to a user, sometimes with dire consequences. For example, stalkers can make use of personal information available on an OSN site to their personal gain. This dissertation aims to systematically study a user vulnerability against such privacy attacks.
A user vulnerability can be managed in three steps: (1) identifying, (2) measuring and (3) reducing a user vulnerability. Researchers have long been identifying vulnerabilities arising from user's personal data, including user names, demographic attributes, lists of friends, wall posts and associated interactions, multimedia data such as photos, audios and videos, and tagging of friends. Hence, this research first proposes a way to measure and reduce a user vulnerability to protect such personal data. This dissertation also proposes an algorithm to minimize a user's vulnerability while maximizing their social utility values.
To address these vulnerability concerns, social networking sites like Facebook usually let their users to adjust their profile settings so as to make some of their data invisible. However, users sometimes interact with others using unprotected posts (e.g., posts from a ``Facebook page\footnote{The term ''Facebook page`` refers to the page which are commonly dedicated for businesses, brands and organizations to share their stories and connect with people.}''). Such interactions help users to become more social and are publicly accessible to everyone. Thus, visibilities of these interactions are beyond the control of their profile settings. I explore such unprotected interactions so that users' are well aware of these new vulnerabilities and adopt measures to mitigate them further. In particular, {\em are users' personal attributes predictable using only the unprotected interactions}? To answer this question, I address a novel problem of predictability of users' personal attributes with unprotected interactions. The extreme sparsity patterns in users' unprotected interactions pose a serious challenge. Therefore, I approach to mitigating the data sparsity challenge by designing a novel attribute prediction framework using only the unprotected interactions. Experimental results on Facebook dataset demonstrates that the proposed framework can predict users' personal attributes.
A user vulnerability can be managed in three steps: (1) identifying, (2) measuring and (3) reducing a user vulnerability. Researchers have long been identifying vulnerabilities arising from user's personal data, including user names, demographic attributes, lists of friends, wall posts and associated interactions, multimedia data such as photos, audios and videos, and tagging of friends. Hence, this research first proposes a way to measure and reduce a user vulnerability to protect such personal data. This dissertation also proposes an algorithm to minimize a user's vulnerability while maximizing their social utility values.
To address these vulnerability concerns, social networking sites like Facebook usually let their users to adjust their profile settings so as to make some of their data invisible. However, users sometimes interact with others using unprotected posts (e.g., posts from a ``Facebook page\footnote{The term ''Facebook page`` refers to the page which are commonly dedicated for businesses, brands and organizations to share their stories and connect with people.}''). Such interactions help users to become more social and are publicly accessible to everyone. Thus, visibilities of these interactions are beyond the control of their profile settings. I explore such unprotected interactions so that users' are well aware of these new vulnerabilities and adopt measures to mitigate them further. In particular, {\em are users' personal attributes predictable using only the unprotected interactions}? To answer this question, I address a novel problem of predictability of users' personal attributes with unprotected interactions. The extreme sparsity patterns in users' unprotected interactions pose a serious challenge. Therefore, I approach to mitigating the data sparsity challenge by designing a novel attribute prediction framework using only the unprotected interactions. Experimental results on Facebook dataset demonstrates that the proposed framework can predict users' personal attributes.
ContributorsGundecha, Pritam S (Author) / Liu, Huan (Thesis advisor) / Ahn, Gail-Joon (Committee member) / Ye, Jieping (Committee member) / Barbier, Geoffrey (Committee member) / Arizona State University (Publisher)
Created2015
Description
Static CMOS logic has remained the dominant design style of digital systems for
more than four decades due to its robustness and near zero standby current. Static
CMOS logic circuits consist of a network of combinational logic cells and clocked sequential
elements, such as latches and flip-flops that are used for sequencing computations
over time. The majority of the digital design techniques to reduce power, area, and
leakage over the past four decades have focused almost entirely on optimizing the
combinational logic. This work explores alternate architectures for the flip-flops for
improving the overall circuit performance, power and area. It consists of three main
sections.
First, is the design of a multi-input configurable flip-flop structure with embedded
logic. A conventional D-type flip-flop may be viewed as realizing an identity function,
in which the output is simply the value of the input sampled at the clock edge. In
contrast, the proposed multi-input flip-flop, named PNAND, can be configured to
realize one of a family of Boolean functions called threshold functions. In essence,
the PNAND is a circuit implementation of the well-known binary perceptron. Unlike
other reconfigurable circuits, a PNAND can be configured by simply changing the
assignment of signals to its inputs. Using a standard cell library of such gates, a technology
mapping algorithm can be applied to transform a given netlist into one with
an optimal mixture of conventional logic gates and threshold gates. This approach
was used to fabricate a 32-bit Wallace Tree multiplier and a 32-bit booth multiplier
in 65nm LP technology. Simulation and chip measurements show more than 30%
improvement in dynamic power and more than 20% reduction in core area.
The functional yield of the PNAND reduces with geometry and voltage scaling.
The second part of this research investigates the use of two mechanisms to improve
the robustness of the PNAND circuit architecture. One is the use of forward and reverse body biases to change the device threshold and the other is the use of RRAM
devices for low voltage operation.
The third part of this research focused on the design of flip-flops with non-volatile
storage. Spin-transfer torque magnetic tunnel junctions (STT-MTJ) are integrated
with both conventional D-flipflop and the PNAND circuits to implement non-volatile
logic (NVL). These non-volatile storage enhanced flip-flops are able to save the state of
system locally when a power interruption occurs. However, manufacturing variations
in the STT-MTJs and in the CMOS transistors significantly reduce the yield, leading
to an overly pessimistic design and consequently, higher energy consumption. A
detailed analysis of the design trade-offs in the driver circuitry for performing backup
and restore, and a novel method to design the energy optimal driver for a given yield is
presented. Efficient designs of two nonvolatile flip-flop (NVFF) circuits are presented,
in which the backup time is determined on a per-chip basis, resulting in minimizing
the energy wastage and satisfying the yield constraint. To achieve a yield of 98%,
the conventional approach would have to expend nearly 5X more energy than the
minimum required, whereas the proposed tunable approach expends only 26% more
energy than the minimum. A non-volatile threshold gate architecture NV-TLFF are
designed with the same backup and restore circuitry in 65nm technology. The embedded
logic in NV-TLFF compensates performance overhead of NVL. This leads to the
possibility of zero-overhead non-volatile datapath circuits. An 8-bit multiply-and-
accumulate (MAC) unit is designed to demonstrate the performance benefits of the
proposed architecture. Based on the results of HSPICE simulations, the MAC circuit
with the proposed NV-TLFF cells is shown to consume at least 20% less power and
area as compared to the circuit designed with conventional DFFs, without sacrificing
any performance.
more than four decades due to its robustness and near zero standby current. Static
CMOS logic circuits consist of a network of combinational logic cells and clocked sequential
elements, such as latches and flip-flops that are used for sequencing computations
over time. The majority of the digital design techniques to reduce power, area, and
leakage over the past four decades have focused almost entirely on optimizing the
combinational logic. This work explores alternate architectures for the flip-flops for
improving the overall circuit performance, power and area. It consists of three main
sections.
First, is the design of a multi-input configurable flip-flop structure with embedded
logic. A conventional D-type flip-flop may be viewed as realizing an identity function,
in which the output is simply the value of the input sampled at the clock edge. In
contrast, the proposed multi-input flip-flop, named PNAND, can be configured to
realize one of a family of Boolean functions called threshold functions. In essence,
the PNAND is a circuit implementation of the well-known binary perceptron. Unlike
other reconfigurable circuits, a PNAND can be configured by simply changing the
assignment of signals to its inputs. Using a standard cell library of such gates, a technology
mapping algorithm can be applied to transform a given netlist into one with
an optimal mixture of conventional logic gates and threshold gates. This approach
was used to fabricate a 32-bit Wallace Tree multiplier and a 32-bit booth multiplier
in 65nm LP technology. Simulation and chip measurements show more than 30%
improvement in dynamic power and more than 20% reduction in core area.
The functional yield of the PNAND reduces with geometry and voltage scaling.
The second part of this research investigates the use of two mechanisms to improve
the robustness of the PNAND circuit architecture. One is the use of forward and reverse body biases to change the device threshold and the other is the use of RRAM
devices for low voltage operation.
The third part of this research focused on the design of flip-flops with non-volatile
storage. Spin-transfer torque magnetic tunnel junctions (STT-MTJ) are integrated
with both conventional D-flipflop and the PNAND circuits to implement non-volatile
logic (NVL). These non-volatile storage enhanced flip-flops are able to save the state of
system locally when a power interruption occurs. However, manufacturing variations
in the STT-MTJs and in the CMOS transistors significantly reduce the yield, leading
to an overly pessimistic design and consequently, higher energy consumption. A
detailed analysis of the design trade-offs in the driver circuitry for performing backup
and restore, and a novel method to design the energy optimal driver for a given yield is
presented. Efficient designs of two nonvolatile flip-flop (NVFF) circuits are presented,
in which the backup time is determined on a per-chip basis, resulting in minimizing
the energy wastage and satisfying the yield constraint. To achieve a yield of 98%,
the conventional approach would have to expend nearly 5X more energy than the
minimum required, whereas the proposed tunable approach expends only 26% more
energy than the minimum. A non-volatile threshold gate architecture NV-TLFF are
designed with the same backup and restore circuitry in 65nm technology. The embedded
logic in NV-TLFF compensates performance overhead of NVL. This leads to the
possibility of zero-overhead non-volatile datapath circuits. An 8-bit multiply-and-
accumulate (MAC) unit is designed to demonstrate the performance benefits of the
proposed architecture. Based on the results of HSPICE simulations, the MAC circuit
with the proposed NV-TLFF cells is shown to consume at least 20% less power and
area as compared to the circuit designed with conventional DFFs, without sacrificing
any performance.
ContributorsYang, Jinghua (Author) / Vrudhula, Sarma (Thesis advisor) / Barnaby, Hugh (Committee member) / Cao, Yu (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2018
Description
Reasoning about the activities of cyber threat actors is critical to defend against cyber
attacks. However, this task is difficult for a variety of reasons. In simple terms, it is difficult
to determine who the attacker is, what the desired goals are of the attacker, and how they will
carry out their attacks. These three questions essentially entail understanding the attacker’s
use of deception, the capabilities available, and the intent of launching the attack. These
three issues are highly inter-related. If an adversary can hide their intent, they can better
deceive a defender. If an adversary’s capabilities are not well understood, then determining
what their goals are becomes difficult as the defender is uncertain if they have the necessary
tools to accomplish them. However, the understanding of these aspects are also mutually
supportive. If we have a clear picture of capabilities, intent can better be deciphered. If we
understand intent and capabilities, a defender may be able to see through deception schemes.
In this dissertation, I present three pieces of work to tackle these questions to obtain
a better understanding of cyber threats. First, we introduce a new reasoning framework
to address deception. We evaluate the framework by building a dataset from DEFCON
capture-the-flag exercise to identify the person or group responsible for a cyber attack.
We demonstrate that the framework not only handles cases of deception but also provides
transparent decision making in identifying the threat actor. The second task uses a cognitive
learning model to determine the intent – goals of the threat actor on the target system.
The third task looks at understanding the capabilities of threat actors to target systems by
identifying at-risk systems from hacker discussions on darkweb websites. To achieve this
task we gather discussions from more than 300 darkweb websites relating to malicious
hacking.
attacks. However, this task is difficult for a variety of reasons. In simple terms, it is difficult
to determine who the attacker is, what the desired goals are of the attacker, and how they will
carry out their attacks. These three questions essentially entail understanding the attacker’s
use of deception, the capabilities available, and the intent of launching the attack. These
three issues are highly inter-related. If an adversary can hide their intent, they can better
deceive a defender. If an adversary’s capabilities are not well understood, then determining
what their goals are becomes difficult as the defender is uncertain if they have the necessary
tools to accomplish them. However, the understanding of these aspects are also mutually
supportive. If we have a clear picture of capabilities, intent can better be deciphered. If we
understand intent and capabilities, a defender may be able to see through deception schemes.
In this dissertation, I present three pieces of work to tackle these questions to obtain
a better understanding of cyber threats. First, we introduce a new reasoning framework
to address deception. We evaluate the framework by building a dataset from DEFCON
capture-the-flag exercise to identify the person or group responsible for a cyber attack.
We demonstrate that the framework not only handles cases of deception but also provides
transparent decision making in identifying the threat actor. The second task uses a cognitive
learning model to determine the intent – goals of the threat actor on the target system.
The third task looks at understanding the capabilities of threat actors to target systems by
identifying at-risk systems from hacker discussions on darkweb websites. To achieve this
task we gather discussions from more than 300 darkweb websites relating to malicious
hacking.
ContributorsNunes, Eric (Author) / Shakarian, Paulo (Thesis advisor) / Ahn, Gail-Joon (Committee member) / Baral, Chitta (Committee member) / Cooke, Nancy J. (Committee member) / Arizona State University (Publisher)
Created2018
Description
The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the deep learning algorithm inference. However, deploying CNNs on portable and embedded systems is still challenging due to large data volume, intensive computation, varying algorithm structures, and frequent memory accesses. This dissertation proposes a complete design methodology and framework to accelerate the inference process of various CNN algorithms on FPGA hardware with high performance, efficiency and flexibility.
As convolution contributes most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. An efficient dataflow and hardware architecture of CNN acceleration are proposed to minimize the data communication while maximizing the resource utilization to achieve high performance.
Although great performance and efficiency can be achieved by customizing the FPGA hardware for each CNN model, significant efforts and expertise are required leading to long development time, which makes it difficult to catch up with the rapid development of CNN algorithms. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g. NiN, VGG, GoogLeNet and ResNet, on two different standalone FPGAs achieving state-of-the art performance.
Based on the optimized acceleration strategy, there are still a lot of design options, e.g. the degree and dimension of computation parallelism, the size of on-chip buffers, and the external memory bandwidth, which impact the utilization of computation resources and data communication efficiency, and finally affect the performance and energy consumption of the accelerator. The large design space of the accelerator makes it impractical to explore the optimal design choice during the real implementation phase. Therefore, a performance model is proposed in this work to quantitatively estimate the accelerator performance and resource utilization. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase.
As convolution contributes most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. An efficient dataflow and hardware architecture of CNN acceleration are proposed to minimize the data communication while maximizing the resource utilization to achieve high performance.
Although great performance and efficiency can be achieved by customizing the FPGA hardware for each CNN model, significant efforts and expertise are required leading to long development time, which makes it difficult to catch up with the rapid development of CNN algorithms. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g. NiN, VGG, GoogLeNet and ResNet, on two different standalone FPGAs achieving state-of-the art performance.
Based on the optimized acceleration strategy, there are still a lot of design options, e.g. the degree and dimension of computation parallelism, the size of on-chip buffers, and the external memory bandwidth, which impact the utilization of computation resources and data communication efficiency, and finally affect the performance and energy consumption of the accelerator. The large design space of the accelerator makes it impractical to explore the optimal design choice during the real implementation phase. Therefore, a performance model is proposed in this work to quantitatively estimate the accelerator performance and resource utilization. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase.
ContributorsMa, Yufei (Author) / Vrudhula, Sarma (Thesis advisor) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Barnaby, Hugh (Committee member) / Arizona State University (Publisher)
Created2018
Description
With the software-defined networking trend growing, several network virtualization controllers have been developed in recent years. These controllers, also called network hypervisors, attempt to manage physical SDN based networks so that multiple tenants can safely share the same forwarding plane hardware without risk of being affected by or affecting other tenants. However, many areas remain unexplored by current network hypervisor implementations. This thesis presents and evaluates some of the features offered by network hypervisors, such as full header space availability, isolation, and transparent traffic forwarding capabilities for tenants. Flow setup time and throughput are also measured and compared among different network hypervisors. Three different network hypervisors are evaluated: FlowVisor, VeRTIGO and OpenVirteX. These virtualization tools are assessed with experiments conducted on three different testbeds: an emulated Mininet scenario, a physical single-switch testbed, and also a remote GENI testbed. The results indicate that network hypervisors bring SDN flexibility to network virtualization, making it easier for network administrators to define with precision how the network is sliced and divided among tenants. This increased flexibility, however, may come with the cost of decreased performance, and also brings additional risks of interoperability due to a lack of standardization of virtualization methods.
ContributorsStall Rechia, Felipe (Author) / Syrotiuk, Violet R. (Thesis advisor) / Ahn, Gail-Joon (Committee member) / Huang, Dijiang (Committee member) / Arizona State University (Publisher)
Created2016