Search Content

Topics in power and performance optimization of embedded systems

Description

The ubiquity of embedded computational systems has exploded in recent years impacting everything from hand-held computers and automotive driver assistance to battlefield command and control and autonomous systems. Typical embedded computing systems are characterized by highly resource constrained operating environments. In particular, limited energy resources constrain performance in embedded systems…

The ubiquity of embedded computational systems has exploded in recent years impacting everything from hand-held computers and automotive driver assistance to battlefield command and control and autonomous systems. Typical embedded computing systems are characterized by highly resource constrained operating environments. In particular, limited energy resources constrain performance in embedded systems often reliant on independent fuel or battery supplies. Ultimately, mitigating energy consumption without sacrificing performance in these systems is paramount. In this work power/performance optimization emphasizing prevailing data centric applications including video and signal processing is addressed for energy constrained embedded systems. Frameworks are presented which exchange quality of service (QoS) for reduced power consumption enabling power aware energy management. Power aware systems provide users with tools for precisely managing available energy resources in light of user priorities, extending availability when QoS can be sacrificed. Specifically, power aware management tools for next generation bistable electrophoretic displays and the state of the art H.264 video codec are introduced. The multiprocessor system on chip (MPSoC) paradigm is examined in the context of next generation many-core hand-held computing devices. MPSoC architectures promise to breach the power/performance wall prohibiting advancement of complex high performance single core architectures. Several many-core distributed memory MPSoC architectures are commercially available, while the tools necessary to effectively tap their enormous potential remain largely open for discovery. Adaptable scalability in many-core systems is addressed through a scalable high performance multicore H.264 video decoder implemented on the representative Cell Broadband Engine (CBE) architecture. The resulting agile performance scalable system enables efficient adaptive power optimization via decoding-rate driven sleep and voltage/frequency state management. The significant problem of mapping applications onto these architectures is additionally addressed from the perspective of instruction mapping for limited distributed memory architectures with a code overlay generator implemented on the CBE. Finally runtime scheduling and mapping of scalable applications in multitasking environments is addressed through the introduction of a lightweight work partitioning framework targeting streaming applications with low latency and near optimal throughput demonstrated on the CBE.

ContributorsBaker, Michael (Author) / Chatha, Karam S. (Thesis advisor) / Raupp, Gregory B. (Committee member) / Vrudhula, Sarma B. K. (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)

Created2011

Modeling and implementation of threshold logic circuits and architectures

Description

Threshold logic has long been studied as a means of achieving higher performance and lower power dissipation, providing improvements by condensing simple logic gates into more complex primitives, effectively reducing gate count, pipeline depth, and number of interconnects. This work proposes a new physical implementation of threshold logic, the threshold…

Threshold logic has long been studied as a means of achieving higher performance and lower power dissipation, providing improvements by condensing simple logic gates into more complex primitives, effectively reducing gate count, pipeline depth, and number of interconnects. This work proposes a new physical implementation of threshold logic, the threshold logic latch (TLL), which overcomes the difficulties observed in previous work, particularly with respect to gate reliability in the presence of noise and process variations. Simple but effective models were created to assess the delay, power, and noise margin of TLL gates for the purpose of determining the physical parameters and assignment of input signals that achieves the lowest delay subject to constraints on power and reliability. From these models, an optimized library of standard TLL cells was developed to supplement a commercial library of static CMOS gates. The new cells were then demonstrated on a number of automatically synthesized, placed, and routed designs. A two-stage 2's complement integer multiplier designed with CMOS and TLL gates utilized 19.5% less area, 28.0% less active power, and 61.5% less leakage power than an equivalent design with the same performance using only static CMOS gates. Additionally, a two-stage 32-instruction 4-way issue queue designed with CMOS and TLL gates utilized 30.6% less area, 31.0% less active power, and 58.9% less leakage power than an equivalent design with the same performance using only static CMOS gates.

ContributorsLeshner, Samuel (Author) / Vrudhula, Sarma (Thesis advisor) / Chatha, Karamvir (Committee member) / Clark, Lawrence (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)

Created2010

Self-aware, Adaptive General-purpose Computing Systems

Description

With the breakdown of Dennard scaling, computer architects can no longer rely on integrated circuit energy efficiency to scale with transistor density, and must under-clock or power-gate parts of their designs in order to fit within given power budgets. Hardware accelerators may improve energy efficiency of some compute-intensive tasks, but…

With the breakdown of Dennard scaling, computer architects can no longer rely on integrated circuit energy efficiency to scale with transistor density, and must under-clock or power-gate parts of their designs in order to fit within given power budgets. Hardware accelerators may improve energy efficiency of some compute-intensive tasks, but as more tasks are accelerated, the general-purpose portions of workloads account for a larger share of execution time while also leaving less instruction, data, or task-level parallelism to exploit. Adaptive computing systems have potential to address these challenges by modifying their behavior at runtime. Adaptation requires runtime decision-making, which can be performed both in hardware and software. While software-based decision-making is more flexible and can execute higher complexity operations compared to hardware, it also incurs a significant latency and power overhead. Hardware designs are more limited in the space of decisions they can make, but have direct access to their own internal microarchitectural states and can make faster decisions, allowing for better-informed adaptation and extracting previously unobtainable performance and security benefits. In this dissertation I study (i) the viability and trade-offs of general-purpose adaptive systems, (ii) the difficulty and complexity of making adaptation decisions, and (iii) how time spent in the observation-analysis-adaptation cycle affects adaptation benefits. I introduce techniques for (a) modeling and understanding high performance computing systems and microarchitecture, (b) enabling hardware learning and decision-making through low-latency networks, and (c) on securing hardware designs using runtime decision-making. I propose an always-awake and active learning `hardware nervous system' pervasive throughout the chip that can reason about the individual hardware module performance, energy usage, and security. I present the design and implementation of (1) a reference architecture and (2) a microarchitecture-aware static binary instrumentation tool. Finally, I provide results showing (1) that runtime adaptation is a necessary to continue improving performance on general-purpose tasks, (2) that significant performance loss and performance variation happens under the ISA-level, and is unobservable without hardware support, and (3) that hardware must possess decision-making and ‘self-awareness’ capabilities at the microarchitecture level in order to efficiently use its own faculties.

ContributorsIsakov, Mihailo (Author) / Kinsy, Michel (Thesis advisor) / Shrivastava, Aviral (Committee member) / Rudd, Kevin (Committee member) / Gadepally, Vijay (Committee member) / Arizona State University (Publisher)

Created2022

Eleatic: Secure Architecture Across the Edge-to-Cloud Continuum

Description

Many companies face pressure to deploy flexible compute infrastructures to manage their operations. However, the current developments in cloud and edge computing have created a data processing asymmetry challenge. On the edge, workloads frequently require low-latency responses, contend with connectivity and bandwidth instabilities, may require privacy guarantees, and may perform…

Many companies face pressure to deploy flexible compute infrastructures to manage their operations. However, the current developments in cloud and edge computing have created a data processing asymmetry challenge. On the edge, workloads frequently require low-latency responses, contend with connectivity and bandwidth instabilities, may require privacy guarantees, and may perform under limited or high-variance compute resources. In the cloud, workloads tolerate longer latency, expect highly available infrastructure, access high-performance compute resources, and have more power available, but may be further from where the processing results are needed. This compute asymmetry challenge requires a new computational paradigm. In this work, I advance a new computing architecture model, called the Continuum Computing Architecture (CCA), and validate this model with a candidate architecture. CCA is a unifying edge-fog-cloud computing model that provides the following capabilities: (i) a continuum of compute that spans from network-connected edge devices to the cloud – with very low power consumption to high-performance compute; (ii) same architecture with different micro-architectures along this compute continuum – a single RISC-V instruction set architecture with reconfigurable processing units; (iii) portability across all scales – the same program can be run across the continuum with different latencies and power utilizations; and (iv) secure shared memory features are fully-supported – physical memories along the continuum are abstracted to allow edge and cloud to share data in a transparent fashion. The validating architecture has three micro-architectures. The edge micro-architecture, Parmenides, targets accelerator-based edge processing system-on-chips (SoCs). Parmenides includes security features to protect the SoC in uncontrolled environments while adapting its power usage and processing to ambient events. The fog and cloud micro-architectures, Melissus and Zeno, must support application data distribution across the memory of many compute nodes to achieve the desired scale and performance. As a solution, I introduce the Eleatic Memory Model (EMM): a global shared memory architecture with hardware-supported global memory access permissions. All memory accesses are made with a Namespace-based capability scheme that supports improved scalability and memory security. The CCA model addresses several memory-centric security challenges including the misuse of resources, risk to application and data integrity, as well as concerns over authorization and confidentiality.

ContributorsEhret, Alan (Author) / Kinsy, Michel A (Thesis advisor) / Vrudhula, Sarma (Committee member) / Shrivastava, Aviral (Committee member) / Rudd, Kevin (Committee member) / Gettings, Karen (Committee member) / Arizona State University (Publisher)

Created2022

FPGA-Based Edge-Computing Acceleration

Description

The rapid growth of Internet-of-things (IoT) and artificial intelligence applications have called forth a new computing paradigm--edge computing. Edge computing applications, such as video surveillance, autonomous driving, and augmented reality, are highly computationally intensive and require real-time processing. Current edge systems are typically based on commodity general-purpose hardware such as…

The rapid growth of Internet-of-things (IoT) and artificial intelligence applications have called forth a new computing paradigm--edge computing. Edge computing applications, such as video surveillance, autonomous driving, and augmented reality, are highly computationally intensive and require real-time processing. Current edge systems are typically based on commodity general-purpose hardware such as Central Processing Units (CPUs) and Graphical Processing Units (GPUs) , which are mainly designed for large, non-time-sensitive jobs in the cloud and do not match the needs of the edge workloads. Also, these systems are usually power hungry and are not suitable for resource-constrained edge deployments. Such application-hardware mismatch calls forth a new computing backbone to support the high-bandwidth, low-latency, and energy-efficient requirements. Also, the new system should be able to support a variety of edge applications with different characteristics. This thesis addresses the above challenges by studying the use of Field Programmable Gate Array (FPGA) -based computing systems for accelerating the edge workloads, from three critical angles. First, it investigates the feasibility of FPGAs for edge computing, in comparison to conventional CPUs and GPUs. Second, it studies the acceleration of common algorithmic characteristics, identified as loop patterns, using FPGAs, and develops a benchmark tool for analyzing the performance of these patterns on different accelerators. Third, it designs a new edge computing platform using multiple clustered FPGAs to provide high-bandwidth and low-latency acceleration of convolutional neural networks (CNNs) widely used in edge applications. Finally, it studies the acceleration of the emerging neural networks, randomly-wired neural networks, on the multi-FPGA platform. The experimental results from this work show that the new generation of workloads requires rethinking the current edge-computing architecture. First, through the acceleration of common loops, it demonstrates that FPGAs can outperform GPUs in specific loops types up to 14 times. Second, it shows the linear scalability of multi-FPGA platforms in accelerating neural networks. Third, it demonstrates the superiority of the new scheduler to optimally place randomly-wired neural networks on multi-FPGA platforms with 81.1 times better throughput than the available scheduling mechanisms.

ContributorsBiookaghazadeh, Saman (Author) / Zhao, Ming (Thesis advisor) / Ren, Fengbo (Thesis advisor) / Li, Baoxin (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2021

A Methodology and Formalism to Handle Timing Uncertainties in Cyber-Physical Systems

Description

Uncertainty is intrinsic in Cyber-Physical Systems since they interact with human and work with both analog and digital worlds. Since even minute deviation from the real values can make catastrophe in a safety-critical application, considering uncertainties in CPS behavior is essential. On the other side, time is a…

Uncertainty is intrinsic in Cyber-Physical Systems since they interact with human and work with both analog and digital worlds. Since even minute deviation from the real values can make catastrophe in a safety-critical application, considering uncertainties in CPS behavior is essential. On the other side, time is a foundational aspect of Cyber-Physical Systems (CPS). Correct timing of system events is critical to optimize responsiveness to the environment, in terms of timeliness, accuracy, and precision in the knowledge, measurement, prediction, and control of CPS behavior. In order to design a more resilient and reliable CPS, first and foremost, there should be a way to specify the timing constraints that a constructed Cyber-Physical System must meet with considering existing uncertainties. Only then, we can seek systematic approaches to check if all timing constraints are being met, and develop correct-by-construction methodologies. In this regard, Timestamp Temporal Logic (TTL) is developed to specify the timing constraints on a distributed CPS. By TTL designers can specify the timing requirements that a CPS must satisfy in a succinct and intuitive manner and express the tolerable error as a part of the language. The proposed deduction system on TTL (TTL reasoning system) gives the ability to check the consistency among expresses system specifications and simplify them to be implemented on FPGA for run-time verification. Regarding CPS run-time verification, Timestamp-based Monitoring Approach(TMA) has been designed that can hook up to a CPS and take its timing specifications in TTL and verify if the timing constraints are being met with considering existing uncertainties in the system. TMA does not need to compute whether the constraint is being met at each and every instance of time but it re-evaluates constraint only when there is an event that can affect the outcome. This enables it to perform online timing monitoring of CPS for less computation and resources. Furthermore, the minimum design parameters of the timing CPS that are required to enable testing the timing of CPS are defined in this dissertation

ContributorsMehrabian, Mohammadreza (Author) / Shrivastava, Aviral (Thesis advisor) / Ren, Fengbo (Committee member) / Sarjoughian, Hessam (Committee member) / Derler, Patricia (Committee member) / Arizona State University (Publisher)

Created2021

Reduced Order Models and Approximations for Hardware Acceleration of Neural Networks

Description

Many real-world engineering problems require simulations to evaluate the design objectives and constraints. Often, due to the complexity of the system model, simulations can be prohibitive in terms of computation time. One approach to overcome this issue is to construct a surrogate model, which approximates the original model. The focus…

Many real-world engineering problems require simulations to evaluate the design objectives and constraints. Often, due to the complexity of the system model, simulations can be prohibitive in terms of computation time. One approach to overcome this issue is to construct a surrogate model, which approximates the original model. The focus of this work is on the data-driven surrogate models, in which empirical approximations of the output are performed given the input parameters. Recently neural networks (NN) have re-emerged as a popular method for constructing data-driven surrogate models. Although, NNs have achieved excellent accuracy and are widely used, they pose their own challenges. This work addresses two common challenges, the need for: (1) hardware acceleration and (2) uncertainty quantification (UQ) in the presence of input variability. The high demand in the inference phase of deep NNs in cloud servers/edge devices calls for the design of low power custom hardware accelerators. The first part of this work describes the design of an energy-efficient long short-term memory (LSTM) accelerator. The overarching goal is to aggressively reduce the power consumption and area of the LSTM components using approximate computing, and then use architectural level techniques to boost the performance. The proposed design is synthesized and placed and routed as an application-specific integrated circuit (ASIC). The results demonstrate that this accelerator is 1.2X and 3.6X more energy-efficient and area-efficient than the baseline LSTM. In the second part of this work, a robust framework is developed based on an alternate data-driven surrogate model referred to as polynomial chaos expansion (PCE) for addressing UQ. In contrast to many existing approaches, no assumptions are made on the elements of the function space and UQ is a function of the expansion coefficients. Moreover, the sensitivity of the output with respect to any subset of the input variables can be computed analytically by post-processing the PCE coefficients. This provides a systematic and incremental method to pruning or changing the order of the model. This framework is evaluated on several real-world applications from different domains and is extended for classification tasks as well.

ContributorsAzari, Elham (Author) / Vrudhula, Sarma (Thesis advisor) / Fainekos, Georgios (Committee member) / Ren, Fengbo (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)

Created2021

Compiler Design for Accelerating Applications on Coarse-Grained Reconfigurable Architectures

Description

Coarse-Grained Reconfigurable Arrays (CGRAs) are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications. The acceleration achieved by CGRA relies on the efficient mapping of the compute-intensive loops by the CGRA compiler onto the CGRA. The CGRA mapping problem, being NP-complete, is performed in a two-step process, scheduling,…

Coarse-Grained Reconfigurable Arrays (CGRAs) are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications. The acceleration achieved by CGRA relies on the efficient mapping of the compute-intensive loops by the CGRA compiler onto the CGRA. The CGRA mapping problem, being NP-complete, is performed in a two-step process, scheduling, and mapping. The scheduling algorithm allocates timeslots to the nodes of the DFG, and the mapping algorithm maps the scheduled nodes onto the PEs of the CGRA. On a mapping failure, the initiation interval (II) is increased, and a new schedule is obtained for the increased II. Most previous mapping techniques use the Iterative Modulo Scheduling algorithm (IMS) to find a schedule for a given II. Since IMS generates a resource-constrained ASAP (as-soon-as-possible) scheduling, even with increased II, it tends to generate a similar schedule that is not mappable and does not explore the schedule space effectively. The problems encountered by IMS-based scheduling algorithms are explored and an improved randomized scheduling algorithm for scheduling of the application loop to be accelerated is proposed. When encountering a mapping failure for a given schedule, existing mapping algorithms either exit and retry the mapping anew, or recursively remove the previously mapped node to find a valid mapping (backtrack).Abandoning the mapping is extreme, but even backtracking may not be the best choice, since the root of the problem may not be the previous node. The challenges in existing algorithms are systematically analyzed and a failure-aware mapping algorithm is presented. The loops in general-purpose applications are often complicated loops, i.e., loops with perfect and imperfect nests and loops with nested if-then-else's (conditionals). The existing hardware-software solutions to execute branches and conditions are inefficient. A co-design approach that efficiently executes complicated loops on CGRA is proposed. The compiler transforms complex loops, maps them to the CGRA, and lays them out in the memory in a specific manner, such that the hardware can fetch and execute the instructions from the right path at runtime. Finally, a CGRA compilation simulator open-source framework is presented. This open-source CGRA simulation framework is based on LLVM and gem5 to extract the loop, map them onto the CGRA architecture, and execute them as a co-processor to an ARM CPU.

ContributorsBalasubramanian, Mahesh (Author) / Shrivastava, Aviral (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Ren, Fengbo (Committee member) / Pozzi, Laura (Committee member) / Arizona State University (Publisher)

Created2021

Differential Privacy Protection via Inexact Data Cloning

Description

With the advent of new advanced analysis tools and access to related published data, it is getting more difficult for data owners to suppress private information from published data while still providing useful information. This dual problem of providing useful, accurate information and protecting it at the same time has…

With the advent of new advanced analysis tools and access to related published data, it is getting more difficult for data owners to suppress private information from published data while still providing useful information. This dual problem of providing useful, accurate information and protecting it at the same time has been challenging, especially in healthcare. The data owners lack an automated resource that provides layers of protection on a published dataset with validated statistical values for usability. Differential privacy (DP) has gained a lot of attention in the past few years as a solution to the above-mentioned dual problem. DP is defined as a statistical anonymity model that can protect the data from adversarial observation while still providing intended usage. This dissertation introduces a novel DP protection mechanism called Inexact Data Cloning (IDC), which simultaneously protects and preserves information in published data while conveying source data intent. IDC preserves the privacy of the records by converting the raw data records into clonesets. The clonesets then pass through a classifier that removes potential compromising clonesets, filtering only good inexact cloneset. The mechanism of IDC is dependent on a set of privacy protection metrics called differential privacy protection metrics (DPPM), which represents the overall protection level. IDC uses two novel performance values, differential privacy protection score (DPPS) and clone classifier selection percentage (CCSP), to estimate the privacy level of protected data. In support of using IDC as a viable data security product, a software tool chain prototype, differential privacy protection architecture (DPPA), was developed to utilize the IDC. DPPA used the engineering security mechanism of IDC. DPPA is a hub which facilitates a market for data DP security mechanisms. DPPA works by incorporating standalone IDC mechanisms and provides automation, IDC protected published datasets and statistically verified IDC dataset diagnostic report. DPPA is currently doing functional, and operational benchmark processes that quantifies the DP protection of a given published dataset. The DPPA tool was recently used to test a couple of health datasets. The test results further validate the IDC mechanism as being feasible.

Contributorsthomas, zelpha (Author) / Bliss, Daniel W (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Banerjee, Ayan (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)

Created2023

Reconfigurable High-Performance Computing of Sparse Linear Algebra

Description

This thesis presents novel software/hardware co-design methodologies aimed at accelerating sparse linear algebra applications within the realm of High-Performance Computing (HPC). The motivation stems from the limitations of conventional CPU- and GPU-based solutions for sparse linear algebra, which are hindered by fixed hardware architecture and memory hierarchy, frequent off-chip memory…

This thesis presents novel software/hardware co-design methodologies aimed at accelerating sparse linear algebra applications within the realm of High-Performance Computing (HPC). The motivation stems from the limitations of conventional CPU- and GPU-based solutions for sparse linear algebra, which are hindered by fixed hardware architecture and memory hierarchy, frequent off-chip memory access, and high energy consumption. In response, this work explores the deployment of Field-Programmable Gate Arrays (FPGAs) and Application-Specific Integrated Circuits (ASICs) to overcome these challenges through their customized nature, offering performance and energy efficiency gains. The scope of the thesis is divided into three main parts: firstly, it introduces a framework that combines an FPGA computational kernel with a novel scheduling algorithm running on a host processor for accelerating the supernodal multifrontal algorithm for sparse Cholesky factorization. This approach minimizes off-chip memory access and on-chip memory requirements by efficiently managing data dependencies and enhancing data locality. Secondly, it presents FSpGEMM, an OpenCL-based framework for accelerating general sparse matrix-matrix multiplication on FPGAs. FSpGEMM exploits a new compressed sparse vector format (CSV) and a custom buffering scheme tailored to Gustavson's algorithm, significantly improving computational performance by optimizing memory access patterns. Additionally, a row reordering technique is utilized to increase the data reuse enabled by the CSV format. Lastly, the thesis proposes an ASIC design for Sparse Tensor Core, which utilizes a Hardware Merge Sorter to increase parallelism in processing units without compromising operating frequency, offering a high-speed solution for sparse linear algebra operations. In summary, the thesis addresses the challenges of implementing sparse linear algebra algorithms on FPGAs and ASICs, such as the complexity of data dependencies and the need for efficient memory management. By proposing solutions that enhance computational performance, reduce energy consumption, and improve the usability of FPGAs and ASICs in HPC infrastructures, this work contributes to computational science, offering a pathway toward more efficient and sustainable computing for complex, data-intensive applications.

ContributorsBank Tavakoli, Erfan (Author) / Ren, Fengbo (Thesis advisor) / Arizona State University (Publisher)

Created2024

Filtering by