Matching Items (23)
Filtering by

Clear all filters

151527-Thumbnail Image.png
Description
Rapid technology scaling, the main driver of the power and performance improvements of computing solutions, has also rendered our computing systems extremely susceptible to transient errors called soft errors. Among the arsenal of techniques to protect computation from soft errors, Control Flow Checking (CFC) based techniques have gained a reputation

Rapid technology scaling, the main driver of the power and performance improvements of computing solutions, has also rendered our computing systems extremely susceptible to transient errors called soft errors. Among the arsenal of techniques to protect computation from soft errors, Control Flow Checking (CFC) based techniques have gained a reputation of effective, yet low-cost protection mechanism. The basic idea is that, there is a high probability that a soft-fault in program execution will eventually alter the control flow of the program. Therefore just by making sure that the control flow of the program is correct, significant protection can be achieved. More than a dozen techniques for CFC have been developed over the last several decades, ranging from hardware techniques, software techniques, and hardware-software hybrid techniques as well. Our analysis shows that existing CFC techniques are not only ineffective in protecting from soft errors, but cause additional power and performance overheads. For this analysis, we develop and validate a simulation based experimental setup to accurately and quantitatively estimate the architectural vulnerability of a program execution on a processor micro-architecture. We model the protection achieved by various state-of-the-art CFC techniques in this quantitative vulnerability estimation setup, and find out that software only CFC protection schemes (CFCSS, CFCSS+NA, CEDA) increase system vulnerability by 18% to 21% with 17% to 38% performance overhead. Hybrid CFC protection (CFEDC) increases vulnerability by 5%, while the vulnerability remains almost the same for hardware only CFC protection (CFCET); notwithstanding the hardware overheads of design cost, area, and power incurred in the hardware modifications required for their implementations.
ContributorsRhisheekesan, Abhishek (Author) / Shrivastava, Aviral (Thesis advisor) / Colbourn, Charles Joseph (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)
Created2013
151851-Thumbnail Image.png
Description
In this thesis we deal with the problem of temporal logic robustness estimation. We present a dynamic programming algorithm for the robust estimation problem of Metric Temporal Logic (MTL) formulas regarding a finite trace of time stated sequence. This algorithm not only tests if the MTL specification is satisfied by

In this thesis we deal with the problem of temporal logic robustness estimation. We present a dynamic programming algorithm for the robust estimation problem of Metric Temporal Logic (MTL) formulas regarding a finite trace of time stated sequence. This algorithm not only tests if the MTL specification is satisfied by the given input which is a finite system trajectory, but also quantifies to what extend does the sequence satisfies or violates the MTL specification. The implementation of the algorithm is the DP-TALIRO toolbox for MATLAB. Currently it is used as the temporal logic robust computing engine of S-TALIRO which is a tool for MATLAB searching for trajectories of minimal robustness in Simulink/ Stateflow. DP-TALIRO is expected to have near linear running time and constant memory requirement depending on the structure of the MTL formula. DP-TALIRO toolbox also integrates new features not supported in its ancestor FW-TALIRO such as parameter replacement, most related iteration and most related predicate. A derivative of DP-TALIRO which is DP-T-TALIRO is also addressed in this thesis which applies dynamic programming algorithm for time robustness computation. We test the running time of DP-TALIRO and compare it with FW-TALIRO. Finally, we present an application where DP-TALIRO is used as the robustness computation core of S-TALIRO for a parameter estimation problem.
ContributorsYang, Hengyi (Author) / Fainekos, Georgios (Thesis advisor) / Sarjoughian, Hessam S. (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2013
152778-Thumbnail Image.png
Description
Software has a great impact on the energy efficiency of any computing system--it can manage the components of a system efficiently or inefficiently. The impact of software is amplified in the context of a wearable computing system used for activity recognition. The design space this platform opens up is immense

Software has a great impact on the energy efficiency of any computing system--it can manage the components of a system efficiently or inefficiently. The impact of software is amplified in the context of a wearable computing system used for activity recognition. The design space this platform opens up is immense and encompasses sensors, feature calculations, activity classification algorithms, sleep schedules, and transmission protocols. Design choices in each of these areas impact energy use, overall accuracy, and usefulness of the system. This thesis explores methods software can influence the trade-off between energy consumption and system accuracy. In general the more energy a system consumes the more accurate will be. We explore how finding the transitions between human activities is able to reduce the energy consumption of such systems without reducing much accuracy. We introduce the Log-likelihood Ratio Test as a method to detect transitions, and explore how choices of sensor, feature calculations, and parameters concerning time segmentation affect the accuracy of this method. We discovered an approximate 5X increase in energy efficiency could be achieved with only a 5% decrease in accuracy. We also address how a system's sleep mode, in which the processor enters a low-power state and sensors are turned off, affects a wearable computing platform that does activity recognition. We discuss the energy trade-offs in each stage of the activity recognition process. We find that careful analysis of these parameters can result in great increases in energy efficiency if small compromises in overall accuracy can be tolerated. We call this the ``Great Compromise.'' We found a 6X increase in efficiency with a 7% decrease in accuracy. We then consider how wireless transmission of data affects the overall energy efficiency of a wearable computing platform. We find that design decisions such as feature calculations and grouping size have a great impact on the energy consumption of the system because of the amount of data that is stored and transmitted. For example, storing and transmitting vector-based features such as FFT or DCT do not compress the signal and would use more energy than storing and transmitting the raw signal. The effect of grouping size on energy consumption depends on the feature. For scalar features energy consumption is proportional in the inverse of grouping size, so it's reduced as grouping size goes up. For features that depend on the grouping size, such as FFT, energy increases with the logarithm of grouping size, so energy consumption increases slowly as grouping size increases. We find that compressing data through activity classification and transition detection significantly reduces energy consumption and that the energy consumed for the classification overhead is negligible compared to the energy savings from data compression. We provide mathematical models of energy usage and data generation, and test our ideas using a mobile computing platform, the Texas Instruments Chronos watch.
ContributorsBoyd, Jeffrey Michael (Author) / Sundaram, Hari (Thesis advisor) / Li, Baoxin (Thesis advisor) / Shrivastava, Aviral (Committee member) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)
Created2014
152905-Thumbnail Image.png
Description
Coarse-Grained Reconfigurable Architectures (CGRA) are a promising fabric for improving the performance and power-efficiency of computing devices. CGRAs are composed of components that are well-optimized to execute loops and rotating register file is an example of such a component present in CGRAs. Due to the rotating nature of register indexes

Coarse-Grained Reconfigurable Architectures (CGRA) are a promising fabric for improving the performance and power-efficiency of computing devices. CGRAs are composed of components that are well-optimized to execute loops and rotating register file is an example of such a component present in CGRAs. Due to the rotating nature of register indexes in rotating register file, it is very challenging, if at all possible, to hold and properly index memory addresses (pointers) and static values. In this Thesis, different structures for CGRA register files are investigated. Those structures are experimentally compared in terms of performance of mapped applications, design frequency, and area. It is shown that a register file that can logically be partitioned into rotating and non-rotating regions is an excellent choice because it imposes the minimum restriction on underlying CGRA mapping algorithm while resulting in efficient resource utilization.
ContributorsSaluja, Dipal (Author) / Shrivastava, Aviral (Thesis advisor) / Lee, Yann-Hang (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)
Created2014
153089-Thumbnail Image.png
Description
A benchmark suite that is representative of the programs a processor typically executes is necessary to understand a processor's performance or energy consumption characteristics. The first contribution of this work addresses this need for mobile platforms with MobileBench, a selection of representative smartphone applications. In smartphones, like any other

A benchmark suite that is representative of the programs a processor typically executes is necessary to understand a processor's performance or energy consumption characteristics. The first contribution of this work addresses this need for mobile platforms with MobileBench, a selection of representative smartphone applications. In smartphones, like any other portable computing systems, energy is a limited resource. Based on the energy characterization of a commercial widely-used smartphone, application cores are found to consume a significant part of the total energy consumption of the device. With this insight, the subsequent part of this thesis focuses on the portion of energy that is spent to move data from the memory system to the application core's internal registers. The primary motivation for this work comes from the relatively higher power consumption associated with a data movement instruction compared to that of an arithmetic instruction. The data movement energy cost is worsened esp. in a System on Chip (SoC) because the amount of data received and exchanged in a SoC based smartphone increases at an explosive rate. A detailed investigation is performed to quantify the impact of data movement

on the overall energy consumption of a smartphone device. To aid this study, microbenchmarks that generate desired data movement patterns between different levels of the memory hierarchy are designed. Energy costs of data movement are then computed by measuring the instantaneous power consumption of the device when the micro benchmarks are executed. This work makes an extensive use of hardware performance counters to validate the memory access behavior of microbenchmarks and to characterize the energy consumed in moving data. Finally, the calculated energy costs of data movement are used to characterize the portion of energy that MobileBench applications spend in moving data. The results of this study show that a significant 35% of the total device energy is spent in data movement alone. Energy is an increasingly important criteria in the context of designing architectures for future smartphones and this thesis offers insights into data movement energy consumption.
ContributorsPandiyan, Dhinakaran (Author) / Wu, Carole-Jean (Thesis advisor) / Shrivastava, Aviral (Committee member) / Lee, Yann-Hang (Committee member) / Arizona State University (Publisher)
Created2014
153033-Thumbnail Image.png
Description
Coarse Grain Reconfigurable Arrays (CGRAs) are promising accelerators capable of

achieving high performance at low power consumption. While CGRAs can efficiently

accelerate loop kernels, accelerating loops with control flow (loops with if-then-else

structures) is quite challenging. Techniques that handle control flow execution in

CGRAs generally use predication. Such techniques execute both branches of an

if-then-else

Coarse Grain Reconfigurable Arrays (CGRAs) are promising accelerators capable of

achieving high performance at low power consumption. While CGRAs can efficiently

accelerate loop kernels, accelerating loops with control flow (loops with if-then-else

structures) is quite challenging. Techniques that handle control flow execution in

CGRAs generally use predication. Such techniques execute both branches of an

if-then-else structure and select outcome of either branch to commit based on the

result of the conditional. This results in poor utilization of CGRA s computational

resources. Dual-issue scheme which is the state of the art technique for control flow

fetches instructions from both paths of the branch and selects one to execute at

runtime based on the result of the conditional. This technique has an overhead in

instruction fetch bandwidth. In this thesis, to improve performance of control flow

execution in CGRAs, I propose a solution in which the result of the conditional

expression that decides the branch outcome is communicated to the instruction fetch

unit to selectively issue instructions from the path taken by the branch at run time.

Experimental results show that my solution can achieve 34.6% better performance

and 52.1% improvement in energy efficiency on an average compared to state of the

art dual issue scheme without imposing any overhead in instruction fetch bandwidth.
ContributorsRajendran Radhika, Shri Hari (Author) / Shrivastava, Aviral (Thesis advisor) / Christen, Jennifer Blain (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)
Created2014
153193-Thumbnail Image.png
Description
As the number of cores per chip increases, maintaining cache coherence becomes prohibitive for both power and performance. Non Coherent Cache (NCC) architectures do away with hardware-based cache coherence, but they become difficult to program. Some existing architectures provide a middle ground by providing some shared memory in the hardware.

As the number of cores per chip increases, maintaining cache coherence becomes prohibitive for both power and performance. Non Coherent Cache (NCC) architectures do away with hardware-based cache coherence, but they become difficult to program. Some existing architectures provide a middle ground by providing some shared memory in the hardware. Specifically, the 48-core Intel Single-chip Cloud Computer (SCC) provides some off-chip (DRAM) shared memory some on-chip (SRAM) shared memory. We call such architectures Hybrid Shared Memory, or HSM, manycore architectures. However, how to efficiently execute multi-threaded programs on HSM architectures is an open problem. To be able to execute a multi-threaded program correctly on HSM architectures, the compiler must: i) identify all the shared data and map it to the shared memory, and ii) map the frequently accessed shared data to the on-chip shared memory. This work presents a source-to-source translator written using CETUS that identifies a conservative superset of all the shared data in a multi-threaded application and maps it to the shared memory such that it enables execution on HSM architectures.
ContributorsRawat, Tushar (Author) / Shrivastava, Aviral (Thesis advisor) / Dasgupta, Partha (Committee member) / Fainekos, Georgios (Committee member) / Arizona State University (Publisher)
Created2014
150544-Thumbnail Image.png
Description
Limited Local Memory (LLM) multicore architectures are promising powerefficient architectures will scalable memory hierarchy. In LLM multicores, each core can access only a small local memory. Accesses to a large shared global memory can only be made explicitly through Direct Memory Access (DMA) operations. Standard Template Library (STL) is a

Limited Local Memory (LLM) multicore architectures are promising powerefficient architectures will scalable memory hierarchy. In LLM multicores, each core can access only a small local memory. Accesses to a large shared global memory can only be made explicitly through Direct Memory Access (DMA) operations. Standard Template Library (STL) is a powerful programming tool and is widely used for software development. STLs provide dynamic data structures, algorithms, and iterators for vector, deque (double-ended queue), list, map (red-black tree), etc. Since the size of the local memory is limited in the cores of the LLM architecture, and data transfer is not automatically supported by hardware cache or OS, the usage of current STL implementation on LLM multicores is limited. Specifically, there is a hard limitation on the amount of data they can handle. In this article, we propose and implement a framework which manages the STL container classes on the local memory of LLM multicore architecture. Our proposal removes the data size limitation of the STL, and therefore improves the programmability on LLM multicore architectures with little change to the original program. Our implementation results in only about 12%-17% increase in static library code size and reasonable runtime overheads.
ContributorsLu, Di (Author) / Shrivastava, Aviral (Thesis advisor) / Chatha, Karamvir (Committee member) / Dasgupta, Partha (Committee member) / Arizona State University (Publisher)
Created2012
150486-Thumbnail Image.png
Description
The use of energy-harvesting in a wireless sensor network (WSN) is essential for situations where it is either difficult or not cost effective to access the network's nodes to replace the batteries. In this paper, the problems involved in controlling an active sensor network that is powered both by batteries

The use of energy-harvesting in a wireless sensor network (WSN) is essential for situations where it is either difficult or not cost effective to access the network's nodes to replace the batteries. In this paper, the problems involved in controlling an active sensor network that is powered both by batteries and solar energy are investigated. The objective is to develop control strategies to maximize the quality of coverage (QoC), which is defined as the minimum number of targets that must be covered and reported over a 24 hour period. Assuming a time varying solar profile, the problem is to optimally control the sensing range of each sensor so as to maximize the QoC while maintaining connectivity throughout the network. Implicit in the solution is the dynamic allocation of solar energy during the day to sensing and to recharging the battery so that a minimum coverage is guaranteed even during the night, when only the batteries can supply energy to the sensors. This problem turns out to be a non-linear optimal control problem of high complexity. Based on novel and useful observations, a method is presented to solve it as a series of quasiconvex (unimodal) optimization problems which not only ensures a maximum QoC, but also maintains connectivity throughout the network. The runtime of the proposed solution is 60X less than a naive but optimal method which is based on dynamic programming, while the peak error of the solution is less than 8%. Unlike the dynamic programming method, the proposed method is scalable to large networks consisting of hundreds of sensors and targets. The solution method enables a designer to explore the optimal configuration of network design. This paper offers many insights in the design of energy-harvesting networks, which result in minimum network setup cost through determination of optimal configuration of number of sensors, sensing beam width, and the sampling time.
ContributorsGaudette, Benjamin (Author) / Vrudhula, Sarma (Thesis advisor) / Shrivastava, Aviral (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)
Created2012
150460-Thumbnail Image.png
Description
Performance improvements have largely followed Moore's Law due to the help from technology scaling. In order to continue improving performance, power-efficiency must be reduced. Better technology has improved power-efficiency, but this has a limit. Multi-core architectures have been shown to be an additional aid to this crusade of increased power-efficiency.

Performance improvements have largely followed Moore's Law due to the help from technology scaling. In order to continue improving performance, power-efficiency must be reduced. Better technology has improved power-efficiency, but this has a limit. Multi-core architectures have been shown to be an additional aid to this crusade of increased power-efficiency. Accelerators are growing in popularity as the next means of achieving power-efficient performance. Accelerators such as Intel SSE are ideal, but prove difficult to program. FPGAs, on the other hand, are less efficient due to their fine-grained reconfigurability. A middle ground is found in CGRAs, which are highly power-efficient, but largely programmable accelerators. Power-efficiencies of 100s of GOPs/W have been estimated, more than 2 orders of magnitude greater than current processors. Currently, CGRAs are limited in their applicability due to their ability to only accelerate a single thread at a time. This limitation becomes especially apparent as multi-core/multi-threaded processors have moved into the mainstream. This limitation is removed by enabling multi-threading on CGRAs through a software-oriented approach. The key capability in this solution is enabling quick run-time transformation of schedules to execute on targeted portions of the CGRA. This allows the CGRA to be shared among multiple threads simultaneously. Analysis shows that enabling multi-threading has very small costs but provides very large benefits (less than 1% single-threaded performance loss but nearly 300% CGRA throughput increase). By increasing dynamism of CGRA scheduling, system performance is shown to increase overall system performance of an optimized system by almost 350% over that of a single-threaded CGRA and nearly 20x faster than the same system with no CGRA in a highly threaded environment.
ContributorsPager, Jared (Author) / Shrivastava, Aviral (Thesis advisor) / Gupta, Sandeep (Committee member) / Speyer, Gil (Committee member) / Arizona State University (Publisher)
Created2011