Matching Items (53)
Filtering by

Clear all filters

161988-Thumbnail Image.png
Description
Autonomous Vehicles (AV) are inevitable entities in future mobility systems thatdemand safety and adaptability as two critical factors in replacing/assisting human drivers. Safety arises in defining, standardizing, quantifying, and monitoring requirements for all autonomous components. Adaptability, on the other hand, involves efficient handling of uncertainty and inconsistencies in models and data. First, I

Autonomous Vehicles (AV) are inevitable entities in future mobility systems thatdemand safety and adaptability as two critical factors in replacing/assisting human drivers. Safety arises in defining, standardizing, quantifying, and monitoring requirements for all autonomous components. Adaptability, on the other hand, involves efficient handling of uncertainty and inconsistencies in models and data. First, I address safety by presenting a search-based test-case generation framework that can be used in training and testing deep-learning components of AV. Next, to address adaptability, I propose a framework based on multi-valued linear temporal logic syntax and semantics that allows autonomous agents to perform model-checking on systems with uncertainties. The search-based test-case generation framework provides safety assurance guarantees through formalizing and monitoring Responsibility Sensitive Safety (RSS) rules. I use the RSS rules in signal temporal logic as qualification specifications for monitoring and screening the quality of generated test-drive scenarios. Furthermore, to extend the existing temporal-based formal languages’ expressivity, I propose a new spatio-temporal perception logic that enables formalizing qualification specifications for perception systems. All-in-one, my test-generation framework can be used for reasoning about the quality of perception, prediction, and decision-making components in AV. Finally, my efforts resulted in publicly available software. One is an offline monitoring algorithm based on the proposed logic to reason about the quality of perception systems. The other is an optimal planner (model checker) that accepts mission specifications and model descriptions in the form of multi-valued logic and multi-valued sets, respectively. My monitoring framework is distributed with the publicly available S-TaLiRo and Sim-ATAV tools.
ContributorsHekmatnejad, Mohammad (Author) / Fainekos, Georgios (Thesis advisor) / Deshmukh, Jyotirmoy V (Committee member) / Karam, Lina (Committee member) / Pedrielli, Giulia (Committee member) / Shrivastava, Aviral (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2021
168306-Thumbnail Image.png
Description
Coarse-Grained Reconfigurable Arrays (CGRAs) are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications. The acceleration achieved by CGRA relies on the efficient mapping of the compute-intensive loops by the CGRA compiler onto the CGRA. The CGRA mapping problem, being NP-complete, is performed in a two-step process, scheduling,

Coarse-Grained Reconfigurable Arrays (CGRAs) are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications. The acceleration achieved by CGRA relies on the efficient mapping of the compute-intensive loops by the CGRA compiler onto the CGRA. The CGRA mapping problem, being NP-complete, is performed in a two-step process, scheduling, and mapping. The scheduling algorithm allocates timeslots to the nodes of the DFG, and the mapping algorithm maps the scheduled nodes onto the PEs of the CGRA. On a mapping failure, the initiation interval (II) is increased, and a new schedule is obtained for the increased II. Most previous mapping techniques use the Iterative Modulo Scheduling algorithm (IMS) to find a schedule for a given II. Since IMS generates a resource-constrained ASAP (as-soon-as-possible) scheduling, even with increased II, it tends to generate a similar schedule that is not mappable and does not explore the schedule space effectively. The problems encountered by IMS-based scheduling algorithms are explored and an improved randomized scheduling algorithm for scheduling of the application loop to be accelerated is proposed. When encountering a mapping failure for a given schedule, existing mapping algorithms either exit and retry the mapping anew, or recursively remove the previously mapped node to find a valid mapping (backtrack).Abandoning the mapping is extreme, but even backtracking may not be the best choice, since the root of the problem may not be the previous node. The challenges in existing algorithms are systematically analyzed and a failure-aware mapping algorithm is presented. The loops in general-purpose applications are often complicated loops, i.e., loops with perfect and imperfect nests and loops with nested if-then-else's (conditionals). The existing hardware-software solutions to execute branches and conditions are inefficient. A co-design approach that efficiently executes complicated loops on CGRA is proposed. The compiler transforms complex loops, maps them to the CGRA, and lays them out in the memory in a specific manner, such that the hardware can fetch and execute the instructions from the right path at runtime. Finally, a CGRA compilation simulator open-source framework is presented. This open-source CGRA simulation framework is based on LLVM and gem5 to extract the loop, map them onto the CGRA architecture, and execute them as a co-processor to an ARM CPU.
ContributorsBalasubramanian, Mahesh (Author) / Shrivastava, Aviral (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Ren, Fengbo (Committee member) / Pozzi, Laura (Committee member) / Arizona State University (Publisher)
Created2021
187820-Thumbnail Image.png
Description
With the advent of new advanced analysis tools and access to related published data, it is getting more difficult for data owners to suppress private information from published data while still providing useful information. This dual problem of providing useful, accurate information and protecting it at the same time has

With the advent of new advanced analysis tools and access to related published data, it is getting more difficult for data owners to suppress private information from published data while still providing useful information. This dual problem of providing useful, accurate information and protecting it at the same time has been challenging, especially in healthcare. The data owners lack an automated resource that provides layers of protection on a published dataset with validated statistical values for usability. Differential privacy (DP) has gained a lot of attention in the past few years as a solution to the above-mentioned dual problem. DP is defined as a statistical anonymity model that can protect the data from adversarial observation while still providing intended usage. This dissertation introduces a novel DP protection mechanism called Inexact Data Cloning (IDC), which simultaneously protects and preserves information in published data while conveying source data intent. IDC preserves the privacy of the records by converting the raw data records into clonesets. The clonesets then pass through a classifier that removes potential compromising clonesets, filtering only good inexact cloneset. The mechanism of IDC is dependent on a set of privacy protection metrics called differential privacy protection metrics (DPPM), which represents the overall protection level. IDC uses two novel performance values, differential privacy protection score (DPPS) and clone classifier selection percentage (CCSP), to estimate the privacy level of protected data. In support of using IDC as a viable data security product, a software tool chain prototype, differential privacy protection architecture (DPPA), was developed to utilize the IDC. DPPA used the engineering security mechanism of IDC. DPPA is a hub which facilitates a market for data DP security mechanisms. DPPA works by incorporating standalone IDC mechanisms and provides automation, IDC protected published datasets and statistically verified IDC dataset diagnostic report. DPPA is currently doing functional, and operational benchmark processes that quantifies the DP protection of a given published dataset. The DPPA tool was recently used to test a couple of health datasets. The test results further validate the IDC mechanism as being feasible.
Contributorsthomas, zelpha (Author) / Bliss, Daniel W (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Banerjee, Ayan (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2023
187470-Thumbnail Image.png
Description
Among the many challenges facing circuit designers in deep sub-micron technologies, power, performance, area (PPA) and process variations are perhaps the most critical. Since existing strategies for reducing power and boosting the performance of the circuit designs have already matured to saturation, it is necessary to explore alternate unconventional strategies.

Among the many challenges facing circuit designers in deep sub-micron technologies, power, performance, area (PPA) and process variations are perhaps the most critical. Since existing strategies for reducing power and boosting the performance of the circuit designs have already matured to saturation, it is necessary to explore alternate unconventional strategies. This investigation focuses on using perceptrons to enhance PPA in digital circuits and starts by constructing the perceptron using a combination of complementary metal-oxide-semiconductor (CMOS) and flash technology. The use of flash enables the perceptron to have a variable delay and functionality, making them robust to process, voltage, and temperature variations. By replacing parts of an application-specific integrated circuit (ASIC) with these perceptrons, improvements of up to 30% in the area and 20% in power can be achieved without affecting performance. Furthermore, the ability to vary the delay of a perceptron enables circuit designers to fix setup and hold-time violations post-fabrication, while reprogramming the functionality enables the obfuscation of the circuits. The study extends to field-programmable gate arrays (FPGAs), showing that traditional FPGA architectures can also achieve improved PPA by replacing some Look-Up-Tables (LUTs) with perceptrons. Considering that replacing parts of traditional digital circuits provides significant improvements in PPA, a natural extension was to see whether circuits built dedicatedly using perceptrons as its compute unit would lead to improvements in energy efficiency. This was demonstrated by developing perceptron-based compute elements and constructing an architecture using these elements for Quantized Neural Network acceleration. The resulting circuit delivered up to 50 times more energy efficiency compared to a CMOS-based accelerator without using standard low-power techniques such as voltage scaling and approximate computing.
ContributorsWagle, Ankit (Author) / Vrudhula, Sarma (Thesis advisor) / Khatri, Sunil (Committee member) / Shrivastava, Aviral (Committee member) / Seo, Jae-Sun (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)
Created2023
156829-Thumbnail Image.png
Description
Advances in semiconductor technology have brought computer-based systems intovirtually all aspects of human life. This unprecedented integration of semiconductor based systems in our lives has significantly increased the domain and the number

of safety-critical applications – application with unacceptable consequences of failure. Software-level error resilience schemes are attractive because they can

Advances in semiconductor technology have brought computer-based systems intovirtually all aspects of human life. This unprecedented integration of semiconductor based systems in our lives has significantly increased the domain and the number

of safety-critical applications – application with unacceptable consequences of failure. Software-level error resilience schemes are attractive because they can provide commercial-off-the-shelf microprocessors with adaptive and scalable reliability.

Among all software-level error resilience solutions, in-application instruction replication based approaches have been widely used and are deemed to be the most effective. However, existing instruction-based replication schemes only protect some part of computations i.e. arithmetic and logical instructions and leave the rest as unprotected. To improve the efficacy of instruction-level redundancy-based approaches, we developed several error detection and error correction schemes. nZDC (near Zero silent

Data Corruption) is an instruction duplication scheme which protects the execution of whole application. Rather than detecting errors on register operands of memory and control flow operations, nZDC checks the results of such operations. nZDC en

sures the correct execution of memory write instruction by reloading stored value and checking it against redundantly computed value. nZDC also introduces a novel control flow checking mechanism which replicates compare and branch instructions and

detects both wrong direction branches as well as unwanted jumps. Fault injection experiments show that nZDC can improve the error coverage of the state-of-the-art schemes by more than 10x, without incurring any more performance penalty. Further

more, we introduced two error recovery solutions. InCheck is our backward recovery solution which makes light-weighted error-free checkpoints at the basic block granularity. In the case of error, InCheck reverts the program execution to the beginning of last executed basic block and resumes the execution by the aid of preserved in formation. NEMESIS is our forward recovery scheme which runs three versions of computation and detects errors by checking the results of all memory write and branch

operations. In the case of a mismatch, NEMESIS diagnosis routine decides if the error is recoverable. If yes, NEMESIS recovery routine reverts the effect of error from the program state and resumes program normal execution from the error detection

point.
ContributorsDidehban, Moslem (Author) / Shrivastava, Aviral (Thesis advisor) / Wu, Carole-Jean (Committee member) / Clark, Lawrence (Committee member) / Mahlke, Scott (Committee member) / Arizona State University (Publisher)
Created2018
156791-Thumbnail Image.png
Description
General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs

General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions.

Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%.

Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications.

Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future.

In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.
ContributorsArunkumar, Akhil (Author) / Wu, Carole-Jean (Thesis advisor) / Shrivastava, Aviral (Committee member) / Lee, Yann-Hang (Committee member) / Bolotin, Evgeny (Committee member) / Arizona State University (Publisher)
Created2018
156948-Thumbnail Image.png
Description
The Internet of Things ecosystem has spawned a wide variety of embedded real-time systems that complicate the identification and resolution of bugs in software. The methods of concurrent checkpoint provide a means to monitor the application state with the ability to replay the execution on like hardware and software,

The Internet of Things ecosystem has spawned a wide variety of embedded real-time systems that complicate the identification and resolution of bugs in software. The methods of concurrent checkpoint provide a means to monitor the application state with the ability to replay the execution on like hardware and software, without holding off and delaying the execution of application threads. In this thesis, it is accomplished by monitoring physical memory of the application using a soft-dirty page tracker and measuring the various types of overhead when employing concurrent checkpointing. The solution presented is an advancement of the Checkpoint and Replay In Userspace (CRIU) thereby eliminating the large stalls and parasitic operation for each successive checkpoint. Impact and performance is measured using the Parsec 3.0 Benchmark suite and 4.11.12-rt16+ Linux kernel on a MinnowBoard Turbot Quad-Core board.
ContributorsPrinke, Michael L (Author) / Lee, Yann-Hang (Thesis advisor) / Shrivastava, Aviral (Committee member) / Zhao, Ming (Committee member) / Arizona State University (Publisher)
Created2018
157100-Thumbnail Image.png
Description
One of the main goals of computer architecture design is to improve performance without much increase in the power consumption. It cannot be achieved by adding increasingly complex intelligent schemes in the hardware, since they will become increasingly less power-efficient. Therefore, parallelism comes up as the solution. In fact, the

One of the main goals of computer architecture design is to improve performance without much increase in the power consumption. It cannot be achieved by adding increasingly complex intelligent schemes in the hardware, since they will become increasingly less power-efficient. Therefore, parallelism comes up as the solution. In fact, the irrevocable trend of computer design in near future is still to keep increasing the number of cores while reducing the operating frequency. However, it is not easy to scale number of cores. One important challenge is that existing cores consume too much power. Another challenge is that cache-based memory hierarchy poses a serious limitation due to the rapidly increasing demand of area and power for coherence maintenance.

In this dissertation, opportunities to resolve the aforementioned issues were explored in two aspects.

Firstly, the possibility of removing hardware cache altogether, and replacing it with scratchpad memory with software management was explored. Scratchpad memory consumes much less power than caches. However, as data management logic is completely shifted to Software, how to reduce software overhead is challenging. This thesis presents techniques to manage scratchpad memory judiciously by exploiting application semantics and knowledge of data access patterns, thereby enabling optimization of data movement across the memory hierarchy. Experimental results show that the optimization was able to reduce stack data management overhead by 13X, produce better code mapping in more than 80% of the case, and improve performance by 83% in heap management.

Secondly, the possibility of using software branch hinting to replace hardware branch prediction to completely eliminate power consumption on corresponding hardware components was explored. As branch predictor is removed from hardware, software logic is responsible for reducing branch penalty. Techniques to minimize the branch penalty by optimizing branch hint placement were proposed, which can reduce branch penalty by 35.4% over the state-of-the-art.
ContributorsLu, Jing (Author) / Shrivastava, Aviral (Thesis advisor) / Sarjoughian, Hessam S. (Committee member) / Wu, Carole-Jean (Committee member) / Doupe, Adam (Committee member) / Arizona State University (Publisher)
Created2019
153968-Thumbnail Image.png
Description
The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density

The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density increases significantly by technology scaling. Due to technology factors, the reduction in power consumption per transistor is not sufficient to offset the increase in power consumption per unit area. Therefore, to improve performance, increasing energy-efficiency must be addressed at all design levels from circuit level to application and algorithm levels.

At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism.

Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRA
ContributorsHamzeh, Mahdi (Author) / Vrudhula, Sarma (Thesis advisor) / Gopalakrishnan, Kailash (Committee member) / Shrivastava, Aviral (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)
Created2015
154657-Thumbnail Image.png
Description
Several decades of transistor technology scaling has brought the threat of soft errors to modern embedded processors. Several techniques have been proposed to protect these systems from soft errors. However, their effectiveness in protecting the computation cannot be ascertained without accurate and quantitative estimation of system reliability. Vulnerability -- a

Several decades of transistor technology scaling has brought the threat of soft errors to modern embedded processors. Several techniques have been proposed to protect these systems from soft errors. However, their effectiveness in protecting the computation cannot be ascertained without accurate and quantitative estimation of system reliability. Vulnerability -- a metric that defines the probability of system-failure (reliability) through analytical models -- is the most effective mechanism for our current estimation and early design space exploration needs. Previous vulnerability estimation tools are based around the Sim-Alpha simulator which has been to shown to have several limitations. In this thesis, I present gemV: an accurate and comprehensive vulnerability estimation tool based on gem5. Gem5 is a popular cycle-accurate micro-architectural simulator that can model several different processor models in close to real hardware form. GemV can be used for fast and early design space exploration and also evaluate the protection afforded by commodity processors. gemV is comprehensive, since it models almost all sequential components of the processor. gemV is accurate because of fine-grain vulnerability tracking, accurate vulnerability modeling of squashed instructions, and accurate vulnerability modeling of shared data structures in gem5. gemV has been thoroughly validated against extensive fault injection experiments and achieves a 97\% accuracy with 95\% confidence. A micro-architect can use gemV to discover micro-architectural variants of a processor that minimize vulnerability for allowed performance penalty. A software developer can use gemV to explore the performance-vulnerability trade-off by choosing different algorithms and compiler optimizations, while the system designer can use gemV to explore the performance-vulnerability trade-offs of choosing different Insruction Set Architectures (ISA).
ContributorsTanikella, Srinivas Karthik (Author) / Shrivastava, Aviral (Thesis advisor) / Bazzi, Rida (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)
Created2016