Matching Items (35)
Filtering by

Clear all filters

150197-Thumbnail Image.png
Description
Ever reducing time to market, along with short product lifetimes, has created a need to shorten the microprocessor design time. Verification of the design and its analysis are two major components of this design cycle. Design validation techniques can be broadly classified into two major categories: simulation based approaches and

Ever reducing time to market, along with short product lifetimes, has created a need to shorten the microprocessor design time. Verification of the design and its analysis are two major components of this design cycle. Design validation techniques can be broadly classified into two major categories: simulation based approaches and formal techniques. Simulation based microprocessor validation involves running millions of cycles using random or pseudo random tests and allows verification of the register transfer level (RTL) model against an architectural model, i.e., that the processor executes instructions as required. The validation effort involves model checking to a high level description or simulation of the design against the RTL implementation. Formal techniques exhaustively analyze parts of the design but, do not verify RTL against the architecture specification. The focus of this work is to implement a fully automated validation environment for a MIPS based radiation hardened microprocessor using simulation based approaches. The basic framework uses the classical validation approach in which the design to be validated is described in a Hardware Definition Language (HDL) such as VHDL or Verilog. To implement a simulation based approach a number of random or pseudo random tests are generated. The output of the HDL based design is compared against the one obtained from a "perfect" model implementing similar functionality, a mismatch in the results would thus indicate a bug in the HDL based design. Effort is made to design the environment in such a manner that it can support validation during different stages of the design cycle. The validation environment includes appropriate changes so as to support architecture changes which are introduced because of radiation hardening. The manner in which the validation environment is build is highly dependent on the specifications of the perfect model used for comparisons. This work implements the validation environment for two MIPS simulators as the reference model. Two bugs have been discovered in the RTL model, using simulation based approaches through the validation environment.
ContributorsSharma, Abhishek (Author) / Clark, Lawrence (Thesis advisor) / Holbert, Keith E. (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2011
150460-Thumbnail Image.png
Description
Performance improvements have largely followed Moore's Law due to the help from technology scaling. In order to continue improving performance, power-efficiency must be reduced. Better technology has improved power-efficiency, but this has a limit. Multi-core architectures have been shown to be an additional aid to this crusade of increased power-efficiency.

Performance improvements have largely followed Moore's Law due to the help from technology scaling. In order to continue improving performance, power-efficiency must be reduced. Better technology has improved power-efficiency, but this has a limit. Multi-core architectures have been shown to be an additional aid to this crusade of increased power-efficiency. Accelerators are growing in popularity as the next means of achieving power-efficient performance. Accelerators such as Intel SSE are ideal, but prove difficult to program. FPGAs, on the other hand, are less efficient due to their fine-grained reconfigurability. A middle ground is found in CGRAs, which are highly power-efficient, but largely programmable accelerators. Power-efficiencies of 100s of GOPs/W have been estimated, more than 2 orders of magnitude greater than current processors. Currently, CGRAs are limited in their applicability due to their ability to only accelerate a single thread at a time. This limitation becomes especially apparent as multi-core/multi-threaded processors have moved into the mainstream. This limitation is removed by enabling multi-threading on CGRAs through a software-oriented approach. The key capability in this solution is enabling quick run-time transformation of schedules to execute on targeted portions of the CGRA. This allows the CGRA to be shared among multiple threads simultaneously. Analysis shows that enabling multi-threading has very small costs but provides very large benefits (less than 1% single-threaded performance loss but nearly 300% CGRA throughput increase). By increasing dynamism of CGRA scheduling, system performance is shown to increase overall system performance of an optimized system by almost 350% over that of a single-threaded CGRA and nearly 20x faster than the same system with no CGRA in a highly threaded environment.
ContributorsPager, Jared (Author) / Shrivastava, Aviral (Thesis advisor) / Gupta, Sandeep (Committee member) / Speyer, Gil (Committee member) / Arizona State University (Publisher)
Created2011
150544-Thumbnail Image.png
Description
Limited Local Memory (LLM) multicore architectures are promising powerefficient architectures will scalable memory hierarchy. In LLM multicores, each core can access only a small local memory. Accesses to a large shared global memory can only be made explicitly through Direct Memory Access (DMA) operations. Standard Template Library (STL) is a

Limited Local Memory (LLM) multicore architectures are promising powerefficient architectures will scalable memory hierarchy. In LLM multicores, each core can access only a small local memory. Accesses to a large shared global memory can only be made explicitly through Direct Memory Access (DMA) operations. Standard Template Library (STL) is a powerful programming tool and is widely used for software development. STLs provide dynamic data structures, algorithms, and iterators for vector, deque (double-ended queue), list, map (red-black tree), etc. Since the size of the local memory is limited in the cores of the LLM architecture, and data transfer is not automatically supported by hardware cache or OS, the usage of current STL implementation on LLM multicores is limited. Specifically, there is a hard limitation on the amount of data they can handle. In this article, we propose and implement a framework which manages the STL container classes on the local memory of LLM multicore architecture. Our proposal removes the data size limitation of the STL, and therefore improves the programmability on LLM multicore architectures with little change to the original program. Our implementation results in only about 12%-17% increase in static library code size and reasonable runtime overheads.
ContributorsLu, Di (Author) / Shrivastava, Aviral (Thesis advisor) / Chatha, Karamvir (Committee member) / Dasgupta, Partha (Committee member) / Arizona State University (Publisher)
Created2012
149617-Thumbnail Image.png
Description
The ubiquity of embedded computational systems has exploded in recent years impacting everything from hand-held computers and automotive driver assistance to battlefield command and control and autonomous systems. Typical embedded computing systems are characterized by highly resource constrained operating environments. In particular, limited energy resources constrain performance in embedded systems

The ubiquity of embedded computational systems has exploded in recent years impacting everything from hand-held computers and automotive driver assistance to battlefield command and control and autonomous systems. Typical embedded computing systems are characterized by highly resource constrained operating environments. In particular, limited energy resources constrain performance in embedded systems often reliant on independent fuel or battery supplies. Ultimately, mitigating energy consumption without sacrificing performance in these systems is paramount. In this work power/performance optimization emphasizing prevailing data centric applications including video and signal processing is addressed for energy constrained embedded systems. Frameworks are presented which exchange quality of service (QoS) for reduced power consumption enabling power aware energy management. Power aware systems provide users with tools for precisely managing available energy resources in light of user priorities, extending availability when QoS can be sacrificed. Specifically, power aware management tools for next generation bistable electrophoretic displays and the state of the art H.264 video codec are introduced. The multiprocessor system on chip (MPSoC) paradigm is examined in the context of next generation many-core hand-held computing devices. MPSoC architectures promise to breach the power/performance wall prohibiting advancement of complex high performance single core architectures. Several many-core distributed memory MPSoC architectures are commercially available, while the tools necessary to effectively tap their enormous potential remain largely open for discovery. Adaptable scalability in many-core systems is addressed through a scalable high performance multicore H.264 video decoder implemented on the representative Cell Broadband Engine (CBE) architecture. The resulting agile performance scalable system enables efficient adaptive power optimization via decoding-rate driven sleep and voltage/frequency state management. The significant problem of mapping applications onto these architectures is additionally addressed from the perspective of instruction mapping for limited distributed memory architectures with a code overlay generator implemented on the CBE. Finally runtime scheduling and mapping of scalable applications in multitasking environments is addressed through the introduction of a lightweight work partitioning framework targeting streaming applications with low latency and near optimal throughput demonstrated on the CBE.
ContributorsBaker, Michael (Author) / Chatha, Karam S. (Thesis advisor) / Raupp, Gregory B. (Committee member) / Vrudhula, Sarma B. K. (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2011
149343-Thumbnail Image.png
Description
Threshold logic has long been studied as a means of achieving higher performance and lower power dissipation, providing improvements by condensing simple logic gates into more complex primitives, effectively reducing gate count, pipeline depth, and number of interconnects. This work proposes a new physical implementation of threshold logic, the threshold

Threshold logic has long been studied as a means of achieving higher performance and lower power dissipation, providing improvements by condensing simple logic gates into more complex primitives, effectively reducing gate count, pipeline depth, and number of interconnects. This work proposes a new physical implementation of threshold logic, the threshold logic latch (TLL), which overcomes the difficulties observed in previous work, particularly with respect to gate reliability in the presence of noise and process variations. Simple but effective models were created to assess the delay, power, and noise margin of TLL gates for the purpose of determining the physical parameters and assignment of input signals that achieves the lowest delay subject to constraints on power and reliability. From these models, an optimized library of standard TLL cells was developed to supplement a commercial library of static CMOS gates. The new cells were then demonstrated on a number of automatically synthesized, placed, and routed designs. A two-stage 2's complement integer multiplier designed with CMOS and TLL gates utilized 19.5% less area, 28.0% less active power, and 61.5% less leakage power than an equivalent design with the same performance using only static CMOS gates. Additionally, a two-stage 32-instruction 4-way issue queue designed with CMOS and TLL gates utilized 30.6% less area, 31.0% less active power, and 58.9% less leakage power than an equivalent design with the same performance using only static CMOS gates.
ContributorsLeshner, Samuel (Author) / Vrudhula, Sarma (Thesis advisor) / Chatha, Karamvir (Committee member) / Clark, Lawrence (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2010
171583-Thumbnail Image.png
Description
With the breakdown of Dennard scaling, computer architects can no longer rely on integrated circuit energy efficiency to scale with transistor density, and must under-clock or power-gate parts of their designs in order to fit within given power budgets. Hardware accelerators may improve energy efficiency of some compute-intensive tasks, but

With the breakdown of Dennard scaling, computer architects can no longer rely on integrated circuit energy efficiency to scale with transistor density, and must under-clock or power-gate parts of their designs in order to fit within given power budgets. Hardware accelerators may improve energy efficiency of some compute-intensive tasks, but as more tasks are accelerated, the general-purpose portions of workloads account for a larger share of execution time while also leaving less instruction, data, or task-level parallelism to exploit. Adaptive computing systems have potential to address these challenges by modifying their behavior at runtime. Adaptation requires runtime decision-making, which can be performed both in hardware and software. While software-based decision-making is more flexible and can execute higher complexity operations compared to hardware, it also incurs a significant latency and power overhead. Hardware designs are more limited in the space of decisions they can make, but have direct access to their own internal microarchitectural states and can make faster decisions, allowing for better-informed adaptation and extracting previously unobtainable performance and security benefits. In this dissertation I study (i) the viability and trade-offs of general-purpose adaptive systems, (ii) the difficulty and complexity of making adaptation decisions, and (iii) how time spent in the observation-analysis-adaptation cycle affects adaptation benefits. I introduce techniques for (a) modeling and understanding high performance computing systems and microarchitecture, (b) enabling hardware learning and decision-making through low-latency networks, and (c) on securing hardware designs using runtime decision-making. I propose an always-awake and active learning `hardware nervous system' pervasive throughout the chip that can reason about the individual hardware module performance, energy usage, and security. I present the design and implementation of (1) a reference architecture and (2) a microarchitecture-aware static binary instrumentation tool. Finally, I provide results showing (1) that runtime adaptation is a necessary to continue improving performance on general-purpose tasks, (2) that significant performance loss and performance variation happens under the ISA-level, and is unobservable without hardware support, and (3) that hardware must possess decision-making and ‘self-awareness’ capabilities at the microarchitecture level in order to efficiently use its own faculties.
ContributorsIsakov, Mihailo (Author) / Kinsy, Michel (Thesis advisor) / Shrivastava, Aviral (Committee member) / Rudd, Kevin (Committee member) / Gadepally, Vijay (Committee member) / Arizona State University (Publisher)
Created2022
187599-Thumbnail Image.png
Description
5G Millimeter Wave (mmWave) technology holds great promise for Connected Autonomous Vehicles (CAVs) due to its ability to achieve data rates in the Gbps range. However, mmWave suffers high beamforming overhead and requirement of line of sight (LOS) to maintain a strong connection. For Vehicle-to-Infrastructure (V2I) scenarios, where CAVs connect

5G Millimeter Wave (mmWave) technology holds great promise for Connected Autonomous Vehicles (CAVs) due to its ability to achieve data rates in the Gbps range. However, mmWave suffers high beamforming overhead and requirement of line of sight (LOS) to maintain a strong connection. For Vehicle-to-Infrastructure (V2I) scenarios, where CAVs connect to roadside units (RSUs), these drawbacks become apparent. Because vehicles are dynamic, there is a large potential for link blockages, which in turn is detrimental to the connected applications running on the vehicle, such as cooperative perception and remote driver takeover. Existing RSU selection schemes base their decisions on signal strength and vehicle trajectory alone, which is not enough to prevent the blockage of links. Most recent CAVs motion planning algorithms routinely use other vehicle's near-future plans, either by explicit communication among vehicles, or by prediction. In this thesis, I make use of this knowledge (of the other vehicle's near future path plans) to further improve the RSU association mechanism for CAVs. I solve the RSU association problem by converting it to a shortest path problem with the objective to maximize the total communication bandwidth. Evaluations of B-AWARE in simulation using Simulated Urban Mobility (SUMO) and Digital twin for self-dRiving Intelligent VEhicles (DRIVE) on 12 highway and city street scenarios with varying traffic density and RSU placements show that B-AWARE results in a 1.05x improvement of the potential datarate in the average case and 1.28x in the best case vs. the state of the art. But more impressively, B-AWARE reduces the time spent with no connection by 48% in the average case and 251% in the best case as compared to the state-of-the-art methods. This is partly a result of B-AWARE reducing almost 100% of blockage occurrences in simulation.
ContributorsSzeto, Matthew (Author) / Shrivastava, Aviral (Thesis advisor) / LiKamWa, Robert (Committee member) / Meuth, Ryan (Committee member) / Arizona State University (Publisher)
Created2023
171818-Thumbnail Image.png
Description
Recent advances in autonomous vehicle (AV) technologies have ensured that autonomous driving will soon be present in real-world traffic. Despite the potential of AVs, many studies have shown that traffic accidents in hybrid traffic environments (where both AVs and human-driven vehicles (HVs) are present) are inevitable because of the unpredictability

Recent advances in autonomous vehicle (AV) technologies have ensured that autonomous driving will soon be present in real-world traffic. Despite the potential of AVs, many studies have shown that traffic accidents in hybrid traffic environments (where both AVs and human-driven vehicles (HVs) are present) are inevitable because of the unpredictability of human-driven vehicles. Given that eliminating accidents is impossible, an achievable goal of designing AVs is to design them in a way so that they will not be blamed for any accident in which they are involved in. This work proposes BlaFT – a Blame-Free motion planning algorithm in hybrid Traffic. BlaFT is designed to be compatible with HVs and other AVs, and will not be blamed for accidents in a structured road environment. Also, it proves that no accidents will happen if all AVs are using the BlaFT motion planner and that when in hybrid traffic, the AV using BlaFT will be blame-free even if it is involved in a collision. The work instantiated scores of BlaFT and HV vehicles in an urban road scape loop in the 'Simulation of Urban MObility', ran the simulation for several hours, and observe that as the percentage of BlaFT vehicles increases, the traffic becomes safer. Adding BlaFT vehicles to HVs also increases the efficiency of traffic as a whole by up to 34%.
ContributorsPark, Sanggu (Author) / Shrivastava, Aviral (Thesis advisor) / Wang, Ruoyu (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2022
171405-Thumbnail Image.png
Description
Many companies face pressure to deploy flexible compute infrastructures to manage their operations. However, the current developments in cloud and edge computing have created a data processing asymmetry challenge. On the edge, workloads frequently require low-latency responses, contend with connectivity and bandwidth instabilities, may require privacy guarantees, and may perform

Many companies face pressure to deploy flexible compute infrastructures to manage their operations. However, the current developments in cloud and edge computing have created a data processing asymmetry challenge. On the edge, workloads frequently require low-latency responses, contend with connectivity and bandwidth instabilities, may require privacy guarantees, and may perform under limited or high-variance compute resources. In the cloud, workloads tolerate longer latency, expect highly available infrastructure, access high-performance compute resources, and have more power available, but may be further from where the processing results are needed. This compute asymmetry challenge requires a new computational paradigm. In this work, I advance a new computing architecture model, called the Continuum Computing Architecture (CCA), and validate this model with a candidate architecture. CCA is a unifying edge-fog-cloud computing model that provides the following capabilities: (i) a continuum of compute that spans from network-connected edge devices to the cloud – with very low power consumption to high-performance compute; (ii) same architecture with different micro-architectures along this compute continuum – a single RISC-V instruction set architecture with reconfigurable processing units; (iii) portability across all scales – the same program can be run across the continuum with different latencies and power utilizations; and (iv) secure shared memory features are fully-supported – physical memories along the continuum are abstracted to allow edge and cloud to share data in a transparent fashion. The validating architecture has three micro-architectures. The edge micro-architecture, Parmenides, targets accelerator-based edge processing system-on-chips (SoCs). Parmenides includes security features to protect the SoC in uncontrolled environments while adapting its power usage and processing to ambient events. The fog and cloud micro-architectures, Melissus and Zeno, must support application data distribution across the memory of many compute nodes to achieve the desired scale and performance. As a solution, I introduce the Eleatic Memory Model (EMM): a global shared memory architecture with hardware-supported global memory access permissions. All memory accesses are made with a Namespace-based capability scheme that supports improved scalability and memory security. The CCA model addresses several memory-centric security challenges including the misuse of resources, risk to application and data integrity, as well as concerns over authorization and confidentiality.
ContributorsEhret, Alan (Author) / Kinsy, Michel A (Thesis advisor) / Vrudhula, Sarma (Committee member) / Shrivastava, Aviral (Committee member) / Rudd, Kevin (Committee member) / Gettings, Karen (Committee member) / Arizona State University (Publisher)
Created2022
171406-Thumbnail Image.png
Description
Coarse-grain reconfigurable architectures (CGRAs) have shown significant improvements as hardware accelerator whilst demanding low power. Such acceleration inherits from the nature of instruction-level parallelism and exploited by many techniques. Modulo scheduling is a popular approach to software pipelining techniques that provides an efficient heuristic to accelerations on loops, repetitive regions

Coarse-grain reconfigurable architectures (CGRAs) have shown significant improvements as hardware accelerator whilst demanding low power. Such acceleration inherits from the nature of instruction-level parallelism and exploited by many techniques. Modulo scheduling is a popular approach to software pipelining techniques that provides an efficient heuristic to accelerations on loops, repetitive regions of an application. Existing scheduling algorithms for modulo scheduling heuristic persist on loop exiting problems that limit CGRA acceleration to only loops with known trip count and no exit statements. Another notable limitation is the early exit problem, where loops can only terminate after certain iterations as CGRA moves to kernel stage. In attempts to circumvent such obstacles, COMSAT introduces a modified modulo scheduling technique that acts as an external module and can be applied to any existing scheduling/mapping algorithms with minimal hardware changes. Experiments from MiBench and Rodinia benchmark suites have shown that COMSAT achieved an average speedup of 3x in overall benchmarks and 10x speedup in kernel regions. Without COMSAT techniques, only 25% of said loops would have been able to accelerate, reducing benchmark and kernel speedups to 1.25x and 3.63x respectively.
ContributorsTa, Vinh (Author) / Shrivastava, Aviral (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Kinsey, Michel (Committee member) / Arizona State University (Publisher)
Created2022