Matching Items (8)
Filtering by

Clear all filters

153033-Thumbnail Image.png
Description
Coarse Grain Reconfigurable Arrays (CGRAs) are promising accelerators capable of

achieving high performance at low power consumption. While CGRAs can efficiently

accelerate loop kernels, accelerating loops with control flow (loops with if-then-else

structures) is quite challenging. Techniques that handle control flow execution in

CGRAs generally use predication. Such techniques execute both branches of an

if-then-else

Coarse Grain Reconfigurable Arrays (CGRAs) are promising accelerators capable of

achieving high performance at low power consumption. While CGRAs can efficiently

accelerate loop kernels, accelerating loops with control flow (loops with if-then-else

structures) is quite challenging. Techniques that handle control flow execution in

CGRAs generally use predication. Such techniques execute both branches of an

if-then-else structure and select outcome of either branch to commit based on the

result of the conditional. This results in poor utilization of CGRA s computational

resources. Dual-issue scheme which is the state of the art technique for control flow

fetches instructions from both paths of the branch and selects one to execute at

runtime based on the result of the conditional. This technique has an overhead in

instruction fetch bandwidth. In this thesis, to improve performance of control flow

execution in CGRAs, I propose a solution in which the result of the conditional

expression that decides the branch outcome is communicated to the instruction fetch

unit to selectively issue instructions from the path taken by the branch at run time.

Experimental results show that my solution can achieve 34.6% better performance

and 52.1% improvement in energy efficiency on an average compared to state of the

art dual issue scheme without imposing any overhead in instruction fetch bandwidth.
ContributorsRajendran Radhika, Shri Hari (Author) / Shrivastava, Aviral (Thesis advisor) / Christen, Jennifer Blain (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)
Created2014
150197-Thumbnail Image.png
Description
Ever reducing time to market, along with short product lifetimes, has created a need to shorten the microprocessor design time. Verification of the design and its analysis are two major components of this design cycle. Design validation techniques can be broadly classified into two major categories: simulation based approaches and

Ever reducing time to market, along with short product lifetimes, has created a need to shorten the microprocessor design time. Verification of the design and its analysis are two major components of this design cycle. Design validation techniques can be broadly classified into two major categories: simulation based approaches and formal techniques. Simulation based microprocessor validation involves running millions of cycles using random or pseudo random tests and allows verification of the register transfer level (RTL) model against an architectural model, i.e., that the processor executes instructions as required. The validation effort involves model checking to a high level description or simulation of the design against the RTL implementation. Formal techniques exhaustively analyze parts of the design but, do not verify RTL against the architecture specification. The focus of this work is to implement a fully automated validation environment for a MIPS based radiation hardened microprocessor using simulation based approaches. The basic framework uses the classical validation approach in which the design to be validated is described in a Hardware Definition Language (HDL) such as VHDL or Verilog. To implement a simulation based approach a number of random or pseudo random tests are generated. The output of the HDL based design is compared against the one obtained from a "perfect" model implementing similar functionality, a mismatch in the results would thus indicate a bug in the HDL based design. Effort is made to design the environment in such a manner that it can support validation during different stages of the design cycle. The validation environment includes appropriate changes so as to support architecture changes which are introduced because of radiation hardening. The manner in which the validation environment is build is highly dependent on the specifications of the perfect model used for comparisons. This work implements the validation environment for two MIPS simulators as the reference model. Two bugs have been discovered in the RTL model, using simulation based approaches through the validation environment.
ContributorsSharma, Abhishek (Author) / Clark, Lawrence (Thesis advisor) / Holbert, Keith E. (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2011
150524-Thumbnail Image.png
Description
Network-on-Chip (NoC) architectures have emerged as the solution to the on-chip communication challenges of multi-core embedded processor architectures. Design space exploration and performance evaluation of a NoC design requires fast simulation infrastructure. Simulation of register transfer level model of NoC is too slow for any meaningful design space exploration. One

Network-on-Chip (NoC) architectures have emerged as the solution to the on-chip communication challenges of multi-core embedded processor architectures. Design space exploration and performance evaluation of a NoC design requires fast simulation infrastructure. Simulation of register transfer level model of NoC is too slow for any meaningful design space exploration. One of the solutions to reduce the speed of simulation is to increase the level of abstraction. SystemC TLM2.0 provides the capability to model hardware design at higher levels of abstraction with trade-off of simulation speed and accuracy. In this thesis, SystemC TLM2.0 models of NoC routers are developed at three levels of abstraction namely loosely-timed, approximately-timed, and cycle accurate. Simulation speed and accuracy of these three models are evaluated by a case study of a 4x4 mesh NoC.
ContributorsArlagadda Narasimharaju, Jyothi Swaroop (Author) / Chatha, Karamvir S (Thesis advisor) / Sen, Arunabha (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2012
149788-Thumbnail Image.png
Description
Residue number systems have gained significant importance in the field of high-speed digital signal processing due to their carry-free nature and speed-up provided by parallelism. The critical aspect in the application of RNS is the selection of the moduli set and the design of the conversion units. There have been

Residue number systems have gained significant importance in the field of high-speed digital signal processing due to their carry-free nature and speed-up provided by parallelism. The critical aspect in the application of RNS is the selection of the moduli set and the design of the conversion units. There have been several RNS moduli sets proposed for the implementation of digital filters. However, some are unbalanced and some do not provide the required dynamic range. This thesis addresses the drawbacks of existing RNS moduli sets and proposes a new moduli set for efficient implementation of FIR filters. An efficient VLSI implementation model has been derived for the design of a reverse converter from RNS to the conventional two's complement representation. This model facilitates the realization of a reverse converter for better performance with less hardware complexity when compared with the reverse converter designs of the existing balanced 4-moduli sets. Experimental results comparing multiply and accumulate units using RNS that are implemented using the proposed four-moduli set with the state-of-the-art balanced four-moduli sets, show large improvements in area (46%) and power (43%) reduction for various dynamic ranges. RNS FIR filters using the proposed moduli-set and existing balanced 4-moduli set are implemented in RTL and compared for chip area and power and observed 20% improvements. This thesis also presents threshold logic implementation of the reverse converter.
ContributorsChalivendra, Gayathri (Author) / Vrudhula, Sarma (Thesis advisor) / Shrivastava, Aviral (Committee member) / Bakkaloglu, Bertan (Committee member) / Arizona State University (Publisher)
Created2011
155034-Thumbnail Image.png
Description
The availability of a wide range of general purpose as well as accelerator cores on

modern smartphones means that a significant number of applications can be executed

on a smartphone simultaneously, resulting in an ever increasing demand on the memory

subsystem. While the increased computation capability is intended for improving

user experience, memory requests

The availability of a wide range of general purpose as well as accelerator cores on

modern smartphones means that a significant number of applications can be executed

on a smartphone simultaneously, resulting in an ever increasing demand on the memory

subsystem. While the increased computation capability is intended for improving

user experience, memory requests from each concurrent application exhibit unique

memory access patterns as well as specific timing constraints. If not considered, this

could lead to significant memory contention and result in lowered user experience.

This work first analyzes the impact of memory degradation caused by the interference

at the memory system for a broad range of commonly-used smartphone applications.

The real system characterization results show that smartphone applications,

such as web browsing and media playback, suffer significant performance degradation.

This is caused by shared resource contention at the application processor’s last-level

cache, the communication fabric, and the main memory.

Based on the detailed characterization results, rest of this thesis focuses on the

design of an effective memory interference mitigation technique. Since web browsing,

being one of the most commonly-used smartphone applications and represents many

html-based smartphone applications, my thesis focuses on meeting the performance

requirement of a web browser on a smartphone in the presence of background processes

and co-scheduled applications. My thesis proposes a light-weight user space frequency

governor to mitigate the degradation caused by interfering applications, by predicting

the performance and power consumption of web browsing. The governor selects an

optimal energy-efficient frequency setting periodically by using the statically-trained

performance and power models with dynamically-varying architecture and system

conditions, such as the memory access intensity of background processes and/or coscheduled applications, and temperature of cores. The governor has been extensively evaluated on a Nexus 5 smartphone over a diverse range of mobile workloads. By

operating at the most energy-efficient frequency setting in the presence of interference,

energy efficiency is improved by as much as 35% and with an average of 18% compared

to the existing interactive governor, while maintaining the satisfactory performance

of web page loading under 3 seconds.
ContributorsShingari, Davesh (Author) / Wu, Carole-Jean (Thesis advisor) / Vrudhula, Sarma (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2016
155040-Thumbnail Image.png
Description
Soft errors are considered as a key reliability challenge for sub-nano scale transistors. An ideal solution for such a challenge should ultimately eliminate the effect of soft errors from the microprocessor. While forward recovery techniques achieve fast recovery from errors by simply voting out the wrong values, they incur the

Soft errors are considered as a key reliability challenge for sub-nano scale transistors. An ideal solution for such a challenge should ultimately eliminate the effect of soft errors from the microprocessor. While forward recovery techniques achieve fast recovery from errors by simply voting out the wrong values, they incur the overhead of three copies execution. Backward recovery techniques only need two copies of execution, but suffer from check-pointing overhead.

In this work I explored the efficiency of integrating check-pointing into the application and the effectiveness of recovery that can be performed upon it. After evaluating the available fine-grained approaches to perform recovery, I am introducing InCheck, an in-application recovery scheme that can be integrated into instruction-duplication based techniques, thus providing a fast error recovery. The proposed technique makes light-weight checkpoints at the basic-block granularity, and uses them for recovery purposes.

To evaluate the effectiveness of the proposed technique, 10,000 fault injection experiments were performed on different hardware components of a modern ARM in-order simulated processor. InCheck was able to recover from all detected errors by replaying about 20 instructions, however, the state of the art recovery scheme failed more than 200 times.
ContributorsLokam, Sai Ram Dheeraj (Author) / Shrivastava, Aviral (Thesis advisor) / Clark, Lawrence T (Committee member) / Mubayi, Anuj (Committee member) / Arizona State University (Publisher)
Created2016
155058-Thumbnail Image.png
Description
Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable

of accelerating even non-parallel loops and loops with low trip-counts. One challenge

in compiling for CGRAs is to manage both recurring and nonrecurring variables in

the register file (RF) of the CGRA. Although prior works have managed recurring

variables via rotating RF, they access the nonrecurring

Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable

of accelerating even non-parallel loops and loops with low trip-counts. One challenge

in compiling for CGRAs is to manage both recurring and nonrecurring variables in

the register file (RF) of the CGRA. Although prior works have managed recurring

variables via rotating RF, they access the nonrecurring variables through either a

global RF or from a constant memory. The former does not scale well, and the latter

degrades the mapping quality. This work proposes a hardware-software codesign

approach in order to manage all the variables in a local nonrotating RF. Hardware

provides modulo addition based indexing mechanism to enable correct addressing

of recurring variables in a nonrotating RF. The compiler determines the number of

registers required for each recurring variable and configures the boundary between the

registers used for recurring and nonrecurring variables. The compiler also pre-loads

the read-only variables and constants into the local registers in the prologue of the

schedule. Synthesis and place-and-route results of the previous and the proposed RF

design show that proposed solution achieves 17% better cycle time. Experiments of

mapping several important and performance-critical loops collected from MiBench

show proposed approach improves performance (through better mapping) by 18%,

compared to using constant memory.
ContributorsDave, Shail (Author) / Shrivastava, Aviral (Thesis advisor) / Ren, Fengbo (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)
Created2016
187599-Thumbnail Image.png
Description
5G Millimeter Wave (mmWave) technology holds great promise for Connected Autonomous Vehicles (CAVs) due to its ability to achieve data rates in the Gbps range. However, mmWave suffers high beamforming overhead and requirement of line of sight (LOS) to maintain a strong connection. For Vehicle-to-Infrastructure (V2I) scenarios, where CAVs connect

5G Millimeter Wave (mmWave) technology holds great promise for Connected Autonomous Vehicles (CAVs) due to its ability to achieve data rates in the Gbps range. However, mmWave suffers high beamforming overhead and requirement of line of sight (LOS) to maintain a strong connection. For Vehicle-to-Infrastructure (V2I) scenarios, where CAVs connect to roadside units (RSUs), these drawbacks become apparent. Because vehicles are dynamic, there is a large potential for link blockages, which in turn is detrimental to the connected applications running on the vehicle, such as cooperative perception and remote driver takeover. Existing RSU selection schemes base their decisions on signal strength and vehicle trajectory alone, which is not enough to prevent the blockage of links. Most recent CAVs motion planning algorithms routinely use other vehicle's near-future plans, either by explicit communication among vehicles, or by prediction. In this thesis, I make use of this knowledge (of the other vehicle's near future path plans) to further improve the RSU association mechanism for CAVs. I solve the RSU association problem by converting it to a shortest path problem with the objective to maximize the total communication bandwidth. Evaluations of B-AWARE in simulation using Simulated Urban Mobility (SUMO) and Digital twin for self-dRiving Intelligent VEhicles (DRIVE) on 12 highway and city street scenarios with varying traffic density and RSU placements show that B-AWARE results in a 1.05x improvement of the potential datarate in the average case and 1.28x in the best case vs. the state of the art. But more impressively, B-AWARE reduces the time spent with no connection by 48% in the average case and 251% in the best case as compared to the state-of-the-art methods. This is partly a result of B-AWARE reducing almost 100% of blockage occurrences in simulation.
ContributorsSzeto, Matthew (Author) / Shrivastava, Aviral (Thesis advisor) / LiKamWa, Robert (Committee member) / Meuth, Ryan (Committee member) / Arizona State University (Publisher)
Created2023