Matching Items (17)
Filtering by

Clear all filters

154862-Thumbnail Image.png
Description
The last decade has witnessed a paradigm shift in computing platforms, from laptops and servers to mobile devices like smartphones and tablets. These devices host an immense variety of applications many of which are computationally expensive and thus are power hungry. As most of these mobile platforms are powered by

The last decade has witnessed a paradigm shift in computing platforms, from laptops and servers to mobile devices like smartphones and tablets. These devices host an immense variety of applications many of which are computationally expensive and thus are power hungry. As most of these mobile platforms are powered by batteries, energy efficiency has become one of the most critical aspects of such devices. Thus, the energy cost of the fundamental arithmetic operations executed in these applications has to be reduced. As voltage scaling has effectively ended, the energy efficiency of integrated circuits has ceased to improve within successive generations of transistors. This resulted in widespread use of Application Specific Integrated Circuits (ASIC), which provide incredible energy efficiency. However, these are not flexible and have high non-recurring engineering (NRE) cost. Alternatively, Field Programmable Gate Arrays (FPGA) offer flexibility to implement any application, but at the cost of higher area and energy compared to ASIC.

In this work, a spatially programmable architecture customized for image processing applications is proposed. The intent is to bridge the efficiency gap between ASICs and FPGAs, by offering FPGA-like flexibility and ASIC-like energy efficiency. This architecture minimizes the energy overheads in FPGAs, which result from the use of fine-grained programming style and global interconnect. It is flexible compared to an ASIC and can accommodate multiple applications.

The main contribution of the thesis is the feasibility analysis of the data path of this architecture, customized for image processing applications. The data path is implemented at the register transfer level (RTL), and the synthesis results are obtained in 45nm technology cell library from a leading foundry. The results of image-processing applications demonstrate that this architecture is within a factor of 10x of the energy and area efficiency of ASIC implementations.
ContributorsSatapathy, Saktiswarup (Author) / Brunhaver, John (Thesis advisor) / Clark, Lawrence T (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)
Created2016
155058-Thumbnail Image.png
Description
Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable

of accelerating even non-parallel loops and loops with low trip-counts. One challenge

in compiling for CGRAs is to manage both recurring and nonrecurring variables in

the register file (RF) of the CGRA. Although prior works have managed recurring

variables via rotating RF, they access the nonrecurring

Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable

of accelerating even non-parallel loops and loops with low trip-counts. One challenge

in compiling for CGRAs is to manage both recurring and nonrecurring variables in

the register file (RF) of the CGRA. Although prior works have managed recurring

variables via rotating RF, they access the nonrecurring variables through either a

global RF or from a constant memory. The former does not scale well, and the latter

degrades the mapping quality. This work proposes a hardware-software codesign

approach in order to manage all the variables in a local nonrotating RF. Hardware

provides modulo addition based indexing mechanism to enable correct addressing

of recurring variables in a nonrotating RF. The compiler determines the number of

registers required for each recurring variable and configures the boundary between the

registers used for recurring and nonrecurring variables. The compiler also pre-loads

the read-only variables and constants into the local registers in the prologue of the

schedule. Synthesis and place-and-route results of the previous and the proposed RF

design show that proposed solution achieves 17% better cycle time. Experiments of

mapping several important and performance-critical loops collected from MiBench

show proposed approach improves performance (through better mapping) by 18%,

compared to using constant memory.
ContributorsDave, Shail (Author) / Shrivastava, Aviral (Thesis advisor) / Ren, Fengbo (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)
Created2016
155831-Thumbnail Image.png
Description
With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures.

First, not all parallel threads

With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures.

First, not all parallel threads have a uniform amount of workload to fully utilize GPU’s computation ability, leading to a sub-optimal performance problem, called warp criticality. To mitigate the degree of warp criticality, I propose a Criticality-Aware Warp Acceleration mechanism, called CAWA. CAWA predicts and accelerates the critical warp execution by allocating larger execution time slices and additional cache resources to the critical warp. The evaluation result shows that with CAWA, GPUs can achieve an average of 1.23x speedup.

Second, the shared cache storage in GPUs is often insufficient to accommodate demands of the large number of concurrent threads. As a result, cache thrashing is commonly experienced in GPU’s cache memories, particularly in the L1 data caches. To alleviate the cache contention and thrashing problem, I develop an instruction aware Control Loop Based Adaptive Bypassing algorithm, called Ctrl-C. Ctrl-C learns the cache reuse behavior and bypasses a portion of memory requests with the help of feedback control loops. The evaluation result shows that Ctrl-C can effectively improve cache utilization in GPUs and achieve an average of 1.42x speedup for cache sensitive GPGPU workloads.

Finally, GPU workloads and the co-located processes running on the host chip multiprocessor (CMP) in a heterogeneous system setup can contend for memory resources in multiple levels, resulting in significant performance degradation. To maximize the system throughput and balance the performance degradation of all co-located applications, I design a scalable performance degradation predictor specifically for heterogeneous systems, called HeteroPDP. HeteroPDP predicts the application execution time and schedules OpenCL workloads to run on different devices based on the optimization goal. The evaluation result shows HeteroPDP can improve the system fairness from 24% to 65% when an OpenCL application is co-located with other processes, and gain an additional 50% speedup compared with always offloading the OpenCL workload to GPUs.

In summary, this dissertation aims to provide insights for the future microarchitecture and system architecture designs by identifying, analyzing, and addressing three critical performance problems in modern GPUs.
ContributorsLee, Shin-Ying (Author) / Wu, Carole-Jean (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Ren, Fengbo (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2017
155791-Thumbnail Image.png
Description
Caches pose a serious limitation in scaling many-core architectures since the demand of area and power for maintaining cache coherence increases rapidly with the number of cores. Scratch-Pad Memories (SPMs) provide a cheaper and lower power alternative that can be used to build a more scalable many-core architecture. The trade-off

Caches pose a serious limitation in scaling many-core architectures since the demand of area and power for maintaining cache coherence increases rapidly with the number of cores. Scratch-Pad Memories (SPMs) provide a cheaper and lower power alternative that can be used to build a more scalable many-core architecture. The trade-off of substituting SPMs for caches is however that the data must be explicitly managed in software. Heap management on SPM poses a major challenge due to the highly dynamic nature of of heap data access. Most existing heap management techniques implement a software caching scheme on SPM, emulating the behavior of hardware caches. The state-of-the-art heap management scheme implements a 4-way set-associative software cache on SPM for a single program running with one thread on one core. While the technique works correctly, it suffers from signifcant performance overhead. This paper presents a series of compiler-based efficient heap management approaches that reduces heap management overhead through several optimization techniques. Experimental results on benchmarks from MiBenchGuthaus et al. (2001) executed on an SMM processor modeled in gem5Binkert et al. (2011) demonstrate that our approach (implemented in llvm v3.8Lattner and Adve (2004)) can improve execution time by 80% on average compared to the previous state-of-the-art.
ContributorsLin, Jinn-Pean (Author) / Shrivastava, Aviral (Thesis advisor) / Ren, Fengbo (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)
Created2017
155631-Thumbnail Image.png
Description
The information era has brought about many technological advancements in the past

few decades, and that has led to an exponential increase in the creation of digital images and

videos. Constantly, all digital images go through some image processing algorithm for

various reasons like compression, transmission, storage, etc. There is data loss during

The information era has brought about many technological advancements in the past

few decades, and that has led to an exponential increase in the creation of digital images and

videos. Constantly, all digital images go through some image processing algorithm for

various reasons like compression, transmission, storage, etc. There is data loss during this

process which leaves us with a degraded image. Hence, to ensure minimal degradation of

images, the requirement for quality assessment has become mandatory. Image Quality

Assessment (IQA) has been researched and developed over the last several decades to

predict the quality score in a manner that agrees with human judgments of quality. Modern

image quality assessment (IQA) algorithms are quite effective at prediction accuracy, and

their development has not focused on improving computational performance. The existing

serial implementation requires a relatively large run-time on the order of seconds for a single

frame. Hardware acceleration using Field programmable gate arrays (FPGAs) provides

reconfigurable computing fabric that can be tailored for a broad range of applications.

Usually, programming FPGAs has required expertise in hardware descriptive languages

(HDLs) or high-level synthesis (HLS) tool. OpenCL is an open standard for cross-platform,

parallel programming of heterogeneous systems along with Altera OpenCL SDK, enabling

developers to use FPGA's potential without extensive hardware knowledge. Hence, this

thesis focuses on accelerating the computationally intensive part of the most apparent

distortion (MAD) algorithm on FPGA using OpenCL. The results are compared with CPU

implementation to evaluate performance and efficiency gains.
ContributorsGunavelu Mohan, Aswin (Author) / Sohoni, Sohum (Thesis advisor) / Ren, Fengbo (Thesis advisor) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2017
155154-Thumbnail Image.png
Description
Achieving human level intelligence is a long-term goal for many Artificial Intelligence (AI) researchers. Recent developments in combining deep learning and reinforcement learning helped us to move a step forward in achieving this goal. Reinforcement learning using a delayed reward mechanism is an approach to machine intelligence which studies decision

Achieving human level intelligence is a long-term goal for many Artificial Intelligence (AI) researchers. Recent developments in combining deep learning and reinforcement learning helped us to move a step forward in achieving this goal. Reinforcement learning using a delayed reward mechanism is an approach to machine intelligence which studies decision making with control and how a decision making agent can learn to act optimally in an environment-unaware conditions.

Q-learning is one of the model-free reinforcement directed learning strategies which uses temporal differences to estimate the performances of state-action pairs called Q values. A simple implementation of Q-learning algorithm can be done using a Q table memory to store and update the Q values. However, with an increase in state space data due to a complex environment, and with an increase in possible number of actions an agent can perform, Q table reaches its space limit and would be difficult to scale well. Q-learning with neural networks eliminates the use of Q table by approximating the Q function using neural networks.

Autonomous agents need to develop cognitive properties and become self-adaptive to be deployable in any environment. Reinforcement learning with Q-learning have been very efficient in solving such problems. However, embedded systems like space rovers and autonomous robots rarely implement such techniques due to the constraints faced like processing power, chip area, convergence rate and cost of the chip. These problems present a need for a portable, low power, area efficient hardware accelerator to accelerate the process of such learning.

This problem is targeted by implementing a hardware schematic architecture for Q-learning using Artificial Neural networks. This architecture exploits the massive parallelism provided by neural network with a dedicated fine grain parallelism provided by a Field Programmable Gate Array (FPGA) thereby processing the Q values at a high throughput. Mars exploration rovers currently use Xilinx-Space-grade FPGA devices for image processing, pyrotechnic operation control and obstacle avoidance. The hardware resource consumption for the architecture has been synthesized considering Xilinx Virtex7 FPGA as the target device.
ContributorsGankidi, Pranay Reddy (Author) / Thangavelautham, Jekanthan (Thesis advisor) / Ren, Fengbo (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2016
168306-Thumbnail Image.png
Description
Coarse-Grained Reconfigurable Arrays (CGRAs) are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications. The acceleration achieved by CGRA relies on the efficient mapping of the compute-intensive loops by the CGRA compiler onto the CGRA. The CGRA mapping problem, being NP-complete, is performed in a two-step process, scheduling,

Coarse-Grained Reconfigurable Arrays (CGRAs) are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications. The acceleration achieved by CGRA relies on the efficient mapping of the compute-intensive loops by the CGRA compiler onto the CGRA. The CGRA mapping problem, being NP-complete, is performed in a two-step process, scheduling, and mapping. The scheduling algorithm allocates timeslots to the nodes of the DFG, and the mapping algorithm maps the scheduled nodes onto the PEs of the CGRA. On a mapping failure, the initiation interval (II) is increased, and a new schedule is obtained for the increased II. Most previous mapping techniques use the Iterative Modulo Scheduling algorithm (IMS) to find a schedule for a given II. Since IMS generates a resource-constrained ASAP (as-soon-as-possible) scheduling, even with increased II, it tends to generate a similar schedule that is not mappable and does not explore the schedule space effectively. The problems encountered by IMS-based scheduling algorithms are explored and an improved randomized scheduling algorithm for scheduling of the application loop to be accelerated is proposed. When encountering a mapping failure for a given schedule, existing mapping algorithms either exit and retry the mapping anew, or recursively remove the previously mapped node to find a valid mapping (backtrack).Abandoning the mapping is extreme, but even backtracking may not be the best choice, since the root of the problem may not be the previous node. The challenges in existing algorithms are systematically analyzed and a failure-aware mapping algorithm is presented. The loops in general-purpose applications are often complicated loops, i.e., loops with perfect and imperfect nests and loops with nested if-then-else's (conditionals). The existing hardware-software solutions to execute branches and conditions are inefficient. A co-design approach that efficiently executes complicated loops on CGRA is proposed. The compiler transforms complex loops, maps them to the CGRA, and lays them out in the memory in a specific manner, such that the hardware can fetch and execute the instructions from the right path at runtime. Finally, a CGRA compilation simulator open-source framework is presented. This open-source CGRA simulation framework is based on LLVM and gem5 to extract the loop, map them onto the CGRA architecture, and execute them as a co-processor to an ARM CPU.
ContributorsBalasubramanian, Mahesh (Author) / Shrivastava, Aviral (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Ren, Fengbo (Committee member) / Pozzi, Laura (Committee member) / Arizona State University (Publisher)
Created2021
154335-Thumbnail Image.png
Description
Concurrency bugs are one of the most notorious software bugs and are very difficult to manifest. Significant work has been done on detection of atomicity violations bugs for high performance systems but there is not much work related to detect these bugs for embedded systems. Although criteria to claim existence

Concurrency bugs are one of the most notorious software bugs and are very difficult to manifest. Significant work has been done on detection of atomicity violations bugs for high performance systems but there is not much work related to detect these bugs for embedded systems. Although criteria to claim existence of bugs remains same, approach changes a bit for embedded systems. The main focus of this research is to develop a systemic methodology to address the issue from embedded systems perspective. A framework is developed which predicts the access interleaving patterns that may violate atomicity using memory references of shared variables and provides support to force and analyze these schedules for any output change, system fault or change in execution path.
ContributorsPatel, Jay (Author) / Lee, Yann-Hang (Thesis advisor) / Ren, Fengbo (Committee member) / Srivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2016
157870-Thumbnail Image.png
Description
With the exponential growth in video content over the period of the last few years, analysis of videos is becoming more crucial for many applications such as self-driving cars, healthcare, and traffic management. Most of these video analysis application uses deep learning algorithms such as convolution neural networks (CNN) because

With the exponential growth in video content over the period of the last few years, analysis of videos is becoming more crucial for many applications such as self-driving cars, healthcare, and traffic management. Most of these video analysis application uses deep learning algorithms such as convolution neural networks (CNN) because of their high accuracy in object detection. Thus enhancing the performance of CNN models become crucial for video analysis. CNN models are computationally-expensive operations and often require high-end graphics processing units (GPUs) for acceleration. However, for real-time applications in an energy-thermal constrained environment such as traffic management, GPUs are less preferred because of their high power consumption, limited energy efficiency. They are challenging to fit in a small place.

To enable real-time video analytics in emerging large scale Internet of things (IoT) applications, the computation must happen at the network edge (near the cameras) in a distributed fashion. Thus, edge computing must be adopted. Recent studies have shown that field-programmable gate arrays (FPGAs) are highly suitable for edge computing due to their architecture adaptiveness, high computational throughput for streaming processing, and high energy efficiency.

This thesis presents a generic OpenCL-defined CNN accelerator architecture optimized for FPGA-based real-time video analytics on edge. The proposed CNN OpenCL kernel adopts a highly pipelined and parallelized 1-D systolic array architecture, which explores both spatial and temporal parallelism for energy efficiency CNN acceleration on FPGAs. The large fan-in and fan-out of computational units to the memory interface are identified as the limiting factor in existing designs that causes scalability issues, and solutions are proposed to resolve the issue with compiler automation. The proposed CNN kernel is highly scalable and parameterized by three architecture parameters, namely pe_num, reuse_fac, and vec_fac, which can be adapted to achieve 100% utilization of the coarse-grained computation resources (e.g., DSP blocks) for a given FPGA. The proposed CNN kernel is generic and can be used to accelerate a wide range of CNN models without recompiling the FPGA kernel hardware. The performance of Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet has been measured by the proposed CNN kernel on Intel Arria 10 GX1150 FPGA. The measurement result shows that the proposed CNN kernel, when mapped with 100% utilization of computation resources, can achieve a latency of 11ms, 84ms, 1614.9ms, and 990.34ms for Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet respectively when the input feature maps and weights are represented using 32-bit floating-point data type.
ContributorsDua, Akshay (Author) / Ren, Fengbo (Thesis advisor) / Ogras, Umit Y. (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2019
158643-Thumbnail Image.png
Description
A Single Event Transient (SET) is a transient voltage pulse induced by an ionizing radiation particle striking a combinational logic node in a circuit. The probability of a storage element capturing the transient pulse depends on the width of the pulse. Measuring the rate of occurrence and the distribution of

A Single Event Transient (SET) is a transient voltage pulse induced by an ionizing radiation particle striking a combinational logic node in a circuit. The probability of a storage element capturing the transient pulse depends on the width of the pulse. Measuring the rate of occurrence and the distribution of SET pulse widths is essential to understand the likelihood of soft errors and to develop cost-effective mitigation schemes. Existing research measures the pulse width of SETs in bulk Complementary Metal-Oxide-Semiconductor (CMOS) and Silicon On Insulator (SOI) technologies, but not on Fin Field-Effect Transistors (FinFETs). This thesis focuses on developing a test structure on the FinFET process to generate, propagate, and separate SETs and build a time-to-digital converter to measure the pulse width of SET.



The proposed SET test structure statistically separates SETs generated at NMOS and PMOS based on the difference in restoring current. It consists of N-collection devices to collect events at NMOS and P-collection devices to collect events at PMOS. The events that occur in PMOS of the N-collection device and NMOS of the P-collection device are false events. The logic gates of the collection devices are skewed to perform pulse expansion so that a minimally sustained SET propagates without getting suppressed by the contamination delay. A symmetric tree structure with an S-R latch event detector localizes the location of the SET. The Cartesian coordinates-based pulse injection structure injects external pulses at specific nodes to perform instrumentation and calibrate the measurement. A thermometer-encoded chain (vernier chain) with mismatched delay paths measures the width of the SET.

For low Linear Energy Transfer (LET) tests, the false events are entirely masked and do not propagate since the amount of charge that has to be deposited for successful event propagation is significantly high. In the case of high LET tests, the actual events and false events propagate, but they can be separated based on the SET location and the width of the output event. The vernier chain has a high measurement resolution of ~3.5ps, which aids in separating the events.
ContributorsShreedharan, Sanjay (Author) / Brunhaver, John (Thesis advisor) / Clark, Lawrence (Committee member) / Sanchez Esqueda, Ivan (Committee member) / Arizona State University (Publisher)
Created2020