Matching Items (16)
Filtering by

Clear all filters

Description
Generating real-world content for VR is challenging in terms of capturing and processing at high resolution and high frame-rates. The content needs to represent a truly immersive experience, where the user can look around in 360-degree view and perceive the depth of the scene. The existing solutions only capture and

Generating real-world content for VR is challenging in terms of capturing and processing at high resolution and high frame-rates. The content needs to represent a truly immersive experience, where the user can look around in 360-degree view and perceive the depth of the scene. The existing solutions only capture and offload the compute load to the server. But offloading large amounts of raw camera feeds takes longer latencies and poses difficulties for real-time applications. By capturing and computing on the edge, we can closely integrate the systems and optimize for low latency. However, moving the traditional stitching algorithms to battery constrained device needs at least three orders of magnitude reduction in power. We believe that close integration of capture and compute stages will lead to reduced overall system power.

We approach the problem by building a hardware prototype and characterize the end-to-end system bottlenecks of power and performance. The prototype has 6 IMX274 cameras and uses Nvidia Jetson TX2 development board for capture and computation. We found that capturing is bottlenecked by sensor power and data-rates across interfaces, whereas compute is limited by the total number of computations per frame. Our characterization shows that redundant capture and redundant computations lead to high power, huge memory footprint, and high latency. The existing systems lack hardware-software co-design aspects, leading to excessive data transfers across the interfaces and expensive computations within the individual subsystems. Finally, we propose mechanisms to optimize the system for low power and low latency. We emphasize the importance of co-design of different subsystems to reduce and reuse the data. For example, reusing the motion vectors of the ISP stage reduces the memory footprint of the stereo correspondence stage. Our estimates show that pipelining and parallelization on custom FPGA can achieve real time stitching.
ContributorsGunnam, Sridhar (Author) / LiKamWa, Robert (Thesis advisor) / Turaga, Pavan (Committee member) / Jayasuriya, Suren (Committee member) / Arizona State University (Publisher)
Created2018
156791-Thumbnail Image.png
Description
General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs

General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions.

Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%.

Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications.

Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future.

In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.
ContributorsArunkumar, Akhil (Author) / Wu, Carole-Jean (Thesis advisor) / Shrivastava, Aviral (Committee member) / Lee, Yann-Hang (Committee member) / Bolotin, Evgeny (Committee member) / Arizona State University (Publisher)
Created2018
156829-Thumbnail Image.png
Description
Advances in semiconductor technology have brought computer-based systems intovirtually all aspects of human life. This unprecedented integration of semiconductor based systems in our lives has significantly increased the domain and the number

of safety-critical applications – application with unacceptable consequences of failure. Software-level error resilience schemes are attractive because they can

Advances in semiconductor technology have brought computer-based systems intovirtually all aspects of human life. This unprecedented integration of semiconductor based systems in our lives has significantly increased the domain and the number

of safety-critical applications – application with unacceptable consequences of failure. Software-level error resilience schemes are attractive because they can provide commercial-off-the-shelf microprocessors with adaptive and scalable reliability.

Among all software-level error resilience solutions, in-application instruction replication based approaches have been widely used and are deemed to be the most effective. However, existing instruction-based replication schemes only protect some part of computations i.e. arithmetic and logical instructions and leave the rest as unprotected. To improve the efficacy of instruction-level redundancy-based approaches, we developed several error detection and error correction schemes. nZDC (near Zero silent

Data Corruption) is an instruction duplication scheme which protects the execution of whole application. Rather than detecting errors on register operands of memory and control flow operations, nZDC checks the results of such operations. nZDC en

sures the correct execution of memory write instruction by reloading stored value and checking it against redundantly computed value. nZDC also introduces a novel control flow checking mechanism which replicates compare and branch instructions and

detects both wrong direction branches as well as unwanted jumps. Fault injection experiments show that nZDC can improve the error coverage of the state-of-the-art schemes by more than 10x, without incurring any more performance penalty. Further

more, we introduced two error recovery solutions. InCheck is our backward recovery solution which makes light-weighted error-free checkpoints at the basic block granularity. In the case of error, InCheck reverts the program execution to the beginning of last executed basic block and resumes the execution by the aid of preserved in formation. NEMESIS is our forward recovery scheme which runs three versions of computation and detects errors by checking the results of all memory write and branch

operations. In the case of a mismatch, NEMESIS diagnosis routine decides if the error is recoverable. If yes, NEMESIS recovery routine reverts the effect of error from the program state and resumes program normal execution from the error detection

point.
ContributorsDidehban, Moslem (Author) / Shrivastava, Aviral (Thesis advisor) / Wu, Carole-Jean (Committee member) / Clark, Lawrence (Committee member) / Mahlke, Scott (Committee member) / Arizona State University (Publisher)
Created2018
155034-Thumbnail Image.png
Description
The availability of a wide range of general purpose as well as accelerator cores on

modern smartphones means that a significant number of applications can be executed

on a smartphone simultaneously, resulting in an ever increasing demand on the memory

subsystem. While the increased computation capability is intended for improving

user experience, memory requests

The availability of a wide range of general purpose as well as accelerator cores on

modern smartphones means that a significant number of applications can be executed

on a smartphone simultaneously, resulting in an ever increasing demand on the memory

subsystem. While the increased computation capability is intended for improving

user experience, memory requests from each concurrent application exhibit unique

memory access patterns as well as specific timing constraints. If not considered, this

could lead to significant memory contention and result in lowered user experience.

This work first analyzes the impact of memory degradation caused by the interference

at the memory system for a broad range of commonly-used smartphone applications.

The real system characterization results show that smartphone applications,

such as web browsing and media playback, suffer significant performance degradation.

This is caused by shared resource contention at the application processor’s last-level

cache, the communication fabric, and the main memory.

Based on the detailed characterization results, rest of this thesis focuses on the

design of an effective memory interference mitigation technique. Since web browsing,

being one of the most commonly-used smartphone applications and represents many

html-based smartphone applications, my thesis focuses on meeting the performance

requirement of a web browser on a smartphone in the presence of background processes

and co-scheduled applications. My thesis proposes a light-weight user space frequency

governor to mitigate the degradation caused by interfering applications, by predicting

the performance and power consumption of web browsing. The governor selects an

optimal energy-efficient frequency setting periodically by using the statically-trained

performance and power models with dynamically-varying architecture and system

conditions, such as the memory access intensity of background processes and/or coscheduled applications, and temperature of cores. The governor has been extensively evaluated on a Nexus 5 smartphone over a diverse range of mobile workloads. By

operating at the most energy-efficient frequency setting in the presence of interference,

energy efficiency is improved by as much as 35% and with an average of 18% compared

to the existing interactive governor, while maintaining the satisfactory performance

of web page loading under 3 seconds.
ContributorsShingari, Davesh (Author) / Wu, Carole-Jean (Thesis advisor) / Vrudhula, Sarma (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2016
155240-Thumbnail Image.png
Description
Cyber-physical systems and hard real-time systems have strict timing constraints that specify deadlines until which tasks must finish their execution. Missing a deadline can cause unexpected outcome or endanger human lives in safety-critical applications, such as automotive or aeronautical systems. It is, therefore, of utmost importance to obtain and optimize

Cyber-physical systems and hard real-time systems have strict timing constraints that specify deadlines until which tasks must finish their execution. Missing a deadline can cause unexpected outcome or endanger human lives in safety-critical applications, such as automotive or aeronautical systems. It is, therefore, of utmost importance to obtain and optimize a safe upper bound of each task’s execution time or the worst-case execution time (WCET), to guarantee the absence of any missed deadline. Unfortunately, conventional microarchitectural components, such as caches and branch predictors, are only optimized for average-case performance and often make WCET analysis complicated and pessimistic. Caches especially have a large impact on the worst-case performance due to expensive off- chip memory accesses involved in cache miss handling. In this regard, software-controlled scratchpad memories (SPMs) have become a promising alternative to caches. An SPM is a raw SRAM, controlled only by executing data movement instructions explicitly at runtime, and such explicit control facilitates static analyses to obtain safe and tight upper bounds of WCETs. SPM management techniques, used in compilers targeting an SPM-based processor, determine how to use a given SPM space by deciding where to insert data movement instructions and what operations to perform at those program locations. This dissertation presents several management techniques for program code and stack data, which aim to optimize the WCETs of a given program. The proposed code management techniques include optimal allocation algorithms and a polynomial-time heuristic for allocating functions to the SPM space, with or without the use of abstraction of SPM regions, and a heuristic for splitting functions into smaller partitions. The proposed stack data management technique, on the other hand, finds an optimal set of program locations to evict and restore stack frames to avoid stack overflows, when the call stack resides in a size-limited SPM. In the evaluation, the WCETs of various benchmarks including real-world automotive applications are statically calculated for SPMs and caches in several different memory configurations.
ContributorsKim, Yooseong (Author) / Shrivastava, Aviral (Thesis advisor) / Broman, David (Committee member) / Fainekos, Georgios (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)
Created2017
155831-Thumbnail Image.png
Description
With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures.

First, not all parallel threads

With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures.

First, not all parallel threads have a uniform amount of workload to fully utilize GPU’s computation ability, leading to a sub-optimal performance problem, called warp criticality. To mitigate the degree of warp criticality, I propose a Criticality-Aware Warp Acceleration mechanism, called CAWA. CAWA predicts and accelerates the critical warp execution by allocating larger execution time slices and additional cache resources to the critical warp. The evaluation result shows that with CAWA, GPUs can achieve an average of 1.23x speedup.

Second, the shared cache storage in GPUs is often insufficient to accommodate demands of the large number of concurrent threads. As a result, cache thrashing is commonly experienced in GPU’s cache memories, particularly in the L1 data caches. To alleviate the cache contention and thrashing problem, I develop an instruction aware Control Loop Based Adaptive Bypassing algorithm, called Ctrl-C. Ctrl-C learns the cache reuse behavior and bypasses a portion of memory requests with the help of feedback control loops. The evaluation result shows that Ctrl-C can effectively improve cache utilization in GPUs and achieve an average of 1.42x speedup for cache sensitive GPGPU workloads.

Finally, GPU workloads and the co-located processes running on the host chip multiprocessor (CMP) in a heterogeneous system setup can contend for memory resources in multiple levels, resulting in significant performance degradation. To maximize the system throughput and balance the performance degradation of all co-located applications, I design a scalable performance degradation predictor specifically for heterogeneous systems, called HeteroPDP. HeteroPDP predicts the application execution time and schedules OpenCL workloads to run on different devices based on the optimization goal. The evaluation result shows HeteroPDP can improve the system fairness from 24% to 65% when an OpenCL application is co-located with other processes, and gain an additional 50% speedup compared with always offloading the OpenCL workload to GPUs.

In summary, this dissertation aims to provide insights for the future microarchitecture and system architecture designs by identifying, analyzing, and addressing three critical performance problems in modern GPUs.
ContributorsLee, Shin-Ying (Author) / Wu, Carole-Jean (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Ren, Fengbo (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2017
168714-Thumbnail Image.png
Description
Deep neural network-based methods have been proved to achieve outstanding performance on object detection and classification tasks. Deep neural networks follow the ``deeper model with deeper confidence'' belief to gain a higher recognition accuracy. However, reducing these networks' computational costs remains a challenge, which impedes their deployment on embedded devices.

Deep neural network-based methods have been proved to achieve outstanding performance on object detection and classification tasks. Deep neural networks follow the ``deeper model with deeper confidence'' belief to gain a higher recognition accuracy. However, reducing these networks' computational costs remains a challenge, which impedes their deployment on embedded devices. For instance, the intersection management of Connected Autonomous Vehicles (CAVs) requires running computationally intensive object recognition algorithms on low-power traffic cameras. This dissertation aims to study the effect of a dynamic hardware and software approach to address this issue. Characteristics of real-world applications can facilitate this dynamic adjustment and reduce the computation. Specifically, this dissertation starts with a dynamic hardware approach that adjusts itself based on the toughness of input and extracts deeper features if needed. Next, an adaptive learning mechanism has been studied that use extracted feature from previous inputs to improve system performance. Finally, a system (ARGOS) was proposed and evaluated that can be run on embedded systems while maintaining the desired accuracy. This system adopts shallow features at inference time, but it can switch to deep features if the system desires a higher accuracy. To improve the performance, ARGOS distills the temporal knowledge from deep features to the shallow system. Moreover, ARGOS reduces the computation furthermore by focusing on regions of interest. The response time and mean average precision are adopted for the performance evaluation to evaluate the proposed ARGOS system.
ContributorsFarhadi, Mohammad (Author) / Yang, Yezhou (Thesis advisor) / Vrudhula, Sarma (Committee member) / Wu, Carole-Jean (Committee member) / Ren, Yi (Committee member) / Arizona State University (Publisher)
Created2022
155908-Thumbnail Image.png
Description
User satisfaction is pivotal to the success of mobile applications. At the same time, it is imperative to maximize the energy efficiency of the mobile device to ensure optimal usage of the limited energy source available to mobile devices while maintaining the necessary levels of user satisfaction. However, this is

User satisfaction is pivotal to the success of mobile applications. At the same time, it is imperative to maximize the energy efficiency of the mobile device to ensure optimal usage of the limited energy source available to mobile devices while maintaining the necessary levels of user satisfaction. However, this is complicated due to user interactions, numerous shared resources, and network conditions that produce substantial uncertainty to the mobile device's performance and power characteristics. In this dissertation, a new approach is presented to characterize and control mobile devices that accurately models these uncertainties. The proposed modeling framework is a completely data-driven approach to predicting power and performance. The approach makes no assumptions on the distributions of the underlying sources of uncertainty and is capable of predicting power and performance with over 93% accuracy.

Using this data-driven prediction framework, a closed-loop solution to the DEM problem is derived to maximize the energy efficiency of the mobile device subject to various thermal, reliability and deadline constraints. The design of the controller imposes minimal operational overhead and is able to tune the performance and power prediction models to changing system conditions. The proposed controller is implemented on a real mobile platform, the Google Pixel smartphone, and demonstrates a 19% improvement in energy efficiency over the standard frequency governor implemented on all Android devices.
ContributorsGaudette, Benjamin David (Author) / Vrudhula, Sarma (Thesis advisor) / Wu, Carole-Jean (Thesis advisor) / Fainekos, Georgios (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2017
161894-Thumbnail Image.png
Description
Heterogenous SoCs are in development that marry multiple architectural patterns together. In order for software to be run on such a platform, it must be broken down into its constituent parts, kernels, and scheduled for execution on the hardware. Although this can be done by hand, it would be arduous

Heterogenous SoCs are in development that marry multiple architectural patterns together. In order for software to be run on such a platform, it must be broken down into its constituent parts, kernels, and scheduled for execution on the hardware. Although this can be done by hand, it would be arduous and time consuming; rather, a tool should be developed that analyzes the source binary, extracts the kernels, schedules the kernels, and optimizes the scheduled kernels for their target component. This dissertation proposes a decidable kernel definition that enables an algorithmic approach to detecting kernels from arbitrary programs. This definition is built upon four constraints that can be tested using basic graph theory. In addition, two algorithms are proposed that successfully extract kernels based upon runtime information. The first utilizes dynamic traces, which are generated using a collection of novel optimizations. The second utilizes a simple affinity matrix, which has no runtime overhead during program execution. Finally, a Dense Neural Network is proposed that is capable of detecting a kernel's archetype based upon only the composition of the source program and the number of times individual basic blocks execute. The contributions proposed in this dissertation provide the necessary infrastructure to perform a litany of other optimizations on kernels. By detecting kernels algorithmically, any program can be analyzed and optimized with techniques that have heretofore required kernels be written in a compatible form. Computational kernels can be extracted from any program with no constraints. The innovations describes here will form the foundation for automated kernel optimization in the future, helping optimize the code of the future.
ContributorsUhrie, Richard Lawrence (Author) / Brunhaver, John (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Shrivastiva, Aviral (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)
Created2021
168629-Thumbnail Image.png
Description
With the rapid development of both hardware and software, mobile devices with their advantages in mobility, interactivity, and privacy have enabled various applications, including social networking, mixed reality, entertainment, authentication, and etc.In diverse forms such as smartphones, glasses, and watches, the number of mobile devices is expected to increase by

With the rapid development of both hardware and software, mobile devices with their advantages in mobility, interactivity, and privacy have enabled various applications, including social networking, mixed reality, entertainment, authentication, and etc.In diverse forms such as smartphones, glasses, and watches, the number of mobile devices is expected to increase by 1 billion per year in the future. These devices not only generate and exchange small data such as GPS data, but also large data including videos and point clouds. Such massive visual data presents many challenges for processing on mobile devices. First, continuously capturing and processing high resolution visual data is energy-intensive, which can drain the battery of a mobile device very quickly. Second, data offloading for edge or cloud computing is helpful, but users are afraid that their privacy can be exposed to malicious developers. Third, interactivity and user experience is degraded if mobile devices cannot process large scale visual data in real-time such as off-device high precision point clouds. To deal with these challenges, this work presents three solutions towards fine-grained control of visual data in mobile systems, revolving around two core ideas, enabling resolution-based tradeoffs and adopting split-process to protect visual data.In particular, this work introduces: (1) Banner media framework to remove resolution reconfiguration latency in the operating system for enabling seamless dynamic resolution-based tradeoffs; (2) LesnCap split-process application development framework to protect user's visual privacy against malicious data collection in cloud-based Augmented Reality (AR) applications by isolating the visual processing in a distinct process; (3) A novel voxel grid schema to enable adaptive sampling at the edge device that can sample point clouds flexibly for interactive 3D vision use cases across mobile devices and mobile networks. The evaluation in several mobile environments demonstrates that, by controlling visual data at a fine granularity, energy efficiency can be improved by 49% switching between resolutions, visual privacy can be protected through split-process with negligible overhead, and point clouds can be delivered at a high throughput meeting various requirements.Thus, this work can enable more continuous mobile vision applications for the future of a new reality.
ContributorsHu, Jinhan (Author) / LiKamWa, Robert (Thesis advisor) / Wu, Carole-Jean (Committee member) / Doupe, Adam (Committee member) / Jayasuriya, Suren (Committee member) / Arizona State University (Publisher)
Created2022