Matching Items (8)

156791-Thumbnail Image.png

Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

Description

General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many

General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions.

Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%.

Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications.

Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future.

In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.

Contributors

Agent

Created

Date Created
  • 2018

155167-Thumbnail Image.png

An analysis of the memory bottleneck and cache performance of most apparent distortion image quality assessment algorithm on GPU

Description

As digital images are transmitted over the network or stored on a disk, image processing is done as part of the standard for efficient storage and bandwidth. This causes some

As digital images are transmitted over the network or stored on a disk, image processing is done as part of the standard for efficient storage and bandwidth. This causes some amount of distortion or artifacts in the image which demands the need for quality assessment. Subjective image quality assessment is expensive, time consuming and influenced by the subject's perception. Hence, there is a need for developing mathematical models that are capable of predicting the quality evaluation. With the advent of the information era and an exponential growth in image/video generation and consumption, the requirement for automated quality assessment has become mandatory to assess the degradation. The last few decades have seen research on automated image quality assessment (IQA) algorithms gaining prominence. However, the focus has been on achieving better predication accuracy, and not on improving computational performance. As a result, existing serial implementations require a lot of time in processing a single frame. In the last 5 years, research on general-purpose graphic processing unit (GPGPU) based image quality assessment (IQA) algorithm implementation has shown promising results for single images. Still, the implementations are not efficient enough for deployment in real world applications, especially for live videos at high resolution. Hence, in this thesis, it is proposed that microarchitecture-conscious coding on a graphics processing unit (GPU) combined with detailed understanding of the image quality assessment (IQA) algorithm can result in non-trivial speedups without compromising quality prediction accuracy. This document focusses on the microarchitectural analysis of the most apparent distortion (MAD) algorithm. The results are analyzed in-depth and one of the major bottlenecks is identified. With the knowledge of underlying microarchitecture, the implementation is restructured thereby resolving the bottleneck and improving the performance.

Contributors

Agent

Created

Date Created
  • 2016

155003-Thumbnail Image.png

GPGPU based implementation of BLIINDS-II NR-IQA

Description

The technological advances in the past few decades have made possible creation and consumption of digital visual content at an explosive rate. Consequently, there is a need for efficient quality

The technological advances in the past few decades have made possible creation and consumption of digital visual content at an explosive rate. Consequently, there is a need for efficient quality monitoring systems to ensure minimal degradation of images and videos during various processing operations like compression, transmission, storage etc. Objective Image Quality Assessment (IQA) algorithms have been developed that predict quality scores which match well with human subjective quality assessment. However, a lot of research still remains to be done before IQA algorithms can be deployed in real world systems. Long runtimes for one frame of image is a major hurdle. Graphics Processing Units (GPUs), equipped with massive number of computational cores, provide an opportunity to accelerate IQA algorithms by performing computations in parallel. Indeed, General Purpose Graphics Processing Units (GPGPU) techniques have been applied to a few Full Reference IQA algorithms which fall under the. We present a GPGPU implementation of Blind Image Integrity Notator using DCT Statistics (BLIINDS-II), which falls under the No Reference IQA algorithm paradigm. We have been able to achieve a speedup of over 30x over the previous CPU version of this algorithm. We test our implementation using various distorted images from the CSIQ database and present the performance trends observed. We achieve a very consistent performance of around 9 milliseconds per distorted image, which made possible the execution of over 100 images per second (100 fps).

Contributors

Agent

Created

Date Created
  • 2016

153223-Thumbnail Image.png

Exploring latent structure in data: algorithms and implementations

Description

Feature representations for raw data is one of the most important component in a machine learning system. Traditionally, features are \textit{hand crafted} by domain experts which can often be a

Feature representations for raw data is one of the most important component in a machine learning system. Traditionally, features are \textit{hand crafted} by domain experts which can often be a time consuming process. Furthermore, they do not generalize well to unseen data and novel tasks. Recently, there have been many efforts to generate data-driven representations using clustering and sparse models. This dissertation focuses on building data-driven unsupervised models for analyzing raw data and developing efficient feature representations.

Simultaneous segmentation and feature extraction approaches for silicon-pores sensor data are considered. Aggregating data into a matrix and performing low rank and sparse matrix decompositions with additional smoothness constraints are proposed to solve this problem. Comparison of several variants of the approaches and results for signal de-noising and translocation/trapping event extraction are presented. Algorithms to improve transform-domain features for ion-channel time-series signals based on matrix completion are presented. The improved features achieve better performance in classification tasks and in reducing the false alarm rates when applied to analyte detection.

Developing representations for multimedia is an important and challenging problem with applications ranging from scene recognition, multi-media retrieval and personal life-logging systems to field robot navigation. In this dissertation, we present a new framework for feature extraction for challenging natural environment sounds. Proposed features outperform traditional spectral features on challenging environmental sound datasets. Several algorithms are proposed that perform supervised tasks such as recognition and tag annotation. Ensemble methods are proposed to improve the tag annotation process.

To facilitate the use of large datasets, fast implementations are developed for sparse coding, the key component in our algorithms. Several strategies to speed-up Orthogonal Matching Pursuit algorithm using CUDA kernel on a GPU are proposed. Implementations are also developed for a large scale image retrieval system. Image-based "exact search" and "visually similar search" using the image patch sparse codes are performed. Results demonstrate large speed-up over CPU implementations and good retrieval performance is also achieved.

Contributors

Agent

Created

Date Created
  • 2014

153865-Thumbnail Image.png

A GPU accelerated discontinuous Galerkin conservative level set method for simulating atomization

Description

This dissertation describes a process for interface capturing via an arbitrary-order, nearly quadrature free, discontinuous Galerkin (DG) scheme for the conservative level set method (Olsson et al., 2005, 2008). The

This dissertation describes a process for interface capturing via an arbitrary-order, nearly quadrature free, discontinuous Galerkin (DG) scheme for the conservative level set method (Olsson et al., 2005, 2008). The DG numerical method is utilized to solve both advection and reinitialization, and executed on a refined level set grid (Herrmann, 2008) for effective use of processing power. Computation is executed in parallel utilizing both CPU and GPU architectures to make the method feasible at high order. Finally, a sparse data structure is implemented to take full advantage of parallelism on the GPU, where performance relies on well-managed memory operations.

With solution variables projected into a kth order polynomial basis, a k+1 order convergence rate is found for both advection and reinitialization tests using the method of manufactured solutions. Other standard test cases, such as Zalesak's disk and deformation of columns and spheres in periodic vortices are also performed, showing several orders of magnitude improvement over traditional WENO level set methods. These tests also show the impact of reinitialization, which often increases shape and volume errors as a result of level set scalar trapping by normal vectors calculated from the local level set field.

Accelerating advection via GPU hardware is found to provide a 30x speedup factor comparing a 2.0GHz Intel Xeon E5-2620 CPU in serial vs. a Nvidia Tesla K20 GPU, with speedup factors increasing with polynomial degree until shared memory is filled. A similar algorithm is implemented for reinitialization, which relies on heavier use of shared and global memory and as a result fills them more quickly and produces smaller speedups of 18x.

Contributors

Agent

Created

Date Created
  • 2015

153851-Thumbnail Image.png

The C++ implementation of refined level set grid (RLSG) method

Description

In this thesis, a FORTRAN code is rewritten in C++ with an object oriented ap-

proach. There are several reasons for this purpose. The first reason is to establish

the basis of

In this thesis, a FORTRAN code is rewritten in C++ with an object oriented ap-

proach. There are several reasons for this purpose. The first reason is to establish

the basis of a GPU programming. To write programs that utilize GPU hardware,

CUDA or OpenCL is used which only support C and C++. FORTRAN has a feature

that lets its programs to call C/C++ functions. FORTRAN sends relevant data to

C/C++, which in turn sends that data to OpenCL. Although this approach works,

it makes the code messy and bulky and in the end more difficult to deal with. More-

over, there is a slight performance decrease from the additional data copy. This is

the motivation to have the code entirely written in C++ to make it more uniform,

efficient and clean. The second reason is the object oriented feature of the C++. The

“abstraction”, “inheritance” and “run-time polymorphism” features of C++ provide

some form of classes and objects, the ability to build new abstractions, and some

form of run-time binding, respectively. In recent years, some of popular codes has

been rewritten in C++ which were initially in FORTRAN. One of these softwares is

LAMMPS.

In this code the level set equation is solved by RLSG method to track the interface in

two phase flow. In gas/fluid flows, the surface tension is important and only exists at

the interface. Therefore, the location and some geometric features of interface need

to be evaluated which can be achieved by solving the level set equation.

Contributors

Agent

Created

Date Created
  • 2015

151278-Thumbnail Image.png

A fast fluid simulator using smoothed-particle hydrodynamics

Description

This document presents a new implementation of the Smoothed Particles Hydrodynamics algorithm using DirectX 11 and DirectCompute. The main goal of this document is to present to the reader an

This document presents a new implementation of the Smoothed Particles Hydrodynamics algorithm using DirectX 11 and DirectCompute. The main goal of this document is to present to the reader an alternative solution to the largely studied and researched problem of fluid simulation. Most other solutions have been implemented using the NVIDIA CUDA framework; however, the proposed solution in this document uses the Microsoft general-purpose computing on graphics processing units API. The implementation allows for the simulation of a large number of particles in a real-time scenario. The solution presented here uses the Smoothed Particles Hydrodynamics algorithm to calculate the forces within the fluid; this algorithm provides a Lagrangian approach for discretizes the Navier-Stockes equations into a set of particles. Our solution uses the DirectCompute compute shaders to evaluate each particle using the multithreading and multi-core capabilities of the GPU increasing the overall performance. The solution then describes a method for extracting the fluid surface using the Marching Cubes method and the programmable interfaces exposed by the DirectX pipeline. Particularly, this document presents a method for using the Geometry Shader Stage to generate the triangle mesh as defined by the Marching Cubes method. The implementation results show the ability to simulate over 64K particles at a rate of 900 and 400 frames per second, not including the surface reconstruction steps and including the Marching Cubes steps respectively.

Contributors

Agent

Created

Date Created
  • 2012

155837-Thumbnail Image.png

Analysis of hardware usage of shuffle instruction based performance optimization in the Blinds-II image quality assessment algorithm

Description

With the advent of GPGPU, many applications are being accelerated by using CUDA programing paradigm. We are able to achieve around 10x -100x speedups by simply porting the application on

With the advent of GPGPU, many applications are being accelerated by using CUDA programing paradigm. We are able to achieve around 10x -100x speedups by simply porting the application on to the GPU and running the parallel chunk of code on its multi cored SIMT (Single instruction multiple thread) architecture. But for optimal performance it is necessary to make sure that all the GPU resources are efficiently used, and the latencies in the application are minimized. For this, it is essential to monitor the Hardware usage of the algorithm and thus diagnose the compute and memory bottlenecks in the implementation. In the following thesis, we will be analyzing the mapping of CUDA implementation of BLIINDS-II algorithm on the underlying GPU hardware, and come up with a Kepler architecture specific solution of using shuffle instruction via CUB library to tackle the two major bottlenecks in the algorithm. Experiments were conducted to convey the advantage of using shuffle instru3ction in algorithm over only using shared memory as a buffer to global memory. With the new implementation of BLIINDS-II algorithm using CUB library, a speedup of around 13.7% was achieved.

Contributors

Agent

Created

Date Created
  • 2017