Search Content

An analysis of the memory bottleneck and cache performance of most apparent distortion image quality assessment algorithm on GPU

Description

As digital images are transmitted over the network or stored on a disk, image processing is done as part of the standard for efficient storage and bandwidth. This causes some amount of distortion or artifacts in the image which demands the need for quality assessment. Subjective image quality assessment is…

As digital images are transmitted over the network or stored on a disk, image processing is done as part of the standard for efficient storage and bandwidth. This causes some amount of distortion or artifacts in the image which demands the need for quality assessment. Subjective image quality assessment is expensive, time consuming and influenced by the subject's perception. Hence, there is a need for developing mathematical models that are capable of predicting the quality evaluation. With the advent of the information era and an exponential growth in image/video generation and consumption, the requirement for automated quality assessment has become mandatory to assess the degradation. The last few decades have seen research on automated image quality assessment (IQA) algorithms gaining prominence. However, the focus has been on achieving better predication accuracy, and not on improving computational performance. As a result, existing serial implementations require a lot of time in processing a single frame. In the last 5 years, research on general-purpose graphic processing unit (GPGPU) based image quality assessment (IQA) algorithm implementation has shown promising results for single images. Still, the implementations are not efficient enough for deployment in real world applications, especially for live videos at high resolution. Hence, in this thesis, it is proposed that microarchitecture-conscious coding on a graphics processing unit (GPU) combined with detailed understanding of the image quality assessment (IQA) algorithm can result in non-trivial speedups without compromising quality prediction accuracy. This document focusses on the microarchitectural analysis of the most apparent distortion (MAD) algorithm. The results are analyzed in-depth and one of the major bottlenecks is identified. With the knowledge of underlying microarchitecture, the implementation is restructured thereby resolving the bottleneck and improving the performance.

ContributorsKannan, Vignesh (Author) / Sohoni, Sohum (Thesis advisor) / Ren, Fengbo (Committee member) / Sayeed, Mohamed (Committee member) / Arizona State University (Publisher)

Created2016

Fido Tracker

Description

Millions of pets go missing every year and this project has the purpose of offering a pet GPS tracking solution to aid in this issue. An Arduino microcontroller was combined with a GPS module and GSM module to create the hardware of the device, which was then connected to a…

Millions of pets go missing every year and this project has the purpose of offering a pet GPS tracking solution to aid in this issue. An Arduino microcontroller was combined with a GPS module and GSM module to create the hardware of the device, which was then connected to a mobile application that was developed for the explicit purpose of this project. Amazon Web Services was used to significantly bring down the cost of connecting the hardware to the mobile app. Upon the completion of the project, a prototype pet GPS tracking device and mobile application were developed, and instructions were given so that any user could re-create the same solution for their own purposes.

ContributorsKiaei, Ariana (Author) / Ren, Fengbo (Thesis director) / Abraham, Seth (Committee member) / Computer Science and Engineering Program (Contributor) / Electrical Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2021-05

Compiler Design for Accelerating Applications on Coarse-Grained Reconfigurable Architectures

Description

Coarse-Grained Reconfigurable Arrays (CGRAs) are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications. The acceleration achieved by CGRA relies on the efficient mapping of the compute-intensive loops by the CGRA compiler onto the CGRA. The CGRA mapping problem, being NP-complete, is performed in a two-step process, scheduling,…

Coarse-Grained Reconfigurable Arrays (CGRAs) are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications. The acceleration achieved by CGRA relies on the efficient mapping of the compute-intensive loops by the CGRA compiler onto the CGRA. The CGRA mapping problem, being NP-complete, is performed in a two-step process, scheduling, and mapping. The scheduling algorithm allocates timeslots to the nodes of the DFG, and the mapping algorithm maps the scheduled nodes onto the PEs of the CGRA. On a mapping failure, the initiation interval (II) is increased, and a new schedule is obtained for the increased II. Most previous mapping techniques use the Iterative Modulo Scheduling algorithm (IMS) to find a schedule for a given II. Since IMS generates a resource-constrained ASAP (as-soon-as-possible) scheduling, even with increased II, it tends to generate a similar schedule that is not mappable and does not explore the schedule space effectively. The problems encountered by IMS-based scheduling algorithms are explored and an improved randomized scheduling algorithm for scheduling of the application loop to be accelerated is proposed. When encountering a mapping failure for a given schedule, existing mapping algorithms either exit and retry the mapping anew, or recursively remove the previously mapped node to find a valid mapping (backtrack).Abandoning the mapping is extreme, but even backtracking may not be the best choice, since the root of the problem may not be the previous node. The challenges in existing algorithms are systematically analyzed and a failure-aware mapping algorithm is presented. The loops in general-purpose applications are often complicated loops, i.e., loops with perfect and imperfect nests and loops with nested if-then-else's (conditionals). The existing hardware-software solutions to execute branches and conditions are inefficient. A co-design approach that efficiently executes complicated loops on CGRA is proposed. The compiler transforms complex loops, maps them to the CGRA, and lays them out in the memory in a specific manner, such that the hardware can fetch and execute the instructions from the right path at runtime. Finally, a CGRA compilation simulator open-source framework is presented. This open-source CGRA simulation framework is based on LLVM and gem5 to extract the loop, map them onto the CGRA architecture, and execute them as a co-processor to an ARM CPU.

ContributorsBalasubramanian, Mahesh (Author) / Shrivastava, Aviral (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Ren, Fengbo (Committee member) / Pozzi, Laura (Committee member) / Arizona State University (Publisher)

Created2021

A Methodology and Formalism to Handle Timing Uncertainties in Cyber-Physical Systems

Description

Uncertainty is intrinsic in Cyber-Physical Systems since they interact with human and work with both analog and digital worlds. Since even minute deviation from the real values can make catastrophe in a safety-critical application, considering uncertainties in CPS behavior is essential. On the other side, time is a…

Uncertainty is intrinsic in Cyber-Physical Systems since they interact with human and work with both analog and digital worlds. Since even minute deviation from the real values can make catastrophe in a safety-critical application, considering uncertainties in CPS behavior is essential. On the other side, time is a foundational aspect of Cyber-Physical Systems (CPS). Correct timing of system events is critical to optimize responsiveness to the environment, in terms of timeliness, accuracy, and precision in the knowledge, measurement, prediction, and control of CPS behavior. In order to design a more resilient and reliable CPS, first and foremost, there should be a way to specify the timing constraints that a constructed Cyber-Physical System must meet with considering existing uncertainties. Only then, we can seek systematic approaches to check if all timing constraints are being met, and develop correct-by-construction methodologies. In this regard, Timestamp Temporal Logic (TTL) is developed to specify the timing constraints on a distributed CPS. By TTL designers can specify the timing requirements that a CPS must satisfy in a succinct and intuitive manner and express the tolerable error as a part of the language. The proposed deduction system on TTL (TTL reasoning system) gives the ability to check the consistency among expresses system specifications and simplify them to be implemented on FPGA for run-time verification. Regarding CPS run-time verification, Timestamp-based Monitoring Approach(TMA) has been designed that can hook up to a CPS and take its timing specifications in TTL and verify if the timing constraints are being met with considering existing uncertainties in the system. TMA does not need to compute whether the constraint is being met at each and every instance of time but it re-evaluates constraint only when there is an event that can affect the outcome. This enables it to perform online timing monitoring of CPS for less computation and resources. Furthermore, the minimum design parameters of the timing CPS that are required to enable testing the timing of CPS are defined in this dissertation

ContributorsMehrabian, Mohammadreza (Author) / Shrivastava, Aviral (Thesis advisor) / Ren, Fengbo (Committee member) / Sarjoughian, Hessam (Committee member) / Derler, Patricia (Committee member) / Arizona State University (Publisher)

Created2021

Vision-inspired Representation and Learning for Data-driven Signal Processing

Description

In the era of data explosion, massive data is generated from various sources at an unprecedented speed. The ever-growing amount of data reveals enormous opportunities for developing novel data-driven solutions to unsolved problems. In recent years, benefiting from numerous public datasets and advances in deep learning, data-driven approaches in the…

In the era of data explosion, massive data is generated from various sources at an unprecedented speed. The ever-growing amount of data reveals enormous opportunities for developing novel data-driven solutions to unsolved problems. In recent years, benefiting from numerous public datasets and advances in deep learning, data-driven approaches in the computer vision domain have demonstrated superior performance with high adaptability on various data and tasks. Meanwhile, signal processing has long been dominated by techniques derived from rigorous mathematical models built upon prior knowledge of signals. Due to the lack of adaptability to real data and applications, model-based methods often suffer from performance degradation and engineering difficulties. In this dissertation, multiple signal processing problems are studied from vision-inspired data representation and learning perspectives to address the major limitation on adaptability. Corresponding data-driven solutions are proposed to achieve significantly improved performance over conventional solutions. Specifically, in the compressive sensing domain, an open-source image compressive sensing toolbox and benchmark to standardize the implementation and evaluation of reconstruction methods are first proposed. Then a plug-and-play compression ratio adapter is proposed to enable the adaptability of end-to-end data-driven reconstruction methods to variable compression ratios. Lastly, the problem of transfer learning from images to bioelectric signals is experimentally studied to demonstrate the improved performance of data-driven reconstruction. In the image subsampling domain, task-adaptive data-driven image subsampling is studied to reduce data redundancy and retain information of interest simultaneously. In the semiconductor analysis domain, the data-driven automatic error detection problem is studied in the context of integrated circuit segmentation for the first time. In the light detection and ranging(LiDAR) camera calibration domain, the calibration accuracy degradation problem in low-resolution LiDAR scenarios is addressed with data-driven techniques.

ContributorsZhang, Zhikang (Author) / Ren, Fengbo (Thesis advisor) / Li, Baoxin (Committee member) / Turaga, Pavan (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)

Created2023

Neuron-based Digital and Mixed-signal Circuit Design: From ASIC to SIMD Processors

Description

Among the many challenges facing circuit designers in deep sub-micron technologies, power, performance, area (PPA) and process variations are perhaps the most critical. Since existing strategies for reducing power and boosting the performance of the circuit designs have already matured to saturation, it is necessary to explore alternate unconventional strategies.…

Among the many challenges facing circuit designers in deep sub-micron technologies, power, performance, area (PPA) and process variations are perhaps the most critical. Since existing strategies for reducing power and boosting the performance of the circuit designs have already matured to saturation, it is necessary to explore alternate unconventional strategies. This investigation focuses on using perceptrons to enhance PPA in digital circuits and starts by constructing the perceptron using a combination of complementary metal-oxide-semiconductor (CMOS) and flash technology. The use of flash enables the perceptron to have a variable delay and functionality, making them robust to process, voltage, and temperature variations. By replacing parts of an application-specific integrated circuit (ASIC) with these perceptrons, improvements of up to 30% in the area and 20% in power can be achieved without affecting performance. Furthermore, the ability to vary the delay of a perceptron enables circuit designers to fix setup and hold-time violations post-fabrication, while reprogramming the functionality enables the obfuscation of the circuits. The study extends to field-programmable gate arrays (FPGAs), showing that traditional FPGA architectures can also achieve improved PPA by replacing some Look-Up-Tables (LUTs) with perceptrons. Considering that replacing parts of traditional digital circuits provides significant improvements in PPA, a natural extension was to see whether circuits built dedicatedly using perceptrons as its compute unit would lead to improvements in energy efficiency. This was demonstrated by developing perceptron-based compute elements and constructing an architecture using these elements for Quantized Neural Network acceleration. The resulting circuit delivered up to 50 times more energy efficiency compared to a CMOS-based accelerator without using standard low-power techniques such as voltage scaling and approximate computing.

ContributorsWagle, Ankit (Author) / Vrudhula, Sarma (Thesis advisor) / Khatri, Sunil (Committee member) / Shrivastava, Aviral (Committee member) / Seo, Jae-Sun (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)

Created2023

Optimizing Memory and Storage Disaggregation for Data-intensive Systems

Description

Data-intensive systems such as big data and large machine learning (ML) systems experience serious scalability challenges due to the ever-increasing data demand from ML and analytics applications and the resource fragmentation caused by conventional monolithic server architecture. Memory and storage disaggregation emerges as a pivotal technology to address these challenges…

Data-intensive systems such as big data and large machine learning (ML) systems experience serious scalability challenges due to the ever-increasing data demand from ML and analytics applications and the resource fragmentation caused by conventional monolithic server architecture. Memory and storage disaggregation emerges as a pivotal technology to address these challenges by decoupling memory and storage resources from individual servers and managing and provisioning them to applications as a shared resource pool. This dissertation investigates several important aspects of memory and storage disaggregation and proposes novel solutions to support data-intensive applications.First, caching is a fundamental way to utilize disaggregated storage, but building a large disaggregated cache is challenging because the commonly-used fix-sized cache block allocation scheme is unable to provide good cache performance with low memory overhead for diverse cloud workloads with vastly different I/O patterns. The dissertation proposes a novel adaptive cache block allocation approach that dynamically adjusts cache block sizes based on changing I/O patterns. This approach significantly improves I/O performance while reducing memory usage, outperforming traditional fixed-size cache systems in diverse cloud workloads. Evaluation shows that it improves read latency by 20% and write latency by 9%. It also reduces the amount of I/O traffic to cloud block storage by up to 74% while achieving up to 41% memory savings with only 2 ms. Second, large ML applications such as large language model (LLM) inference are memory demanding, but to support them using disaggregated memory brings challenges to memory management since disaggregated memory has higher memory access latency compared to local memory. The dissertation proposes latency-aware memory aggregation which cautiously distributes memory accesses to minimize the latency gap between local and disaggregated memory. It also proposes NUMA-aligned tensor parallelism to further improve the computing efficiency. With these optimizations, LLM inference achieves substantial speedups. For example, first token latency improves by 61%, and end-to-end latency improves by 43% for a LLM inference task which uses a model of 66 billion parameters when the batch size is 8. Finally, to address the cost, power consumption, and volatility of DRAM, the dissertation proposes to incorporate flash memory into memory pools within the disaggregation framework. By establishing a tiered memory architecture which combines fast-tier local DRAM with slow-tier DRAM and flash memory in the memory pool and effectively migrates data based on hotness across memory tiers, this approach not only reduces expenses but also maintains the overall performance and scalability of data-intensive systems. For example, with 50% saving in memory cost, the performance degradation of training ResNet50 on ImageNet dataset is only 2.68%. Together, these contributions systematically optimize the use of memory and storage disaggregation to deliver more efficient, scalable, and cost-effective systems for supporting the data explosion in today’s and future computing systems.

ContributorsYang, Qirui (Author) / Zhao, Ming (Thesis advisor) / Shrivastava, Aviral (Committee member) / Ren, Fengbo (Committee member) / Zou, Jia (Committee member) / Arizona State University (Publisher)

Created2024

Dynamic analysis of multithreaded embedded software to expose atomicity violations

Description

Concurrency bugs are one of the most notorious software bugs and are very difficult to manifest. Significant work has been done on detection of atomicity violations bugs for high performance systems but there is not much work related to detect these bugs for embedded systems. Although criteria to claim existence…

Concurrency bugs are one of the most notorious software bugs and are very difficult to manifest. Significant work has been done on detection of atomicity violations bugs for high performance systems but there is not much work related to detect these bugs for embedded systems. Although criteria to claim existence of bugs remains same, approach changes a bit for embedded systems. The main focus of this research is to develop a systemic methodology to address the issue from embedded systems perspective. A framework is developed which predicts the access interleaving patterns that may violate atomicity using memory references of shared variables and provides support to force and analyze these schedules for any output change, system fault or change in execution path.

ContributorsPatel, Jay (Author) / Lee, Yann-Hang (Thesis advisor) / Ren, Fengbo (Committee member) / Srivastava, Aviral (Committee member) / Arizona State University (Publisher)

Created2016

Hardware Acceleration of Video analytics on FPGA using OpenCL

Description

With the exponential growth in video content over the period of the last few years, analysis of videos is becoming more crucial for many applications such as self-driving cars, healthcare, and traffic management. Most of these video analysis application uses deep learning algorithms such as convolution neural networks (CNN) because…

With the exponential growth in video content over the period of the last few years, analysis of videos is becoming more crucial for many applications such as self-driving cars, healthcare, and traffic management. Most of these video analysis application uses deep learning algorithms such as convolution neural networks (CNN) because of their high accuracy in object detection. Thus enhancing the performance of CNN models become crucial for video analysis. CNN models are computationally-expensive operations and often require high-end graphics processing units (GPUs) for acceleration. However, for real-time applications in an energy-thermal constrained environment such as traffic management, GPUs are less preferred because of their high power consumption, limited energy efficiency. They are challenging to fit in a small place.

To enable real-time video analytics in emerging large scale Internet of things (IoT) applications, the computation must happen at the network edge (near the cameras) in a distributed fashion. Thus, edge computing must be adopted. Recent studies have shown that field-programmable gate arrays (FPGAs) are highly suitable for edge computing due to their architecture adaptiveness, high computational throughput for streaming processing, and high energy efficiency.

This thesis presents a generic OpenCL-defined CNN accelerator architecture optimized for FPGA-based real-time video analytics on edge. The proposed CNN OpenCL kernel adopts a highly pipelined and parallelized 1-D systolic array architecture, which explores both spatial and temporal parallelism for energy efficiency CNN acceleration on FPGAs. The large fan-in and fan-out of computational units to the memory interface are identified as the limiting factor in existing designs that causes scalability issues, and solutions are proposed to resolve the issue with compiler automation. The proposed CNN kernel is highly scalable and parameterized by three architecture parameters, namely pe_num, reuse_fac, and vec_fac, which can be adapted to achieve 100% utilization of the coarse-grained computation resources (e.g., DSP blocks) for a given FPGA. The proposed CNN kernel is generic and can be used to accelerate a wide range of CNN models without recompiling the FPGA kernel hardware. The performance of Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet has been measured by the proposed CNN kernel on Intel Arria 10 GX1150 FPGA. The measurement result shows that the proposed CNN kernel, when mapped with 100% utilization of computation resources, can achieve a latency of 11ms, 84ms, 1614.9ms, and 990.34ms for Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet respectively when the input feature maps and weights are represented using 32-bit floating-point data type.

ContributorsDua, Akshay (Author) / Ren, Fengbo (Thesis advisor) / Ogras, Umit Y. (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2019

FPGA Acceleration of CNNs Using OpenCL

Description

Convolutional Neural Network (CNN) has achieved state-of-the-art performance in numerous applications like computer vision, natural language processing, robotics etc. The advancement of High-Performance Computing systems equipped with dedicated hardware accelerators has also paved the way towards the success of compute intensive CNNs. Graphics Processing Units (GPUs), with massive processing capability,…

Convolutional Neural Network (CNN) has achieved state-of-the-art performance in numerous applications like computer vision, natural language processing, robotics etc. The advancement of High-Performance Computing systems equipped with dedicated hardware accelerators has also paved the way towards the success of compute intensive CNNs. Graphics Processing Units (GPUs), with massive processing capability, have been of general interest for the acceleration of CNNs. Recently, Field Programmable Gate Arrays (FPGAs) have been promising in CNN acceleration since they offer high performance while also being re-configurable to support the evolution of CNNs. This work focuses on a design methodology to accelerate CNNs on FPGA with low inference latency and high-throughput which are crucial for scenarios like self-driving cars, video surveillance etc. It also includes optimizations which reduce the resource utilization by a large margin with a small degradation in performance thus making the design suitable for low-end FPGA devices as well.

FPGA accelerators often suffer due to the limited main memory bandwidth. Also, highly parallel designs with large resource utilization often end up achieving low operating frequency due to poor routing. This work employs data fetch and buffer mechanisms, designed specifically for the memory access pattern of CNNs, that overlap computation with memory access. This work proposes a novel arrangement of the systolic processing element array to achieve high frequency and consume less resources than the existing works. Also, support has been extended to more complicated CNNs to do video processing. On Intel Arria 10 GX1150, the design operates at a frequency as high as 258MHz and performs single inference of VGG-16 and C3D in 23.5ms and 45.6ms respectively. For VGG-16 and C3D the design offers a throughput of 66.1 and 23.98 inferences/s respectively. This design can outperform other FPGA 2D CNN accelerators by up to 9.7 times and 3D CNN accelerators by up to 2.7 times.

ContributorsRavi, Pravin Kumar (Author) / Zhao, Ming (Thesis advisor) / Li, Baoxin (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)

Created2020