Search Content

Study of Knowledge Transfer Techniques For Deep Learning on Edge Devices

Description

With the emergence of edge computing paradigm, many applications such as image recognition and augmented reality require to perform machine learning (ML) and artificial intelligence (AI) tasks on edge devices. Most AI and ML models are large and computational heavy, whereas edge devices are usually equipped with limited computational and…

With the emergence of edge computing paradigm, many applications such as image recognition and augmented reality require to perform machine learning (ML) and artificial intelligence (AI) tasks on edge devices. Most AI and ML models are large and computational heavy, whereas edge devices are usually equipped with limited computational and storage resources. Such models can be compressed and reduced in order to be placed on edge devices, but they may loose their capability and may not generalize and perform well compared to large models. Recent works used knowledge transfer techniques to transfer information from a large network (termed teacher) to a small one (termed student) in order to improve the performance of the latter. This approach seems to be promising for learning on edge devices, but a thorough investigation on its effectiveness is lacking.

The purpose of this work is to provide an extensive study on the performance (both in terms of accuracy and convergence speed) of knowledge transfer, considering different student-teacher architectures, datasets and different techniques for transferring knowledge from teacher to student.

A good performance improvement is obtained by transferring knowledge from both the intermediate layers and last layer of the teacher to a shallower student. But other architectures and transfer techniques do not fare so well and some of them even lead to negative performance impact. For example, a smaller and shorter network, trained with knowledge transfer on Caltech 101 achieved a significant improvement of 7.36\% in the accuracy and converges 16 times faster compared to the same network trained without knowledge transfer. On the other hand, smaller network which is thinner than the teacher network performed worse with an accuracy drop of 9.48\% on Caltech 101, even with utilization of knowledge transfer.

ContributorsSistla, Ragini (Author) / Zhao, Ming (Thesis advisor, Committee member) / Li, Baoxin (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)

Created2018

FPGA Acceleration of CNNs Using OpenCL

Description

Convolutional Neural Network (CNN) has achieved state-of-the-art performance in numerous applications like computer vision, natural language processing, robotics etc. The advancement of High-Performance Computing systems equipped with dedicated hardware accelerators has also paved the way towards the success of compute intensive CNNs. Graphics Processing Units (GPUs), with massive processing capability,…

Convolutional Neural Network (CNN) has achieved state-of-the-art performance in numerous applications like computer vision, natural language processing, robotics etc. The advancement of High-Performance Computing systems equipped with dedicated hardware accelerators has also paved the way towards the success of compute intensive CNNs. Graphics Processing Units (GPUs), with massive processing capability, have been of general interest for the acceleration of CNNs. Recently, Field Programmable Gate Arrays (FPGAs) have been promising in CNN acceleration since they offer high performance while also being re-configurable to support the evolution of CNNs. This work focuses on a design methodology to accelerate CNNs on FPGA with low inference latency and high-throughput which are crucial for scenarios like self-driving cars, video surveillance etc. It also includes optimizations which reduce the resource utilization by a large margin with a small degradation in performance thus making the design suitable for low-end FPGA devices as well.

FPGA accelerators often suffer due to the limited main memory bandwidth. Also, highly parallel designs with large resource utilization often end up achieving low operating frequency due to poor routing. This work employs data fetch and buffer mechanisms, designed specifically for the memory access pattern of CNNs, that overlap computation with memory access. This work proposes a novel arrangement of the systolic processing element array to achieve high frequency and consume less resources than the existing works. Also, support has been extended to more complicated CNNs to do video processing. On Intel Arria 10 GX1150, the design operates at a frequency as high as 258MHz and performs single inference of VGG-16 and C3D in 23.5ms and 45.6ms respectively. For VGG-16 and C3D the design offers a throughput of 66.1 and 23.98 inferences/s respectively. This design can outperform other FPGA 2D CNN accelerators by up to 9.7 times and 3D CNN accelerators by up to 2.7 times.

ContributorsRavi, Pravin Kumar (Author) / Zhao, Ming (Thesis advisor) / Li, Baoxin (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)

Created2020

FPGA-Based Edge-Computing Acceleration

Description

The rapid growth of Internet-of-things (IoT) and artificial intelligence applications have called forth a new computing paradigm--edge computing. Edge computing applications, such as video surveillance, autonomous driving, and augmented reality, are highly computationally intensive and require real-time processing. Current edge systems are typically based on commodity general-purpose hardware such as…

The rapid growth of Internet-of-things (IoT) and artificial intelligence applications have called forth a new computing paradigm--edge computing. Edge computing applications, such as video surveillance, autonomous driving, and augmented reality, are highly computationally intensive and require real-time processing. Current edge systems are typically based on commodity general-purpose hardware such as Central Processing Units (CPUs) and Graphical Processing Units (GPUs) , which are mainly designed for large, non-time-sensitive jobs in the cloud and do not match the needs of the edge workloads. Also, these systems are usually power hungry and are not suitable for resource-constrained edge deployments. Such application-hardware mismatch calls forth a new computing backbone to support the high-bandwidth, low-latency, and energy-efficient requirements. Also, the new system should be able to support a variety of edge applications with different characteristics. This thesis addresses the above challenges by studying the use of Field Programmable Gate Array (FPGA) -based computing systems for accelerating the edge workloads, from three critical angles. First, it investigates the feasibility of FPGAs for edge computing, in comparison to conventional CPUs and GPUs. Second, it studies the acceleration of common algorithmic characteristics, identified as loop patterns, using FPGAs, and develops a benchmark tool for analyzing the performance of these patterns on different accelerators. Third, it designs a new edge computing platform using multiple clustered FPGAs to provide high-bandwidth and low-latency acceleration of convolutional neural networks (CNNs) widely used in edge applications. Finally, it studies the acceleration of the emerging neural networks, randomly-wired neural networks, on the multi-FPGA platform. The experimental results from this work show that the new generation of workloads requires rethinking the current edge-computing architecture. First, through the acceleration of common loops, it demonstrates that FPGAs can outperform GPUs in specific loops types up to 14 times. Second, it shows the linear scalability of multi-FPGA platforms in accelerating neural networks. Third, it demonstrates the superiority of the new scheduler to optimally place randomly-wired neural networks on multi-FPGA platforms with 81.1 times better throughput than the available scheduling mechanisms.

ContributorsBiookaghazadeh, Saman (Author) / Zhao, Ming (Thesis advisor) / Ren, Fengbo (Thesis advisor) / Li, Baoxin (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2021

The Necessity of Error Correction In The Quantum World

Description

Quantum computers provide a promising future, where computationally difficult
problems can be executed exponentially faster than the current classical computers we have in use today. While there is tremendous research and development in the creation of quantum computers, there is a fundamental challenge that exists in the quantum world. Due to…

Quantum computers provide a promising future, where computationally difficult
problems can be executed exponentially faster than the current classical computers we have in use today. While there is tremendous research and development in the creation of quantum computers, there is a fundamental challenge that exists in the quantum world. Due to the fragility of the quantum world, error correction methods have originated since 1995 to tackle the giant problem. Since the birth of the idea that these powerful computers can crunch and process numbers beyond the limit of the current computers, there exist several mathematical error correcting codes that could potentially give the required stability in the fragile and fault tolerant quantum world. While there has been a multitude of possible solutions, there is no one single error correcting code that is the key to solving the problem. Almost every solution presented has shared with it a limiting factor or an issue that prevents it from becoming the breakthrough that is desperately needed.

This paper gives an introductory knowledge of what is the quantum world and why there is a need for error correcting topologies. Finally, it introduces one recent topology that could be added to the list of possible solutions to this central problem. Rather than focusing on the mathematical frameworks, the paper introduces the main concepts so that most readers even outside the major field of computer science can understand what the main problem is and how this topology attempts to solve it.

ContributorsAhmed, Umer (Author) / Colbourn, Charles (Thesis director) / Zhao, Ming (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

A Study on Resources Utilization of Deep Learning Workloads

Description

Deep learning and AI have grabbed tremendous attention in the last decade. The substantial accuracy improvement by neural networks in common tasks such as image classiﬁcation and speech recognition has made deep learning as a replacement for many conventional machine learning techniques. Training Deep Neural networks require a lot of…

Deep learning and AI have grabbed tremendous attention in the last decade. The substantial accuracy improvement by neural networks in common tasks such as image classiﬁcation and speech recognition has made deep learning as a replacement for many conventional machine learning techniques. Training Deep Neural networks require a lot of data, and therefore vast of amounts of computing resources to process the data and train the model for the neural network. The most obvious solution to solving this problem is to speed up the time it takes to train Deep Neural networks.
AI and deep learning workloads are diﬀerent from the conventional cloud and mobile workloads, with respect to: (1) Computational Intensity, (2) I/O characteristics, and (3) communication pattern. While there is a considerable amount of research activity on the theoretical aspects of AI and Deep Learning algorithms that run with greater eﬃciency, there are only a few studies on the infrastructural impact of Deep Learning workloads on computing and storage resources in distributed systems.
It is typical to utilize a heterogeneous mixture of CPU and GPU devices to perform training on a neural network. Google Brain has a developed a reinforcement model that can place training operations across a heterogeneous cluster. Though it has only been tested with local devices in a single cluster. This study will explore the method’s capabilities and attempt to apply this method on a cluster with nodes across a network.

ContributorsNguyen, Andrew Hoang (Author) / Zhao, Ming (Thesis director) / Biookaghazadeh, Saman (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2019-05

Filtering by

Study of Knowledge Transfer Techniques For Deep Learning on Edge Devices

FPGA Acceleration of CNNs Using OpenCL

FPGA-Based Edge-Computing Acceleration

The Necessity of Error Correction In The Quantum World

A Study on Resources Utilization of Deep Learning Workloads