Search Content

Matching Items (102)

Filtering by

In-Memory Computing Based Hardware Accelerators Advancing the Implementation of AI/ML Tasks in Edge Devices

Description

Artificial Intelligence (AI) and Machine Learning (ML) techniques have come a long way since their inception and have been used to build intelligent systems for a wide range of applications in everyday life. However they are very computationintensive and require transfer of large volume of data from memory to the computation units. This memory access time constitute significant part of the computational latency and a performance bottleneck. To address this limitation and the ever-growing demand for implementation in hand-held and edge-devices, In-memory computing (IMC) based AI/ML hardware accelerators have emerged. First, the dissertation presents an IMC static random access memory (SRAM) based hardware modeling and optimization framework. A unified systematic study closely models the IMC hardware, and investigates how a number of design variables and non-idealities (e.g. device mismatch and ADC quantization) affect the Deep Neural Network (DNN) accuracy of the IMC design. The framework allows co-optimized selection of different design variables accounting for sources of noise in IMC hardware and robust implementation of a high accuracy DNN. Next, it presents a kNN hardware accelerator in 65nm Complementary Metal-Oxide-Semiconductor (CMOS) technology. The accelerator combines an IMC SRAM that is developed for binarized deep neural networks and other digital hardware that performs top-k sorting. The simulated k Nearest Neighbor accelerator design processes up to 17.9 million query vectors per second while consuming 11.8 mW, demonstrating >4.8× energy-efficiency improvement over prior works. This dissertation also presents a novel floating-point precision IMC (FP-IMC) macro with a hybrid architecture that configurably supports two Floating Point (FP) precisions. Implementing FP precision MAC has been a challenge owing to its complexity. The design is implemented on 28nm CMOS, and taped-out on chip demonstrating 12.1 TFLOPS/W and 66.1 TFLOPS/W for 8-bit Floating Point (FP8) and Block Floating point (BF8) respectively. Finally, another iteration of the FP design is presented that is modeled to support multiple precision modes from FP8 up to FP32. Two approaches to the architectural design were compared illustrating the throughput-area overhead trade-off. The simulated design shows a 2.1 × normalized energy-efficiency compared to the on-chip implementation of the FP-IMC.

ContributorsSaikia, Jyotishman (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Thesis advisor) / Fan, Deliang (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2023

Increasing the Efficiency of Heterogeneous System Operation: from Scheduling to Implementation of Split Federated Learning

Description

This thesis addresses the problems of (a) scheduling multiple streaming jobs with soft deadline constraints to minimize the risk/energy consumption in heterogeneous Systems-on-chip (SoCs), and (b) training a neural network model with high accuracy and low training time using split federated learning (SFL) with heterogeneous clients. Designing a scheduler for heterogeneous SoC SoCs built with different types of processing elements (PEs) is quite challenging, especially when it has to balance the conflicting requirements of low energy consumption, low risk, and high throughput for randomly streaming jobs at run time. Two probabilistic deadline-aware schedulers are designed for heterogeneous SoCs for such jobs with soft deadline constraints with the goals of optimizing job-level risk and energy efficiency. The key idea of the probabilistic scheduler is to calculate the task-to-PE allocation probabilities when a job arrives in the system. This allocation probability, generated by manually designed or neural network (NN) based allocation function, is used to compute the intra-job and inter-job contentions to derive the task-level slack. The tasks are allocated to the PEs that can complete the task within the task-level slack with minimum risk or minimum energy consumption. SFL is an edge-friendly decentralized NN training scheme, where the model is split and only a small client-side model is trained in the clients. The communication overhead in SFL is significant since the intermediate activations and gradients of every sample are transmitted in every epoch. Two communication reduction methods have been proposed, namely, loss-aware selective updating to reduce the number of training epochs and bottleneck layer (BL) to reduce the feature size.Next, the SFL system is trained with heterogeneous clients having different data rates and operating on non-IID data. The communication time of clients in low-end group with slow data rates dominates the training time. To reduce the training time without sacrificing accuracy significantly, HeteroSFL is built with HetBL and bi- directional knowledge sharing (BDKS). HetBL compresses data with different factors in low- and high-end groups using narrow and wide bottleneck layers respectively. BDKS is proposed to mitigate the label distribution skew across different groups. BDKS can also be applied in Federated Learning to address the label distribution skew.

ContributorsChen, Xing (Author) / Chakrabarti, Chaitali (Thesis advisor, Committee member) / Ogras, Umit (Committee member) / Fan, Deliang (Committee member) / Zhang, Jeff (Committee member) / Arizona State University (Publisher)

Created2023

First page
Previous page
…
7
8
9
10
11