Design Space Exploration for SODA (Software-Defined Accelerator)

Description
SODA (Software-Defined Accelerator) is a framework that simplifies the generation of hardware accelerators from high-level languages such as Python, using the Multi-Level Intermediate Representation (MLIR) compiler framework. Neural network models utilizing Python libraries such as TensorFlow or PyTorch can be

SODA (Software-Defined Accelerator) is a framework that simplifies the generation of hardware accelerators from high-level languages such as Python, using the Multi-Level Intermediate Representation (MLIR) compiler framework. Neural network models utilizing Python libraries such as TensorFlow or PyTorch can be synthesized into chip designs through SODA framework. This thesis explores loop and memory optimization techniques using the SODA framework to enhance computational efficiency in neural network hardware design. Loop transformations like permutation, tiling, and unrolling are applied to optimize performance, while memory access is improved through Temporary Buffer Allocation and Alloca Buffer Promotion. To handle larger neural networks within hardware limits, dimensional folding is used, involving computation tiling and estimating the required number of tiles and clock cycles. A Design Space Exploration (DSE) framework was developed to explore the above-mentioned optimizations on neural network layers such as 2D Convolution, 2D Depth-wise Convolution, and Fully Connected layers. By analyzing performance metrics, heuristics are derived that narrow the exploration space, reducing the time taken to reach optimal configurations. Results demonstrate that hardware design for large neural networks can be significantly improved through advanced loop transformations within the SODA framework. This study enables the deployment of complex neural networks in hardware-constrained environments, contributing to more efficient hardware synthesis processes and providing valuable insights into affine-level loop optimization strategies.

Details

Contributors
Date Created
2024
Resource Type
Language
  • eng
Note
  • Partial requirement for: M.S., Arizona State University, 2024
  • Field of study: Electrical Engineering

Additional Information

English
Extent
  • 117 pages
Open Access
Peer-reviewed

Efficient Implementation of Interpolation on Domain Adaptive Processor

Description
Interpolation is a crucial technique in signal processing that goes well beyond its conventional use in audio and image processing. Linear interpolation is a type of interpolation that is a simple method of estimating a value between two known points

Interpolation is a crucial technique in signal processing that goes well beyond its conventional use in audio and image processing. Linear interpolation is a type of interpolation that is a simple method of estimating a value between two known points by assuming a straight-line relationship between them. Its applications are spread over various domains, including channel estimation, Doppler shift estimation, synthetic aperture radar (SAR), pulse compression, adaptive filtering, and radar and sonar signal processing.Domain Adaptive Processor (DAP) is a systolic array processor, developed by researchers at the University of Michigan to meet the needs of signal processing workloads. It consists of 64 Processing Elements (PEs) connected in a 2D array. Each PE is subdivided into smaller units (sub-PEs), which can load, store, and carry out mathematical operations for both real and complex numbers, making DAP globally homogeneous and locally heterogeneous. The hardware code for DAP implementation was configured using a combination of handwritten Comma Separated Values (.csv) files and an auto-code generation tool. Researchers at ASU developed this tool to manage instruction redundancies and implement instruction-level parallelism for generating Very Long Instruction Word (VLIW) code for the DAP. The thesis presents implementations, a single PE version and a multi PE version of linear interpolation. Each of them was evaluated and compared based on their latencies

Details

Contributors
Date Created
2024
Resource Type
Language
  • eng
Note
  • Partial requirement for: M.S., Arizona State University, 2024
  • Field of study: Computer Engineering

Additional Information

English
Extent
  • 66 pages
Open Access
Peer-reviewed

Towards Developing Multi-Function Wireless Receiver Systems Using Heterogenous SoCs: Simulation, Implementation and Evaluation

Description
This dissertation describes the hardware implementation and analysis of a multi-function wireless receiver system to perform Joint Radar Communication on Domain-focused Advanced Software-reconfigurable Heterogeneous (DASH) system-on-chip (SoC). As the problem of spectral congestion becomes more chronic and widespread, Electromagnetic radio

This dissertation describes the hardware implementation and analysis of a multi-function wireless receiver system to perform Joint Radar Communication on Domain-focused Advanced Software-reconfigurable Heterogeneous (DASH) system-on-chip (SoC). As the problem of spectral congestion becomes more chronic and widespread, Electromagnetic radio frequency (RF) based systems are posing as a viable solution to this problem. RF Spectral Convergence enables better utilization of available spectrum by reducing wastage and enabling multi-function tasks. There is a need forhigh-performance processors that can operate at low power while being flexible and easy to program. Coarse-scale heterogeneous processors such as DASH SoC have been developed to demonstrate the feasibility of such architectures. DASH SoC addresses domain requirements of spectral convergence by offering high throughput rates with minimal energy expenditure. It also offers a high degree of programmability which enables adaptability to the domain of real-time joint communications and radar systems for applications such as spectral situational awareness, autonomous vehicular navigation systems, and positioning, navigation, and timing (PNT). The hardware implementation of a joint-radar communications receiver on DASH SoC validates these claims. Task scheduling is optimized by using an imitation learning-based runtime scheduler implemented on the DASH SoC platform. Modern applications such as 5G New Radio (5GNR) require varying frame processing requirements, which are also implemented to demonstrate compatibility with current applications. General Matrix Multiply (GEMM) is an integral kernel within the Joint Radar and Communications (JRC) receiver; an estimation model is developed to study characteristics of GEMM implementation on domain adaptive processor (DAP) accelerator core within DASH SoC.

Details

Contributors
Date Created
2024
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2024
  • Field of study: Electrical Engineering

Additional Information

English
Extent
  • 141 pages
Open Access
Peer-reviewed

DSP-MLIR: A Compiler for Digital Signal Processing in MLIR

Description
Traditional Digital Signal Processing (DSP) compilers operate at a low level, such as C and assembly. Lowering from a high-level representation to low-level ones often results in a loss of high-level information, which is critical for optimizations based on DSP-specific

Traditional Digital Signal Processing (DSP) compilers operate at a low level, such as C and assembly. Lowering from a high-level representation to low-level ones often results in a loss of high-level information, which is critical for optimizations based on DSP-specific theorems and properties. Consequently, traditional DSP compilers miss many optimization opportunities available at a high level. The MLIR (Multi-level Intermediate Representation) framework, an emerging multi-level compiler infras tructure, supports specifying optimizations at a higher level. This thesis introduces a DSP Dialect within the MLIR framework to perform domain-specific optimizations based on DSP theorems and properties at a high level. Additionally, a domain-specific language (DSL) is proposed to facilitate the development of DSP applications. Ex perimental results demonstrate that domain-specific optimizations of the DSP dialect can improve the execution time of DSP applications by up to 10x compared to im plementations in C and affine levels.

Details

Contributors
Date Created
2024
Resource Type
Language
  • eng
Note
  • Partial requirement for: M.S., Arizona State University, 2024
  • Field of study: Computer Science

Additional Information

English
Extent
  • 34 pages
Open Access
Peer-reviewed

Cyclebite: Towards A Machine Understanding of Computation

Description
Modern and future computer architectures will feature hyper-efficient and low-latency processing elements dedicated to a specific purpose.In a process known as pre-silicon design, computer architects design a customized piece of hardware, called a domain-specific architecture (DSA), to support the high

Modern and future computer architectures will feature hyper-efficient and low-latency processing elements dedicated to a specific purpose.In a process known as pre-silicon design, computer architects design a customized piece of hardware, called a domain-specific architecture (DSA), to support the high performance and efficiency of a target group of computer programs. Each program contains sections of code called "tasks" whose characteristics may be exploited for hyper-efficient and low-latency execution on specialized processors called processing elements (PE). Architects design PEs toward those tasks, enabling DSAs to deliver magnitudes of performance and energy-efficiency advantages over general-purpose, scalar processors. After the pre-silicon design phase of a DSA is the post-silicon design phase, where application designers map other and novel computer programs for high performance on that DSA. To realize the performance and energy benefits of that DSA, each computer program must realize the PEs of that DSA. Unfortunately, refactoring applications to exploit the PEs of a DSA is difficult. Application and architectural experts (whose expertise is in low supply) must manually structure each application toward their target DSA. Furthermore, open-source and legacy code called code-from-the-wild (CFTW), often lack the structure required to compile applications toward DSAs, requiring manual refactoring and creating a barrier between DSAs and the broader software engineering community. Finally, manual application structuring techniques do not scale to novel applications and architectures. This thesis argues that automating the process of structuring computer programs for transformation and optimization toward DSAs is essential to the widespread adoption and utilization of DSAs. Existing methods to extract the structure from CFTW both capture unimportant code and miss critical code for acceleration because they rely on code frequency.Further, they rely on static information to find communication patterns between them. To address the challenge of localizing the acceleration candidates and finding their communication among each other, this thesis presents Cyclebite. Cyclebite extracts coarse-grained tasks (CGTs) from CFTW by detecting CGT candidates in the dynamic execution profile of the application. Then, Cyclebite localizes the instances of each CGT candidate, which measures its utilization and observes its communication patterns with other task candidates. Each CGT candidate whose utilization is adequate is designated as a task, and its communication with other tasks forms producer-consumer relationships. Cyclebite exports a produce-consume task graph (or simply task graph) for the application - a directed acyclic graph where each node is a CGT and directed edges are communication between CGTs, pointing from producer to consumer. I show that Cyclebite finds both important code that state-of-the-art (SoA) structuring techniques miss and rejects unimportant code that SoA structuring techniques erroneously include in the task graph. I propose a CGT analysis and export tool, called Cyclebite-Template, that extracts the parallel execution pattern of each CGT in the exported Cyclebite task graph, and exports the task graph into a domain-specific language, which facilitates its transformation and optimization towards a target DSA. Cyclebite-Template works on a low-level representation of each CGT, making its analysis agnostic to the syntactic definition of each task, while SoA task templates use higher-level representations that are not robust to differing syntactic task definitions. Further, Cyclebite-Template supports the transformation and optimization of non-polyhedral tasks by extracting each task's canonical expression (which implies its parallel execution pattern), while SoA task templates only support polyhedral tasks. I show that Cyclebite-Template accurately extracts the parallel pattern from Cyclebite tasks, and Cyclebite task graphs exported with Cyclebite-Template achieve speedups inline or exceeding those of SoA task templates.

Details

Contributors
Date Created
2024
Embargo Release Date
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2024
  • Field of study: Computer Engineering

Additional Information

English
Extent
  • 163 pages
Open Access
Peer-reviewed

Efficient Implementations of Matrix Inversion on 2D Array Architecture

Description
Matrix inversion is one of the most common operations in many signal-processingalgorithms. It plays a vital role in Multiple-Input and Multiple-Output (MIMO) systems, beamforming, image recovery, phased-array radar and sonar, 3G wireless communication, and Worldwide Interoperability for Microwave Access (WiMAX). Performing the inversion

Matrix inversion is one of the most common operations in many signal-processingalgorithms. It plays a vital role in Multiple-Input and Multiple-Output (MIMO) systems, beamforming, image recovery, phased-array radar and sonar, 3G wireless communication, and Worldwide Interoperability for Microwave Access (WiMAX). Performing the inversion of large matrices accurately is challenging due to its large computational cost. The thesis presents efficient implementations of matrix inversion using the Gauss- Jordan method and the Gram-Schmidt method which consists of computation of the QR Decomposition of the matrix followed by backward elimination and matrix multiplications. The targeted matrix size is 4 × 4 with complex values for the Gauss- Jordan method, and real values using the Gram-Schmidt method. Both methods can be easily scaled to handle large-sized matrices with the same data flow. The two matrix inversion methods are mapped onto a Domain Adaptive Processor (DAP) developed by researchers at the University of Michigan to meet the needs of signal processing workloads. It consists of 64 Processing Elements (PEs) connected in a 2D array. Each PE is further divided into sub-PEs capable of loading, storing, and performing mathematical computations for real and complex numbers. The hardware code for DAP implementation was configured using a combination of handcrafted Comma Separated Values (.csv) files and an auto code generation tool. This tool was developed by researchers at ASU to handle instruction redundancies and perform instruction-level parallelism to generate VLIW code for DAP. Both implementations were evaluated and compared based on latency, throughput, and precision. The thesis also covers scaling requirements for a set of input ranges and different routing methods. These are necessary for both algorithms to achieve high throughput with minimal instruction count.

Details

Contributors
Date Created
2024
Topical Subject
Resource Type
Language
  • eng
Note
  • Partial requirement for: M.S., Arizona State University, 2024
  • Field of study: Electrical Engineering

Additional Information

English
Extent
  • 62 pages
Open Access
Peer-reviewed

Continual Learning with Novelty Detection: Algorithms and Applications to Image and Microelectronic Design

Description
Machine learning techniques have found extensive application in dynamic fields like drones, self-driving vehicles, surveillance, and more. Their effectiveness stems from meticulously crafted deep neural networks (DNNs), extensive data gathering efforts, and resource-intensive model training processes. However, due to the

Machine learning techniques have found extensive application in dynamic fields like drones, self-driving vehicles, surveillance, and more. Their effectiveness stems from meticulously crafted deep neural networks (DNNs), extensive data gathering efforts, and resource-intensive model training processes. However, due to the unpredictable nature of the environment, these systems will inevitably encounter input samples that deviate from the distribution of their original training data, resulting in instability and performance degradation.To effectively detect the emergence of out-of-distribution (OOD) data, this dissertation first proposes a novel, self-supervised approach that evaluates the Mahalanobis distance between the in-distribution (ID) and OOD in gradient space. A binary classifier is then introduced to guide the label selection for gradients calculation, which further boosts the detection performance. Next, to continuously adapt the new OOD into the existing knowledge base, an unified framework for novelty detection and continual learning is proposed. The binary classifier, trained to distinguish OOD data from ID, is connected sequentially with the pre-trained model to form a “N + 1” classifier, where “N” represents prior knowledge which contains N classes and “1” refers to the newly arrival OOD. This continual learning process continues as “N+1+1+1+...”, assimilating the knowledge of each new OOD instance into the system. Finally, this dissertation demonstrates the practical implementation of novelty detection and continual learning within the domain of thermal analysis. To rapidly address the impact of voids in thermal interface material (TIM), a continuous adaptation approach is proposed, which integrates trainable nodes into the graph at the locations where abnormal thermal behaviors are detected. With minimal training overhead, the model can quickly adapts to the change caused by the defects and regenerate accurate thermal prediction. In summary, this dissertation proposes several algorithms and practical applications in continual learning aimed at enhancing the stability and adaptability of the system. All proposed algorithms are validated through extensive experiments conducted on benchmark datasets such as CIFAR-10, CIFAR-100, TinyImageNet for continual learning, and real thermal data for thermal analysis.

Details

Contributors
Date Created
2024
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2024
  • Field of study: Computer Engineering

Additional Information

English
Extent
  • 102 pages
Open Access
Peer-reviewed

Sensing for Wireless Communication: From Theory to Reality

Description
Millimeter-wave (mmWave) and sub-terahertz (sub-THz) systems aim to utilize the large bandwidth available at these frequencies. This has the potential to enable several future applications that require high data rates, such as autonomous vehicles and digital twins. These systems, however,

Millimeter-wave (mmWave) and sub-terahertz (sub-THz) systems aim to utilize the large bandwidth available at these frequencies. This has the potential to enable several future applications that require high data rates, such as autonomous vehicles and digital twins. These systems, however, have several challenges that need to be addressed to realize their gains in practice. First, they need to deploy large antenna arrays and use narrow beams to guarantee sufficient receive power. Adjusting the narrow beams of the large antenna arrays incurs massive beam training overhead. Second, the sensitivity to blockages is a key challenge for mmWave and THz networks. Since these networks mainly rely on line-of-sight (LOS) links, sudden link blockages highly threaten the reliability of the networks. Further, when the LOS link is blocked, the network typically needs to hand off the user to another LOS basestation, which may incur critical time latency, especially if a search over a large codebook of narrow beams is needed. A promising way to tackle both these challenges lies in leveraging additional side information such as visual, LiDAR, radar, and position data. These sensors provide rich information about the wireless environment, which can be utilized for fast beam and blockage prediction. This dissertation presents a machine-learning framework for sensing-aided beam and blockage prediction. In particular, for beam prediction, this work proposes to utilize visual and positional data to predict the optimal beam indices. For the first time, this work investigates the sensing-aided beam prediction task in a real-world vehicle-to-infrastructure and drone communication scenario. Similarly, for blockage prediction, this dissertation proposes a multi-modal wireless communication solution that utilizes bimodal machine learning to perform proactive blockage prediction and user hand-off. Evaluations on both real-world and synthetic datasets illustrate the promising performance of the proposed solutions and highlight their potential for next-generation communication and sensing systems.

Details

Contributors
Date Created
2024
Embargo Release Date
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2024
  • Field of study: Electrical Engineering

Additional Information

English
Extent
  • 287 pages
Open Access
Peer-reviewed

Increasing the Efficiency of Heterogeneous System Operation: from Scheduling to Implementation of Split Federated Learning

Description
This thesis addresses the problems of (a) scheduling multiple streaming jobs with soft deadline constraints to minimize the risk/energy consumption in heterogeneous Systems-on-chip (SoCs), and (b) training a neural network model with high accuracy and low training time using split

This thesis addresses the problems of (a) scheduling multiple streaming jobs with soft deadline constraints to minimize the risk/energy consumption in heterogeneous Systems-on-chip (SoCs), and (b) training a neural network model with high accuracy and low training time using split federated learning (SFL) with heterogeneous clients. Designing a scheduler for heterogeneous SoC SoCs built with different types of processing elements (PEs) is quite challenging, especially when it has to balance the conflicting requirements of low energy consumption, low risk, and high throughput for randomly streaming jobs at run time. Two probabilistic deadline-aware schedulers are designed for heterogeneous SoCs for such jobs with soft deadline constraints with the goals of optimizing job-level risk and energy efficiency. The key idea of the probabilistic scheduler is to calculate the task-to-PE allocation probabilities when a job arrives in the system. This allocation probability, generated by manually designed or neural network (NN) based allocation function, is used to compute the intra-job and inter-job contentions to derive the task-level slack. The tasks are allocated to the PEs that can complete the task within the task-level slack with minimum risk or minimum energy consumption. SFL is an edge-friendly decentralized NN training scheme, where the model is split and only a small client-side model is trained in the clients. The communication overhead in SFL is significant since the intermediate activations and gradients of every sample are transmitted in every epoch. Two communication reduction methods have been proposed, namely, loss-aware selective updating to reduce the number of training epochs and bottleneck layer (BL) to reduce the feature size.Next, the SFL system is trained with heterogeneous clients having different data rates and operating on non-IID data. The communication time of clients in low-end group with slow data rates dominates the training time. To reduce the training time without sacrificing accuracy significantly, HeteroSFL is built with HetBL and bi- directional knowledge sharing (BDKS). HetBL compresses data with different factors in low- and high-end groups using narrow and wide bottleneck layers respectively. BDKS is proposed to mitigate the label distribution skew across different groups. BDKS can also be applied in Federated Learning to address the label distribution skew.

Details

Contributors
Date Created
2023
Embargo Release Date
Topical Subject
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2023
  • Field of study: Electrical Engineering

Additional Information

English
Extent
  • 218 pages
Open Access
Peer-reviewed

In-Memory Computing Based Hardware Accelerators Advancing the Implementation of AI/ML Tasks in Edge Devices

Description
Artificial Intelligence (AI) and Machine Learning (ML) techniques have come a long way since their inception and have been used to build intelligent systems for a wide range of applications in everyday life. However they are very computationintensive and require

Artificial Intelligence (AI) and Machine Learning (ML) techniques have come a long way since their inception and have been used to build intelligent systems for a wide range of applications in everyday life. However they are very computationintensive and require transfer of large volume of data from memory to the computation units. This memory access time constitute significant part of the computational latency and a performance bottleneck. To address this limitation and the ever-growing demand for implementation in hand-held and edge-devices, In-memory computing (IMC) based AI/ML hardware accelerators have emerged. First, the dissertation presents an IMC static random access memory (SRAM) based hardware modeling and optimization framework. A unified systematic study closely models the IMC hardware, and investigates how a number of design variables and non-idealities (e.g. device mismatch and ADC quantization) affect the Deep Neural Network (DNN) accuracy of the IMC design. The framework allows co-optimized selection of different design variables accounting for sources of noise in IMC hardware and robust implementation of a high accuracy DNN. Next, it presents a kNN hardware accelerator in 65nm Complementary Metal-Oxide-Semiconductor (CMOS) technology. The accelerator combines an IMC SRAM that is developed for binarized deep neural networks and other digital hardware that performs top-k sorting. The simulated k Nearest Neighbor accelerator design processes up to 17.9 million query vectors per second while consuming 11.8 mW, demonstrating >4.8× energy-efficiency improvement over prior works. This dissertation also presents a novel floating-point precision IMC (FP-IMC) macro with a hybrid architecture that configurably supports two Floating Point (FP) precisions. Implementing FP precision MAC has been a challenge owing to its complexity. The design is implemented on 28nm CMOS, and taped-out on chip demonstrating 12.1 TFLOPS/W and 66.1 TFLOPS/W for 8-bit Floating Point (FP8) and Block Floating point (BF8) respectively. Finally, another iteration of the FP design is presented that is modeled to support multiple precision modes from FP8 up to FP32. Two approaches to the architectural design were compared illustrating the throughput-area overhead trade-off. The simulated design shows a 2.1 × normalized energy-efficiency compared to the on-chip implementation of the FP-IMC.

Details

Contributors
Date Created
2023
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2023
  • Field of study: Electrical Engineering

Additional Information

English
Extent
  • 104 pages
Open Access
Peer-reviewed