Search Content

Accelerating Linear Algebra and Machine Learning Kernels on a Massively Parallel Reconfigurable Architecture

Description

This thesis presents efficient implementations of several linear algebra kernels, machine learning kernels and a neural network based recommender systems engine onto a massively parallel reconfigurable architecture, Transformer. The linear algebra kernels include Triangular Matrix Solver (TRSM), LU Decomposition (LUD), QR Decomposition (QRD), and Matrix Inversion. The machine learning kernels…

This thesis presents efficient implementations of several linear algebra kernels, machine learning kernels and a neural network based recommender systems engine onto a massively parallel reconfigurable architecture, Transformer. The linear algebra kernels include Triangular Matrix Solver (TRSM), LU Decomposition (LUD), QR Decomposition (QRD), and Matrix Inversion. The machine learning kernels include an LSTM (Long Short Term Memory) cell, and a GRU (gated Recurrent Unit) cell used in recurrent neural networks. The neural network based recommender systems engine consists of multiple kernels including fully connected layers, embedding layer, 1-D batchnorm, Adam optimizer, etc.

Transformer is a massively parallel reconfigurable multicore architecture designed at the University of Michigan. The Transformer configuration considered here is 4 tiles and 16 General Processing Elements (GPEs) per tile. It supports a two level cache hierarchy where the L1 and L2 caches can operate in shared (S) or private (P) modes. The architecture was modeled using Gem5 and cycle accurate simulations were done to evaluate the performance in terms of execution times, giga-operations per second per Watt (GOPS/W), and giga-floating-point-operations per second per Watt (GFLOPS/W).

This thesis shows that for linear algebra kernels, each kernel achieves high performance for a certain cache mode and that this cache mode can change when the matrix size changes. For instance, for smaller matrix sizes, L1P, L2P cache mode is best for TRSM, while L1S, L2S is the best cache mode for LUD, and L1P, L2S is the best for QRD. For each kernel, the optimal cache mode changes when the matrix size is increased. For instance, for TRSM, the L1P, L2P cache mode is best for smaller matrix sizes ($N=64, 128, 256, 512$) and it changes to L1S, L2P for larger matrix sizes ($N=1024$). For machine learning kernels, L1P, L2P is the best cache mode for all network parameter sizes.

Gem5 simulations show that the peak performance for TRSM, LUD, QRD and Matrix Inverse in the 14nm node is 97.5, 59.4, 133.0 and 83.05 GFLOPS/W, respectively. For LSTM and GRU, the peak performance is 44.06 and 69.3 GFLOPS/W.

The neural network based recommender system was implemented in L1S, L2S cache mode. It includes a forward pass and a backward pass and is significantly more complex in terms of both computational complexity and data movement. The most computationally intensive block is the fully connected layer followed by Adam optimizer. The overall performance of the recommender systems engine is 54.55 GFLOPS/W and 169.12 GOPS/W.

ContributorsSoorishetty, Anuraag (Author) / Chakrabarti, Chaitali (Thesis advisor) / Kim, Hun Seok (Committee member) / LiKamWa, Robert (Committee member) / Arizona State University (Publisher)

Created2019

RNS-Based NTT Polynomial Multiplier for Lattice-Based Cryptography

Description

Lattice-based Cryptography is an up and coming field of cryptography that utilizes the difficulty of lattice problems to design lattice-based cryptosystems that are resistant to quantum attacks and applicable to Fully Homomorphic Encryption schemes (FHE). In this thesis, the parallelization of the Residue Number System (RNS) and algorithmic efficiency of…

Lattice-based Cryptography is an up and coming field of cryptography that utilizes the difficulty of lattice problems to design lattice-based cryptosystems that are resistant to quantum attacks and applicable to Fully Homomorphic Encryption schemes (FHE). In this thesis, the parallelization of the Residue Number System (RNS) and algorithmic efficiency of the Number Theoretic Transform (NTT) are combined to tackle the most significant bottleneck of polynomial ring multiplication with the hardware design of an optimized RNS-based NTT polynomial multiplier. The design utilizes Negative Wrapped Convolution, the NTT, RNS Montgomery reduction with Bajard and Shenoy extensions, and optimized modular 32-bit channel arithmetic for nine RNS channels to accomplish an RNS polynomial multiplication. In addition to a full software implementation of the whole system, a pipelined and optimized RNS-based NTT unit with 4 RNS butterflies is implemented on the Xilinx Artix-7 FPGA(xc7a200tlffg1156-2L) for size and delay estimates. The hardware implementation achieves an operating frequency of 47.043 MHz and utilizes 13239 LUT's, 4010 FF's, and 330 DSP blocks, allowing for multiple simultaneously operating NTT units depending on FGPA size constraints.

ContributorsBrist, Logan Alan (Author) / Chakrabarti, Chaitali (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Bliss, Daniel (Committee member) / Arizona State University (Publisher)

Created2020

Accelerator for Flexible QR Decomposition and Back Substitution

Description

QR decomposition (QRD) of a matrix is one of the most common linear algebra operationsused for the decomposition of a square
on-square matrix. It has a wide range
of applications especially in Multiple Input-Multiple Output (MIMO) communication
systems. Unfortunately it has high computation complexity { for matrix size of nxn,
QRD has O(n3) complexity…

QR decomposition (QRD) of a matrix is one of the most common linear algebra operationsused for the decomposition of a square
on-square matrix. It has a wide range
of applications especially in Multiple Input-Multiple Output (MIMO) communication
systems. Unfortunately it has high computation complexity { for matrix size of nxn,
QRD has O(n3) complexity and back substitution, which is used to solve a system
of linear equations, has O(n2) complexity. Thus, as the matrix size increases, the
hardware resource requirement for QRD and back substitution increases signicantly.
This thesis presents the design and implementation of a
exible QRD and back substitution accelerator using a folded architecture. It can support matrix sizes of
4x4, 8x8, 12x12, 16x16, and 20x20 with low hardware resource requirement.
The proposed architecture is based on the systolic array implementation of the
Givens algorithm for QRD. It is built with three dierent types of computation blocks
which are connected in a 2-D array structure. These blocks are controlled by a
scheduler which facilitates reusability of the blocks to perform computation for any
input matrix size which is a multiple of 4. These blocks are designed using two
basic programming elements which support both the forward and backward paths to
compute matrix R in QRD and column-matrix X in back substitution computation.
The proposed architecture has been mapped to Xilinx Zynq Ultrascale+ FPGA
(Field Programmable Gate Array), ZCU102. All inputs are complex with precision
of 40 bits (38 fractional bits and 1 signed bit). The architecture can be clocked at
50 MHz. The synthesis results of the folded architecture for dierent matrix sizes
are presented. The results show that the folded architecture can support QRD and
back substitution for inputs of large sizes which otherwise cannot t on an FPGA
when implemented using a
at architecture. The memory sizes required for dierent
matrix sizes are also presented.

ContributorsKanagala, Srimayee (Author) / Chakrabarti, Chaitali (Thesis advisor) / Bliss, Daniel (Committee member) / Cao, Yu (Kevin) (Committee member) / Arizona State University (Publisher)

Created2020

Automatic Computational Domain Detection

Description

Heterogenous SoCs are in development that marry multiple architectural patterns together. In order for software to be run on such a platform, it must be broken down into its constituent parts, kernels, and scheduled for execution on the hardware. Although this can be done by hand, it would be arduous…

Heterogenous SoCs are in development that marry multiple architectural patterns together. In order for software to be run on such a platform, it must be broken down into its constituent parts, kernels, and scheduled for execution on the hardware. Although this can be done by hand, it would be arduous and time consuming; rather, a tool should be developed that analyzes the source binary, extracts the kernels, schedules the kernels, and optimizes the scheduled kernels for their target component. This dissertation proposes a decidable kernel definition that enables an algorithmic approach to detecting kernels from arbitrary programs. This definition is built upon four constraints that can be tested using basic graph theory. In addition, two algorithms are proposed that successfully extract kernels based upon runtime information. The first utilizes dynamic traces, which are generated using a collection of novel optimizations. The second utilizes a simple affinity matrix, which has no runtime overhead during program execution. Finally, a Dense Neural Network is proposed that is capable of detecting a kernel's archetype based upon only the composition of the source program and the number of times individual basic blocks execute. The contributions proposed in this dissertation provide the necessary infrastructure to perform a litany of other optimizations on kernels. By detecting kernels algorithmically, any program can be analyzed and optimized with techniques that have heretofore required kernels be written in a compatible form. Computational kernels can be extracted from any program with no constraints. The innovations describes here will form the foundation for automated kernel optimization in the future, helping optimize the code of the future.

ContributorsUhrie, Richard Lawrence (Author) / Brunhaver, John (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Shrivastiva, Aviral (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)

Created2021

Energy Modeling of Machine Learning Algorithms on General Purpose Hardware

Description

Articial Neural Network(ANN) has become a for-bearer in the field of Articial Intel-

ligence. The innovations in ANN has led to ground breaking technological advances

like self-driving vehicles,medical diagnosis,speech Processing,personal assistants and

many more. These were inspired by evolution and working of our brains. Similar

to how our brain evolved using a combination of…

Articial Neural Network(ANN) has become a for-bearer in the field of Articial Intel-

ligence. The innovations in ANN has led to ground breaking technological advances

like self-driving vehicles,medical diagnosis,speech Processing,personal assistants and

many more. These were inspired by evolution and working of our brains. Similar

to how our brain evolved using a combination of epigenetics and live stimulus,ANN

require training to learn patterns.The training usually requires a lot of computation

and memory accesses. To realize these systems in real embedded hardware many

Energy/Power/Performance issues needs to be solved. The purpose of this research

is to focus on methods to study data movement requirement for generic Neural Net-

work along with the energy associated with it and suggest some ways to improve the

design.Many methods have suggested ways to optimize using mix of computation and

data movement solutions without affecting task accuracy. But these methods lack a

computation model to calculate the energy and depend on mere back of the envelope calculation. We realized that there is a need for a generic quantitative analysis

for memory access energy which helps in better architectural exploration. We show

that the present architectural tools are either incompatible or too slow and we need

a better analytical method to estimate data movement energy. We also propose a

simplistic yet effective approach that is robust and expandable by users to support

various systems.

ContributorsChowdary, Hidayatullah (Author) / Cao, Yu (Thesis advisor) / Seo, JaeSun (Committee member) / Chakrabarti, Chaitali (Committee member) / Arizona State University (Publisher)

Created2018

Filtering by

Accelerating Linear Algebra and Machine Learning Kernels on a Massively Parallel Reconfigurable Architecture

RNS-Based NTT Polynomial Multiplier for Lattice-Based Cryptography

Accelerator for Flexible QR Decomposition and Back Substitution

Automatic Computational Domain Detection

Energy Modeling of Machine Learning Algorithms on General Purpose Hardware