Search Content

Accelerator for Flexible QR Decomposition and Back Substitution

Description

QR decomposition (QRD) of a matrix is one of the most common linear algebra operationsused for the decomposition of a square
on-square matrix. It has a wide range
of applications especially in Multiple Input-Multiple Output (MIMO) communication
systems. Unfortunately it has high computation complexity { for matrix size of nxn,
QRD has O(n3) complexity…

QR decomposition (QRD) of a matrix is one of the most common linear algebra operationsused for the decomposition of a square
on-square matrix. It has a wide range
of applications especially in Multiple Input-Multiple Output (MIMO) communication
systems. Unfortunately it has high computation complexity { for matrix size of nxn,
QRD has O(n3) complexity and back substitution, which is used to solve a system
of linear equations, has O(n2) complexity. Thus, as the matrix size increases, the
hardware resource requirement for QRD and back substitution increases signicantly.
This thesis presents the design and implementation of a
exible QRD and back substitution accelerator using a folded architecture. It can support matrix sizes of
4x4, 8x8, 12x12, 16x16, and 20x20 with low hardware resource requirement.
The proposed architecture is based on the systolic array implementation of the
Givens algorithm for QRD. It is built with three dierent types of computation blocks
which are connected in a 2-D array structure. These blocks are controlled by a
scheduler which facilitates reusability of the blocks to perform computation for any
input matrix size which is a multiple of 4. These blocks are designed using two
basic programming elements which support both the forward and backward paths to
compute matrix R in QRD and column-matrix X in back substitution computation.
The proposed architecture has been mapped to Xilinx Zynq Ultrascale+ FPGA
(Field Programmable Gate Array), ZCU102. All inputs are complex with precision
of 40 bits (38 fractional bits and 1 signed bit). The architecture can be clocked at
50 MHz. The synthesis results of the folded architecture for dierent matrix sizes
are presented. The results show that the folded architecture can support QRD and
back substitution for inputs of large sizes which otherwise cannot t on an FPGA
when implemented using a
at architecture. The memory sizes required for dierent
matrix sizes are also presented.

ContributorsKanagala, Srimayee (Author) / Chakrabarti, Chaitali (Thesis advisor) / Bliss, Daniel (Committee member) / Cao, Yu (Kevin) (Committee member) / Arizona State University (Publisher)

Created2020

RNS-Based NTT Polynomial Multiplier for Lattice-Based Cryptography

Description

Lattice-based Cryptography is an up and coming field of cryptography that utilizes the difficulty of lattice problems to design lattice-based cryptosystems that are resistant to quantum attacks and applicable to Fully Homomorphic Encryption schemes (FHE). In this thesis, the parallelization of the Residue Number System (RNS) and algorithmic efficiency of…

Lattice-based Cryptography is an up and coming field of cryptography that utilizes the difficulty of lattice problems to design lattice-based cryptosystems that are resistant to quantum attacks and applicable to Fully Homomorphic Encryption schemes (FHE). In this thesis, the parallelization of the Residue Number System (RNS) and algorithmic efficiency of the Number Theoretic Transform (NTT) are combined to tackle the most significant bottleneck of polynomial ring multiplication with the hardware design of an optimized RNS-based NTT polynomial multiplier. The design utilizes Negative Wrapped Convolution, the NTT, RNS Montgomery reduction with Bajard and Shenoy extensions, and optimized modular 32-bit channel arithmetic for nine RNS channels to accomplish an RNS polynomial multiplication. In addition to a full software implementation of the whole system, a pipelined and optimized RNS-based NTT unit with 4 RNS butterflies is implemented on the Xilinx Artix-7 FPGA(xc7a200tlffg1156-2L) for size and delay estimates. The hardware implementation achieves an operating frequency of 47.043 MHz and utilizes 13239 LUT's, 4010 FF's, and 330 DSP blocks, allowing for multiple simultaneously operating NTT units depending on FGPA size constraints.

ContributorsBrist, Logan Alan (Author) / Chakrabarti, Chaitali (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Bliss, Daniel (Committee member) / Arizona State University (Publisher)

Created2020

Dash Database: Structured Kernel Data For The Machine Understanding of Computation

Description

As device and voltage scaling cease, ever-increasing performance targets can only be achieved through the design of parallel, heterogeneous architectures. The workloads targeted by these domain-specific architectures must be designed to leverage the strengths of the platform: a task that has proven to be extremely difficult…

As device and voltage scaling cease, ever-increasing performance targets can only be achieved through the design of parallel, heterogeneous architectures. The workloads targeted by these domain-specific architectures must be designed to leverage the strengths of the platform: a task that has proven to be extremely difficult and expensive.
Machine learning has the potential to automate this process by understanding the features of computation that optimize device utilization and throughput.
Unfortunately, applications of this technique have utilized small data-sets and specific feature extraction, limiting the impact of their contributions.

To address this problem I present Dash-Database; a repository of C and C++ programs for software-defined radio applications and its neighboring fields; a methodology for structuring the features of computation using kernels, and a set of evaluation metrics to standardize computation data sets. Dash-Database contributes a general data set that supports machine understanding of computation and standardizes the input corpus utilized for machine learning of computation; currently only a small set of benchmarks and features are being used.
I present an evaluation of Dash-Database using three novel metrics: breadth, depth and richness; and compare its results to a data set largely representative of those used in prior work, indicating a 5x increase in breadth, 40x increase in depth, and a rich set of sample features.
Using Dash-Database, the broader community can work toward a general machine understanding of computation that can automate the design of workloads for domain-specific computation.

ContributorsWillis, Benjamin Roy (Author) / Brunhaver, John S (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)

Created2020

Hardware Implementation and Analysis of Temporal Interference Mitigation : A High-Level Synthesis Based Approach

Description

The following document describes the hardware implementation and analysis of Temporal Interference Mitigation using High-Level Synthesis. As the problem of spectral congestion becomes more chronic and widespread, Electromagnetic radio frequency (RF) based systems are posing as viable solution to this problem. Among the existing RF methods Cooperation based systems have…

The following document describes the hardware implementation and analysis of Temporal Interference Mitigation using High-Level Synthesis. As the problem of spectral congestion becomes more chronic and widespread, Electromagnetic radio frequency (RF) based systems are posing as viable solution to this problem. Among the existing RF methods Cooperation based systems have been a solution to a host of congestion problems. One of the most important elements of RF receiver is the spatially adaptive part of the receiver. Temporal Mitigation is vital technique employed at the receiver for signal recovery and future propagation along the radar chain.

The computationally intensive parts of temporal mitigation are identified and hardware accelerated. The hardware implementation is based on sequential approach with optimizations applied on the individual components for better performance.

An extensive analysis using a range of fixed point data types is performed to find the optimal data type necessary.

Finally a hybrid combination of data types for different components of temporal mitigation is proposed based on results from the above analysis.

ContributorsSiddiqui, Saquib Ahmad (Author) / Bliss, Daniel (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Ogras, Umit Y. (Committee member) / Jayasuriya, Suren (Committee member) / Arizona State University (Publisher)

Created2020

Theoretical Receiver Operating Characteristics of Two-Stage Change Detector for Synthetic Aperture Radar Images

Description

Detecting areas of change between two synthetic aperture radar (SAR) images of the same scene, taken at different times is generally performed using two approaches. Non-coherent change detection is performed using the sample variance ratio detector, and displays a good performance in detecting areas of significant changes. Coherent change detection…

Detecting areas of change between two synthetic aperture radar (SAR) images of the same scene, taken at different times is generally performed using two approaches. Non-coherent change detection is performed using the sample variance ratio detector, and displays a good performance in detecting areas of significant changes. Coherent change detection can be implemented using the classical coherence estimator, which does better at detecting subtle changes, like vehicle tracks. A two-stage detector was proposed by Cha et al., where the sample variance ratio forms the first stage, and the second stage comprises of Berger's alternative coherence estimator.

A modification to the first stage of the two-stage detector is proposed in this study, which significantly simplifies the analysis of the this detector. Cha et al. have used a heuristic approach to determine the thresholds for this two-stage detector. In this study, the probability density function for the modified two-stage detector is derived, and using this probability density function, an approach for determining the thresholds for this two-dimensional detection problem has been proposed. The proposed method of threshold selection reveals an interesting behavior shown by the two-stage detector. With the help of theoretical receiver operating characteristic analysis, it is shown that the two-stage detector gives a better detection performance as compared to the other three detectors. However, the Berger's estimator proves to be a simpler alternative, since it gives only a slightly poorer performance as compared to the two-stage detector. All the four detectors have also been implemented on a SAR data set, and it is shown that the two-stage detector and the Berger's estimator generate images where the areas showing change are easily visible.

ContributorsBondre, Akshay Sunil (Author) / Richmond, Christ D (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Bliss, Daniel W (Committee member) / Arizona State University (Publisher)

Created2020

Automatic Computational Domain Detection

Description

Heterogenous SoCs are in development that marry multiple architectural patterns together. In order for software to be run on such a platform, it must be broken down into its constituent parts, kernels, and scheduled for execution on the hardware. Although this can be done by hand, it would be arduous…

Heterogenous SoCs are in development that marry multiple architectural patterns together. In order for software to be run on such a platform, it must be broken down into its constituent parts, kernels, and scheduled for execution on the hardware. Although this can be done by hand, it would be arduous and time consuming; rather, a tool should be developed that analyzes the source binary, extracts the kernels, schedules the kernels, and optimizes the scheduled kernels for their target component. This dissertation proposes a decidable kernel definition that enables an algorithmic approach to detecting kernels from arbitrary programs. This definition is built upon four constraints that can be tested using basic graph theory. In addition, two algorithms are proposed that successfully extract kernels based upon runtime information. The first utilizes dynamic traces, which are generated using a collection of novel optimizations. The second utilizes a simple affinity matrix, which has no runtime overhead during program execution. Finally, a Dense Neural Network is proposed that is capable of detecting a kernel's archetype based upon only the composition of the source program and the number of times individual basic blocks execute. The contributions proposed in this dissertation provide the necessary infrastructure to perform a litany of other optimizations on kernels. By detecting kernels algorithmically, any program can be analyzed and optimized with techniques that have heretofore required kernels be written in a compatible form. Computational kernels can be extracted from any program with no constraints. The innovations describes here will form the foundation for automated kernel optimization in the future, helping optimize the code of the future.

ContributorsUhrie, Richard Lawrence (Author) / Brunhaver, John (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Shrivastiva, Aviral (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)

Created2021

End-to-End Performance Benchmarking Tool for High-Speed Memory Access in Deep Learning

Description

Due to high DRAM access latency and energy, several convolutional neural network(CNN) accelerators face performance and energy efficiency challenges, which are critical for embedded implementations. As these applications exploit larger datasets, memory accesses of these emerging applications are increasing. As a result, it is difficult to predict the combined…

Due to high DRAM access latency and energy, several convolutional neural network(CNN) accelerators face performance and energy efficiency challenges, which are critical for embedded implementations. As these applications exploit larger datasets, memory accesses of these emerging applications are increasing. As a result, it is difficult to predict the combined dynamic random access memory (DRAM) workload behavior, which can sabotage memory optimizations in software. To understand the impact of external memory access on CNN accelerators which reduces the high DRAMaccess latency and energy, simulators such as RAMULATOR and VAMPIRE have been proposed in prior work. In this work, we utilize these simulators to benchmark external memory access in CNN accelerators. Experiments are performed generating trace files based on the number of parameters and data precision and also using trace file generated for CNN Accelerator Altera Arria 10 GX 1150 FPGA data to complete the end to end workflow using the mentioned simulators. Besides that, certain modifications were made in the default VAMPIRE code to implement certain functionalities such as PREA(Precharge All) and REF(Refresh). Then, precalculated energies were computed for DDR3, DDR4, and HBM based on the micron model to mention it in the dram specification file inputted to the VAMPIRE tool. An experimental study was performed and a comparison is made between DDR3, DDR4, and HBM, it was proved that DDR4 is nearly 31% more energy-efficient than DDR3 and HBMis 54% energy-efficient than DDR3. Performed modeling and experimental analysis on a large set of data and then split it into a set of data and compared the results of the small sets multiplied with the number of sets and the large data set and concluded that the results were nearly the same. Finally, a GUI is developed by wrapping both the simulators. GUI provides user-friendly access and can analyze the parameters without much prior knowledge and understanding of the working.

ContributorsPannala, Manvitha (Author) / Cao, Yu (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2021

Bayesian Nonparametric Reinforcement Learning in LTE and Wi-Fi Coexistence

Description

With the formation of next generation wireless communication, a growing number of new applications like internet of things, autonomous car, and drone is crowding the unlicensed spectrum. Licensed network such as LTE also comes to the unlicensed spectrum for better providing high-capacity contents with low cost. However, LTE was not…

With the formation of next generation wireless communication, a growing number of new applications like internet of things, autonomous car, and drone is crowding the unlicensed spectrum. Licensed network such as LTE also comes to the unlicensed spectrum for better providing high-capacity contents with low cost. However, LTE was not designed for sharing spectrum with others. A cooperation center for these networks is costly because they possess heterogeneous properties and everyone can enter and leave the spectrum unrestrictedly, so the design will be challenging. Since it is infeasible to incorporate potentially infinite scenarios with one unified design, an alternative solution is to let each network learn its own coexistence policy. Previous solutions only work on fixed scenarios. In this work we present a reinforcement learning algorithm to cope with the coexistence between Wi-Fi and LTE-LAA agents in 5 GHz unlicensed spectrum. The coexistence problem was modeled as a Dec-POMDP and Bayesian approach was adopted for policy learning with nonparametric prior to accommodate the uncertainty of policy for different agents. A fairness measure was introduced in the reward function to encourage fair sharing between agents. We turned the reinforcement learning into an optimization problem by transforming the value function as likelihood and variational inference for posterior approximation. Simulation results demonstrate that this algorithm can reach high value with compact policy representations, and stay computationally efficient when applying to agent set.

ContributorsSHIH, PO-KAN (Author) / Moraffah, Bahman (Thesis advisor) / Papandreou-Suppappola, Antonia (Thesis advisor) / Dasarathy, Gautam (Committee member) / Shih, YiChang (Committee member) / Arizona State University (Publisher)

Created2021

Radiation detection and imaging: neutrons and electric fields

Description

The work presented in this manuscript has the overarching theme of radiation. The two forms of radiation of interest are neutrons, i.e. nuclear, and electric fields. The ability to detect such forms of radiation have significant security implications that could also be extended to very practical industrial applications.…

The work presented in this manuscript has the overarching theme of radiation. The two forms of radiation of interest are neutrons, i.e. nuclear, and electric fields. The ability to detect such forms of radiation have significant security implications that could also be extended to very practical industrial applications. The goal is therefore to detect, and even image, such radiation sources.

The method to do so revolved around the concept of building large-area sensor arrays. By covering a large area, we can increase the probability of detection and gather more data to build a more complete and clearer view of the environment. Large-area circuitry can be achieved cost-effectively by leveraging the thin-film transistor process of the display industry. With production of displays increasing with the explosion of mobile devices and continued growth in sales of flat panel monitors and television, the cost to build a unit continues to decrease.

Using a thin-film process also allows for flexible electronics, which could be taken advantage of in-house at the Flexible Electronics and Display Center. Flexible electronics implies new form factors and applications that would not otherwise be possible with their single crystal counterparts. To be able to effectively use thin-film technology, novel ways of overcoming the drawbacks of the thin-film process, namely the lower performance scale.

The two deliverable devices that underwent development are a preamplifier used in an active pixel sensor for neutron detection and a passive electric field imaging array. This thesis will cover the theory and process behind realizing these devices.

ContributorsChung, Hugh E (Author) / Allee, David R. (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Holbert, Keith E. (Committee member) / Arizona State University (Publisher)

Created2015

Does Human Speech Follow Benford's Law

Description

Researchers have observed that the frequencies of leading digits in many man-made and naturally occurring datasets follow a logarithmic curve, with digits that start with the number 1 accounting for 30% of all numbers in the dataset and digits that start with the number 9 accounting for 5% of all…

Researchers have observed that the frequencies of leading digits in many man-made and naturally occurring datasets follow a logarithmic curve, with digits that start with the number 1 accounting for 30% of all numbers in the dataset and digits that start with the number 9 accounting for 5% of all numbers in the dataset. This phenomenon, known as Benford's Law, is highly repeatable and appears in lists of numbers from electricity bills, stock prices, tax returns, house prices, death rates, lengths of rivers, and naturally occurring images. This paper will demonstrate that human speech spectra also follow Benford's Law. This observation is used to motivate a new set of features that can be efficiently extracted from speech and demonstrate that these features can be used to classify between human speech and synthetic speech.

ContributorsHsu, Leo (Author) / Berisha, Visar (Thesis advisor) / Spanias, Andreas (Committee member) / Papandreou-Suppappola, Antonia (Committee member) / Arizona State University (Publisher)

Created2022

Filtering by