2026-05-24T16:20:58Zhttps://keep.lib.asu.edu/oai/request

oai:keep.lib.asu.edu:node-1569622024-12-20T18:25:12Zoai_pmh:alloai_pmh:repo_items

156962 https://hdl.handle.net/2286/R.I.51737 http://rightsstatements.org/vocab/InC/1.0/ 2018 79 pages Masters Thesis Academic theses Text eng Animesh, Saurabh Chakrabarti, Chaitali Brunhaver, John Ren, Fengbo Arizona State University Masters Thesis Computer Engineering 2018 With the end of Dennard scaling and Moore's law, architects have moved towards heterogeneous designs consisting of specialized cores to achieve higher performance and energy efficiency for a target application domain. Applications of linear algebra are ubiquitous in the field of scientific computing, machine learning, statistics, etc. with matrix computations being fundamental to these linear algebra based solutions. Design of multiple dense (or sparse) matrix computation routines on the same platform is quite challenging. Added to the complexity is the fact that dense and sparse matrix computations have large differences in their storage and access patterns and are difficult to optimize on the same architecture. This thesis addresses this challenge and introduces a reconfigurable accelerator that supports both dense and sparse matrix computations efficiently. The reconfigurable architecture has been optimized to execute the following linear algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM (Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver), LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication), SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where each core consists of a 2D array of processing elements (PE). The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix updates efficiently. A sequence of such updates is used to solve a larger problem inside a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR) is used to perform sparse kernel updates. Scalable partitioning and mapping schemes are presented that map input matrices of any given size to the multicore architecture. Design trade-offs related to the PE array dimension, size of local memory inside a core and the bandwidth between on-chip memories and the cores have been presented. An optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto 32 GOPS using a single core. Electrical Engineering Accelerator Algebras, Linear matrix Multicore Reconfigurable sparse Algorithm Architecture Co-design for Dense and Sparse Matrix Computations