Search Content

Matching Items (2)

Self-aware, Adaptive General-purpose Computing Systems

Description

With the breakdown of Dennard scaling, computer architects can no longer rely on integrated circuit energy efficiency to scale with transistor density, and must under-clock or power-gate parts of their designs in order to fit within given power budgets. Hardware accelerators may improve energy efficiency of some compute-intensive tasks, but as more tasks are accelerated, the general-purpose portions of workloads account for a larger share of execution time while also leaving less instruction, data, or task-level parallelism to exploit. Adaptive computing systems have potential to address these challenges by modifying their behavior at runtime. Adaptation requires runtime decision-making, which can be performed both in hardware and software. While software-based decision-making is more flexible and can execute higher complexity operations compared to hardware, it also incurs a significant latency and power overhead. Hardware designs are more limited in the space of decisions they can make, but have direct access to their own internal microarchitectural states and can make faster decisions, allowing for better-informed adaptation and extracting previously unobtainable performance and security benefits. In this dissertation I study (i) the viability and trade-offs of general-purpose adaptive systems, (ii) the difficulty and complexity of making adaptation decisions, and (iii) how time spent in the observation-analysis-adaptation cycle affects adaptation benefits. I introduce techniques for (a) modeling and understanding high performance computing systems and microarchitecture, (b) enabling hardware learning and decision-making through low-latency networks, and (c) on securing hardware designs using runtime decision-making. I propose an always-awake and active learning `hardware nervous system' pervasive throughout the chip that can reason about the individual hardware module performance, energy usage, and security. I present the design and implementation of (1) a reference architecture and (2) a microarchitecture-aware static binary instrumentation tool. Finally, I provide results showing (1) that runtime adaptation is a necessary to continue improving performance on general-purpose tasks, (2) that significant performance loss and performance variation happens under the ISA-level, and is unobservable without hardware support, and (3) that hardware must possess decision-making and ‘self-awareness’ capabilities at the microarchitecture level in order to efficiently use its own faculties.

ContributorsIsakov, Mihailo (Author) / Kinsy, Michel (Thesis advisor) / Shrivastava, Aviral (Committee member) / Rudd, Kevin (Committee member) / Gadepally, Vijay (Committee member) / Arizona State University (Publisher)

Created2022

A Parallel Adaptive Mesh Refinement Library for Cartesian Meshes

Description

This dissertation introduces FARCOM (Fortran Adaptive Refiner for Cartesian Orthogonal Meshes), a new general library for adaptive mesh refinement (AMR) based on an unstructured hexahedral mesh framework. As a result of the underlying unstructured formulation, the refinement and coarsening operators of the library operate on a single-cell basis and perform in-situ replacement of old mesh elements. This approach allows for h-refinement without the memory and computational expense of calculating masked coarse grid cells, as is done in traditional patch-based AMR approaches, and enables unstructured flow solvers to have access to the automated domain generation capabilities usually only found in tree AMR formulations.

The library is written to let the user determine where to refine and coarsen through custom refinement selector functions for static mesh generation and dynamic mesh refinement, and can handle smooth fields (such as level sets) or localized markers (e.g. density gradients). The library was parallelized with the use of the Zoltan graph-partitioning library, which provides interfaces to both a graph partitioner (PT-Scotch) and a partitioner based on Hilbert space-filling curves. The partitioned adjacency graph, mesh data, and solution variable data is then packed and distributed across all MPI ranks in the simulation, which then regenerate the mesh, generate domain decomposition ghost cells, and create communication caches.

Scalability runs were performed using a Leveque wave propagation scheme for solving the Euler equations. The results of simulations on up to 1536 cores indicate that the parallel performance is highly dependent on the graph partitioner being used, and differences between the partitioners were analyzed. FARCOM is found to have better performance if each MPI rank has more than 60,000 cells.

ContributorsBallesteros, Carlos Alberto (Author) / Herrmann, Marcus (Thesis advisor) / Adrian, Ronald (Committee member) / Chen, Kangping (Committee member) / Huang, Huei-Ping (Committee member) / Lopez, Juan (Committee member) / Arizona State University (Publisher)

Created2019