Description
Modern and future computer architectures will feature hyper-efficient and low-latency processing elements dedicated to a specific purpose.In a process known as pre-silicon design, computer architects design a customized piece of hardware, called a domain-specific architecture (DSA), to support the high

Modern and future computer architectures will feature hyper-efficient and low-latency processing elements dedicated to a specific purpose.In a process known as pre-silicon design, computer architects design a customized piece of hardware, called a domain-specific architecture (DSA), to support the high performance and efficiency of a target group of computer programs. Each program contains sections of code called "tasks" whose characteristics may be exploited for hyper-efficient and low-latency execution on specialized processors called processing elements (PE). Architects design PEs toward those tasks, enabling DSAs to deliver magnitudes of performance and energy-efficiency advantages over general-purpose, scalar processors. After the pre-silicon design phase of a DSA is the post-silicon design phase, where application designers map other and novel computer programs for high performance on that DSA. To realize the performance and energy benefits of that DSA, each computer program must realize the PEs of that DSA. Unfortunately, refactoring applications to exploit the PEs of a DSA is difficult. Application and architectural experts (whose expertise is in low supply) must manually structure each application toward their target DSA. Furthermore, open-source and legacy code called code-from-the-wild (CFTW), often lack the structure required to compile applications toward DSAs, requiring manual refactoring and creating a barrier between DSAs and the broader software engineering community. Finally, manual application structuring techniques do not scale to novel applications and architectures. This thesis argues that automating the process of structuring computer programs for transformation and optimization toward DSAs is essential to the widespread adoption and utilization of DSAs. Existing methods to extract the structure from CFTW both capture unimportant code and miss critical code for acceleration because they rely on code frequency.Further, they rely on static information to find communication patterns between them. To address the challenge of localizing the acceleration candidates and finding their communication among each other, this thesis presents Cyclebite. Cyclebite extracts coarse-grained tasks (CGTs) from CFTW by detecting CGT candidates in the dynamic execution profile of the application. Then, Cyclebite localizes the instances of each CGT candidate, which measures its utilization and observes its communication patterns with other task candidates. Each CGT candidate whose utilization is adequate is designated as a task, and its communication with other tasks forms producer-consumer relationships. Cyclebite exports a produce-consume task graph (or simply task graph) for the application - a directed acyclic graph where each node is a CGT and directed edges are communication between CGTs, pointing from producer to consumer. I show that Cyclebite finds both important code that state-of-the-art (SoA) structuring techniques miss and rejects unimportant code that SoA structuring techniques erroneously include in the task graph. I propose a CGT analysis and export tool, called Cyclebite-Template, that extracts the parallel execution pattern of each CGT in the exported Cyclebite task graph, and exports the task graph into a domain-specific language, which facilitates its transformation and optimization towards a target DSA. Cyclebite-Template works on a low-level representation of each CGT, making its analysis agnostic to the syntactic definition of each task, while SoA task templates use higher-level representations that are not robust to differing syntactic task definitions. Further, Cyclebite-Template supports the transformation and optimization of non-polyhedral tasks by extracting each task's canonical expression (which implies its parallel execution pattern), while SoA task templates only support polyhedral tasks. I show that Cyclebite-Template accurately extracts the parallel pattern from Cyclebite tasks, and Cyclebite task graphs exported with Cyclebite-Template achieve speedups inline or exceeding those of SoA task templates.
Reuse Permissions
  • 1.13 MB application/pdf

    Download restricted until 2026-08-01.

    Details

    Title
    • Cyclebite: Towards A Machine Understanding of Computation
    Contributors
    Date Created
    2024
    Resource Type
  • Text
  • Collections this item is in
    Note
    • Partial requirement for: Ph.D., Arizona State University, 2024
    • Field of study: Computer Engineering

    Machine-readable links