Matching Items (2)
Filtering by
- All Subjects: engineering
Description
With increasing transistor volume and reducing feature size, it has become a major design constraint to reduce power consumption also. This has given rise to aggressive architectural changes for on-chip power management and rapid development to energy efficient hardware accelerators. Accordingly, the objective of this research work is to facilitate software developers to leverage these hardware techniques and improve energy efficiency of the system. To achieve this, I propose two solutions for Linux kernel: Optimal use of these architectural enhancements to achieve greater energy efficiency requires accurate modeling of processor power consumption. Though there are many models available in literature to model processor power consumption, there is a lack of such models to capture power consumption at the task-level. Task-level energy models are a requirement for an operating system (OS) to perform real-time power management as OS time multiplexes tasks to enable sharing of hardware resources. I propose a detailed design methodology for constructing an architecture agnostic task-level power model and incorporating it into a modern operating system to build an online task-level power profiler. The profiler is implemented inside the latest Linux kernel and validated for Intel Sandy Bridge processor. It has a negligible overhead of less than 1\% hardware resource consumption. The profiler power prediction was demonstrated for various application benchmarks from SPEC to PARSEC with less than 4\% error. I also demonstrate the importance of the proposed profiler for emerging architectural techniques through use case scenarios, which include heterogeneous computing and fine grained per-core DVFS. Along with architectural enhancement in general purpose processors to improve energy efficiency, hardware accelerators like Coarse Grain reconfigurable architecture (CGRA) are gaining popularity. Unlike vector processors, which rely on data parallelism, CGRA can provide greater flexibility and compiler level control making it more suitable for present SoC environment. To provide streamline development environment for CGRA, I propose a flexible framework in Linux to do design space exploration for CGRA. With accurate and flexible hardware models, fine grained integration with accurate architectural simulator, and Linux memory management and DMA support, a user can carry out limitless experiments on CGRA in full system environment.
ContributorsDesai, Digant Pareshkumar (Author) / Vrudhula, Sarma (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)
Created2013
Description
Graphs are one of the key data structures for many real-world computing applica-
tions such as machine learning, social networks, genomics etc. The main challenges of
graph processing include diculty in parallelizing the workload that results in work-
load imbalance, poor memory locality and very large number of memory accesses.
This causes large-scale graph processing to be very expensive.
This thesis presents implementation of a select set of graph kernels on a multi-core
architecture, Transmuter. The kernels are Breadth-First Search (BFS), Page Rank
(PR), and Single Source Shortest Path (SSSP). Transmuter is a multi-tiled architec-
ture with 4 tiles and 16 general processing elements (GPE) per tile that supports a
two level cache hierarchy. All graph processing kernels have been implemented on
Transmuter using Gem5 architectural simulator.
The key pre-processing steps in improving the performance are static partition-
ing by destination and balancing the workload among the processing cores. Results
obtained by processing graphs that are partitioned against un-partitioned graphs
show almost 3x improvement in performance. Choice of data structure also plays an
important role in the amount of storage space consumed and the amount of synchro-
nization required in a parallel implementation. Here the compressed sparse column
data format was used. BFS and SSSP are frontier-based algorithms where a frontier
represents a subset of vertices that are active during the current iteration. They
were implemented using the Boolean frontier array data structure. PR is an iterative
algorithm where all vertices are active at all times.
The performance of the dierent Transmuter implementations for the 14nm node
were evaluated based on metrics such as power consumption (Watt), Giga Operations
Per Second(GOPS), GOPS/Watt and L1/L2 cache misses. GOPS/W numbers for
graphs with 10k nodes and 10k edges is 33 for BFS, 477 for PR and 10 for SSSP.
i
Frontier-based algorithms have much lower GOPS/W compared to iterative algo-
rithms such as PR. This is because all nodes in Page Rank are active at all points
in time. For all three kernel implementations, the L1 cache miss rates are quite low
while the L2 cache hit rates are high.
tions such as machine learning, social networks, genomics etc. The main challenges of
graph processing include diculty in parallelizing the workload that results in work-
load imbalance, poor memory locality and very large number of memory accesses.
This causes large-scale graph processing to be very expensive.
This thesis presents implementation of a select set of graph kernels on a multi-core
architecture, Transmuter. The kernels are Breadth-First Search (BFS), Page Rank
(PR), and Single Source Shortest Path (SSSP). Transmuter is a multi-tiled architec-
ture with 4 tiles and 16 general processing elements (GPE) per tile that supports a
two level cache hierarchy. All graph processing kernels have been implemented on
Transmuter using Gem5 architectural simulator.
The key pre-processing steps in improving the performance are static partition-
ing by destination and balancing the workload among the processing cores. Results
obtained by processing graphs that are partitioned against un-partitioned graphs
show almost 3x improvement in performance. Choice of data structure also plays an
important role in the amount of storage space consumed and the amount of synchro-
nization required in a parallel implementation. Here the compressed sparse column
data format was used. BFS and SSSP are frontier-based algorithms where a frontier
represents a subset of vertices that are active during the current iteration. They
were implemented using the Boolean frontier array data structure. PR is an iterative
algorithm where all vertices are active at all times.
The performance of the dierent Transmuter implementations for the 14nm node
were evaluated based on metrics such as power consumption (Watt), Giga Operations
Per Second(GOPS), GOPS/Watt and L1/L2 cache misses. GOPS/W numbers for
graphs with 10k nodes and 10k edges is 33 for BFS, 477 for PR and 10 for SSSP.
i
Frontier-based algorithms have much lower GOPS/W compared to iterative algo-
rithms such as PR. This is because all nodes in Page Rank are active at all points
in time. For all three kernel implementations, the L1 cache miss rates are quite low
while the L2 cache hit rates are high.
ContributorsRENGANATHAN, SRINIDHI (Author) / Chakrabarti, Chaitali (Thesis advisor) / Shrivastava, Aviral (Committee member) / Mudge, Trevor (Committee member) / Arizona State University (Publisher)
Created2019