Matching Items (2)

157687-Thumbnail Image.png

Implementation of Graph Kernels on Multi core Architecture

Description

Graphs are one of the key data structures for many real-world computing applica-

tions such as machine learning, social networks, genomics etc. The main challenges of

graph processing include diculty in parallelizing

Graphs are one of the key data structures for many real-world computing applica-

tions such as machine learning, social networks, genomics etc. The main challenges of

graph processing include diculty in parallelizing the workload that results in work-

load imbalance, poor memory locality and very large number of memory accesses.

This causes large-scale graph processing to be very expensive.

This thesis presents implementation of a select set of graph kernels on a multi-core

architecture, Transmuter. The kernels are Breadth-First Search (BFS), Page Rank

(PR), and Single Source Shortest Path (SSSP). Transmuter is a multi-tiled architec-

ture with 4 tiles and 16 general processing elements (GPE) per tile that supports a

two level cache hierarchy. All graph processing kernels have been implemented on

Transmuter using Gem5 architectural simulator.

The key pre-processing steps in improving the performance are static partition-

ing by destination and balancing the workload among the processing cores. Results

obtained by processing graphs that are partitioned against un-partitioned graphs

show almost 3x improvement in performance. Choice of data structure also plays an

important role in the amount of storage space consumed and the amount of synchro-

nization required in a parallel implementation. Here the compressed sparse column

data format was used. BFS and SSSP are frontier-based algorithms where a frontier

represents a subset of vertices that are active during the current iteration. They

were implemented using the Boolean frontier array data structure. PR is an iterative

algorithm where all vertices are active at all times.

The performance of the dierent Transmuter implementations for the 14nm node

were evaluated based on metrics such as power consumption (Watt), Giga Operations

Per Second(GOPS), GOPS/Watt and L1/L2 cache misses. GOPS/W numbers for

graphs with 10k nodes and 10k edges is 33 for BFS, 477 for PR and 10 for SSSP.

i

Frontier-based algorithms have much lower GOPS/W compared to iterative algo-

rithms such as PR. This is because all nodes in Page Rank are active at all points

in time. For all three kernel implementations, the L1 cache miss rates are quite low

while the L2 cache hit rates are high.

Contributors

Agent

Created

Date Created
  • 2019

155283-Thumbnail Image.png

Designing Low Cost Error Correction Schemes for Improving Memory Reliability

Description

Memory systems are becoming increasingly error-prone, and thus guaranteeing their reliability is a major challenge. In this dissertation, new techniques to improve the reliability of both 2D and 3D dynamic

Memory systems are becoming increasingly error-prone, and thus guaranteeing their reliability is a major challenge. In this dissertation, new techniques to improve the reliability of both 2D and 3D dynamic random access memory (DRAM) systems are presented. The proposed schemes have higher reliability than current systems but with lower power, better performance and lower hardware cost.

First, a low overhead solution that improves the reliability of commodity DRAM systems with no change in the existing memory architecture is presented. Specifically, five erasure and error correction (E-ECC) schemes are proposed that provide at least Chipkill-Correct protection for x4 (Schemes 1, 2 and 3), x8 (Scheme 4) and x16 (Scheme 5) DRAM systems. All schemes have superior error correction performance due to the use of strong symbol-based codes. In addition, the use of erasure codes extends the lifetime of the 2D DRAM systems.

Next, two error correction schemes are presented for 3D DRAM memory systems. The first scheme is a rate-adaptive, two-tiered error correction scheme (RATT-ECC) that provides strong reliability (10^10x) reduction in raw FIT rate) for an HBM-like 3D DRAM system that services CPU applications. The rate-adaptive feature of RATT-ECC enables permanent bank failures to be handled through sparing. It can also be used to significantly reduce the refresh power consumption without decreasing the reliability and timing performance.

The second scheme is a two-tiered error correction scheme (Config-ECC) that supports different sized accesses in GPU applications with strong reliability. It addresses the mismatch between data access size and fixed sized ECC scheme by designing a product code based flexible scheme. Config-ECC is built around a core unit designed for 32B access with a simple extension to support 64B and 128B accesses. Compared to fixed 32B and 64B ECC schemes, Config-ECC reduces the failure in time (FIT) rate by 200x and 20x, respectively. It also reduces the memory energy by 17% (in the dynamic mode) and 21% (in the static mode) compared to a state-of-the-art fixed 64B ECC scheme.

Contributors

Agent

Created

Date Created
  • 2017