Matching Items (2)
Filtering by

Clear all filters

152383-Thumbnail Image.png
Description
Data centers connect a larger number of servers requiring IO and switches with low power and delay. Virtualization of IO and network is crucial for these servers, which run virtual processes for computing, storage, and apps. We propose using the PCI Express (PCIe) protocol and a new PCIe switch fabric

Data centers connect a larger number of servers requiring IO and switches with low power and delay. Virtualization of IO and network is crucial for these servers, which run virtual processes for computing, storage, and apps. We propose using the PCI Express (PCIe) protocol and a new PCIe switch fabric for IO and switch virtualization. The switch fabric has little data buffering, allowing up to 512 physical 10 Gb/s PCIe2.0 lanes to be connected via a switch fabric. The switch is scalable with adapters running multiple adaptation protocols, such as Ethernet over PCIe, PCIe over Internet, or FibreChannel over Ethernet. Such adaptation protocols allow integration of IO often required for disjoint datacenter applications such as storage and networking. The novel switch fabric based on space-time carrier sensing facilitates high bandwidth, low power, and low delay multi-protocol switching. To achieve Terabit switching, both time (high transmission speed) and space (multi-stage interconnection network) technologies are required. In this paper, we present the design of an up to 256 lanes Clos-network of multistage crossbar switch fabric for PCIe system. The switch core consists of 48 16x16 crossbar sub-switches. We also propose a new output contention resolution algorithm utilizing an out-of-band protocol of Request-To-Send (RTS), Clear-To-Send (CTS) before sending PCIe packets through the switch fabric. Preliminary power and delay estimates are provided.
ContributorsLuo, Haojun (Author) / Hui, Joseph (Thesis advisor) / Song, Hongjiang (Committee member) / Reisslein, Martin (Committee member) / Zhang, Yanchao (Committee member) / Arizona State University (Publisher)
Created2013
155837-Thumbnail Image.png
Description
With the advent of GPGPU, many applications are being accelerated by using CUDA programing paradigm. We are able to achieve around 10x -100x speedups by simply porting the application on to the GPU and running the parallel chunk of code on its multi cored SIMT (Single instruction multiple thread) architecture.

With the advent of GPGPU, many applications are being accelerated by using CUDA programing paradigm. We are able to achieve around 10x -100x speedups by simply porting the application on to the GPU and running the parallel chunk of code on its multi cored SIMT (Single instruction multiple thread) architecture. But for optimal performance it is necessary to make sure that all the GPU resources are efficiently used, and the latencies in the application are minimized. For this, it is essential to monitor the Hardware usage of the algorithm and thus diagnose the compute and memory bottlenecks in the implementation. In the following thesis, we will be analyzing the mapping of CUDA implementation of BLIINDS-II algorithm on the underlying GPU hardware, and come up with a Kepler architecture specific solution of using shuffle instruction via CUB library to tackle the two major bottlenecks in the algorithm. Experiments were conducted to convey the advantage of using shuffle instru3ction in algorithm over only using shared memory as a buffer to global memory. With the new implementation of BLIINDS-II algorithm using CUB library, a speedup of around 13.7% was achieved.
ContributorsWadekar, Ameya (Author) / Sohoni, Sohum (Thesis advisor) / Aukes, Daniel (Committee member) / Redkar, Sangram (Committee member) / Arizona State University (Publisher)
Created2017