Matching Items (3)
- Creators: Aukes, Daniel
- Creators: Frakes, David
- Creators: Hui, Joseph
Data centers connect a larger number of servers requiring IO and switches with low power and delay. Virtualization of IO and network is crucial for these servers, which run virtual processes for computing, storage, and apps. We propose using the PCI Express (PCIe) protocol and a new PCIe switch fabric for IO and switch virtualization. The switch fabric has little data buffering, allowing up to 512 physical 10 Gb/s PCIe2.0 lanes to be connected via a switch fabric. The switch is scalable with adapters running multiple adaptation protocols, such as Ethernet over PCIe, PCIe over Internet, or FibreChannel over Ethernet. Such adaptation protocols allow integration of IO often required for disjoint datacenter applications such as storage and networking. The novel switch fabric based on space-time carrier sensing facilitates high bandwidth, low power, and low delay multi-protocol switching. To achieve Terabit switching, both time (high transmission speed) and space (multi-stage interconnection network) technologies are required. In this paper, we present the design of an up to 256 lanes Clos-network of multistage crossbar switch fabric for PCIe system. The switch core consists of 48 16x16 crossbar sub-switches. We also propose a new output contention resolution algorithm utilizing an out-of-band protocol of Request-To-Send (RTS), Clear-To-Send (CTS) before sending PCIe packets through the switch fabric. Preliminary power and delay estimates are provided.
Magnetic Resonance Imaging using spiral trajectories has many advantages in speed, efficiency in data-acquistion and robustness to motion and flow related artifacts. The increase in sampling speed, however, requires high performance of the gradient system. Hardware inaccuracies from system delays and eddy currents can cause spatial and temporal distortions in the encoding gradient waveforms. This causes sampling discrepancies between the actual and the ideal k-space trajectory. Reconstruction assuming an ideal trajectory can result in shading and blurring artifacts in spiral images. Current methods to estimate such hardware errors require many modifications to the pulse sequence, phantom measurements or specialized hardware. This work presents a new method to estimate time-varying system delays for spiral-based trajectories. It requires a minor modification of a conventional stack-of-spirals sequence and analyzes data collected on three orthogonal cylinders. The method is fast, robust to off-resonance effects, requires no phantom measurements or specialized hardware and estimate variable system delays for the three gradient channels over the data-sampling period. The initial results are presented for acquired phantom and in-vivo data, which show a substantial reduction in the artifacts and improvement in the image quality.
Analysis of hardware usage of shuffle instruction based performance optimization in the Blinds-II image quality assessment algorithm
With the advent of GPGPU, many applications are being accelerated by using CUDA programing paradigm. We are able to achieve around 10x -100x speedups by simply porting the application on to the GPU and running the parallel chunk of code on its multi cored SIMT (Single instruction multiple thread) architecture. But for optimal performance it is necessary to make sure that all the GPU resources are efficiently used, and the latencies in the application are minimized. For this, it is essential to monitor the Hardware usage of the algorithm and thus diagnose the compute and memory bottlenecks in the implementation. In the following thesis, we will be analyzing the mapping of CUDA implementation of BLIINDS-II algorithm on the underlying GPU hardware, and come up with a Kepler architecture specific solution of using shuffle instruction via CUB library to tackle the two major bottlenecks in the algorithm. Experiments were conducted to convey the advantage of using shuffle instru3ction in algorithm over only using shared memory as a buffer to global memory. With the new implementation of BLIINDS-II algorithm using CUB library, a speedup of around 13.7% was achieved.