Matching Items (3)
156280-Thumbnail Image.png
Description
Fundamental limits of fixed-to-variable (F-V) and variable-to-fixed (V-F) length universal source coding at short blocklengths is characterized. For F-V length coding, the Type Size (TS) code has previously been shown to be optimal up to the third-order rate for universal compression of all memoryless sources over finite alphabets. The TS

Fundamental limits of fixed-to-variable (F-V) and variable-to-fixed (V-F) length universal source coding at short blocklengths is characterized. For F-V length coding, the Type Size (TS) code has previously been shown to be optimal up to the third-order rate for universal compression of all memoryless sources over finite alphabets. The TS code assigns sequences ordered based on their type class sizes to binary strings ordered lexicographically.

Universal F-V coding problem for the class of first-order stationary, irreducible and aperiodic Markov sources is first considered. Third-order coding rate of the TS code for the Markov class is derived. A converse on the third-order coding rate for the general class of F-V codes is presented which shows the optimality of the TS code for such Markov sources.

This type class approach is then generalized for compression of the parametric sources. A natural scheme is to define two sequences to be in the same type class if and only if they are equiprobable under any model in the parametric class. This natural approach, however, is shown to be suboptimal. A variation of the Type Size code is introduced, where type classes are defined based on neighborhoods of minimal sufficient statistics. Asymptotics of the overflow rate of this variation is derived and a converse result establishes its optimality up to the third-order term. These results are derived for parametric families of i.i.d. sources as well as Markov sources.

Finally, universal V-F length coding of the class of parametric sources is considered in the short blocklengths regime. The proposed dictionary which is used to parse the source output stream, consists of sequences in the boundaries of transition from low to high quantized type complexity, hence the name Type Complexity (TC) code. For large enough dictionary, the $\epsilon$-coding rate of the TC code is derived and a converse result is derived showing its optimality up to the third-order term.
ContributorsIri, Nematollah (Author) / Kosut, Oliver (Thesis advisor) / Bliss, Daniel (Committee member) / Sankar, Lalitha (Committee member) / Zhang, Junshan (Committee member) / Arizona State University (Publisher)
Created2018
134524-Thumbnail Image.png
Description
With the rising data output and falling costs of Next Generation Sequencing technologies, research into data compression is crucial to maintaining storage efficiency and costs. High throughput sequencers such as the HiSeqX Ten can produce up to 1.8 terabases of data per run, and such large storage demands are even

With the rising data output and falling costs of Next Generation Sequencing technologies, research into data compression is crucial to maintaining storage efficiency and costs. High throughput sequencers such as the HiSeqX Ten can produce up to 1.8 terabases of data per run, and such large storage demands are even more important to consider for institutions that rely on their own servers rather than large data centers (cloud storage)1. Compression algorithms aim to reduce the amount of space taken up by large genomic datasets by encoding the most frequently occurring symbols with the shortest bit codewords and by changing the order of the data to make it easier to encode. Depending on the probability distribution of the symbols in the dataset or the structure of the data, choosing the wrong algorithm could result in a compressed file larger than the original or a poorly compressed file that results in a waste of time and space2. To test efficiency among compression algorithms for each file type, 37 open-source compression algorithms were used to compress six types of genomic datasets (FASTA, VCF, BCF, GFF, GTF, and SAM) and evaluated on compression speed, decompression speed, compression ratio, and file size using the benchmark test lzbench. Compressors that outpreformed the popular bioinformatics compressor Gzip (zlib -6) were evaluated against one another by ratio and speed for each file type and across the geometric means of all file types. Compressors that exhibited fast compression and decompression speeds were also evaluated by transmission time through variable speed internet pipes in scenarios where the file was compressed only once or compressed multiple times.
ContributorsHowell, Abigail (Author) / Cartwright, Reed (Thesis director) / Wilson Sayres, Melissa (Committee member) / Taylor, Jay (Committee member) / Barrett, The Honors College (Contributor)
Created2017-05
161549-Thumbnail Image.png
Description
Traditionally, databases have been categorized as either row-oriented or column-oriented databases. Row-oriented databases store each row of the table’s data contiguously onto the disk whereas column-oriented databases store each column’s data contiguously onto the disk. In recent years, columnar database management systems are becoming increasingly popular because deep and narrow

Traditionally, databases have been categorized as either row-oriented or column-oriented databases. Row-oriented databases store each row of the table’s data contiguously onto the disk whereas column-oriented databases store each column’s data contiguously onto the disk. In recent years, columnar database management systems are becoming increasingly popular because deep and narrow queries are faster on them. Hence, column-oriented databases are highly optimized to be used with analytical (OLAP) workloads (Mike Freedman 2019). That is why they are very frequently used in business intelligence (BI), data warehouses, etc., which involve working with large data sets, intensive queries and aggregated computing. As the size of data keeps growing, efficient compression of data becomes an important consideration for these databases to optimize storage as well as improve query performance. Since column-oriented databases store data of the same data type contiguously, most modern compression techniques provide better compression ratios as compared to row-oriented databases. This thesis introduces a new compression technique called SA128 for column-oriented databases that performs a column-wise compression of database tables. SA128 is a multi-stage compression technique which performs a column-wise compression followed by a table-wide compression of database tables. In the first stage, SA128 performs an analysis based on the characteristics of data (such as data type and distribution) and determines which combination of lossless compression algorithms would result in the best compression ratio. In the second phase, SA128 uses an entropy encoding technique such as rANS (Duda, J., 2013) to further improve the compression ratio.
ContributorsAnand, Sukhpreet Singh (Author) / Bansal, Ajay (Thesis advisor) / Heinrichs, Robert R (Committee member) / Gonzalez-Sanchez, Javier (Committee member) / Arizona State University (Publisher)
Created2021