Profiling of Indel Phases in Coding Regions

Zhu, Ziqi

Advances in sequencing technology have generated an enormous amount of data over the past decade. Equally advanced computational methods are needed to conduct comparative and functional genomic studies on these datasets, in particular tools that appropriately interpret indels within an…

Advances in sequencing technology have generated an enormous amount of data over the past decade. Equally advanced computational methods are needed to conduct comparative and functional genomic studies on these datasets, in particular tools that appropriately interpret indels within an evolutionary framework. The evolutionary history of indels is complex and often involves repetitive genomic regions, which makes identification, alignment, and annotation difficult. While previous studies have found that indel lengths in both deoxyribonucleic acid and proteins obey a power law, probabilistic models for indel evolution have rarely been explored due to their computational complexity. In my research, I first explore an application of an expectation-maximization algorithm for maximum-likelihood training of a codon substitution model. I demonstrate the training accuracy of the expectation-maximization on my substitution model. Then I apply this algorithm on a published 90 pairwise species dataset and find a negative correlation between the branch length and non-synonymous selection coefficient. Second, I develop a post-alignment fixation method to profile each indel event into three different phases according to its codon position. Because current codon-aware models can only identify the indels by placing the gaps between codons and lead to the misalignment of the sequences. I find that the mouse-rat species pair is under purifying selection by looking at the proportion difference of the indel phases. I also demonstrate the power of my sliding-window method by comparing the post-aligned and original gap positions. Third, I create an indel-phase moore machine including the indel rates of three phases, length distributions, and codon substitution models. Then I design a gillespie simulation that is capable of generating true sequence alignments. Next I develop an importance sampling method within the expectation-maximization algorithm that can successfully train the indel-phase model and infer accurate parameter estimates from alignments. Finally, I extend the indel phase analysis to the 90 pairwise species dataset across three alignment methods, including Mafft+sw method developed in chapter 3, coati-sampling methods applied in chapter 4, and coati-max method. Also I explore a non-linear relationship between the dN/dS and Zn/(Zn+Zs) ratio across 90 species pairs.

Copyright Statement