{"id":720,"date":"2023-03-15T14:30:04","date_gmt":"2023-03-15T07:30:04","guid":{"rendered":"https:\/\/conf.icgbio.ru\/bgrs98\/?page_id=720"},"modified":"2023-04-11T14:35:25","modified_gmt":"2023-04-11T07:35:25","slug":"056_identifying-dna-and-protein-patterns-with-statistically-significant-alignment-matrices","status":"publish","type":"page","link":"https:\/\/conf.icgbio.ru\/bgrs98\/abstracts\/abstract-list\/056_identifying-dna-and-protein-patterns-with-statistically-significant-alignment-matrices\/","title":{"rendered":"IDENTIFYING DNA AND PROTEIN PATTERNS WITH STATISTICALLY SIGNIFICANT ALIGNMENT MATRICES"},"content":{"rendered":"<p><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/abstracts\/authors-index\/#hertz\">HERTZ GERALD Z.<\/a><sup>+<\/sup>,\u00a0<a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/abstracts\/authors-index\/#stormo\">STORMO GARY D.<\/a><\/p>\n<p>Department of Molecular, Cellular, and Developmental Biology; University of Colorado; Campus Box 347; Boulder, CO 80309; USA; hertz or e-mail:\u00a0<a href=\"mailto:stormo@colorado.edu\" target=\"_blank\" rel=\"noopener\">stormo@colorado.edu<\/a><\/p>\n<p><sup>+<\/sup>Corresponding author<\/p>\n<p><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/abstracts\/keywords-index\/\">Keywords<\/a>: patterns, statistically significant alignment matrix, information context<\/p>\n<p>&nbsp;<\/p>\n<p>We have been developing computational methods for identifying consensus patterns in sets of DNA or protein sequences that are functionally related according to previous experiments. For example, functionally related DNA sequences can be coordinately regulated promoters, or DNA sequences that are bound by a common protein. Our methods align the functionally related sequences to identify the consensus pattern. Once a pattern is identified, it can be used to predict related domains in other sequences.<\/p>\n<p>We describe consensus patterns with alignment matrices that summarize a sequence alignment. We use various matrix models that range from simple ungapped patterns to complex patterns containing gaps (i.e. insertions and deletions) and adjacent correlations. Alignment matrices are initially compared using a log-likelihood score we call information content. However, the ultimate statistical significance of an alignment is the product of the p-value of its information content and the number of alignments that can be created from the data. Previously, we have been able to determine good estimates of the number of possible alignments due to the data. Here, we introduce a method for accurately estimating the p-value of the information content of an alignment matrix.<\/p>\n<p><b>1. Describing sequence alignments with matrices<\/b><\/p>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image1.gif\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" class=\"alignnone wp-image-728 size-full\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image1.gif\" alt=\"\" width=\"176\" height=\"184\" \/><\/a><\/p>\n<p>Figure 1 An example of a matrix summarizing a DNA sequence alignment not containing gaps. On the top is an alignment of four 6-mers. Below is a matrix containing the number of times that the indicated letter is observed at the indicated position of this alignment.<\/p>\n<p>&nbsp;<\/p>\n<p>Functionally related DNA and protein sequences are generally expected to share some common sequence elements. For example, a DNA-binding protein is expected to bind related DNA sequences. The pattern shared by a set of functionally related sequences is commonly identified during the process of aligning the sequences to maximize sequence conservation. The oldest method for describing a sequence alignment is the consensus sequence, which contains the most highly conserved letter (i.e. base for DNA or amino acid for protein) at each position of the alignment. However, most alignments are not limited to just a single letter at each position. At some positions of an alignment any letter may be permissible, although some letters may be much more frequent than others. A more complete description of an alignment is the alignment matrix, which, in its simplest form, lists the occurrence of each letter at each position of the alignment (Figure 1).<sup>1<\/sup>\u00a0However, alignment matrices can also describe more complex patterns that contain gaps (i.e. sequences contain insertions and deletions relative to each other) or in which different positions are correlated with each other (Figure 2).<sup>2<\/sup><\/p>\n<p>&nbsp;<\/p>\n<p align=\"CENTER\"><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image2.gif\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" class=\"alignnone wp-image-729 size-full\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image2.gif\" alt=\"\" width=\"214\" height=\"294\" \/><\/a><\/p>\n<p>Figure 2. An example of a matrix summarizing a DNA sequence alignment that contains gaps. On the top is an alignment of four sequences: the first and last are 6-mers, and the middle two are 4-mers. Below is a matrix whose first five rows represent our simplest matrix for alignments containing gaps. The first four rows contain the number of times that the indicated letter is observed at the indicated position of this alignment, and the fifth row contains the number of times that a gap is observed at the indicated position. A more complex matrix representation includes gap-letter correlations and contains the four lower rows showing the number of times that a letter (<a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_l.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-767\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_l.gif\" alt=\"\" width=\"12\" height=\"18\" \/><\/a>) or a gap (- ) is preceded by a letter or a gap.<\/p>\n<p><b>2. Measuring sequence conservation with information content.<\/b><\/p>\n<p><b><i>2.1 Information content of a simple alignment<\/i><\/b><\/p>\n<p>We score the significance of a sequence alignment by what we call the information content of the alignment matrix.<sup>1,2<\/sup> The higher the information content, the rarer the pattern described by the alignment. The following is the formula for determining the information content <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_i.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-722\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_i.gif\" alt=\"\" width=\"13\" height=\"17\" \/><\/a>\u00a0of a simple alignment matrix (Figure 1):<\/p>\n<table border=\"0\" width=\"100%\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td width=\"90%\">\n<p align=\"center\"><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image3.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-730\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image3.gif\" alt=\"\" width=\"127\" height=\"48\" \/><\/a>, where<\/p>\n<\/td>\n<td width=\"10%\">(1)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<ul>\n<li><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_i.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-727\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_i.gif\" alt=\"\" width=\"9\" height=\"17\" \/><\/a>\u00a0refers to the rows of the matrix (e.g., the bases A,C,G,T for a DNA alignment),<\/li>\n<li><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_j.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-766\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_j.gif\" alt=\"\" width=\"13\" height=\"19\" \/><\/a>\u00a0refers to the columns of the matrix (i.e., the position within the pattern),<\/li>\n<li><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_a.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-721\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_a.gif\" alt=\"\" width=\"15\" height=\"17\" \/><\/a>\u00a0is the total number of letters in the sequence alphabet (4 for DNA and 20 for protein),<\/li>\n<li><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_l.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-723\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_l.gif\" alt=\"\" width=\"14\" height=\"17\" \/><\/a>\u00a0is the total number of columns in the matrix,<\/li>\n<li><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_p_i.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-768\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_p_i.gif\" alt=\"\" width=\"18\" height=\"24\" \/><\/a>\u00a0is the <i>a priori<\/i> probability of letter <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_i.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-727\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_i.gif\" alt=\"\" width=\"9\" height=\"17\" \/><\/a>&#8211; this might be the genomic frequency in the particular organism or the observed frequency in the particular data set and<\/li>\n<li><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_f_i_j.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-726\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_f_i_j.gif\" alt=\"\" width=\"24\" height=\"25\" \/><\/a> is the frequency that letter <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_i.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-727\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_i.gif\" alt=\"\" width=\"9\" height=\"17\" \/><\/a> occurs at position <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_j.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-766\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_j.gif\" alt=\"\" width=\"13\" height=\"19\" \/><\/a> so that <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image4.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-731\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image4.gif\" alt=\"\" width=\"78\" height=\"30\" \/><\/a>.<\/li>\n<\/ul>\n<p>The minimum value of <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_i.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-722\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_i.gif\" alt=\"\" width=\"13\" height=\"17\" \/><\/a> is 0. An alignment having this minimum value is consistent with a pattern that any sequence might match. When <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_i.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-722\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_i.gif\" alt=\"\" width=\"13\" height=\"17\" \/><\/a>\u00a0is at its maximum value, only a single sequence will be consistent with the pattern.<\/p>\n<p>This formula for information content assumes that each position of the alignment is independent and that each column of an arbitrary alignment of random sequences follows a multinomial distribution. The most conventional derivation of formula 1 is to simply normalize the negative of the common log-likelihood statistic (described in introductory statistics books) by the number of sequences in the alignment. Alternatively, the formula can be shown to be a large-deviation rate constant.<sup>3<\/sup><\/p>\n<p><i><b>2.2 Information content of alignments containing gaps<\/b><\/i><\/p>\n<p>The information content <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image5.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-732\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image5.gif\" alt=\"\" width=\"18\" height=\"25\" \/><\/a> of an alignment containing gaps but no correlations (first 5 rows of the matrix in Figure 2) includes an additional term <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image6.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-733\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image6.gif\" alt=\"\" width=\"26\" height=\"25\" \/><\/a>, which is the frequency of a gap occurring at position <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_j.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-766\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_j.gif\" alt=\"\" width=\"13\" height=\"19\" \/><\/a> of the alignment, so that <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image7.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-734\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image7.gif\" alt=\"\" width=\"117\" height=\"30\" \/><\/a>. The formula for <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image5.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-732\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image5.gif\" alt=\"\" width=\"18\" height=\"25\" \/><\/a>\u00a0is:<\/p>\n<table border=\"0\" width=\"100%\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td width=\"90%\"><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image8.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-735\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image8.gif\" alt=\"\" width=\"227\" height=\"50\" \/><\/a><\/td>\n<td width=\"10%\">(2)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Notice that the formula 1 can be derived from formula 2 by setting <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image9.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-736\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image9.gif\" alt=\"\" width=\"53\" height=\"25\" \/><\/a> because <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image10.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-737\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image10.gif\" alt=\"\" width=\"90\" height=\"30\" \/><\/a>. Also notice that any column in which <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image11.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-738\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image11.gif\" alt=\"\" width=\"50\" height=\"25\" \/><\/a> (i.e., the corresponding position never contains any letters) contributes nothing to the overall value of <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image5.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-732\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image5.gif\" alt=\"\" width=\"18\" height=\"25\" \/><\/a>. Thus, an infinite number of meaningless positions containing only gaps can be added to any alignment without altering its information content.<\/p>\n<p>There is no\u00a0<i>a priori<\/i>\u00a0probability for gaps since they can be freely introduced. For example, the\u00a0<i>a priori<\/i>\u00a0probability of a sequence is not altered by inserting gaps into its alignment. However, because gaps can be freely inserted into the alignment, the number of permissible alignments is greatly increased. The derivations, mentioned for the information content of alignments not containing gaps, can be generalized to alignments containing gaps.<\/p>\n<p><i><b>2.3 Information content of alignments containing adjacent correlations<\/b><\/i><\/p>\n<p>Gaps are frequently clustered in adjacent positions of an alignment. In extreme cases of clustering, a pattern containing gaps can be described as multiple ungapped conserved domains separated by variable length unconserved spacer sequences. Information for clustering gaps is contained in the correlations between adjacent letters and gaps (the last 4 rows of the matrix in Figure 2). The information content <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image12.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-739\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image12.gif\" alt=\"\" width=\"22\" height=\"25\" \/><\/a>\u00a0for such an alignment is<\/p>\n<p><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image13.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-740\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image13.gif\" alt=\"\" width=\"554\" height=\"53\" \/><\/a>,<\/p>\n<p>in which <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image14.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-741\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image14.gif\" alt=\"\" width=\"25\" height=\"25\" \/><\/a> is the frequency of a letter instead of a gap occurring at position <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_j.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-766\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_j.gif\" alt=\"\" width=\"13\" height=\"19\" \/><\/a>; and <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image15.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-742\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image15.gif\" alt=\"\" width=\"30\" height=\"25\" \/><\/a>, <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image16.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-743\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image16.gif\" alt=\"\" width=\"31\" height=\"25\" \/><\/a>, <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image17.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-744\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image17.gif\" alt=\"\" width=\"31\" height=\"25\" \/><\/a>, <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image18.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-745\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image18.gif\" alt=\"\" width=\"31\" height=\"25\" \/><\/a> are the frequencies of adjacent gaps and letters. For example, <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image16.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-743\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image16.gif\" alt=\"\" width=\"31\" height=\"25\" \/><\/a> is the frequency of a gap at position <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_j.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-766\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_j.gif\" alt=\"\" width=\"13\" height=\"19\" \/><\/a> following a letter at position <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image19.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-746\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image19.gif\" alt=\"\" width=\"31\" height=\"21\" \/><\/a>.<\/p>\n<p>The bracketed term in the summation is commonly called mutual information and represents the information content due to the adjacent correlations between the occurrences of gaps and letters. The mutual information is at its minimum value of zero when the frequencies of adjacent letters and gaps are independent &#8211; i.e., the four fractions are each equal to one. This last alignment model is very similar to the most common hidden Markov models used for determining and describing protein sequence alignments.<\/p>\n<p><b>3. Determining the p-value of the information content<\/b><\/p>\n<p>Information content is a good measure for comparing alignments having the same width and the same number of sequences. However, a more meaningful measure is the p-value of the information content\u00f3i.e., the probability of observing an information content greater than or equal to a specified value, given the width of the alignment and the number of sequences in the alignment. The p-value also allows the comparison of alignments having different widths and containing different numbers of sequences. When calculating p-value, our null model is that the distribution of letters in each alignment column is an independent multinomial distribution in which the probability of observing each letter is the\u00a0<i>a priori\u00a0<\/i>probability of that letter. We calculate the p-value of an information content <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image20.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-747\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image20.gif\" alt=\"\" width=\"15\" height=\"22\" \/><\/a>\u00a0using a technique from large-deviation statistics.<sup>3<\/sup> Let <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image21.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-748\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image21.gif\" alt=\"\" width=\"33\" height=\"22\" \/><\/a> be the probability of observing an information content <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_i.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-722\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_i.gif\" alt=\"\" width=\"13\" height=\"17\" \/><\/a>. Define a new probability <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image22.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-749\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image22.gif\" alt=\"\" width=\"37\" height=\"25\" \/><\/a>\u00a0as<\/p>\n<table border=\"0\" width=\"100%\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td width=\"90%\"><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image23.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-750\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image23.gif\" alt=\"\" width=\"105\" height=\"46\" \/><\/a>, where<\/td>\n<td width=\"10%\">(4)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image24.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-751\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image24.gif\" alt=\"\" width=\"122\" height=\"35\" \/><\/a> is the moment generating function for <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image21.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-748\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image21.gif\" alt=\"\" width=\"33\" height=\"22\" \/><\/a> and ensures that <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image25.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-752\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image25.gif\" alt=\"\" width=\"77\" height=\"35\" \/><\/a>; <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image26.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-753\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image26.gif\" alt=\"\" width=\"13\" height=\"17\" \/><\/a> is chosen such that the average of <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_i.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-722\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_i.gif\" alt=\"\" width=\"13\" height=\"17\" \/><\/a> for <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image27.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-754\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image27.gif\" alt=\"\" width=\"18\" height=\"25\" \/><\/a> is <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image20.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-747\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image20.gif\" alt=\"\" width=\"15\" height=\"22\" \/><\/a>: <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image28.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-755\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image28.gif\" alt=\"\" width=\"154\" height=\"35\" \/><\/a>; and the variance of <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_i.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-722\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_i.gif\" alt=\"\" width=\"13\" height=\"17\" \/><\/a> for <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image29.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-756\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image29.gif\" alt=\"\" width=\"18\" height=\"24\" \/><\/a> is <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image30.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-757\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image30.gif\" alt=\"\" width=\"198\" height=\"37\" \/><\/a>. Formula 4 can be rearranged so that<\/p>\n<p><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image31.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-758\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image31.gif\" alt=\"\" width=\"179\" height=\"45\" \/><\/a>, and<\/p>\n<table border=\"0\" width=\"100%\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td width=\"90%\"><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image32.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-759\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image32.gif\" alt=\"\" width=\"227\" height=\"48\" \/><\/a><\/td>\n<td width=\"10%\">(5)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Formula 5 became useful for estimating p-value once we realized that <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image33.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-760\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image33.gif\" alt=\"\" width=\"39\" height=\"22\" \/><\/a>, <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image34.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-761\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image34.gif\" alt=\"\" width=\"42\" height=\"22\" \/><\/a>, and <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image35.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-762\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image35.gif\" alt=\"\" width=\"48\" height=\"22\" \/><\/a> can be determined in time <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image36.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-763\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image36.gif\" alt=\"\" width=\"55\" height=\"24\" \/><\/a> for the alignment models described in sections 2.1 and 2.2. Thus, <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image26.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-753\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image26.gif\" alt=\"\" width=\"13\" height=\"17\" \/><\/a>\u00a0can be determined numerically, the bracketed portion of equation 5 can be determined exactly, and the summation can be approximated by a normal distribution.<\/p>\n<p>An accurate estimation of p-value is necessary since the ultimate statistical significance of an alignment is the product of the p-value and the typically huge number of alignments that can be generated with any set of data. For example, given <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_n.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-724\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_n.gif\" alt=\"\" width=\"18\" height=\"18\" \/><\/a> sequences of length <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_q.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-725\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_q.gif\" alt=\"\" width=\"15\" height=\"21\" \/><\/a>, the number of alignments of width <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_l.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-723\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_big_l.gif\" alt=\"\" width=\"14\" height=\"17\" \/><\/a> and having a single contribution from <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image37.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-764\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image37.gif\" alt=\"\" width=\"24\" height=\"22\" \/><\/a> of the sequences is <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image38.gif\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-765\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis56_Image38.gif\" alt=\"\" width=\"175\" height=\"45\" \/><\/a>. Alternative formulas for counting alignments also exist for when each sequence is not limited to contributing at most once to any single alignment.<\/p>\n<p><b>5. Conclusion<\/b><\/p>\n<p>We have described how alignment matrices and information content can describe and score simple ungapped alignments and complex alignments having gaps and adjacent correlations. We also described how to determine the statistical significance of an alignment and introduced a new method for determining the p-value of alignments in which each column is considered independent. Extending this method for estimating p-value to our more complex alignment models that incorporate adjacent correlations is in progress.<\/p>\n<p><b>Acknowledgments<\/b><\/p>\n<p>This work was supported by Public Health Service grant HG-00249 from the National Institutes of Health.<\/p>\n<p><b>References<\/b><\/p>\n<ol>\n<li>G. Z. Hertz, G. W. Hartzell III, and G. D. Stormo, &#8220;Identification of Consensus Patterns in Unaligned DNA Sequences Known to be Functionally Related&#8221; Comput. Appl. Biosci. 6, 81\u00f192 (1990)<\/li>\n<li>G. Z. Hertz and G. D. Stormo, &#8220;Identification of Consensus Patterns in Unaligned DNA and Protein Sequences: A Large-Deviation Statistical Basis for Penalizing Gaps&#8221; In Proceedings of the Third International Conference on Bioinformatics and Genome Research, H. A. Lim and C. R. Cantor, editors, pages 201\u00f1216 (World Scientific Publishing Co., Ltd., Singapore, 1995)<\/li>\n<li>J. A. Bucklew, &#8220;Large Deviation Techniques in Decision, Simulation, and Estimation&#8221; (John Wiley and Sons, Inc., New York, 1990)<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>HERTZ GERALD Z.+,\u00a0STORMO GARY D. Department of Molecular, Cellular, and Developmental Biology; University of Colorado; Campus Box 347; Boulder, CO 80309; USA; hertz or e-mail:\u00a0stormo@colorado.edu +Corresponding author Keywords: patterns, statistically significant alignment matrix, information context &nbsp; We have been developing &hellip; <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/abstracts\/abstract-list\/056_identifying-dna-and-protein-patterns-with-statistically-significant-alignment-matrices\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":13,"featured_media":0,"parent":97,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/pages\/720"}],"collection":[{"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/comments?post=720"}],"version-history":[{"count":6,"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/pages\/720\/revisions"}],"predecessor-version":[{"id":1393,"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/pages\/720\/revisions\/1393"}],"up":[{"embeddable":true,"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/pages\/97"}],"wp:attachment":[{"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/media?parent=720"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}