ON THE STATISTICAL SIGNIFICANCE OF PAIRWISE GLOBAL ALIGNMENTS OF NUCLEOTIDE SEQUENCES
Keywords:
substitution matrix, affine gaps, Monte Carlo method, extreme value distribution, Markov chains, Benjamini-Hochberg procedureDOI:
https://doi.org/10.17654/0973514323004Abstract
Global alignments, generally performed on two sequences, are valuable indicators of evolutionary relatedness. Alignment score distributions of pairwise global alignments are, therefore, of interest, to evaluate the statistical significance of said alignments. This paper shows, how this statistical significance is measured using the Monte Carlo repeated random sampling method by calculating p-values from the cumulative distributions of optimal scores. A null model is either synthetically generated with random nucleotide sequences or compiled from actual nucleotide sequences from genomic database repositories online. The analysis further inculcates different scoring schemes and affine gap models. Since affine gaps are widely used over linear and other models, the study considers two affine gap models. Both uniform and non-uniform substitution matrices are analyzed. To validate the results, a realistic null model is used where real nucleotide sequences are randomly picked from online genomic repositories. Of the three EVDs considered for analysis, it is found that the Gumbel distribution best describes the alignment score distributions for all cases examined in the study. Further, an analysis of the post-normalized alignment score distributions yields the same result.
Received: September 20, 2022
Accepted: November 26, 2022
References
R. Durbin, Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 2013.
R. Chaurasia and U. Ghose, On the effects of substitution matrix choices for pairwise gapped global sequence alignment of DNA nucleotides, Proceedings of the 4th International Conference on Advanced Informatics for Computing Research, Communications in Computer and Information Science, Springer, Singapore, Vol. 1393, 2021, pp. 113-125. DOI: 10.1007/978-981-16-3660-8_11.
D. T. Jones, W. R. Taylor and J. M. Thornton, The rapid generation of mutation data matrices from protein sequences, Bioinformatics 8(3) (1992), 275-282.
DOI: 10.1093/bioinformatics/8.3.275.
M. O. Dayhoff, R. M. Schwartz and B. C. Orcutt, A model of evolutionary change in proteins, M. O. Dayhoff, ed., Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Washington DC, 5(3) (1978), 345-352.
S. Henikoff and J. G. Henikoff, Amino acid substitution matrices from protein blocks, Proceedings of the National Academy of Sciences 89(22) (1992), 10915-10919. DOI: 10.1073/pnas.89.22.10915.
T. Müller and M. Vingron, Modeling amino acid replacement, Journal of Computational Biology 7(6) (2000), 761-776.
DOI: 10.1089/10665270050514918.
T. Müller, R. Spang and M. Vingron, Estimating amino acid substitution models: a comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method, Molecular Biology and Evolution 19(1) (2002), 8-13.
DOI: 10.1093/oxfordjournals.molbev.a003985.
M. S. Waterman and M. Vingron, Rapid and accurate estimates of statistical significance for sequence data base searches, Proceedings of the National Academy of Sciences 91(11) (1994), 4625-4628. DOI: 10.1073/pnas.91.11.4625.
R. Mott, Accurate formula for P-values of gapped local sequence and profile alignments, Journal of Molecular Biology 300(3) (2000), 649-659.
DOI: 10.1006/jmbi.2000.3875.
R. Mott, Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores, Bulletin of Mathematical Biology 54(1) (1992), 59-75. DOI: 10.1007/bf02458620.
R. Olsen, R. Bundschuh and T. Hwa, Rapid assessment of extremal statistics for gapped local alignment, International Conference on Intelligent Systems for Molecular Biology, Menlo Park, CA, AAAI Press, 1999, pp. 211-222.
S. Karlin and S. F. Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes, Proceedings of the National Academy of Sciences 87(6) (1990), 2264-2268.
DOI: 10.1073/pnas.87.6.2264.
S. F. Altschul and W. Gish, Local alignment statistics, R. F. Doolittle, ed., Methods in Enzymology 266 (1996), 460-480.
DOI: 10.1016/s0076-6879(96)66029-7.
W. R. Pearson, Empirical statistical estimates for sequence similarity searches, Journal of Molecular Biology 276(1) (1998), 71-84.
DOI: 10.1006/jmbi.1997.1525.
X. Huang and D. L. Brutlag, Dynamic use of multiple parameter sets in sequence alignment, Nucleic Acids Research 35(2) (2006), 678-686.
DOI: 10.1093/nar/gkl1063.
J. G. Reich, H. Drabsch and A. Däumler, On the statistical assessment of similarities in DNA sequences, Nucleic Acids Research 12(13) (1984), 5529-5543. DOI: 10.1093/nar/12.13.5529.
S. F. Altschul and B. W. Erickson, Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage, Molecular Biology and Evolution 2(6) (1985), 526-538. DOI: 10.1093/oxfordjournals.molbev.a040370.
M. Y. Tabari and A. Pouyan, Estimating Reliability in Mobile ad-hoc Networks Based on Monte Carlo Simulation (TECHNICAL NOTE), International Journal of Engineering 27(5) (2014), 739-746.
F. Jolai and S. M. T. F. Ghomi, Combination of approximation and simulation approaches for distribution functions in stochastic networks, International Journal of Engineering 12(3) (1999), 145-154.
Z. Tabatabaeian and M. Neshati, Sensitivity analysis of a wideband backward-wave directional coupler using neural network and Monte Carlo method (Research Note), International Journal of Engineering 31(5) (2018), 729-733.
H. Nguyen, Probabilistic assessment of bending strength of statically indeterminate reinforced concrete beams, International Journal of Engineering 35(4) (2022), 837-844. DOI: 10.5829/ije.2022.35.04A.24.
U.S. National Library of Medicine, Needleman-Wunsch alignment of two nucleotide sequences, National Center for Biotechnology Information, Retrieved October 27, 2021, from
https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_ DEF=blastn&BLAST_PROG_DEF=blastn&BLAST_SPEC=GlobalAln&LINK_LOC=BlastHomeLink.
Emboss Needle, EBI. Retrieved September 16, 2021, from
https://www.ebi.ac.uk/Tools/psa/emboss_needle/.
Emboss Stretcher, EBI. Retrieved September 15, 2021, from
https://www.ebi.ac.uk/Tools/psa/emboss_stretcher/.
Wikipedia, List of sequence alignment software, Wikipedia Retrieved September 17, 2021, from
https://en.wikipedia.org/w/index.php?title=List_of_sequence_alignment_ software&oldid=979369078.
Emboss stretcher help and Documentation, EBI. Retrieved September 15, 2021, from https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/EMBOSS+Stretcher +Help+and+Docentation.
R. Chaurasia and U. Ghose, Assessing the statistical significance of pairwise gapped global sequence alignment of DNA nucleotides using Monte Carlo Techniques, Proceedings of 4th International Conference of Computational Vision and Bio-Inspired Computing, Advances in Intelligent Systems and Computing, Springer, Singapore, Vol. 1318, 2021, pp. 57-70.
https://doi.org/10.1007/978-981-33-6862-0_5.
G. Peris and A. Marzal, Statistical significance of normalized global alignment, Journal of Computational Biology 21(3) (2014), 257-268.
https://doi.org/10.1089/cmb.2012.0167.
A. Y. Mitrophanov and M. Borodovsky, Statistical significance in biological sequence analysis, Briefings in Bioinformatics 7(1) (2006), 2-24.
DOI: 10.1093/bib/bbk001.
M. S. Waterman, Mathematical Methods for DNA Sequences, CRC Press, 1989.
S. F. Altschul, M. S. Boguski, W. Gish and J. C. Wootton, Issues in searching molecular sequence databases, Nature Genetics 6(2) (1994), 119-129.
DOI: 10.1038/ng0294-119.
D. States, W. Gish and S. Altschul, Improved sensitivity of nucleic acid database searches using application specific scoring matrices, Methods 3(1) (1991), 66-70. DOI: 10.1016/s1046-2023(05)80165-3.
D. M. T. Tammi, Evaluate DNA scoring matrix values - find out what is the DNA scoring target frequency, Retrieved October 18, 2021.
URL: https://bioinformaticshome.com/online_software/evaluateDNAscoring/ evaluateDNAscoring.html.
R. J. Simes, An improved Bonferroni procedure for multiple tests of significance, Biometrika 73(3) (1986), 751-754. DOI: 10.1093/biomet/73.3.751.
Y. Benjamini and Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society Series B (Methodological) 57(1) (1995), 289-300.
http://www.jstor.org/stable/2346101.
P. Villesen, Random DNA sequence generator, Retrieved September 17, 2021, from https://usersbirc.au.dk/~palle/php/fabox/random_sequence_generator.php.
RSAT, Random sequence, Retrieved September 16, 2021, from
http://rsat.sb-roscoff.fr/random-seq_form.cgi.
Random DNA generator, Retrieved September 16, 2021, from
https://www.faculty.ucr.edu/~mmaduro/random.htm.
NCBI, Nucleotide Database, Retrieved September 15, 2021, from
https://www.ncbi.nlm.nih.gov/nucleotide/.
W. R. Pearson, An introduction to sequence similarity (“homology”) searching, Current Protocols in Bioinformatics, Chapter 3, 2013, 3.1.1-3.1.8. DOI: 10.1002/0471250953.bi0301s42.
MATLAB, Assessing the significance of an alignment, Assessing the Significance of an Alignment - MATLAB & Simulink, Retrieved October 28, 2021, from https://www.mathworks.com/help/bioinfo/examples/assessingthe- significance-of-an-alignment.html.
M. Waterman and R. A. Elton, Estimating statistical significance of sequence alignments [and Discussion], Philosophical Transactions: Biological Sciences 344(1310) (1994), 383-390. http://www.jstor.org/stable/56110.
M. Vingron and M. S. Waterman, Sequence alignment and penalty choice, Journal of Molecular Biology 235(1) (1994), 1-12.
DOI: 10.1016/s0022-2836(05)80006-3.
Downloads
Published
Issue
Section
License
Copyright (c) 2023 JP Journal of Biostatistics

This work is licensed under a Creative Commons Attribution 4.0 International License.
_________________________
Attribution: Credit Pushpa Publishing House as the original publisher, including title and author(s) if applicable.
Non-Commercial Use: For non-commercial purposes only. No commercial activities without explicit permission.
No Derivatives: Modifying or creating derivative works not allowed without written permission.
Contact Puspha Publishing House for more info or permissions.
Journal Impact Factor: 


Google h-index: 10