Biological Sequence Analysis ：Probabilistic Models of Proteins and Nucleic Acids

Publication subTitle ：Probabilistic Models of Proteins and Nucleic Acids

Author： Richard Durbin; Sean R. Eddy; Anders Krogh

Publisher： Cambridge University Press‎

Publication year： 1998

E-ISBN: 9780511252259

P-ISBN(Paperback): 9780521629713

Subject： Q7 Molecular Biology

Keyword：分子生物学

Language： ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Biological Sequence Analysis

Description

Probabilistic models are becoming increasingly important in analysing the huge amount of data being produced by large-scale DNA-sequencing efforts such as the Human Genome Project. For example, hidden Markov models are used for analysing biological sequences, linguistic-grammar-based probabilistic models for identifying RNA secondary structure, and probabilistic evolutionary models for inferring phylogenies of sequences from different organisms. This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis. Written by an interdisciplinary team of authors, it aims to be accessible to molecular biologists, computer scientists, and mathematicians with no formal knowledge of the other fields, and at the same time present the state-of-the-art in this new and highly important field.

Chapter

1.2 Overview of the book

1.3 Probabilities and probabilistic models

Maximum likelihood estimation

Conditional, joint, and marginal probabilities

Bayes’ theorem and model comparison

Bayesian parameter estimation

1.4 Further reading

2 Pairwise alignment

2.1 Introduction

2.2 The scoring model

Substitution matrices

Gap penalties

2.3 Alignment algorithms

Global alignment: Needleman–Wunsch algorithm

Big-O notation for algorithmic complexity

Local alignment: Smith–Waterman algorithm

Repeated matches

Overlap matches

Hybrid match conditions

2.4 Dynamic programming with more complex models

Alignment with affine gap scores

More complex FSA models

2.5 Heuristic alignment algorithms

BLAST

FASTA

2.6 Linear space alignments

2.7 Significance of scores

The Bayesian approach: model comparison

The classical approach: the extreme value distribution

Correcting for length

Why use the alignment score as the test statistic?

2.8 Deriving score parameters from alignment data

Dayhoff PAM matrices

BLOSUM matrices

Estimating gap penalties

2.9 Further reading

3 Markov chains and hidden Markov models

3.1 Markov chains

Modelling the beginning and end of sequences

Using Markov chains for discrimination

3.2 Hidden Markov models

Formal definition of an HMM

Most probable state path: the Viterbi algorithm

The forward algorithm

The backward algorithm and posterior state probabilities

Posterior decoding

3.3 Parameter estimation for HMMs

Estimation when the state sequence is known

Estimation when paths are unknown: Baum–Welch and Viterbi training

Modelling of labelled sequences

Discriminative estimation

3.4 HMM model structure

Choice of model topology

Duration modelling

Silent states

3.5 More complex Markov chains

High order Markov chains

Finding prokaryotic genes

Inhomogeneous Markov chains

3.6 Numerical stability of HMM algorithms

The log transformation

Scaling of probabilities

3.7 Further reading

4 Pairwise alignment using HMMs

4.1 Pair HMMs

The most probable path is the optimal FSA alignment

A pair HMM for local alignment

4.2 The full probability of x and y, summing over all paths

4.3 Suboptimal alignment

Probabilistic sampling of alignments

Finding distinct suboptimal alignments

4.4 The posterior probability that x[sub(i)] is aligned to y[sub(j)]

The expected accuracy of an alignment

4.5 Pair HMMs versus FSAs for searching

4.6 Further reading

5 Profile HMMs for sequence families

5.1 Ungapped score matrices

5.2 Adding insert and delete states to obtain profile HMMs

Profile HMMs generalise pairwise alignment

5.3 Deriving profile HMMs from multiple alignments

Non-probabilistic profiles

Basic profile HMM parameterisation

5.4 Searching with profile HMMs

Viterbi equations

Forward algorithm

Alternatives to log-odds scoring

Alignment

5.5 Profile HMM variants for non-global alignments

5.6 More on estimation of probabilities

Simple pseudocounts

Dirichlet mixtures

Substitution matrix mixtures

Deriving P(b|a) from an arbitrary matrix

Estimation based on an ancestor

Testing the pseudocount methods

5.7 Optimal model construction

MAP match–insert assignment

5.8 Weighting training sequences

Simple weighting schemes derived from a tree

Root weights from Gaussian parameters

Voronoi weights

Maximum discrimination weights

Maximum entropy weights

5.9 Further reading

6 Multiple sequence alignment methods

6.1 What a multiple alignment means

6.2 Scoring a multiple alignment

Minimum entropy

Sum of pairs: SP scores

6.3 Multidimensional dynamic programming

6.4 Progressive alignment methods

Feng–Doolittle progressive multiple alignment

Profile alignment

CLUSTALW

Iterative refinement methods

6.5 Multiple alignment by profile HMM training

Multiple alignment with a known profile HMM

Overview of profile HMM training from unaligned sequences

Initial model

Baum–Welch expectation maximisation

Avoiding local maxima

Theoretical basis of simulated annealing

Noise injection during Baum–Welch HMM re-estimation

Simulated annealing Viterbi estimation of HMMs

Comparison to Gibbs sampling

Adaptively modifying model architecture; model surgery

6.6 Further reading

7 Building phylogenetic trees

7.1 The tree of life

7.2 Background on trees

Counting and labelling trees

7.3 Making a tree from pairwise distances

Clustering methods: UPGMA

Molecular clocks and the ultrametric property of distances

Additivity and neighbour-joining

Rooting trees

7.4 Parsimony

Selecting labelled branching patterns by branch and bound

7.5 Assessing the trees: the bootstrap

7.6 Simultaneous alignment and phylogeny

Sankoff & Cedergren’s gap-substitution algorithm

Hein’s affine cost algorithm

Limitations of Hein’s model

7.7 Further reading

7.8 Appendix: proof of neighbour-joining theorem

8 Probabilistic approaches to phylogeny

8.1 Introduction

Overview of the probabilistic approach to phylogeny

8.2 Probabilistic models of evolution

8.3 Calculating the likelihood for ungapped alignments

The case of two sequences

The likelihood for an arbitrary number of sequences

Reversibility and independence of root position

8.4 Using the likelihood for inference

Maximising the likelihood

Sampling from the posterior distribution

A proposal distribution for phylogenetic trees

Other phylogenetic uses of sampling

The bootstrap revisited

8.5 Towards more realistic evolutionary models

Allowing different rates at different sites

Evolutionary models with gaps

Evaluating different probabilistic models

8.6 Comparison of probabilistic and non-probabilistic methods

A probabilistic interpretation of parsimony

Maximum likelihood and pairwise distance methods

A probabilistic interpretation of Sankoff & Cedergren

Interpreting Hein’s algorithm probabilistically

8.7 Further reading

9 Transformational grammars

9.1 Transformational grammars

Definition of a transformational grammar

The Chomsky hierarchy

Automata

9.2 Regular grammars

Finite state automata

Moore vs. Mealy machines

Deterministic vs. nondeterministic automata

PROSITE patterns

What a regular grammar can’t do

9.3 Context-free grammars

Parse trees

Push-down automata

9.4 Context-sensitive grammars

Unrestricted grammars and Turing machines

9.5 Stochastic grammars

Stochastic context-sensitive or unrestricted grammars

Hidden Markov models are stochastic regular grammars

9.6 Stochastic context-free grammars for sequence modelling

Normal forms for stochastic context-free grammars

The inside algorithm

The outside algorithm

Parameter re-estimation by expectation maximisation

The CYK alignment algorithm

Summary of SCFG algorithms

9.7 Further reading

10 RNA structure analysis

10.1 RNA

Terminology of RNA secondary structure

RNA sequence evolution is constrained by structure

Inferring structure by comparative sequence analysis

10.2 RNA secondary structure prediction

Base pair maximisation and the Nussinov folding algorithm

An SCFG version of the Nussinov algorithm

Energy minimisation and the Zuker folding algorithm

Suboptimal RNA folding

Base pair confidence estimates

10.3 Covariance models: SCFG-based RNA profiles

SCFG models of ungapped RNA alignments

Design of covariance models

Construction of a CM from an RNA alignment

CM alignment algorithms

Notation

Scoring: inside algorithm

Outside algorithm for CMs

Inside–outside expectation maximisation for CM parameters

Database searching: the CYK algorithm

Structural alignment: CYK algorithm with traceback

‘Automated’ comparative sequence analysis using CMs

An example of a practical application of CM algorithms

10.4 Further reading

11 Background on probability

11.1 Probability distributions

The binomial distribution

The Gaussian distribution

The multinomial distribution

The Dirichlet distribution

The gamma distribution

The extreme value distribution

11.2 Entropy

Relative entropy and mutual information

11.3 Inference

Maximum likelihood

The posterior probability distribution

Change of variables

11.4 Sampling

Sampling by transformation from a uniform distribution

Sampling from a Dirichlet by rejection

Sampling with the Metropolis algorithm

Gibbs sampling

11.5 Estimation of probabilities from counts

Mixtures of Dirichlets

Estimating the prior

11.6 The EM algorithm

EM explanation of the Baum–Welch algorithm

Bibliography

Author index

Subject index

The users who browse this book also browse

Description

Chapter

The users who browse this book also browse

No browse record.