《生物信息学：导论与方法》----马尔可夫模型----听课笔记（七）

wxw060709 · 2019-09-12 11:14:56 发布

第四章马尔科夫模型

4.4 学生课堂报告1

Example1: Was she happy? 非常有意思的例子。。。hidden_states = (Happy, Unhappy)
observations = (Kiss, Beat, Do nothing)
Viterbi算法
Example2: 5’ splice site recognition-----hidden_states = (E, 5, I) observations = (A, C, G, T)
Example3: Coding region------hidden_states = (Sta1, Sta2, Sta3, Cod1, Cod2, Cod3, Sto1, Sto2, Sto3)
observations = (A, C, G, T)
Example4: Prokaryotic gene-----hidden_states = (intergenic,
Sta‐8, Sta‐7, Sta‐6, Sta‐5, Sta‐4, Sta‐3, Sta‐2, Sta‐1, Sta1, Sta2, Sta3, Sta4, Sta5, Sta6,
Cod1, Cod2, Cod3,
Sto‐3, Sto‐2, Sto‐1, Sto1, Sto2, Sto3, Sto4, Sto5, Sto6, Sto7, Sto8, Sto9, Sto10, Sto11)
observations = (A, C, G, T)
Example5: Eukaryotic gene 1
Example6: Eukaryotic gene 2

4.5 学生课堂报告2

What is Pfam? Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models; The Pfam database contains information about protein domains and families.
Pfam entries are classified in one of four ways:

Family--A collection of related protein regions
Domain--A structural unit
Repeat--A short unit which is unstable in isolation but forms a stable structure when multiple copies are present
Motifs--A short unit found outside globular domains

Pfam includes Pfam-A and Pfam-B
Pfam-A: Pfam-A is the manually curated portion of the database that contains over 10,000 entries. For each entry a protein sequence alignment and a hidden Markov model is stored.
Pfam-B: Because the entries in Pfam-A do not cover all known proteins, an automatically generated supplement is provided called Pfam-B. Pfam-B contains a large number of small families derived from clusters produced by an algorithm called ADDA. Although of lower quality, Pfam-B families can be useful when no Pfam-A families are found.
一般说Pfam库都是指Pfam-A
Functions of Pfam, for each family in Pfam one can:

pHMM Generation
Principles in pHMM: Tokens: amino acid sequence; States: insertion, deletion, match;Column: probability of residues at each site
pHMM Parameters are Derived from Training Set
Parameters of pHMM: transition probability & emission probability
Training set: curated, highly representative relatively conserved sequences of a family
pHMM Parameters are ajusted to include all the members in the family.
Estimated directly from a multiple alignment.
Using expectation-maximization procedure from unaligned sequences.
pHMM finally generates a pHMM Logo.
Why Pfam is reliable?

Pfam doesn't give the 3D-structure of the target protein
Pfam only gives the function of specific domains, but doesn't describe the function of the whole protein
Pfam doesn't give the basic properties of the target including PI, solution property, etc.

4.6 学生课堂报告3

Three fundamental problems: given model M=M(w)
Evalution: one sequence 'O=O1O2...': calculate P(O|w)
Decoding: multiple sequences 'Oa/Ob...' : choose S=q1q2... which could best interpret observed sequences O
Learning: Adjust parameters to maximize P(O|w), use observed sequences to train the model

Be self-adaptive to target sequence. Do not rely on priori experience.
Perform better when combined with other biological methods. ----Revised easily stucture data.
Function as flexible method in different conditions. ----Adjustable to meet variable requirements.