第四章  马尔科夫模型

4.4 学生课堂报告1

  • Example1: Was she happy? 非常有意思的例子。。。hidden_states = (Happy, Unhappy)
    observations = (Kiss, Beat, Do nothing)
  • Viterbi算法
  • Example2: 5’ splice site recognition-----hidden_states = (E, 5, I)  observations = (A, C, G, T)
  • Example3: Coding region------hidden_states = (Sta1, Sta2, Sta3, Cod1, Cod2, Cod3, Sto1, Sto2, Sto3)
    observations = (A, C, G, T)
  • Example4: Prokaryotic gene-----hidden_states = (intergenic,
    Sta‐8, Sta‐7, Sta‐6, Sta‐5, Sta‐4, Sta‐3, Sta‐2, Sta‐1, Sta1, Sta2, Sta3, Sta4, Sta5, Sta6,
    Cod1, Cod2, Cod3,
    Sto‐3, Sto‐2, Sto‐1, Sto1, Sto2, Sto3, Sto4, Sto5, Sto6, Sto7, Sto8, Sto9, Sto10, Sto11)
    observations = (A, C, G, T)
  • Example5: Eukaryotic gene 1
  • Example6: Eukaryotic gene 2

4.5 学生课堂报告2

  • What is Pfam?  Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models; The Pfam database contains information about protein domains and families.
  • Pfam entries are classified in one of four ways:
  1. Family--A collection of related protein regions
  2. Domain--A structural unit
  3. Repeat--A short unit which is unstable in isolation but forms a stable structure when multiple copies are present
  4. Motifs--A short unit found outside globular domains
  • Pfam includes Pfam-A and Pfam-B
  • Pfam-A: Pfam-A is the manually curated portion of the database that contains over 10,000 entries. For each entry a protein sequence alignment and a hidden Markov model is stored.
  • Pfam-B: Because the entries in Pfam-A do not cover all known proteins, an automatically generated supplement is provided called Pfam-B. Pfam-B contains a large number of small families derived from clusters produced by an algorithm called ADDA. Although of lower quality, Pfam-B families can be useful when no Pfam-A families are found.
  • 一般说Pfam库都是指Pfam-A
  • Functions of Pfam, for each family in Pfam one can:
  1. Look at multiple alignments
  2. View protein domain architectures
  3. Examine species distribution
  4. Follow links to other databases
  5. View known protein structures
  • pHMM Generation
  • Principles in pHMM: Tokens: amino acid sequence; States: insertion, deletion, match;Column: probability of residues at each site
  • pHMM Parameters are Derived from Training Set
  • Parameters of pHMM: transition probability & emission probability
  • Training set: curated, highly representative relatively conserved sequences of a family
  • pHMM Parameters are ajusted to include all the members in the family.
  • Estimated directly from a multiple alignment.
  • Using expectation-maximization procedure from unaligned sequences.
  • pHMM finally generates a pHMM Logo.
  • Why Pfam is reliable?
  1. When generation the pHMM, all the database are manully curated.
  2. Using pHMM model to indicate is reliable.
  3. Pfam only shows result with great significance.
  • http://pfam.xpfam.org/
  • Limitations
  1. Pfam doesn't give the 3D-structure of the target protein
  2. Pfam only gives the function of specific domains, but doesn't describe the function of the whole protein
  3. Pfam doesn't give the basic properties of the target including PI, solution property, etc.

4.6 学生课堂报告3

  1. Three fundamental problems: given model M=M(w)
  2. Evalution: one sequence 'O=O1O2...': calculate P(O|w)
  3. Decoding: multiple sequences 'Oa/Ob...' : choose S=q1q2... which could best interpret observed sequences O
  4. Learning: Adjust parameters to maximize P(O|w), use observed sequences to train the model
  • Feasibility: Biological Meaning
  1. Be self-adaptive to target sequence. Do not rely on priori experience.
  2. Perform better when combined with other biological methods. ----Revised easily stucture data.
  3. Function as flexible method in different conditions. ----Adjustable to meet variable requirements.
  • Shortages:
  1. The algorithm do not guarantee the global optimal solution.
  2. The training process is limited by sample seize.
  3. The choice of match states is arbitrary.

4.7 学生实践

  • Finding CNVs with HMM
  • What is Copy-number variation(CNVs) (基因拷贝数变异)?
  • CNVs:一种发生在染色体尺度的大片段拷贝。
  • Question Define: Identifying repeating sequences(CNVs) in a long DNA sequence.
  • Step1: Hidden states: Is a CNV; Is not a CNV
  • Step2: Matrix: Transition Matrix; Creat Matrix
  • Step3: Training Set
  • Step4: Dynamic Programming
Logo

DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐