直接使用分箱数据运行phylophlan

左左左左左左左

1174人浏览 · 2024-11-20 11:26:05

左左左左左左左 · 2024-11-20 11:26:05 发布

众所周知，phylophlan可以输入dna或者蛋白质序列，但是偏偏常用的、包含了400个通用标志的phylophlan数据库是蛋白质库，想直接用dna序列跑要费一番手脚，→_→非常可恶。

解决方法参考自软件作者在github中的评论。

根据作者提供的方法，有几个关键点要注意：

1.需要在生成配置文件和运行两个阶段都使用长选项 --force_nucleotides；

## 该选项能够强制使用dna序列
 --force_nucleotides   If specified, force PhyloPhlAn to use nucleotide
                        sequences for the phylogenetic analysis, even in the
                        case of a amino acids database (default: False)

2.设置正确的配置文件选项和参数
dna输入数据，要映射蛋白质数据库，其正确的设置姿势是：
-d a，-d是指要输入数据要map到的数据的类型，因此是amino acid
--db_aa diamond，蛋白质数据库的索引方法，因此选项是aa，参数是diamond
--map_dna diamond，输入数据是dna，因此映射的选项是dna，参数要选能实现dna映射aa的方法，这里同样选diamond

3.其他小问题

①分箱直接产生的数据一般是.fa文件，而phylophlan的default是识别.fna和.faa，运行时需要用--genome_extension .fa声明
②需要将生成的配置文件粘贴到指定的路径中

这里我使用-d phylophlan数据库多标志共同建树的上游流程如下：

配置文件

## 如果工作目录没有cd到phylophlan_write_config_file
## 需要用声明该插件的路径
python /usr/local/miniconda3/envs/phylophlan/bin/phylophlan_write_config_file -o /yourpath/supermatrix_aa.cfg -d a --db_aa diamond --map_dna diamond --msa mafft --trim trimal --tree1 fasttree --tree2 raxml --force_nucleotides

粘贴配置文件到指定路径

## 指定路径可以从下一步运行程序的报错信息中查看
cp /yourpath/supermatrix_aa.cfg /usr/local/miniconda3/envs/phylophlan/lib/python3.9/site-packages/phylophlan/phylophlan_configs/supermatrix_aa.cfg

运行程序

## 种水平
phylophlan -i /your/bins/dir --genome_extension .fa -d phylophlan --diversity low -f /your/path/supermatrix_aa.cfg --force_nucleotides 
##

部分运行过程：

Mapping "metawrap_70_5_bins_phylophlan/tmp/clean_dna/bin.82.fa"
"bin.82.b6o.bkp" generated in 943s
Mapping "metawrap_70_5_bins_phylophlan/tmp/clean_dna/bin.104.fa"
"bin.104.b6o.bkp" generated in 885s
Mapping "metawrap_70_5_bins_phylophlan/tmp/clean_dna/bin.21.fa"
"bin.21.b6o.bkp" generated in 1473s

可见直接用dna序列来跑phylophlan虽然可行，但速度并不快甚至说有点慢，→_→分箱数量多的情况下还是耗费不少时间的。如果跑了功能注释拿到了.faa文件可以加速这里的系统发育分析。下个问题见。