2.1 Mothur—16s双端测序处理

老板不知道从哪弄了一堆肠道微生物的16s高通量数据丢给我,说了一通要求,然后让我分析。而我向来不善于拒绝,秉着“不会,硬着头皮也要弄出来”的原则,开始了回家前一段奇妙的mothur摸索之旅。

Question:

之前都是用qiime来处理数据,根据barcode来识别序列,给序列编号,同时去掉低质量的片段;而且不拼接。但这批来自苏州某测序公司的数据,既没有给我们barcode也没有给primer。所以有三个问题对我来说是新的:

  1. 去低质量的片段;
  2. 双端的拼接;
  3. 给每条序列编号。

言归正传,还是看看如何假借mothur之手,同时结合qiime来分析这批数据吧。

一.mothur的安装

bio-linux系统自带,这一块没有做,所以也就不提了,以后如果安装了,再来补一块的东西。

下载 http://www.mothur.org/wiki/Download_mothur

二.运行方式

2.1 Interactivemode 交互式

 在terminal输入mothur进入交互式界面,
 mothur >
 mothur > align.seqs(help)
 mothur > quit

2.2 Batch mode 批量运行模式

 http://www.mothur.org/wiki/Batch_mode
 把命令写到一个文件中batchfile
 --------------------------------batchfile--------------------------------------
 cluster(phylip=98_sq_phylip_amazon.dist, cutoff=0.1)
 collect.single()
 rarefaction.single()
 -------------------------------------------------------------------------------
 运行:../mothur/mothur batchfile

2.3 Command line mode 命令行模式

 http://www.mothur.org/wiki/Command_line_mode
 mothur"#read.dist(phylip=98_sq_phylip_amazon.dist, cutoff=0.1); cluster();collect.single()"
 命令间用分号分隔,所有的命令用双引号括起来,且括号内的命令以#号开头

三. 分析流程

分析大全:http://www.mothur.org/wiki/Analysis_examples

下面我们还是以MiSeqSOP (standard operating procedure)为例: http://www.mothur.org/wiki/MiSeq_SOP,以下内容都是这个的翻译,如果不嫌麻烦,还是尽量看英文的。

为了尊重Mothur工作组的劳动,如果用了他家的东西,感觉还不错的话,记得引上他们的文章 Kozich JJ,Westcott SL, Baxter NT, Highlander SK, Schloss PD. (2013): Development of adual-index sequencing strategy and curation pipeline for analyzing ampliconsequence data on the MiSeq Illumina sequencing platform. Applied andEnvironmental Microbiology. 79(17):5112-20。

四、实例

1.解压缩

cd /sam/meta/changdao/fastq;
gzip -d *.gz;

2.trimmomatic过滤双端序列

2.1先知道测序为phred64是33

http://hugo.qinqianshan.com/fastq-format-phred33-64/

perl /sam/usefulscript/fastq_phred_decide.pl M008_R1.fasta
 #####Negative value appear 8322 times in 225584 base quality score, it should be Phred33

2.2 trimmomatic用法:

http://hugo.qinqianshan.com/trimmomatic-filtration-of-raw-reads/ 
 #去除两端质量小于25的碱基,去掉4个碱基平均质量小于25的碱基序列,最小长度150
 ###nohup.out为输出结果

3 paire end根据overlap合并

首先做一个文件stability.files包含如下内容:

 F3D0 F3D0_S188_L001_R1_001.fastq F3D0_S188_L001_R2_001.fastq
 F3D141 F3D141_S207_L001_R1_001.fastq F3D141_S207_L001_R2_001.fastq
 F3D142 F3D142_S208_L001_R1_001.fastq F3D142_S208_L001_R2_001.fastq
 F3D143 F3D143_S209_L001_R1_001.fastq F3D143_S209_L001_R2_001.fastq
 F3D144 F3D144_S210_L001_R1_001.fastq F3D144_S210_L001_R2_001.fastq
 这样的文件第一列是每个样品的名字,第二列是前端测序结果,第三列是反向测序结果。这也是唯一一个需要手动来编辑的文件。

下面我们就来完成所有样品两端测序的拼接,通过make.contigs命令来实现,这个命令需要stability.files作为输入文件。他的原理在于:对一对序列进行比对后分析共有的那个区域,先看其中两条序列不匹配的地方,如果一条序列的某个碱基对应另一条序列的gap,这个碱基的质量必须高于25才算真实的;如果两条序列在哪个位置都一个碱基,我们需要其中一个条序列上该碱基质量大于6或者得分大于另一条上的;如果得分小于6,则认为该碱基为N。Processors为线程,我的电脑其实是12的,哈哈。

mothur > make.contigs(file=stability.files, processors=8)
 程序运行时需要调用libreadline.so.6,#很重要的,不要遗漏。该程序运行时需要调用libreadline.so.6动态链接库。
 该命令生成的结果为:
 Group count:
 F3D0 7793
 F3D1 5869
 F3D141 5958
 F3D142 3183
 F3D143 3178
 F3D144 4827
 F3D145 7377
 F3D146 5021
 F3D147 17070
 F3D148 12405
 F3D149 13083
 F3D150 5509
 F3D2 19620
 F3D3 6758
 F3D5 4448
 F3D6 7989
 F3D7 5129
 F3D8 5294
 F3D9 7070
 Mock 4779
 Total of all groups is 152360
 如果在后面报错:
 [WARNING]: your sequence names contained ':'. I changed them to '_' to avoid problems in your downstream analysis.
 这是因为序列名一般都是像这样"M00967:43:000000000-A3JHG:1:1101:18327:1699"包含了很多:,但是后续分析中树形文件如果有:就会报错,所以这里将:改为了_。

生成的文件为:

 stability.trim.contigs.fasta
 stability.contigs.groups. 每条序列属于哪个样品
 stability.contigs.report file(will tell you something about the contig assembly for each read
 contigs.report file:
 1st: contig accession #
 2nd: length of contig
 3rd: length of overlapping region
 4th: overlap start position
 5th: overlap end position
 6th: # of mismatches
 7th: # of Ns

 查看合并后的结果

mothur > summary.seqs(fasta=stability.trim.contigs.fasta)
 Summary.seqs(http://www.mothur.org/wiki/Summary.seqs)仅仅是对fasta文件处理,能够得到如下信息:
 Start End NBases Ambigs Polymer NumSeqs
 Minimum: 1 422 422 0 4 1
 2.5%-tile: 1 436 436 0 4 3
 25%-tile: 1 507 507 1 5 25
 Median: 1 530 530 3 5 50
 75%-tile: 1 961 961 6 6 74
 97.5%-tile: 1 973 973 15 8 96
 Maximum: 1 978 978 20 9 98
 Mean: 1 678.235 678.235 4.54082 5.44898
 # of Seqs: 98
 这是未比对的序列,所有的序列的起始位置都在1开始,我们可以看到这批数据长度从422到978个碱基。这个Ambigs是未知碱基,那这个Polymer是什么呢?
 Polymer是Homopolymer的简写。是指只含有一种碱基的序列片段,那多少个一样的碱基算呢??还是说它就是指最多的一样连着的碱基,连着的个数。
 生成文件:
 stability.trim.contigs.summary

4. 对contigs的过滤

我们已看到了生成的contigs的基本信息,需要去掉其中拼接质量不好的序列

 mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=0, maxlength=275)

上面的这个命令将会删掉含有大量未知碱基以及长度超过275b拍的序列。也可以直接用summary.seqs产生的结果。下面这个命令将比上面的那个更快。

 mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, summary=stability.trim.contigs.summary, maxambig=0, maxlength=275)
 根据之前的summary,这里设置最大长度,最小长度,等等参数。

 Screen.seqs(http://www.mothur.org/wiki/Screen.seqs)的用法:
 start与end
 mothur > screen.seqs(fasta=sogin.unique.align, start=4655) #针对比对后的序列文件,删掉起点小于4655的序列。
 mothur > screen.seqs(fasta=sogin.unique.align, end=4928)
 mothur > screen.seqs(fasta=sogin.unique.align, start=4655, end=4928)
 mothur > summary.seqs(fasta=sogin.unique.good.align)
 Start End NBases Ambigs Polymer
 Minimum: 4615 4928 50 0 2
 2.5%-tile: 4652 4928 57 0 3
 25%-tile: 4655 4928 60 0 3
 Median: 4655 4928 62 0 4
 75%-tile: 4655 4928 65 0 4
 97.5%-tile: 4655 4938 72 0 6
 Maximum: 4655 5014 100 0 31
 # of Seqs: 20670
 minlength与 maxlength 用于控制序列的长度




mothur > screen.seqs(fasta=sogin.unique.align, minlength=50, maxlength=70) mothur > summary.seqs(fasta=sogin.unique.good.align) Start End NBases Ambigs Polymer Minimum: 109 222 50 0 2 2.5%-tile: 4648 4875 57 0 3 25%-tile: 4655 4928 60 0 3 Median: 4655 4928 62 0 4 75%-tile: 4655 4928 64 0 4 97.5%-tile: 4655 4939 69 0 6 Maximum: 6785 6850 70 0 9 # of Seqs: 20160 maxambig 容许含有的N的个数 mothur > screen.seqs(fasta=sogin.unique.align, maxambig=0) maxhomop 容许最大的连着的碱基个数 mothur > screen.seqs(fasta=sogin.unique.align, maxhomop=8) mothur > summary.seqs(fasta=sogin.unique.good.align) Start End NBases Ambigs Polymer Minimum: 109 222 40 0 2 2.5%-tile: 4648 4875 57 0 3 25%-tile: 4655 4928 60 0 3 Median: 4655 4928 62 0 4 75%-tile: 4655 4928 65 0 4 97.5%-tile: 4655 4939 73 0 6 Maximum: 6793 6850 100 0 8 # of Seqs: 21896 maxn 容许的最大N的个数 跟上面的maxambig 有什么区别呢? 其他的参数去官网查询

5. get.current()查看当前的有效对象:

mothur > get.current()
 Current files saved by mothur:
 fasta=stability.trim.contigs.good.fasta
 group=stability.contigs.good.groups
 processors=8
 我们可以看到保留的上一步操作对应的参数
 mothur > summary.seqs(fasta=stability.trim.contigs.good.fasta)
 mothur > summary.seqs(fasta=current)
 mothur > summary.seqs()
 如上,这样命令就可以简写了。

#提升效率

6.unique序列

mothur > unique.seqs(fasta=stability.trim.contigs.good.fasta)
 Unique.seqs(http://www.mothur.org/wiki/Unique.seqs)如果相同的两条序列,他们被认为是重复然后合并
 生成两个文件
 stability.trim.contigs.good.names 第一列为为代表的名字,第二列为聚在一起的序列
 stability.trim.contigs.good.unique.fasta
 count.seqs 简写名字
 mothur > count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups)
 count.seqs(http://www.mothur.org/wiki/Count.seqs)
 生成的文件stability.trim.contigs.good.count_table,格式如下:
 Representative_Sequence total NMV1 NMV2 NMV3 NMV4 NMV5 NMV6
 M01519_40_000000000-AA6DB_1_1101_18483_2428 1 1 0 0 0 0 0
 M01519_40_000000000-AA6DB_1_1101_22903_3700 4 4 0 0 0 0 0
 M01519_40_000000000-AA6DB_1_1101_7884_4343 5 4 1 0 0 0 0
 M01519_40_000000000-AA6DB_1_1101_18202_5158 1 1 0 0 0 0 0
 M01519_40_000000000-AA6DB_1_1101_23520_5214 1 1 0 0 0 0 0
 相当于结合group文件,总结出了每条unique序列在每个样品中出现的次数。这个命令高端啊。而这个生成的文件将用在以后用作count参数。

7.pcr.seqs 优化数据库(http://www.mothur.org/wiki/Pcr.seqs)

mothur > pcr.seqs(fasta=silva.bacteria.fasta, start=11894, end=25319, keepdots=F, processors=8)
 对数据库处理后,获得V4区序列
 
 System 不用离开mothur,完成在终端的操作
 mothur > system(mv silva.bacteria.pcr.fasta silva.v4.fasta)
 mothur > system(copy esophagus.unique.bad.accnos esophagus.bad.accnos) #windows系统
 mothur > system(say -v Vicki joe smells) #Mac系统

8. align.seqs序列比对(http://www.mothur.org/wiki/Align.seqs)

mothur > align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)

比对的数据库包括(http://www.mothur.org/wiki/Alignment_database) greengenes(http://www.mothur.org/wiki/Greengenes-formatted_databases)和SILVA(http://www.mothur.org/wiki/Silva_reference_files)。官网认为SILVA效果更好,那好我们来看看这个数据库。

我们肯定是选择最新的版本Release 119。http://blog.mothur.org/2014/08/08/SILVA-v119-reference-files/介绍了如何使silva适用于Mothur。他制作了3个版本的,到底该选择哪一个呢?? Full length sequences and taxonomy references (Silva.nr_v119.tgz ,http://www.mothur.org/w/images/2/27/Silva.nr_v119.tgz),包含137879 bacteria, 3155 arches, and 12273 eukarya sequences,可用于比对,也可用于分类(压缩前为7.2G,压缩后为249M) Recreated SEED database (Silva.seed_v119.tgz ,http://www.mothur.org/w/images/5/56/Silva.seed_v119.tgz),包含12244 bacteria, 207 arches, and 2558 eukarya sequences,压缩前700M,压缩后25M。 Silva-based alignment of template file for chimera.slayer (Silva.gold.bacteria.zip, http://www.mothur.org/w/images/f/f1/Silva.gold.bacteria.zip)包含5,181 sequences

这三个我如何选择呢?

 如果你内存足够大,建议在使用align.seqs命令时候用silva.nr_v119.align数据库。如果仅仅是针对V4区,可以这样来做
 mothur "#pcr.seqs(fasta=silva.nr_v119.align, start=11894, end=25319, keepdots=F, processors=8);
 unique.seqs()"

将数据库变小;你也可以get.lineage提取其中的细菌序列,但是这样也仅仅是减少了10%的数据;你也可以通过filter.seqs,加上参数vertical=T,但如果有插入的话,还是会有问题的;

当然,你也可以仅仅用silva.seed_v119.align来进行比对; 如果是分类的话,我强烈推荐在对silva.nr_v119.align 进行了pcr.seqs 后用silva.nr_v119.align和silva.nr_v119.tax。

不建议对结果进行unique.seqs

跟库比对完以后,还需要重新screen一下,根据summary的结果,去掉比对结果不好的序列(特别是首尾的位置)

 mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)
 Start End NBases Ambigs Polymer NumSeqs
 Minimum: 1250 10693 250 0 3 1
 2.5%-tile: 1968 11550 252 0 3 3227
 25%-tile: 1968 11550 252 0 4 32265
 Median: 1968 11550 252 0 4 64530
 75%-tile: 1968 11550 253 0 5 96794
 97.5%-tile: 1968 11550 253 0 6 125832
 Maximum: 1982 13400 270 0 12 129058
 Mean: 1967.99 11550 252.462 0 4.36663
 # of unique seqs: 16477
 total # of seqs: 129058
 mothur > screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8)
 再次去掉低质量的片段,改start为97.5%-tile对应的值(),改end为2.5%-tile对应的值。意思是开头和结尾都必须要大于一定值。(这一步需要认真观察一下)

下面这一步是干什么,没明白?(http://www.mothur.org/wiki/Filter.seqs) mothur > filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)

因为上一步已经去掉了首尾,这个时候需要重新在unique一下 mothur > unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)

9.pre.cluster(http://www.mothur.org/wiki/Pre.cluster)

mothur > pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)
 基于pseudo-single linkage algorithm来去掉测序错误的序列。它的基本原理是:bundant sequences are more likely to generate erroneous sequences than rare sequences。根据序列丰度给他排序,Then we walk through the list of sequences looking for rarer sequences that are within some threshold of the original sequence,然后根据阈值,这些序列被归类到大的序列集上。说白了就是根据丰度信息,容许一定阈值的碱基不一样,给序列归类嘛。
 运行过程中,会有3列数据,第一列为处理的序列数,第二列为保留的序列数,第三列为将被归类到第二列的序列数(如下:)
 0 21821 86
 100 20286 1621
 200 19824 2083
 ...
 21700 16380 5527
 21800 16377 5530
 21900 16376 5531

10.去嵌合体

Chimera.uchime http://www.mothur.org/wiki/Chimera.uchime
 mothur > chimera.uchime(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)
 这个命令需要两个文件,一个fasta文件,name file (或者是 reference file )
 chimera.uchime(fasta=stool.trim.unique.good.align, reference=silva.gold.align)跟数据库比。或者你可以让reference=self,这样就能够用高丰度的序列来作为reference,命令如下:
 mothur > chimera.uchime(fasta=stool.trim.unique.good.align, name=stool.trim.good.names)
 如果再加上一个group命令,他就以自己高丰度的序列来检查自己的序列:
 mothur > chimera.uchime(fasta=stool.trim.unique.good.align, name=stool.trim.good.names, group=stool.good.groups)
 也可以直接用count,就相当于name加上group的作用:
 mothur > make.table(name=stool.trim.good.names, group=stool.good.groups)
 mothur > chimera.uchime(fasta=stool.trim.unique.good.align, count=stool.trim.good.count_table)
 dereplicate 的作用:If the dereplicate parameter is false, then if one group finds the sequence to be chimeric, then all groups find it to be chimeric, default=f.

根据序列名去掉fasta文件中的嵌合体序列

 Remove.seqs(http://www.mothur.org/wiki/Remove.seqs)
 mothur > remove.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.accnos)
 classify.seqs(http://www.mothur.org/wiki/Classify.seqs) ???
 mothur > classify.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, reference=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80)
 这一步是基于比对知道序列的分类地位。目前的方法包括k-nearest neighbor consensus和Wang approach。分类地位的数据库http://www.mothur.org/wiki/Taxonomy_outline,一共有四个库:SILVA reference files、RDP reference files、greengenes reference files、PR2 reference files。我选择的Silva库。

11.remove.lineage(http://www.mothur.org/wiki/Remove.lineage)

mothur > remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
 Mitochondria 线粒体; Chloroplast 叶绿体

12 评估错误(这一步我没弄,需要用他们的数据集)

提取"Mock" sample的序列

 mothur > get.groups(count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table, fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, groups=Mock)
 算错
 mothur > seq.error(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta, reference=HMP_MOCK.v35.fasta, aligned=F)
 然后用R来算一下错误率
 R> s <- read.table(file="stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.error.summary", header=T)
 R> ct <- read.table(file="stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.pick.count_table", header=T)
 R> rownames(s) <- s[,1]
 R> rownames(ct) <- ct[,1]
 R> no.chim <- s$numparents==1
 R> s.good <- s[no.chim,]
 R> query <- rownames(s.good)
 R> ct.good <- ct[as.character(query),]
 R> s.good[,1]==ct.good[,1]
 R> sum(ct.good$Mock * s.good$mismatches)/sum(ct.good$Mock * s.good$total)
 [1] 6.510802e-05
 错误率是如此的低,现在我们可以来聚类了
 mothur > dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta, cutoff=0.20)
 mothur > cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.dist, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.pick.count_table)
 mothur > make.shared(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.an.unique_list.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.pick.count_table, label=0.03)
 mothur > rarefaction.single(shared=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.an.unique_list.shared)

13  聚OTUs

如果是一个小的数据集,我们可以这样来聚类

<a href="http://www.mothur.org/wiki/Dist.seqs">dist.seqs</a> ()和 <a href="http://www.mothur.org/wiki/Cluster">cluster</a>()

mothur > dist.seqs(fasta= stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta, cutoff=0.20)
 </span><span style="font-size: 14px;">mothur > cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.dist,count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.pick.count_table)。


也可以用下面的命令Cluster.split(http://www.mothur.org/wiki/Cluster.split),下面的命令式先根据分类水平信息将序列分类,对分类后的数据进行OTU聚类,这样更快更能节省内存。taxlevel=4指的是先在Order上分类

mothur >cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta,count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.pick.count_table,taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)

这个命令可以生成a .list, .rabund, .sabund files三个文件 Nearest neighbor: Each of the sequences within an OTU are at most X% distant from the most similar sequence in the OTU. Furthest neighbor: All of the sequences within an OTU are at most X% distant from all of the other sequences within the OTU. Average neighbor: This method is a middle ground between the other two algorithms. 默认的是Average neighbor 在聚类之前,先是把文件切割成几部分,有3种方法可以用来切割文件:通过距离文件、分类地位、分类地位加上fasta文件。splitmethod对应的3种方法:distance、classify、fasta。默认的为distance。

根据距离来分:需要提供一个距离文件.dist。

mothur > cluster.split(phylip=98_lt_phylip_amazon.dist) 
或 mothur > cluster.split(column=96_lt_column_amazon.dist, name=amazon.names) 
还有一个参数Large,allows you to indicate that your distance matrix is too large to fit in RAM,默认的是关闭,这个参数到底有什么用,我也不清楚。 
mothur > cluster.split(phylip=98_lt_phylip_amazon.dist, large=T) 

根据分类地位来分: 这个方法需要提供一个距离文件,分类地位文件,splitmethod设置为classify,taxlevel设置在什么分类地位分开。

mothur > classify.seqs(fasta=amazon.fasta, template=silva.nogap.fasta, taxonomy=silva.bacteria.silva.tax, probs=f) 
mothur > cluster.split(column=96_lt_column_amazon.dist, name=amazon.names, taxonomy=amazon.silva.taxonomy, splitmethod=classify) 

根据分类地位,也用fasta文件 需要fasta文件、names文件、taxonomy文件。

mothur > classify.seqs(fasta=amazon.fasta, template=silva.nogap.fasta, taxonomy=silva.bacteria.silva.tax, probs=f) 
mothur > cluster.split(fasta=amazon.fasta, name=amazon1.names, taxonomy=amazon.silva.taxonomy, splitmethod=fasta) 

几个重要的参数:

taxonomy:仅仅在splitmethod=classify or fasta时才使用。
taxlevel:默认为3,当为4的时候是在Order水平。 
cluster:分开文件后是否聚类,默认的是T,如果你的硬盘空间不够大,可以先 
cluster.split(fasta=yourFasta, taxonomy=yourTax, count=yourCount, taxlevel=3, cluster=f, processors=8) 
然后:cluster.split(file=yourFile, processors=4)。其中第一步可以用所有的processors去分割,第二步用一般的Processors file:如上,在cluster为f的时候采用。 
mothur > cluster.split(fasta=final.fasta, taxonomy=final.taxonomy, count=final.count_table, taxlevel=3, cluster=f, processors=4) 
mothur > cluster.split(file=final.file, processors=2) 
name:第一类为这一行的代表名字,第二列为被归为这一行的重复序列的名字 
count:与name的作用差不多,只不多第二列是为重复序列的个数。 
method:用哪种方法来据聚类。默认的为method=average,可以用改为furthest、nearest
cutoff: Without setting cutoff to 0.10 this command would have generated 4560 distances (i.e. 96x95/2 = 4560). 
With the cutoff only 56 distances are saved. 
The savings can be substantial when there are a large number of distances.
The actual cutoff used by the command is 0.005 higher than the value that is set to allow for rounding in the clustering steps.
这个cutoff实际是保留的距离矩阵的距离,并不不是真正聚类的距离。 
其他参数略

讨论: 在这一步中会生成俩俩序列的距离矩阵,因为我有1G过滤后的序列(大约100万条序列), 如果俩俩序列名的乘积100万*100万就生成一个行数为1T的矩阵,再加上我序列的名字很长,一行的长度 30个字符, 那么就需要30T的硬盘,恐怖吧,还不算中间生成的temp文件。对于这么大的文件,我那小服务无法承受啊。

有如下策略:

  1. 先分类,仅仅是分类,然后再聚类,上面有提到即cluster=f

  2. 在科水平上分类分类,即taxlevel=5(纲的水平上可能都不够好使)

  3. cutoff设置,保留的距离阈值,大于这个阈值的距离都不会被保留,我设置成了0.10,其实可以更小,因为聚类的时候的阈值不是这个,这个仅仅是保留结果的阈值,可以减少硬盘空间。 4.简化每条序列的名字,用脚本完成

  4. 索性用qiime吧,算法不一样,更简化了,对内容消耗更少

    mothur>cluster(column=150121.unique.dist,name=150121.names) Output File Names: 150121.unique.an.sabund 150121.unique.an.rabund 150121.unique.an.list

为什么上面的有的距离丢失了呢

Missing distances Perhaps the second most commonly asked question is why there isn’t a line for distance 0.XX. If you notice the previous example the distances jump from 0.003 to 0.006. Where are 0.004 and 0.005? mothur only outputs data if the clustering has been updated for a distance. So if you don’t have data at your favorite distance, that means that nothing changed between the previous distance and the next one. Therefore if you want OTU data for a distance of 0.005 in this case, you would use the data from 0.003.原来如果矩阵相同,那么label就合并了啊,所以看到的第一列这个label是断开的。 Next we want to know how many sequences are in each OTU from each group and we can do this using the make.shared command.

Here we tell mothur that we're really only interested in the 0.03 cutoff level:<span style="font-size: 14px;">mothur > make.shared(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.an.unique_list.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.pick.count_table, label=0.03)

We probably also want to know the taxonomy for each of our OTUs. 
We can get the consensus taxonomy for each OTU using the classify.otu command: 
mothur > classify.otu(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.an.unique_list.list,count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.pick.taxonomy, label=0.03) 
Opening stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.an.unique_list.0.03.cons.taxonomy


 you'll see something that looks like... 
OTU Size Taxonomy
Otu001 17 Bacteria(100);"Verrucomicrobia"(100);Verrucomicrobiae(100);Verrucomicrobiales(100);Verrucomicrobiaceae(100);Akkermansia(100);
Otu002 1 Bacteria(100);"Proteobacteria"(100);Gammaproteobacteria(100);Aeromonadales(100);Aeromonadaceae(100);Aeromonas(100);
Otu003 6 Bacteria(100);"Proteobacteria"(100);Betaproteobacteria(100);Neisseriales(100);Neisseriaceae(100);Neisseria(100);
Otu004 1 Bacteria(100);"Proteobacteria"(100);Gammaproteobacteria(100);unclassified(100);unclassified(100);unclassified(100);
Otu005 1 Bacteria(100);"Proteobacteria"(100);Gammaproteobacteria(100);Xanthomonadales(100);Xanthomonadaceae(100);Stenotrophomonas(100);
Otu006 598 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);unclassified(100);
Otu007 513 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);unclassified(100);
Otu008 1442 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);unclassified(100);
...

上面的几个策略都试了一下,生成的数据过于庞大(10T),没办法,就只有采用qiime来接着分析了。qiime在聚OTU的时候选取cd-hit算法能够减少硬盘使用量

去掉fasta中的gap
sam@sam-Precision-WorkStation-T7500[send] head -n 3 bb.filter.unique.precluster.pick.pick.fasta
>M01519_56_000000000-
ABNR1_1_1101_26846_9215 G-G-G-A-G-GC-A-GC-A-G-T-G-G-G-G-A-A–TA-TTGC-A-C-AA-T-G-G–GG–GA-A-A-C-CC-TG-A-TGCAGC-GACGCCGCGT-G-G-A-G–GA-A-G-A-A–G-G-TC———-TT-CG—————GA-TTG-T-A–AA-C-T-CC—–TG-TT-G-T–T-GGG—-G–A-A–G—A————————T———–AA——————————-T-G-A-C-G—–G-T-A-C-CC—A-A-C-A-A-G—G–AA–GT-G–AC-G–GC-TAA-C–T-A-C-G-T-G-CCA-G-C–A-GC-CG-C—-GG–TA-AA–AC-GT-AG-GTC–ACA-A-G-C-GT—T-GT-C-CGG-AA-TT-A-C-T–GG-GT-GT–A—AA-GG-GA-GC—–G-CA-G-G-C-G–G–G-AA-G-A-C-AAG-TT-G-G-A-A-G–TG–A-AA-TC-T-A-TG-G-G–CT-C-AA-C-C-C-A-T-A-A-A-C-T-G-C-T-T-T-C–AA-A-ACT-G-T–TT–T-T-C-T-T-GA-GT-A-G-TG—-CA-G-A–G-G-T-A–GG-C—-GG-A-ATT-C-C-C-G-GTGT-A-GCGGT-G-G-AA-TGCGTAGATA-TC–G-GG-A-G-G-A-ACA-CC-AG-T-GGC-GAA-G–G-C–G-G–C-C-T-A—CTG-G–GC-A-C-C-A-A-CT-GA-CG-CT-G-A-GG-CT-CG-AAA-G-T-G-TG-GG-T-AG-C-AAA-CA–GG-AT-TA-G-ATA–C-C-C-T-G-GTA-G
>M01519_56_000000000-ABNR1_1_1101_17904_17467
 去掉空格-,序列名字也给换了
 sed 's/-//g' bb.filter.unique.precluster.pick.pick.fasta>good.fasta
 用qiime算距离
 /usr/lib/qiime/bin/pick_otus.py -i good.fasta -m cdhit -o cdhit_picked_otus/ -n 100
 生成good_otus.txt
 简化序列的名字
 sed 's/01519_//g' bb.filter.unique.precluster.uchime.pick.pick.count_table|sed 's/000000000-//g'>aaa.count
 awk '{print $1}' aaa.count>ab.count
 sed '1d' ab.count>ad.count
 sam@sam-Precision-WorkStation-T7500[cdhit_picked_otus] wc ad.count [11:13PM]
 1126716 1126716 32123505 ad.count
 sed '1i\unique\t1126716 ' ad.count>ae.count '
 perl onesingle.pl ae.count ah.count #变成一行
 制作list文件
 sed 's/01519_//g' good_otus.txt|sed 's/000000000//g'>aaa.dist
 perl no_tab.pl aaa.dist ae.dist #去掉末尾的很多个tab
 awk '{for(i=2;i<=NF;i++) if(i!=NF){printf $i" "}else{print $i} }' ae.dist>ac.dist
 sed 's/ /,/g' ac.dist>ab.dist
 sam@sam-Precision-WorkStation-T7500[cdhit_picked_otus] wc ab.dist [ 8:33AM]
 3231 1085105 30
 sed '1i\0.03\t3231' ab.dist>ad.dist '
 perl onesingle.pl ad.dist ah.dist #变成一行
 cat ah.count ah.dist>aa.list
 接着用mothur来算每个样品每个out的条数(没办法,因为上面用mothur做了unique,所以这一步必须做啊)
 mothur > make.shared(list=aa.list, count=aaa.count, label=0.03)
 0.03
 Output File Names:
 aa.shared
 sed 's/01519_//g' bb.filter.unique.precluster.pick.nr_v119.wang.pick.taxonomy|sed 's/000000000-//g'>aa.taxonomy
 mothur>classify.otu(list=aa.list, count=aaa.count, taxonomy=aa.taxonomy, label=0.03)
 Output File Names:
 aa.0.03.cons.taxonomy
 aa.0.03.cons.tax.summary
 alpha多样性分析
 mothur > summary.single(shared=aa.shared, calc=sobs-ace-chao-shannon-coverage);
 Output File Names:
 aa.groups.summary
 ##get.oturep(column=ad.dist, count=aa.count, fasta=good.fasta, list=aa.list,label=0.03)
 因为没有dist文件,所以没法用mothur来提取代表序列
 sed 's/01519_//g' good.fasta |sed 's/000000000//g'>better.fasta
 通过验证:qiime跑出来的OTU顺序与mothur跑出来的顺序一致,仅仅是一个有0
 提取rep.fasta
 aaa.dist 用excel打开,删掉0,第1列末尾加上otu个数 #qiime与mothur跑出来的otu编号有差别,qiime从0开始,mothur从1开始
 /usr/lib/qiime/bin/pick_rep_set.py -i aaa.dist -f better.fasta -o rep.fna
 sed 's/;/\t/g' aa.0.03.cons.taxonomy|sed 's/(/\t/g'|sed 's/)//g' >taxonomy
 sed 's/\t/;/g' taxonomy >abb
 sed 's/;;/;/g' abb>abbb
 perl /sam/usefulscript/rdpformat3.pl taxonomy abd;

下面就是进化分析,alpha,beta等分析。

总之,我们也可以通过mothur一键式来完成所有任务

$ ./mothur "#make.contigs(file=stability.files, processors=8); screen.seqs(fasta=current, maxambig=0, maxlength=275); unique.seqs(); count.seqs(name=current, group=current); align.seqs(fasta=current, reference=silva.v4.fasta); screen.seqs(fasta=current, count=current, start=1968, end=11550, maxhomop=8); filter.seqs(fasta=current, vertical=T, trump=.); pre.cluster(fasta=current, count=current, diffs=2); unique.seqs(fasta=current, count=current); pre.cluster(fasta=current, count=current, diffs=2); chimera.uchime(fasta=current, count=current, dereplicate=t); remove.seqs(fasta=current, accnos=current); classify.seqs(fasta=current, count=current, reference=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80); remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota); remove.groups(count=current, fasta=current, taxonomy=current, groups=Mock); cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.15); make.shared(list=current, count=current, label=0.03); classify.otu(list=current, count=current, taxonomy=current, label=0.03); phylotype(taxonomy=current); make.shared(list=current, count=current, label=1); classify.otu(list=current, count=current, taxonomy=current, label=1);"

参考资料:

一个有趣的博客

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn