COO-I-4. Binning

Overview

Teaching: 0 min
Exercises: 60 min
Questions
  • How to cluster contigs into bins in a metagenomic assembly?

Objectives
  • Gain experience with binning software output

  • Interpret completeness and contamination scores associated to metagenomic bins

Binning

Next, we need to classify these contigs into bins. To do this, multiple binning software are available. MetaWRAP includes a complex binning pipeline that employs multiple binning software and evaluates the results jointly.

The runtime of MetaWrap depends strongly on the complexity of the input assemply. Here, again, you would need to wait for a large fraction of the COO to obtain these results. To avoid this inconvenience, Metawrap was already ran, employing the binning software Concoct, maxbin2 and metabat2. Again, fetch the results from the appropriate folder

$ cd ~/GenomeBioinformatics/Block1/COO-I/Results/
$ mkdir 05_binning
$ cd 05_binning
$ ln -s ~/data_bb3bcg20/Block1/COOI/Intermediate_files/Binning/metawrap/ .

Exercise: How many bins were generated by Concoct, maxbin2 and metabat2

Solution

ls -l *_bins/

for f in *_bins; do echo $f; ls $f/*.fa | wc -l; done

Assessing bin quality

MetaWRAP assesses the quality of each bin using CheckM. It accomplishes this by comparing the contigs within each bin to a comprehensive database of gene markers from known reference genomes, thus evaluating the completeness, contamination, and strain heterogeneity scores of each bin. Completeness indicates the proportion of gene markers that is present in the bin. Since these gene markers are meant to be essential genes, this score is often used as a proxy to how much of the genome has been recovered. Contamination measures the presence of sequences from other organisms within the bin, highlighting potential inaccuracies in the binning process. Strain heterogeneity assesses the genetic variability within a bin, which is crucial for distinguishing closely related strains.

The files with suffix “.stats” include CheckM scores, plus a predicted taxonomical identification of each bin based on similarity to reference sequences. Compare the results of the three binning software.

Exercise: What is the expected quality of the obtained bins?

  • Which software provided more high-quality bins?

  • Do you expect any of these MAGs may be represented by a complete, single-contig assembly?

Note: for an easier visualisation of tab-delimited .tsv files in the command line, use the less wrapper coltab.sh:

$ ~/data_bb3bcg20/bin/scripts/coltab.sh $file

Solution

for f in *stats; do echo $f; cat $f; echo; done | ~/data_bb3bcg20/bin/scripts/coltab.sh

All three tools recover 2 bins with over 99% completeness scores and very low contamiination scores. These correspond to bins classified by CheckM within the genus Clostridium (bacterium) and the phylum Euryarchaeota (archaea). In all three cases, the Euryarchaeota genome is formed by one contig that stretches the entirety or almost the entirety of the bin (N50 is equal or almost equal to the total size). Besides these two bins, all three binning tools recover additional bins, but with relatively poor completeness scores.

Bin refinement

Bin refinement in metagenomics is an advisable step aimed at improving the accuracy and completeness of genome bins generated during the initial binning process. Despite the advancements in binning algorithms, the resulting bins may still contain fragmented or misclassified sequences, leading to inaccuracies in downstream analysis. Bin refinement may involve several strategies, including manual curation, re-binning, and leveraging complementary data sources such as metatranscriptomic or single-cell sequencing data. Here, we take advantage of the bin refinement pipeline in MetaWRAP to improve our initial results.

MetaWRAP takes the bins obtained by each one of the employed binning tools, as well as the joint sets formed by all possible combinations of these results (7 sets in total). By comparing groups of bins, it attempts to consolidate them to keep appropriate contigs obtained by some binning efforts but not all. Here, MetaWRAP will focus on the resulting bins with over 70% completeness and under 5% contamination/redundancy.

Exercise: What bins, and from which binning tools, does MetaWRAP obtain as the best results?

cd ~/GenomeBioinformatics/Block1/COO-I/Results

mkdir 06_binRefinement; cd 06_binRefinement

ln -s ~/data_bb3bcg20/Block1/COOI/Intermediate_files/Bin_refinement/metawrap_bin_refinement/ .

Solution

~/data_bb3bcg20/bin/scripts/coltab.sh metawrap_70_10_bins.stat

Metawrap chose a solution including 4 bins, 2 of which come from binner A (Concoct) and 2 of which come from binner C (Maxbin2).

Let’s get one final insight: how deeply was each one of the microbes sequenced? For this analysis, pick the bins selected by MetaWRAP in the last step. Minimap2, a quick read-mapping tool, was used to map the long reads to each bin, and the result is stored in binary BAM files, which are common read-mapping files containing information for the mapping location and statistics for each read in the input file. Let’s analyse the result using samtools.

In your Results folder, create a directory called “07_depthAnalysis” and move there. Link the BAM files to your working directory, and analyse sequencing depth using samtools:

cd ~/GenomeBioinformatics/Block1/COO-I/Results
mkdir 07_depthAnalysis; cd 07_depthAnalysis
ln -s ~/data_bb3bcg20/Block1/COOI/Intermediate_files/Bin_refinement/minimap2/ .
cd minimap2
ls -l

for f in *.bam; do
  echo $f;
  samtools index $f; 
  samtools depth $f > $f.depth;
  perl ~/data_bb3bcg20/bin/scripts/stats.pl <(cut -f3 $f.depth);
  echo;
  echo;
done

This code iterates over all ‘.bam’ files generated by minimap2, and analyses them with two samtools commands. First, samtools index generates an index file for a BAM file (BAM-index or BAI file), which is helpful for quick processing of read-mapping files. Then, samtools depth calculates the sequencing depth, or vertical coverage, at every position of a reference sequence. Finally, we use the quick statistic calculation script with the third column of the generated file, to extract general patterns about coverage depth for each mapping process.

How does mapping depth vary in these bins? What do you think this may mean?

Solution

Median mapping depth is higher for bins 2 and 3 (29-30X), somewhat lower for bin 1 (16X) and low for bin 4 (4X). This may represent different abundances in the sampled microbial community.

Preliminary assessment

Until now, you obtained a sample from the environment, enriched it, and sequenced it. Using bioinformatics, we have now assessed the quality of the sequencing reads, assembled them using two different tools, binned them and analysed the quality of the resulting Metagenome-Assembled Genomes. You will continue the analysis of this dataset in COO-II.

What is your current assessment of this sample you obtained and further enriched? How many different species do you expect coexist? How diverse is this group of organisms?

Key Points