Block 3 - COOI: Recent ploidy changes in the Saccharomycotina (part 3)

Overview

Teaching: 0 min
Exercises: 60 min

Questions

How can you determine changes in ploidy using next-generation sequencing data?

Objectives

Generate allele frequency plots from short-read data

Analyze and interprete allele frequencies to estimate ploidies

Recent ploidy changes in the Saccharomycotina

The bakers’ yeast belongs to the Saccharomycotina subphylum, a taxonomic group that contains many species colloquially referred to as budding yeasts. Budding yeasts are mainly thriving as saprophytes, but some are important pathogens of plants and animals, including human. Furthermore, many members of this sub-phylum are commonly exploited in agriculture, food production, or biotechnology. Interestingly, many of these important species are hybrids that are formed by the fusion of two genetically distinct parents. Next to the doubling of the number of chromosomes by hybridization (also referred to as allopolyploidy), genomes of a single species can also double (autopolyploid). Changes in ploidy, either due to allo- or autoployploidy can generate important variation and is thought to play an important role to fuel adaptive genome evolution.

3. Identify ploidy variation using allele frequencies

Next to analyzing k-mer profiles to infer ploidy variation, we can also assess ploidy variation using information of allele frequencies at biallelic single-nucleotide polymorphisms (SNPs). This approach assumes that alleles present at biallelic SNPs occur at different ratios for different ploidy levels, that is, 0.5/0.5 in diploids, 0.33/0.67 in triploids, and a mixture of 0.25/0.75 and 0.5/0.5 in tetraploids. In this episode, we will detect and analyze ploidy differences in three S. cerevisiae strains.

To determine the allele frequencies for the the three strains, we need to make use of a .bam file that contains the mapping the next-generation sequenincg reads to the S. cerevisiae S288C reference genome assembly; note: you will show you in COOIII how to map sequencing reads to a reference genome assembly.

Use commands including cd, ls and pwd to localize the mapped next-generation sequencing data from S. cerevisiae in the data storage folder (~/data_bb3bcg20/Block3/COOI/intermediate/). Then create a symbolic link to your own folder with the ln -s command.

To analyze the allele frequencies based on the bam files, we will use nQuire. nQuire will extract the necessary information about the allele frequencies as biallelic SNP sites directly the bam file stores this information in a binary output file.

$ nQuire create -b ScereCBS2919.bam -o ScereCBS2919

The binary nQuire output file contains information of the individual allele frequencies. We can visualize the distribution of allele frequencies by using the nQuire histo command.

Exercise

Describe the distribution for each strain. What is does the distribution suggest with respect of the ploidy for each of the three S. cerevisiae strains?

Solution

nQuire histo ScereCBS2919.bin #The distributions clearly show unimodel, bimodel, trimodel distribution indicating diploid, triploid, tetraploid yeast strains, e.g. CBS2919 is triploid strain.

Instead of visualizing the distribution on the command line using nQuire histo, we can also use nQuire view to extract the base counts per bi-allelic SNP as a text file and then subsequently use this information to generate our own visualization using Python or R. This is useful for example for presentations or publications as well as it gives you greater control over the data.

The text file generated by nQuire view contains the coverage (total number of reads), and the number of reads for both alternative bases (biallelic sites) at a given position in the genome. This information is sufficient to generate a similar histogram as using nQuire histo.

Exercise

Extract the information from the binary file using nQuire view and use Python/R to generate a histogram that showns the frequency distribution for each base (major, minor).
Solution

nQuire view ScereCBS2919.bin > ScereCBS2919.base.txt
#plot with python
import matplotlib
matplotlib.use('Agg')  # Non-GUI backend
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

bases = pd.read_csv("ScereCBS2919.base.txt",sep="\t",header=None)
#minor allele
sns.distplot(bases[1]/bases[0]); #minor
sns.distplot(bases[2]/bases[0]); #major
plt.savefig('ScereCBS2919ploidy.png')
plt.close()  # Close to free memory

nQuire uses Gaussian Mixture Models to describe the histogram as a mixture of different distributions (one to three, depending on the ploidy). With a log-likelihood statistical framework, nQuire compares the likelihood for the fixed models (for diploid, triploid, and tetraploid) with a free model to indicate which of the three are optimal given the data. nQuire lrdmodel is performing this analysis and return the log-likelihood for each of the four models as well as the delta-log-likelihood for each ploidy.

Exercise

Prepare and execute nQuire lrdmodel for each strain and identify for each strain.

Which of the delta-log-likelhoods is the smallest?

Based on the nQuire* analyses, which of the three strains is diploid, triploid, or tetraploid?

How does this correlate with your in COOI (part 2) (smudgeplot)?

Solution

nQuire lrdmodel ScereCBS7837.bin #-> diploid

nQuire lrdmodel ScereCBS2919.bin #-> triploid

nQuire lrdmodel ScereCBS9564.bin #-> tetraploid For ScereCBS7837 and ScereCBS2919 that are analysed in part 2, these predictions match very well.

Key Points

You can work with next-generation sequencing data

You can detect (recent) ploidy changes in fungi using next-generation sequencing data

previous episode

Genome Bioinformatics - Computer practicals

next episode

Block 3 - COOI: Recent ploidy changes in the Saccharomycotina (part 3)

Overview

Recent ploidy changes in the Saccharomycotina

3. Identify ploidy variation using allele frequencies

Exercise

Solution

Exercise

Solution

Exercise

Solution

Key Points

previous episode

next episode