Block 3 - COOI: Recent ploidy changes in the Saccharomycotina (part 3)
Overview
Teaching: 0 min
Exercises: 60 minQuestions
How can you determine changes in ploidy using next-generation sequencing data?
Objectives
Generate allele frequency plots from short-read data
Analyze and interprete allele frequencies to estimate ploidies
Recent ploidy changes in the Saccharomycotina
The bakers’ yeast belongs to the Saccharomycotina subphylum, a taxonomic group that contains many species colloquially referred to as budding yeasts. Budding yeasts are mainly thriving as saprophytes, but some are important pathogens of plants and animals, including human. Furthermore, many members of this sub-phylum are commonly exploited in agriculture, food production, or biotechnology. Interestingly, many of these important species are hybrids that are formed by the fusion of two genetically distinct parents. Next to the doubling of the number of chromosomes by hybridization (also referred to as allopolyploidy), genomes of a single species can also double (autopolyploid). Changes in ploidy, either due to allo- or autoployploidy can generate important variation and is thought to play an important role to fuel adaptive genome evolution.
3. Identify ploidy variation using allele frequencies
Next to analyzing k-mer profiles to infer ploidy variation, we can also assess ploidy variation using information of allele frequencies at biallelic single-nucleotide polymorphisms (SNPs). This approach assumes that alleles present at biallelic SNPs occur at different ratios for different ploidy levels, that is, 0.5/0.5 in diploids, 0.33/0.67 in triploids, and a mixture of 0.25/0.75 and 0.5/0.5 in tetraploids. In this episode, we will detect and analyze ploidy differences in three S. cerevisiae strains.
To determine the allele frequencies for the the three strains, we need to make use of a .bam file that contains the mapping the next-generation sequenincg reads to the S. cerevisiae S288C reference genome assembly; note: you will show you in COOIII how to map sequencing reads to a reference genome assembly.
Use commands including cd, ls and pwd to localize the mapped next-generation sequencing data from S. cerevisiae in the data storage folder (~/data_bb3bcg20/Block3/COOI/intermediate/). Then create a symbolic link to your own folder with the ln -s command.
To analyze the allele frequencies based on the bam files, we will use nQuire. nQuire will extract the necessary information about the allele frequencies as biallelic SNP sites directly the bam file stores this information in a binary output file.
$ nQuire create -b ScereCBS2919.bam -o ScereCBS2919
The binary nQuire output file contains information of the individual allele frequencies. We can visualize the distribution of allele frequencies by using the nQuire histo command.
Exercise
Describe the distribution for each strain. What is does the distribution suggest with respect of the ploidy for each of the three S. cerevisiae strains?
Solution
nQuire histo ScereCBS2919.bin#The distributions clearly show unimodel, bimodel, trimodel distribution indicating diploid, triploid, tetraploid yeast strains, e.g. CBS2919 is triploid strain.
Instead of visualizing the distribution on the command line using nQuire histo, we can also use nQuire view to extract the base counts per bi-allelic SNP as a text file and then subsequently use this information to generate our own visualization using Python or R. This is useful for example for presentations or publications as well as it gives you greater control over the data.
The text file generated by nQuire view contains the coverage (total number of reads), and the number of reads for both alternative bases (biallelic sites) at a given position in the genome. This information is sufficient to generate a similar histogram as using nQuire histo.
Exercise
Extract the information from the binary file using nQuire view and use Python/R to generate a histogram that showns the frequency distribution for each base (major, minor).
Solution
nQuire view ScereCBS2919.bin > ScereCBS2919.base.txt#plot with python import matplotlib matplotlib.use('Agg') # Non-GUI backend import matplotlib.pyplot as plt import seaborn as sns import pandas as pd bases = pd.read_csv("ScereCBS2919.base.txt",sep="\t",header=None) #minor allele sns.distplot(bases[1]/bases[0]); #minor sns.distplot(bases[2]/bases[0]); #major plt.savefig('ScereCBS2919ploidy.png') plt.close() # Close to free memory
nQuire uses Gaussian Mixture Models to describe the histogram as a mixture of different distributions (one to three, depending on the ploidy). With a log-likelihood statistical framework, nQuire compares the likelihood for the fixed models (for diploid, triploid, and tetraploid) with a free model to indicate which of the three are optimal given the data. nQuire lrdmodel is performing this analysis and return the log-likelihood for each of the four models as well as the delta-log-likelihood for each ploidy.
Exercise
Prepare and execute
nQuire lrdmodelfor each strain and identify for each strain.
- Which of the delta-log-likelhoods is the smallest?
- Based on the nQuire* analyses, which of the three strains is diploid, triploid, or tetraploid?
- How does this correlate with your in COOI (part 2) (smudgeplot)?
Solution
- nQuire lrdmodel ScereCBS7837.bin #-> diploid
- nQuire lrdmodel ScereCBS2919.bin #-> triploid
- nQuire lrdmodel ScereCBS9564.bin #-> tetraploid For ScereCBS7837 and ScereCBS2919 that are analysed in part 2, these predictions match very well.
Key Points
You can work with next-generation sequencing data
You can detect (recent) ploidy changes in fungi using next-generation sequencing data