Genome assembly information
Overview
Teaching: 0 min
Exercises: 20 minQuestions
How can you obtain and interpret genome assembly information from NCBI?
Objectives
Practice the usage of command line to get files from the internet and examine their content
Interpret genome assembly overview from the internet
Genome assembly information
In our exercises, we will examine the quality of assembled Z. tritici genomes that are publicly available. We will use genome assemblies from NCBI. Unfortunately, generating eukaryote de novo genome assembly is beyond the scope of the exercises; in contrast to bacterial genomes, assembly of eukaryotic genomes requires significantly more computational power and time.
The genome assemblies of some Z. tritici isolates downloaded from NCBI GenBank can be found data/fungalgenomics_seidl/assemblies/ (state: 2018). There are continuously more and more genome assemblies being generated and deposited at NCBI.
To get an overview of the number of genome assemblies currently at NCBI. We can obtain an overview file from NCBI with information of all genomes belonging to fungi.
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/fungi/assembly_summary.txt
Exercise
How many fungal genomes are deposited in the database today?
Solution
cat assembly_summary.txt | grep -v "^#" | wc -l
Inspect the assembly_summary file further. You can look at README_assembly_summary.txt for a description of the columns.
Exercise
What does the term ‘reference genome’ refer to?
Solution
Reference genome: For each defined species with genome assemblies included in RefSeq, one genome is designated as the “reference”, with rare exceptions. This set of reference genomes is a compact, normalized, and taxonomically diverse view of the RefSeq collection that can be used for the taxonomic identification and characterization of novel sequences. For eukaryotes, a reference genome is computationally or manually selected from among the best genomes available for each species, using the following selection hierarchy: i) The genome is not overtly contaminated or from an unverified species. ii) The genome is not overtly partial. iii) Genomes included in the current RefSeq set are preferred if available for the species. A particular genome is manually selected for RefSeq annotation and considers community usage and genome quality. For species with more than one RefSeq genome (e.g., human), one is manually selected as the reference genome. In exceptional cases there may be more than one reference genome selected for some species (e.g., dog and dingo, two subspecies of Canis lupus). iv) Highest contig N50, binned based on rounded log10 values. v) Genomes with gapless chromosomes are preferred. vi) Genomes with a full set of assembled chromosomes are preferred. vii) Genomes with at least one assembled chromosome are preferred. viii) Genomes with at least some sequences assigned to chromosomes (aka unlocalized scaffolds) are preferred. ix) Highest unbinned N50 value (either contig or scaffold).
We can now try to find out for how many different species there is at least one genome sequenced. We can do this using the command line.
$ cat assembly_summary.txt | grep -v "^#" | cut -f8 | cut -f1,2 -d" " | sort -u | wc -l
With grep you can obtain the lines matching a specific expression. The -v inverts the matches, i.e., it will return all lines that do not match the expression. In our case, we are excluding linkes starting with a # as these are comment lines. We then obtain the eights column - why? (Tip: Take a look at the information stored in each column) with cut -f 8 and the only retain the first two entries separated by a space -d " ". Lastly, we make these results unique and then count the number of lines.
Exercise
Can you modify the code to obtain the code to obtain the fungal genus for which most genome assemblies have been produced so far? Explain the logic of the code.
Solution
cat assembly_summary.txt | grep -v "^#" | cut -f8 | cut -f1,1 -d " " | sort | uniq -c | sort -k1,1nr | head. The logic is similar to above but we only keep the genus information (first entry in the second cut), and then sort the output numerically to get the highest number on top of the list.
We can now use a very similar code to obtain the number of Zymoseptoria genome assemblies currently available.
$ cat assembly_summary.txt | grep -v "^#" | cut -f8 | cut -f1,2 -d" " | sort | uniq -c | sort -k1,1nr | grep "Zymoseptoria"
Exercise
Can you think about a code that would give you the number of complete (chromosome-level) genome assemblies?
Key Points
You can obtain assembly information from NCBI
You can subset this information and determine the number of genome assemblies