PHA4GE Bioinformatics Pipelines & Visualization Working Group
Libuit KG, Park D, van Heusden P, Neher R, Kapsak CJ, Southgate J, Bridges D, Mboowa G, Lunn S, Langhorst B
Overview
Genomic analysis of SARS-CoV-2 (SC2) samples is an increasingly critical function to public health laboratories around the world. Integration of the appropriate bioinformatics solutions to support these works, however, can be an overwhelming challenge.
In an attempt to assist this integration process, the Bioinformatics Pipelines and Visualization Working Group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the major bioinformatics challenges for SC2 genomic analysis and suggest various open-source and freely available bioinformatics resources to address them.
Please note that the bioinformatics resources listed in this document are simply an attempt to highlight the most accessible solutions as per the opinions of our working group and in no way represent a comprehensive list of all available SC2 bioinformatics resources. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues.
Bioinformatics Challenges for Public Health
The PHA4GE Bioinformatics Pipeline and Visualization Working Group has defined four key public health bioinformatics challenges for genomic analysis of SC2 samples:
- Generating consensus assemblies from PCR tiling NGS data: Tiled amplicon sequencing–through the Artic V3 protocol, for example–is the most commonly adopted method for generating SC2 sequencing data. These sequencing experiments generate thousands of amplicon reads that represent fragments of the original SC2 genome present in a sample. As a result, one of the initial bioinformatics challenges laboratories face is the assembly of PCR tiling NGS data into a contiguous SC2 genome from which powerful public health insights can be derived, such as lineage typing and genomic epidemiology studies that help inform public-health decision making.
- Submitting raw sequence data (fastq), consensus assemblies (fasta), and relevant sample metadata to internationally-accessible databases: Sharing of sample read and assembly data through internationally accessible databases allows insights to be drawn about how the virus is spreading and mutating across the globe; the more freely available these data are to international researchers and public health scientists, the stronger our decision making can be.
- Screening sequenced SC2 samples for variants of concern: The detection of certain genetic variants of the SARS-CoV-2 virus may have a significant impact on the decisions of public health officials. Thus, an ability to accurately and reliably screen for variants of interest (VoI) and variants of concern(VoC), such as B.1.1.7 (Alpha) or B.1.617.2 (Delta), is a critical component to the bioinformatics analysis of SC2 genomes.
- Performing phylogenetic analysis of SC2 datasets: Genetic relatedness as inferred through phylogenetic analysis of SC2 datasets can be a powerful proxy for epidemiological associations that help resolve transmission networks, enable real-time surveillance, provide insights of the variance-over-time of SC2 samples, and support local outbreak investigations
Open-Access/Source Bioinformatics Solutions & Resources
1. Generating consensus assemblies from PCR tiling NGS data
The bioinformatics resources listed below are open-source pipelines that run on general-purpose, containerized workflow infrastructure to generate consensus SC2 assemblies from PCR tiling NGS data. While some parameters and modules may differ slightly, each pipeline will perform read mapping to the Wuhan-1 reference genome, remove primer regions from the mapped read data, and generate a consensus assembly based on conserved and variant positions identified in the resulting alignment. These resources have been organized into three categories: Terra and Galaxy Workflows, Web-Accessible Software as a Service (SaaS) Solutions, and Command-Line Interface (CLI) tools and are listed in no particular order.Terra and Galaxy Workflows
- Broad viral-ngs
- Brief Description: The viral-ngs workflow collection contains many tools for viral analysis. The consensus genome caller is called assemble_refbased and should work for any low-diversity microbial genome and is appropriate for viruses stemming from a single point-source outbreak, such as SARS-CoV-2. Accepts Illumina paired, single, or mixed reads, as well as ONT reads. Accepts metagenomic or amplicon-based reads with primer trimming.
- Developed/supported by: Broad Institute Viral Genomics
- Documentation: Technical documentation (ReadTheDocs)
- User base: H3Africa West African sites (RUN, KGH, UCAD)
- Workflow language: WDL
- Web/Cloud GUI Platforms: Terra, DNAnexus
- CLI Platforms: Cromwell (local HPC, cloud), miniWDL
- Theiagen’s Public Health Viral Genomics WDL Workflows
- Brief Description: Theiagen’s Public Health Viral Genomics WDL Workflows include four separate WDL workflows (Titan_Illumina_PE, Titan_Illumina_SE, Titan_ClearLabs, and Titan_ONT) that process NGS read data from four different sequencing approaches: Illumina paired-end, Illumina single-end, Clear Labs, and Oxford Nanopore Technology (ONT)) to generate consensus assemblies, produce relevant quality-control metrics for both the input read data and the generated assembly, and assign samples with a lineage and clade designation using Pangolin and NextClade, respectively.
- Developed/supported by: Theiagen Genomics
- Documentation: Technical documentation (ReadTheDocs), step-by-step protocols (Protocols.io), and video tutorials (YouTube Playlist)
- User base: US PHLs
- Workflow language: WDL
- Web/Cloud GUI Platforms: Terra
- CLI Platforms: Cromwell (local HPC, cloud), miniWDL
- COVID-19 Galaxy Workflows
- Brief Description: Several Galaxy workflows for performing SC2 consensus genome assembly have been available including a Galaxy workflow for the analysis of SARS-CoV-2 data.
- Workflow language: Galaxy
- Developed/supported by: usegalaxy.eu (https://covid19.galaxyproject.org/artic/)
- Web/Cloud GUI Platforms: usegalaxy.*
- Documentation: SARS-CoV-2 Data Analysis and Monitoring with Galaxy
- Sequencing technologies supported: Illumina metagenomic sequencing, Illumina and Oxford Nanopore ARTIC amplicon sequencing
- Developed/suppported by: ARIES/Istituto Superiore di Sanità
- Web/Cloud GUI Platforms: ARIES Galaxy (https://aries.iss.it/u/arnold-knijn/w/sars-cov-2recovery31)
- Documentation: bioRxiv
- Sequencing technologies supported: Illumina, Ion Torrent and Oxford Nanopore ARTIC amplicon sequencing
- Developed/supported by: usegalaxy.eu (https://covid19.galaxyproject.org/artic/)
Web-Accessible SaaS SolutionsCommand-line interface (CLI) Tools
2. Submitting raw sequence data (fastq), consensus assemblies (fasta), and relevant sample metadata to internationally-accessible databases
Below is a list of resources developed to assist in the preparation and submission of raw NGS read data (fastq files), SC2 consensus assemblies (fasta files), and contextual sample metadata to internationally-accessible databases such as NCBI, ENA, and GISAID. We have also included a list of bioinformatics software designed to assess the quality of SC2 data; we recommend the use of such software prior to submission to avoid the inadvertent sharing of poor quality, contaminated, or otherwise misleading SC2 data. Additional information regarding the interpretation of read and assembly quality metrics for SC2 data will be made available as a separate document.Recommended SC2 Sample Metadata Specifications
- PHA4GE Contextual Data Specifications
- Database Target(s): GISAID, ENA, SRA, Genbank
- Brief Description: A SARS-CoV-2 contextual data specification based on harmonizable, publicly available, community standards. The specification is implementable via a collection template, as well as an array of protocols and tools to support the harmonization and submission of sequence data and contextual information to public repositories.
- Developed/supported by: PHA4GE
- Documentation: Technical documentation (GitHub README)
- User base: Global public health community
- Protocols: NCBI Submission, ENA Submission, & GISAID Submission
Bioinformatics Solutions to Prepare and/or Submit SC2 Sample Data
- Galaxy ENA Submission Plugin
- Database Target(s): ENA
- Brief Description: Galaxy plugin for direct submission to the European Nucleotide Archive database
- Developed/supported by: Galaxy IUC (Intergalactic Utilities Commission)
- Documentation: https://github.com/ELIXIR-Belgium/ena-upload-container
- User base: European PHLs
- Workflow language: Galaxy
- Web/Cloud GUI Platforms: GalaxyProject
- Broad viral-ngs (Terra workflows described above)
- Database Target(s): GISAID, GenBank, & SRA
- Theiagen’s Public Health Viral Genomics WDL Workflows (Terra workflows described above)
- Database Target(s): GISAID & GenBank (SRA submission in development)
- EDGE COVID-19 (SaaS solution described above)
- Database Target(s): GISAID, GenBank, & SRA
Bioinformatics Solutions to Assess Data Quality Prior to Submission
- VADR – Viral Annotation DefineR
- Brief Description: VADR is a suite of CLI tools for classifying and analyzing sequences homologous to a set of reference models of viral genomes or gene families. With regards to SC2, laboratories have utilized VADR to identify samples with potentially mis-assembled genomes that are likely to be rejected from an internationally-accessible database.
- Developed/supported by: NCBI
- Documentation: Technical Documentation (GitHub Wiki)
- User base: NCBI GenBank & US PHLs
- Accessibility: Local install or the StaPH-B Docker Image
- Broad viral-ngs (Terra workflows described above; includes VADR)
- Titan Workflows for Genomic Characterization (Terra workflows described above; includes VADR)
- COVID-19 Galaxy Workflows (Galaxy resources described above)
- IDSeq (CZ BioHub) (SaaS solution described above)
- EDGE COVID-19 (SaaS solution described above)
- SIGNAL (SARS-CoV-2 Illumina GeNome Assembly Line; CanCOGeN) (CLI tool described above)
- ARTIC nCOV19 (ARTIC Network; Connor-lab) (CLI tool described above)
- StaPH-B ToolKit (CLI tool described above; VADR included in the Cecret workflow)
3. Screening sequenced SC2 samples for variants of concern & general lineage typing
These tools either assign a clade or lineage descriptor to consensus sequences or provide databases for lookup of information on variants in the SARS-CoV-2 genome. As variants of concern are listed by their lineage descriptor (typically PANGO lineage or sometimes Nextclade clades) these tools help identify variants of concern.Bioinformatics tools for SC2 lineage or clade assignment
- Pangolin (Phylogenetic Assignment of Named Global Outbreak LINeages)
- Brief Description: Tool developed to implement the dynamic nomenclature of SARS-CoV-2 lineages, known as the Pango nomenclature. It allows a user to assign a SARS-CoV-2 genome sequence the most likely lineage (PANGO lineage) to SARS-CoV-2 query sequences.
- Developed/supported by: Pangolin Network
- Documentation: Technical Documentation (Pangolin Website), publication (Nature Microbiology)
- User base: Global Public Health Community
- Accessibility: Web application & CLI tool
- Bioinformatics workflows that incorporate Pango lineage assignments:
- Datapipe
- Brief Description: Performs alignment and variant calling, assigns lineages with pangolin and VOC/VUI with scorpio and cleans up geography metadata.
- Developed/supported by: Virus Group (University of Edinburgh)
- User-interface: command-line tool, nextflow pipeline
- User base: COG-UK
- Broad viral-ngs (Terra workflows described above)
- Theiagen’s Public Health Viral Genomics WDL Workflows (Terra workflows described above)
- COVID-19 Galaxy Workflows (Galaxy resources described above)
- IDSeq (SaaS solution described above)
- EDGE COVID-19 (SaaS solution described above)
- SIGNAL (SARS-CoV-2 Illumina GeNome Assembly Line; CanCOGeN) (CLI tool described above)
- StaPH-B ToolKit (CLI tool described above)
- Datapipe
- NextClade
- Brief Descriptio:n Tool that identifies differences between your sequences and a reference sequence used by Nextstrain, uses these differences to assign your sequences to clades, and reports potential sequence quality issues in your data
- User-interface: Web application & CLI tool
- Help/community/discussion: discussion.nextstrain.org
- Bioinformatics workflows that incorporate NextClade clade assignments:
- Broad viral-ngs (Terra workflows described above)
- Theiagen’s Public Health Viral Genomics WDL Workflows (Terra workflows described above)
- COVID-19 Galaxy Workflows (Galaxy resources described above)
- IDSeq (SaaS solution described above)
- StaPH-B ToolKit (CLI tool described above)
Public Health Resources that Track & Visualize SC2 Variants Over Time
- PANGO cov-lineages
- Brief Description: Track global prevalences of PANGO lineages
- Developed/supported by: Pangolin Network
- Covariants
- Brief Description: Track global prevalence of Nextclade-annotated lineages
- Developed/supported by: NextStrain Team
- Outbreak.info
- COV-GLUE
- Brief Description: CoV-GLUE contains a database of amino acid replacements, insertions and deletions which have been observed in GISAID hCoV-19 sequences sampled from the pandemic Epidemiological info including PANGO lineage prevalence
- Developed/supported by: COG-UK
- 2019nCoVR
- Brief Description :2019nCoVR features comprehensive integration of genomic and proteomic sequences as well as their metadata information from the GISAID, NCBI, NMDC and CNCB/NGDC. It also incorporates a wide range of relevant information including scientific literatures, news, and popular articles for science dissemination, and provides visualization functionalities for genome variation analysis results based on all collected SARS-CoV-2 strains.
- Developed/supported by: China National Center for Bioinformation (CNCB)
- CoVizu
- Brief Description: CoVizu is an open source project endeavouring to visualize the global diversity of SARS-CoV-2 genomes, which are provided by the GISAID Initiative.
- Developed/supported by: Poon Laboratory of Western University
- Annotation of SARS-2 Coronavirus Genome (Observable)
- Brief Description: Annotation of variation in the genome with some notes on what is known about the various amino acids
- Developed/supported by: Delphine Lariviere (Penn State University)
Bioinformatics Tools to Track & Visualize Your Own SC2 Variants Over Time
- KRISP R-scripts
- Brief Description: Open-source repository containing all the code, data and information needed to reproduce the analyses for the African genomic epidemiology manuscript.
- Developed/supported by: Emmanuel James San (University of KwaZulu-Natal)
- Documentation: Technical Documentation (GitHub README), publication (Nature Medicine)
- Accessibility: RCL-Scripts
- GISAID Processing
- Brief Description: Open-source repository containing python scripts to process GISIAD data into frequency graphs
- Developed/supported by: Peter van Heusden (University of Western Cape)
- Documentation: Technical Documentation (GitHub README)
- Accessibility: Python-Scripts
4. Performing phylogenetic analysis of SC2 datasets
The tools listed below perform phylogenetic analyses of different complexity, ranging from web-apps to command-line tools that need to run on HPC facilities. The selected tools are integrated with visualization features that facilitate the interrogation of the results, but beware that such inferences might be uncertain and often require careful interpretation.Public Health Resources Performing Global SC2 Phylogenetic Analysis
- NextStrain
- Brief Description: Nextstrain is an open-source project to harness the scientific and public health potential of pathogen genome data.
- Developed/supported by: Fred Hutch/Basel (Nextstrain team)
- User base: USA based groups
- Documentation: docs
- Help/community/discussion: discussion.nextstrain.org
- Implementations for compute steps (“augur”):
- nextstrain/ncov snakemake pipeline
- Description: The authoritative implementation of the Nextstrain “augur” pipeline that takes genomes and metadata to trees and visualizations.
- Developed/supported by: Fred Hutch/Basel (Nextstrain team)
- Workflow language: Snakemake
- Broad viral-ngs (Terra workflows described above)
- Theiagen’s Public Health Viral Genomics WDL Workflows (Terra workflows described above)
- nextstrain/ncov snakemake pipeline
- Microreact
- Brief Description: Open data visualization and sharing for genomic epidemiology
- Developed/supported by: Centre for Genomic Pathogen Surveillance (CGPS)
- User base: COG-UK, New Zealand, etc
- User-interface: Web application / centrally hosted service
Offlineable Browser-Based Web Applications
- Auspice
- Brief Description: Allows interactive exploration of phylogenomic datasets by simply dragging & dropping them onto this page.
- Developed/supported by: Fred Hutch/Basel (Nextstrain team)
- Documentation: Technical documentation (GitHub README), NextStrain discussion Forum
- User-interface: offlineable browser-based web app
- MicrobeTrace
- Brief Description: The Visualization Multitool for Molecular Epidemiology and Bioinformatics
- Developed/supported by: US CDC
- Documentation: https://github.com/CDCgov/MicrobeTrace
- User-interface: offlineable browser-based web app
- UShER
- Brief Description: Places user provided sequences on very large reference trees, extracts the relevant subtree, and provides a visualization
- Developed/supported by: UCSC
- User-interface: offlineable browser-based web app
Command-line interface (CLI) Tools
- Grinch
- Brief Description: Generates reports for the international distribution of PANGO lineages that can be viewed in a web browser.
- Developed/supported by: PANGO, cov-lineages
- User-interface: command-line tool
- Phylopipe
- Brief Description: Generates a downsampled global tree using FastTree and updates it daily using UShER, cleans and annotates the tree; can be run on output from Datapipe.
- Developed/supported by: Virus Group (University of Edinburgh)
- User-interface: command-line tool, nextflow pipeline
- User base: COG-UK