Participate in ethics and data sharing community  | ​  Learn More 

Bioinformatics Solutions for SARS-CoV-2 Genomic Analysis

PHA4GE Bioinformatics Pipelines & Visualization Working Group
Libuit KG, Park D, van Heusden P, Neher R, Kapsak CJ, Southgate J, Bridges D, Mboowa G, Lunn S, Langhorst B

Current Version

Overview

Genomic analysis of SARS-CoV-2 (SC2) samples is an increasingly critical function to public health laboratories around the world. Integration of the appropriate bioinformatics solutions to support these works, however, can be an overwhelming challenge.

In an attempt to assist this integration process, the Bioinformatics Pipelines and Visualization Working Group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the major bioinformatics challenges for SC2 genomic analysis and suggest various open-source and freely available bioinformatics resources to address them.

Please note that the bioinformatics resources listed in this document are simply an attempt to highlight the most accessible solutions as per the opinions of our working group and in no way represent a comprehensive list of all available SC2 bioinformatics resources. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues.

Bioinformatics Challenges for Public Health

The PHA4GE Bioinformatics Pipeline and Visualization Working Group has defined four key public health bioinformatics challenges for genomic analysis of SC2 samples:

  1. Generating consensus assemblies from PCR tiling NGS data: Tiled amplicon sequencing–through the Artic V3 protocol, for example–is the most commonly adopted method for generating SC2 sequencing data. These sequencing experiments generate thousands of amplicon reads that represent fragments of the original SC2 genome present in a sample. As a result, one of the initial bioinformatics challenges laboratories face is the assembly of PCR tiling NGS data into a contiguous SC2 genome from which powerful public health insights can be derived, such as lineage typing and genomic epidemiology studies that help inform public-health decision making.
  2. Submitting raw sequence data (fastq), consensus assemblies (fasta), and relevant sample metadata to internationally-accessible databases: Sharing of sample read and assembly data through internationally accessible databases allows insights to be drawn about how the virus is spreading and mutating across the globe; the more freely available these data are to international researchers and public health scientists, the stronger our decision making can be.
  3. Screening sequenced SC2 samples for variants of concern: The detection of certain genetic variants of the SARS-CoV-2 virus may have a significant impact on the decisions of public health officials. Thus, an ability to accurately and reliably screen for variants of interest (VoI) and variants of concern(VoC), such as B.1.1.7 (Alpha) or B.1.617.2 (Delta), is a critical component to the bioinformatics analysis of SC2 genomes.
  4. Performing phylogenetic analysis of SC2 datasets: Genetic relatedness as inferred through phylogenetic analysis of SC2 datasets can be a powerful proxy for epidemiological associations that help resolve transmission networks, enable real-time surveillance, provide insights of the variance-over-time of SC2 samples, and support local outbreak investigations

Open-Access/Source Bioinformatics Solutions & Resources

1. Generating consensus assemblies from PCR tiling NGS data

The bioinformatics resources listed below are open-source pipelines that run on general-purpose, containerized workflow infrastructure to generate consensus SC2 assemblies from PCR tiling NGS data. While some parameters and modules may differ slightly, each pipeline will perform read mapping to the Wuhan-1 reference genome, remove primer regions from the mapped read data, and generate a consensus assembly based on conserved and variant positions identified in the resulting alignment. These resources have been organized into three categories: Terra and Galaxy Workflows, Web-Accessible Software as a Service (SaaS) Solutions, and Command-Line Interface (CLI) tools and are listed in no particular order.Terra and Galaxy Workflows

  • Broad viral-ngs
    • Brief Description: The viral-ngs workflow collection contains many tools for viral analysis. The consensus genome caller is called assemble_refbased and should work for any low-diversity microbial genome and is appropriate for viruses stemming from a single point-source outbreak, such as SARS-CoV-2. Accepts Illumina paired, single, or mixed reads, as well as ONT reads. Accepts metagenomic or amplicon-based reads with primer trimming.
    • Developed/supported by: Broad Institute Viral Genomics
    • Documentation: Technical documentation (ReadTheDocs)
    • User base: H3Africa West African sites (RUNKGHUCAD)
    • Workflow language: WDL
      • Web/Cloud GUI Platforms: Terra, DNAnexus
      • CLI Platforms: Cromwell (local HPC, cloud), miniWDL
  • Theiagen’s Public Health Viral Genomics WDL Workflows
    • Brief Description: Theiagen’s Public Health Viral Genomics WDL Workflows include four separate WDL workflows (Titan_Illumina_PE, Titan_Illumina_SE, Titan_ClearLabs, and Titan_ONT) that process NGS read data from four different sequencing approaches: Illumina paired-end, Illumina single-end, Clear Labs, and Oxford Nanopore Technology (ONT)) to generate consensus assemblies, produce relevant quality-control metrics for both the input read data and the generated assembly, and assign samples with a lineage and clade designation using Pangolin and NextClade, respectively.
    • Developed/supported by: Theiagen Genomics
    • Documentation: Technical documentation (ReadTheDocs)step-by-step protocols (Protocols.io), and video tutorials (YouTube Playlist)
    • User base: US PHLs
    • Workflow language: WDL
      • Web/Cloud GUI Platforms: Terra
      • CLI Platforms: Cromwell (local HPC, cloud), miniWDL
  • COVID-19 Galaxy Workflows

Web-Accessible SaaS SolutionsCommand-line interface (CLI) Tools

2. Submitting raw sequence data (fastq), consensus assemblies (fasta), and relevant sample metadata to internationally-accessible databases

Below is a list of resources developed to assist in the preparation and submission of raw NGS read data (fastq files), SC2 consensus assemblies (fasta files), and contextual sample metadata to internationally-accessible databases such as NCBIENA, and GISAID. We have also included a list of bioinformatics software designed to assess the quality of SC2 data; we recommend the use of such software prior to submission to avoid the inadvertent sharing of poor quality, contaminated, or otherwise misleading SC2 data. Additional information regarding the interpretation of read and assembly quality metrics for SC2 data will be made available as a separate document.Recommended SC2 Sample Metadata Specifications

  • PHA4GE Contextual Data Specifications
    • Database Target(s): GISAID, ENA, SRA, Genbank
    • Brief Description: A SARS-CoV-2 contextual data specification based on harmonizable, publicly available, community standards. The specification is implementable via a collection template, as well as an array of protocols and tools to support the harmonization and submission of sequence data and contextual information to public repositories.
    • Developed/supported by: PHA4GE
    • Documentation: Technical documentation (GitHub README)
    • User base: Global public health community
    • Protocols: NCBI SubmissionENA Submission, & GISAID Submission

Bioinformatics Solutions to Prepare and/or Submit SC2 Sample Data

Bioinformatics Solutions to Assess Data Quality Prior to Submission

3. Screening sequenced SC2 samples for variants of concern & general lineage typing

These tools either assign a clade or lineage descriptor to consensus sequences or provide databases for lookup of information on variants in the SARS-CoV-2 genome. As variants of concern are listed by their lineage descriptor (typically PANGO lineage or sometimes Nextclade clades) these tools help identify variants of concern.Bioinformatics tools for SC2 lineage or clade assignment

Public Health Resources that Track & Visualize SC2 Variants Over Time

  • PANGO cov-lineages
    • Brief Description: Track global prevalences of PANGO lineages
    • Developed/supported by: Pangolin Network
  • Covariants
    • Brief Description: Track global prevalence of Nextclade-annotated lineages
    • Developed/supported by: NextStrain Team
  • Outbreak.info
    • Brief Description: Epidemiological info including PANGO lineage prevalence
    • Developed/supported by: SuWu, and Andersen labs at Scripps Research
  • COV-GLUE
    • Brief Description: CoV-GLUE contains a database of amino acid replacements, insertions and deletions which have been observed in GISAID hCoV-19 sequences sampled from the pandemic Epidemiological info including PANGO lineage prevalence
    • Developed/supported by: COG-UK
  • 2019nCoVR
    • Brief Description :2019nCoVR features comprehensive integration of genomic and proteomic sequences as well as their metadata information from the GISAID, NCBI, NMDC and CNCB/NGDC. It also incorporates a wide range of relevant information including scientific literatures, news, and popular articles for science dissemination, and provides visualization functionalities for genome variation analysis results based on all collected SARS-CoV-2 strains.
    • Developed/supported by: China National Center for Bioinformation (CNCB)
  • CoVizu
  • Annotation of SARS-2 Coronavirus Genome (Observable)
    • Brief Description: Annotation of variation in the genome with some notes on what is known about the various amino acids
    • Developed/supported by: Delphine Lariviere (Penn State University)

Bioinformatics Tools to Track & Visualize Your Own SC2 Variants Over Time

4. Performing phylogenetic analysis of SC2 datasets

The tools listed below perform phylogenetic analyses of different complexity, ranging from web-apps to command-line tools that need to run on HPC facilities. The selected tools are integrated with visualization features that facilitate the interrogation of the results, but beware that such inferences might be uncertain and often require careful interpretation.Public Health Resources Performing Global SC2 Phylogenetic Analysis

  • NextStrain
    • Brief Description: Nextstrain is an open-source project to harness the scientific and public health potential of pathogen genome data.
    • Developed/supported by: Fred Hutch/Basel (Nextstrain team)
    • User base: USA based groups
    • Documentation: docs
    • Help/community/discussion: discussion.nextstrain.org
    • Implementations for compute steps (“augur”):
  • Microreact
    • Brief Description: Open data visualization and sharing for genomic epidemiology
    • Developed/supported by: Centre for Genomic Pathogen Surveillance (CGPS)
    • User base: COG-UK, New Zealand, etc
    • User-interface: Web application / centrally hosted service

Offlineable Browser-Based Web Applications

  • Auspice
  • MicrobeTrace
    • Brief Description: The Visualization Multitool for Molecular Epidemiology and Bioinformatics
    • Developed/supported by: US CDC
    • Documentation: https://github.com/CDCgov/MicrobeTrace
    • User-interface: offlineable browser-based web app
  • UShER
    • Brief Description: Places user provided sequences on very large reference trees, extracts the relevant subtree, and provides a visualization
    • Developed/supported by: UCSC
    • User-interface: offlineable browser-based web app

Command-line interface (CLI) Tools

  • Grinch
    • Brief Description: Generates reports for the international distribution of PANGO lineages that can be viewed in a web browser.
    • Developed/supported by: PANGO, cov-lineages
    • User-interface: command-line tool
  • Phylopipe
    • Brief Description: Generates a downsampled global tree using FastTree and updates it daily using UShER, cleans and annotates the tree; can be run on output from Datapipe.
    • Developed/supported by: Virus Group (University of Edinburgh)
    • User-interface: command-line tool, nextflow pipeline
    • User base: COG-UK

Subscribe to the PHA4GE Newsletter

We're committed to your privacy. PHA4GE uses the information you provide to us to contact you about our relevant content. You may unsubscribe from these communications at any time.

Follow PHA4GE

Related Articles

Data Repositories Working Group: Welcome to our new Chairs!

Arthur Shem Kasambula and Dr. Emma Hodcroft have joined PHA4GE’s Data Repositories Working Group as co-chairs, aiming to advance tools and databases like Pathoplexus for improved pathogen data sharing. Their efforts will drive consensus-driven solutions and technical recommendations to enhance usability and integration across global data systems.

PHA4GE Newsletter – August 2024

Discover Pathoplexus, a cutting-edge database for human viral pathogens, enhancing data submission and accessibility on key viruses like Ebolavirus and West Nile. Join the PHA4GE Data Repositories Working Group to help shape this vital resource. Plus, explore new Mpox guidance from our Bioinformatics group, and insights from the latest PHA4GE member survey.