Participate in ethics and data sharing community  | ​  Learn More 

Getting the right information to the right people: The PHA4GE SARS-CoV-2 contextual data specification


The SARS-CoV-2 pandemic has impacted lives and economies all over the world, with more than 34.8 million cases and over 1 million deaths globally in early October 2020. Sequencing and bioinformatic analyses of viral genomes has already demonstrated many insights into the origin and spread of the disease, due to the ever-increasing amounts of data shared with public repositories like GISAID and the INSDC. Good quality genomics contextual data (sample metadata, lab/epidemiological/clinical data, methods and metrics) are critical for interpreting sequence data and informing decision making based on results, as well as answering biological questions about the virus and the disease. Contextual data elements such as sample collection dates and geographical locations, patient age, gender, health outcomes, pre-existing conditions, symptoms and onset dates, as well as possible and known exposures, are useful for a wide variety of surveillance and other public health activities. These include characterizing lineages and clusters, identifying variants with clinical significance, and correlating genomic trends with outcomes and risk factors.


In order to capitalize on the potential of SARS-CoV-2 sequence data, getting the right information to the right people is critical, however this process is often hampered by fragmented data collection and management processes. Due to the division of labour across laboratories, departments, agencies and jurisdictions, contextual data is often collected according to local needs and reporting requirements, and structured according to organization-specific data dictionaries, creating data silos and barriers for data sharing. While metadata standards exist, they are broadly scoped to cover as many use cases and pathogens as possible, and include fields that may be subject to privacy concerns, may not be applicable to a pathogen of interest, or exclude fields commonly used in public health surveillance and investigations. 


Many of the members of the Data Structures working group are part of large sequencing consortia (e.g. COG-UK, SPHERES, CanCOGeN, the Latin American Genomics SARS-CoV-2 Network) that have faced challenges in data harmonization and integration as a result of the barriers described above. In light of these challenges, we have developed a fit-for-purpose SARS-CoV-2 contextual data specification focused on public health needs, designed to accommodate privacy requirements while maximizing information linkage, content and interoperability across datasets and databases (https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification). The specification was developed by consensus among domain experts, and incorporates existing community standards to describe repository accession numbers and identifiers, sample collection and processing, host information, host exposure information, sequencing methods, bioinformatics and quality control metrics, pathogen diagnostic testing details, as well as provenance and contribution attribution.


The specification package includes:

Components of the PHA4GE SARS-CoV-2 Contextual Data Specification Package
Standardized collection template
Pick lists: standardized
Reference guide: field labels, definitions, guidance, expected values, required vs optional fields
SOP: how to use template, find new terms, highlights practical/ethical/privacy issues
Field mapping to existing standards: highlight alignment and gap
JSON schema: machine readable version for incorporation into different applications
7 public repository submission protocols (GISAID, NCBI, EMBL-EBI) on protocols.io


The collection template enables vital information to be collated in a single location, and harmonized across various sources using established principles to improve machine-amenability. Different subsets of the harmonized data can be 1) shared with public repositories e.g. GISAID and INSDC using the PHA4GE protocols, 2) shared with trusted partners e.g. national sequencing consortia, public health partners, and 3) kept private and retained locally with the potential for sharing in the future for particular surveillance or research activities. How, and how much of, the specification is used is ultimately at the discretion of the user. To date, versions of the specification are being implemented in the CanCOGeN (Canada) and SPHERES (USA) SARS-CoV-2 sequencing initiatives, the AusTrakka (Australia) national data sharing platform, by the Global Emerging Pathogens Treatment Consortium (Africa), and in the Baobab LIMS at the South African National Bioinformatics Institute (SANBI).


As countries around the world prepare for new waves of infections throughout the pandemic, a unique opportunity for harmonization in data collection exists. With this specification we have endeavored to create a mechanism for promoting consistent, standardized contextual data collection that can be applied in such a way that community-based data sharing efforts are not excessively burdened. We hope that, given sufficient uptake, this specification will enhance the reusability of collected data, enabling national and international agencies to accelerate the understanding of SARS-CoV-2 epidemiology and biology. Furthermore, the framework for SARS-CoV-2 presented in this work can also be used to build a roadmap for dealing with future public health crises. 


To learn more about the specification and how to get started using it, read our recent preprint or listen to our interview on the Micro Binfie podcast.


To learn more about other activities of the PHA4GE Data Structures workgroup, check out our webpage https://pha4ge.org/work-group/data-structures/.

Subscribe to the PHA4GE Newsletter

We're committed to your privacy. PHA4GE uses the information you provide to us to contact you about our relevant content. You may unsubscribe from these communications at any time.

Follow PHA4GE

Related Articles

Data Repositories Working Group: Welcome to our new Chairs!

Arthur Shem Kasambula and Dr. Emma Hodcroft have joined PHA4GE’s Data Repositories Working Group as co-chairs, aiming to advance tools and databases like Pathoplexus for improved pathogen data sharing. Their efforts will drive consensus-driven solutions and technical recommendations to enhance usability and integration across global data systems.

PHA4GE Newsletter – August 2024

Discover Pathoplexus, a cutting-edge database for human viral pathogens, enhancing data submission and accessibility on key viruses like Ebolavirus and West Nile. Join the PHA4GE Data Repositories Working Group to help shape this vital resource. Plus, explore new Mpox guidance from our Bioinformatics group, and insights from the latest PHA4GE member survey.