How the sausage gets made: a conversation with the Data Structures Working Group about developing, testing and implementing the PHA4GE Wastewater Contextual Data Specification Package

June 30, 2024

Wastewater genomics data is increasingly being used to support a wide variety of surveillance programs and research to support public health decision making and responses e.g. detection and monitoring of emerging and existing pathogens, detection and characterization of antimicrobial resistance (AMR) determinants, quantifying infectious disease genotypes and assessing prevalence, etc. Wastewater genomic surveillance of pathogens requires high-quality sequence data, along with well-structured contextual data (e.g. sample, laboratory, clinical, epidemiological, environmental and methods information) to enable the interpretation of sequence data. Furthermore, wastewater genomic surveillance requires integration of diverse data from various sources, agencies, and systems – all of which can be structured in a variety of ways – posing challenges for data harmonization and meaningful interpretation. By structuring contextual data using data standards, this information can be more easily understood and used by both humans and computers and can be more easily reused for different types of analyses.

How data standards are built impacts their interoperability, utility and usability. General knowledge of data standards development principles and practices is often localized within data standards communities of practice, limiting the ability of public health institutions to develop their own interoperable specifications. In this article, we hope to provide the community with a view into the methods and best practices for data standards development used by PHA4GE. Specifically, we interviewed Emma Griffiths (Chair, Data Structures Working Group (DSWG); researcher, Simon Fraser University, Vancouver, Canada), Charlie Barlcay (ontology developer and curator, Simon Fraser University, Vancouver, Canada), and Jillian Paull (PhD student, Broad Institute of MIT and Harvard), who are leading the development of the PHA4GE Wastewater Contextual Data Specification Package. They outline 9 steps, to help make the process of developing data standards clearer to the general public health practitioner.

The package consists of a data standard (standardized fields/terms/formats for structuring contextual data from samples to sequences), tools to help public health labs put the standard into practice, supporting materials such as curation protocols and reference guides, and worked examples to demonstrate how to use the standard in real-world situations. The package is publicly available and can be found at https://github.com/pha4ge/Wastewater_Contextual_Data_Specification.

9 (Supposedly) Easy Steps for Data Specification Design*

**Recipe requires experience, training, and expertise*

Step 1: Define the goal and scope of the specification

The key to designing a contextual data specification is to first clearly define the scope of the specification. Scoping involves defining the types of information that need to be captured and for what purpose the information will be used, who will be responsible for collecting and analyzing the information, and how it will be stored and shared – which may result from a data needs assessment. The scope includes things like which pathogens or phenomena are being investigated (e.g. particular bacteria or viruses, antimicrobial resistance, virulence, other phenotypes); whether the data being captured is intended for private use, sharing with trusted partners and/or public sharing; different use cases/users of the data. Data needs may include sample types and their settings, sampling strategies, sample processing, upstream activities that can affect samples (including experimental interventions being tested), sample measurements and conditions, geographical regions, sequencing assay types, sequencing targets (including amplicons, informative regions), library preparation (including any enrichment methods, amplicon schemes, CRISPR), sequencing instruments (chemistry, flow cell versions), bioinformatics processing (quality filtering, trimming; read-mapping; including software names and version numbers, reference database names and versions; quality control methods), data provenance and contributions (organization names tasked with collecting samples and/or measurements, sequencing and/or bioinformatics analysis, phenotypic characterization; dates and locations, data steward contact information), and more.

All of these data elements were included in the PHA4GE wastewater specification, as well as different identifier tracking to establish chains of custody, and capturing the names and links to reports containing derived results. The specification is intended for future-proofing data for private (lab internal) use, sharing among networks, and public sharing – different subsets of the information could be shared depending on permissions and organizational sharing policies. The specification is scoped to include The standard includes capture of methodological information for different sequencing assay types (e.g. culture and WGS, metagenomics, amplicon sequencing), as well as different widely used use cases (e.g. AMR surveillance, SARS-CoV-2 surveillance, different pathogen (i.e. “pathogen agnostic”) surveillance).

Step 2: Make a list of relevant stakeholders and conduct consultations

The best way to understand data needs and data use is to talk to the people generating the data and using it. To ensure that the PHA4GE package was fit for a variety of public health purposes, we conducted consultations with >60 individuals involved in different wastewater-related public health activities including wastewater sanitation, wastewater surveillance of various pathogens/resistances (e.g. CDC, APHL, UKHSA, PHAC, Africa PGI, WHO, etc), as well as disease modellers, standards developers, and more.

Step 3: Review existing standards

In addition to consultations, existing resources and standards/specifications should be reviewed. Often, standards are created by different authoritative sources for different purposes or to answer different questions. To help make data reusable, it is a best practice to reuse existing standards as much as possible (where appropriate). After review, it is good practice to identify gaps or unmet needs which can be addressed by the new specification. Resources and standards that were explored prior to the development of the PHA4GE wastewater specification included:

o The Global Sewage Project

o US CDC National Wastewater Surveillance System Data Dictionary

o Public Health Environmental Surveillance Open Data Model

o NCBI Biosample Template for SARS-CoV-2 WW

o NCBI SRA Template

o POLIS Environmental Surveillance Sites Template

o Genomics Standards Consortium (GSC) MIxS checklists

o ENA Data Standards: GSC MIxS wastewater sludge; ENA sewage checklist

o DDBJ Data Standards: GSC MIMS.me.wastewater; SARS-CoV-2.wwsurv

o GISAID clinical/wastewater metadata standards

o Results from PATH/EQUALS survey of best standards for wastewater surveillance

o OBO Foundry ontologies

o ISO 23418

Step 4: Develop a draft schema

Once stakeholders have been consulted about data needs and use cases, existing resources have been reviewed and gaps have been identified, the next step is to collect the desired fields and terms and structure them according to a community supported structure or framework. PHA4GE implements a modular, interoperable, extensible framework based on ISO 23418 (Microbiology of the food chain — Whole genome sequencing for typing and genomic characterization of bacteria — General requirements and guidance). Modules contain standardized fields and terms that are grouped thematically and sourced from community-based ontologies. The framework can be customized by simply adding or removing modules, or by enriching or depleting modules for particular fields and terms. Dates are specified according to ISO 8601 prescribed formats. This structure was used to build the PHA4GE SARS-CoV-2 specification during the pandemic. The Wastewater specification’s “Sequencing information” module has been enriched by the addition of PHA4GE’s list of characterized tiling PCR amplicon schemes, and its “Bioinformatics and QC” module has been enriched by the inclusion of PHA4GE QC tags. New fields and terms created for this specification were generated in accordance with best semantic practices and made publicly available via different ontology look-up services (e.g. EBI’s OLS). Members of the project team have more than 10 years’ experience in developing ontologies and specifications for public health. Communication with other communities of practice is also often necessary and is good practice e.g. ontology community (OBO Foundry), international public health organizations (e.g. WHO’s IPSN), genomics and bioinformatics organizations (GA4GH, Elixir), data sharing platforms (e.g. INSDC), ethics and data governance working groups, and more.

Step 5: Testing, testing, and more testing

Testing a specification is critical to ensure its utility and usability in real-world settings. Often, testing provides input about how to streamline implementation or augment existing vocabulary. The PHA4GE Wastewater specification was shared among stakeholders for input and feedback.

The PHA4GE Wastewater specification was also tested by labs in lower resourced settings via a series of PHA4GE subgrants. Ten labs were awarded a subgrant, while other labs that volunteered to test the specification participated in the testing exercise in exchange for academic credit (authorship) on the manuscript.

Step 6: Provide a clear path to implementation via tooling and support materials

Data specifications are files encoding fields, terms, definitions, guidance, examples, as well as data types, rules, patterns, max/mins etc., in different formats (JSON, YAML, tsv, LinkML). One cannot simply hand a public health practitioner a specification file and expect them to know what to do with it. Users need tools to operationalize specifications, and they need protocols (SOPs) to put the tools and vocabulary into practice. Use case-specific subsets of the PHA4GE Wastewater specification have been encoded as fillable templates (AMR, SARS-CoV-2, Pathogen Agnostic) in an application called the DataHarmonizer – an easy to use, Javascript-based, locally downloadable data curation and validation tool. DataHarmonizer operating instructions, field and term reference guides, as well as a curation SOP highlighting different ethical, privacy and practical considerations, are all made publicly available via PHA4GE’s Wastewater specification GitHub repository. A series of worked examples highlighting how the templates can be used to capture contextual data in different scenarios – highlighting key fields and terms – are also available.

Step 7: Gather examples of real-world implementations, and enable data interchange between systems

In addition to testing, it is useful to document implementation of standards in real-world public health systems as proof-of-concept, as well as models for how it can be achieved by others in the community. One thing we can be sure, besides death and taxes, is that there will never be one standard to rule them all. As discussed, different standards are developed for different needs and to answer different kinds of questions. Furthermore, significant investments in organization-specific systems can make it difficult for labs to restructure their processes and data management systems. As such, it is often necessary to develop interchange formats based on mapping fields and terms across schemas/specifications to enable interoperability. PHA4GE has previously developed export formats from the PHA4GE SARS-CoV-2 template in the DataHarmonizer to create submission ready forms for INSDC and GISAID. Similarly, export formats from the DataHarmonizer’s wastewater template series are being constructed to enable submission to NCBI BioSample and SRA, as well as ENA’s BioSample and Experimental metadata, in accordance with the PHA4GE INSDC compliant Data Object Model (DOM). Automation of mapping and data transformations can help improve data sharing. PHA4GE also provides specifications in machine-amenable formats like JSON that enable software developers to integrate them into systems and to develop further tooling. PHA4GE welcomes and encourages all community development of tools based on standardized inputs/outputs.

Step 8: Update the standard as data needs evolve

Data standards must be maintained over time to ensure their longevity and utility. Furthermore, data needs change over time as pathogens or tracking systems evolve. PHA4GE commits to updating and maintaining all their standards, which are made publicly and freely available via GitHub. Changes are version controlled, and to encourage transparent discussions about data needs, community discussions and requests are openly tracked via IssueTracker. Furthermore, PHA4GE enables direct community requests for field or term updates/additions via GitHub New Term Requests – a standardized form for requesting changes or new additions of vocabulary either in singly or in bulk.

Step 9: Encourage community participation in standards development

Consensus is key to everything we do in PHA4GE. As such, all project team members have an equal voice and decisions are made by consensus according to the PHA4GE Charter. Achieving community consensus requires community participation. There are many ways to be involved in data standards projects. Interested individuals can join PHA4GE and be involved in different roles within the project from providing technical expertise, to contributing feedback during testing, to reviewing manuscripts, to sharing thoughts/ideas during group meeting discussions, to taking learnings back to home organizations. More information describing past, ongoing, and future planned projects (including links to GitHub repositories and published manuscripts) can be found on the DSWG projects page.

The Wastewater team is currently wrapping up subgrant work with labs in low resourced settings to test the data standard for utility across geographical regions and sampling strategies. Subgrantees are providing input with regards to use cases (e.g. antimicrobial resistance surveillance, SARS-CoV-2 surveillance, surveillance of many other pathogens e.g. cholera, enterics, influenza etc) and recommendations for improvements based on their expertise. These suggestions are being integrated into the specification package, and all participants will be co-authors on a joint manuscript in preparation.

Emma Griffiths, Charlie Barclay, Jillian Paull, Tracey Calvert-Joshua and Rangarirai Matima

How the sausage gets made: a conversation with the Data Structures Working Group about developing, testing and implementing the PHA4GE Wastewater Contextual Data Specification Package

Share:

9 (Supposedly) Easy Steps for Data Specification Design*

**Recipe requires experience, training, and expertise*

Step 1: Define the goal and scope of the specification

Step 2: Make a list of relevant stakeholders and conduct consultations

Step 3: Review existing standards

Step 4: Develop a draft schema

Step 5: Testing, testing, and more testing

Step 6: Provide a clear path to implementation via tooling and support materials

Step 7: Gather examples of real-world implementations, and enable data interchange between systems

Step 8: Update the standard as data needs evolve

Step 9: Encourage community participation in standards development

Related Articles

MPox Contextual Data Specification

Wastewater Surveillance, Ethics & a Gatsby! | PHA4GE Podcast with Dr Joshua Levy

Wastewater metagenomics in Africa: Opportunities and challenges

Discover PHA4GE Working Groups

Signup for the PHA4GE Newsletter

PHA4GE Membership

Contact us

Sitemap

Connect With PHA4GE