Participate in ethics and data sharing community  | ​  Learn More 

Transforming data for easy integration, reproducibility and sharing

Data Structures Working Group Chair
Dr. Emma Griffiths


How data is structured, organized, managed and stored greatly impacts how it can be used and integrated with existing data. Standardized data structures and interchange formats are critical to the development of an open software ecosystem that will empower the microbial genomics community to analyze and govern their own data. The PHA4GE Data Structures working group comprises 16 active members who are researchers, bioinformaticians and domain experts from North America and Europe, representing public health agencies, research institutions and different large public databases. As one of PHA4GE’s Technical working groups, the Data Structures group focuses on the development, adaptation and standardization of data models for microbial sequence data, contextual metadata, analytical results, and workflow metrics. Through the adoption of data models we hope to improve the transparency, interoperability, and reproducibility of public health sequencing workflows.  


One of the most critical barriers in public health genomics and bioinformatics is the lack of interoperability between datasets, tools, and systems, which inhibits exchange, comparison, analysis, and consistent interpretation of data.  It also creates data silos, increases the need for work-arounds, and can have detrimental impacts on the efficiency of public health responses.  The broad and consistent use of data standards facilitates interoperability by ensuring universal use of well understood terms, formats, and data structures. As part of our initial landscape review of current challenges in the open source public health genomics ecosystem, we identified several key gaps which could be addressed via pilot projects creating and implementing new data standards. As such, we are currently developing a gene detection specification standard for harmonizing the reporting of results, using antimicrobial resistance genes as a proof of concept. Additionally, we are developing a SARS-CoV-2 contextual data specification to support data management, harmonization, and sharing during the current pandemic. 


Antimicrobial resistance is a global health problem that contributes to tens of thousands of deaths per year around the globe. A number of widely-used AMR gene detection tools are currently available, which differ in terms of their inputs, functionality (including parameters and reference databases), and outputs. Differences in the meaning, structure and range of values in the different outputs of these tools can make comparing and interpreting results difficult for public health practitioners and researchers. To address these issues, we are developing a standardized AMR gene detection output specification to better harmonize the AMR detection results across tools and resources and improve interoperability. To support the specification, we have mapped the outputs of different tools to the standard and are developing biopython-compatible parsers that will transform the variable outputs to the PHA4GE standard. We have also created a fully automated pipeline that will run arbitrary microbial genome datasets through almost all currently available species-agnostic AMR gene detection tools. This pipeline can be used in tandem with the parsers and the standard to better enable comparisons of data and for benchmarking. 


The Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has been implicated in over 10 million cases of COVID-19 disease and 500K deaths worldwide. A key tool for understanding SARS-CoV-2 epidemiology has been analysis of viral genome sequence data, which has helped elucidate the spread and evolution of the virus at global, national and local scales, in addition to being used to develop diagnostic tests, treatments, and vaccines. Contextual data (also known as metadata) is essential for interpreting sequence data, answering biological questions, and to inform public health decision making. Generic contextual data standards are scoped to cover as many use cases and pathogens as possible, and so can include fields of information not applicable to SARS-CoV-2 or that may be subject to privacy concerns, or exclude fields commonly used in public health surveillance and investigations. In the face of the current pandemic, we identified a clear and present need for a fit-for-purpose, open-source SARS-CoV-2 contextual data standard. As such, we have developed a SARS-CoV-2 contextual data specification based on publicly available community standards to support data management, harmonization, and sharing. The specification is implemented via a collection template, as well as an array of protocols and tools that we have created to support the harmonization and submission of sequence data and contextual information to public repositories. Well-structured, rich contextual data adds value, promotes reuse, and enables aggregation and integration of disparate data sets. Adoption of the proposed standard and practices by public health practitioners and researchers will better enable interoperability between datasets and systems, improve the consistency and utility of generated data, and ultimately facilitate novel insights and discoveries in SARS-CoV-2 and COVID-19. A manuscript is in preparation and is expected to soon be released as a preprint soon, prior to publication.


For more information regarding the activities and priorities of the PHA4GE Data Structures Working Group visit:

https://pha4ge.org/data-structures/

Subscribe to the PHA4GE Newsletter

We're committed to your privacy. PHA4GE uses the information you provide to us to contact you about our relevant content. You may unsubscribe from these communications at any time.

Follow PHA4GE

Related Articles

Wastewater Contextual Data Specification

The PHA4GE Wastewater Contextual Data Specification Package is scoped for data collection and sharing (within organizations, within networks and if desired, with public repositories) of both pathogen-agnostic genomics contextual data and genotypic attributes (such as antimicrobial resistance genes) derived from amplicon-based, WGS, and metagenomic sequencing approaches.

Wastewater Surveillance Guidance and Resources

This repository hosts guidance documents and resources developed by the PHA4GE Wastewater Surveillance Working Group. These documents address core challenges involved in designing effective wastewater surveillance strategies, analyzing wastewater pathogen sequencing and quantification data, and sharing this data with the global public health community.