Standardization, not perspiration: #TeamDataStructures partners with organizations around the world to implement data standards to build interoperability

November 15, 2021

“Data harmonization should be 10% perspiration, 90% standardization.”

~Data Structures Working Group

A “data standard” is a technical specification that describes how data should be stored or exchanged for the consistent collection and interoperability of that data across different systems, sources, and users. The Data Structures working group (DS WG) has been very busy this quarter working with the community to improve and implement different PHA4GE data standards, and to train users around the world.

At the beginning of October, the PHA4GE DS WG was pleased to partner with the UK’s Bioinformatics Cloud Infrastructure platform, CLIMB BIG DATA, as well as Europe’s JPIAMR ( Joint Programming Initiative on Antimicrobial Resistance) to host the 7th Microbial Bioinformatics Virtual Hackathon. The goal of the hackathon was to bring international bioinformatics researchers, scientists and clinicians together to collaborate and improve/build/extend bioinformatics tools and methods for the AMR community. The three day hackathon attracted over 78 participants from 32 different countries, and projects included testing/expanding PHA4GE’s SNV variant detection standard for AMR (with special focus on its application to harmonizing Mycobacterium tuberculosis (Mtb) mutation-based resistance information), the alignment of AMR databases (programmatic merging and deduplicating), the creation of standardised benchmarking datasets (genomic, metagenomic, assembled, unassembled), the use of uncorrected long read data for AMR analyses, and GPU enabled AMR calling and analysis. Learn more about some of the tools and resources created during the hackathon by visiting https://github.com/AMR-Hackathon-2021 and https://github.com/pha4ge/hAMRonization.

This community effort yielded many impressive results. Most notable for the DS WG were the development of parsers for harmonizing the outputs of different Mtb tools (e.g. TBProfiler, Mykrobe) for easier data sharing and comparisons of analytical results, a layman’s “translator” for the widely used (and sometimes tricky to interpret) sequence variant nomenclature HGVS, and the integration of ChamrDb (https://gitlab.com/antunderwood/chamredb/-/tree/master) into PHA4GE’s hAMRonization package (https://github.com/pha4ge/hAMRonization). ChamrDb is a tool for matching predicted genes across different widely used databases to better enable harmonization of AMR gene/variant prediction tool outputs. Many thanks to the hackathon participants who contributed to these achievements!

Applications for the hackathon suggested that many people were interested in AMR bioinformatics training. To accommodate this need, a virtual Joint PHA4GE-CLIMB BIG DATA-JPIAMR AMR Bioinformatics Workshop was held on October 15, which attracted over 200 participants from around the globe. The half day webinar-style event featured talks by Dr. Kara Tsang (London School of Hygiene & Tropical Medicine, UK), who provided an overview of existing AMR-related databases and resources including the Comprehensive Antimicrobial Resistance Database (CARD), NCBI’s National Database of Antibiotic Resistant Organisms (NDARO), the Center for Genomic Epidemiology’s ResFinder database, and the Pathosystems Resource Integration Center (PATRIC). Dr. Michael Feldgarden (National Center for Biotechnology Information, USA) provided an overview of the theory and use of bioinformatics tools to detect AMR genes from genomes (e.g., AMRFinderPlus). Ines Mendes (Instituto de Medicina Molecular, PT) described PHA4GE’s AMR gene/mutation detection data standard, as well as the hAMRonization program designed to help put standards into practice in public health settings. Dr. Finlay Maguire (Dalhousie University, CA) demonstrated how databases and tools can work together in workflows to efficiently generate AMR gene reports from bacterial genomic reads (https://github.com/fmaguire/amr_training_workshop_practical). Those who missed the event need not worry as all of the talks are freely available online for streaming (https://www.youtube.com/watch?v=DNsF8U4EsIY&list=PLwfIvG-RsIuTp6zDaBhcDVC7OthvF5Hpx). Slide decks are also available for reference and training purposes on the PHA4GE website ( https://pha4ge.org/amr-workshop-2021/). Special thanks to Drs. Finlay Maguire, Andrew Page, and Lisa Marchioretto, and the rest of the Steering Committee for all their hard work organizing the hackathon and workshop.

Also in October, members of PHA4GE (Drs. Duncan MacCannell, Danny Park, and Emma Griffiths) met with the GA4GH community to present our activities and explore opportunities for collaboration at the GA4GH Connect meeting (Oct 12). The Global Alliance for Genomics and Health (GA4GH) is a policy-framing and technical standards-setting organization, seeking to enable responsible genomic data sharing within a human rights framework (https://www.ga4gh.org/). GA4GH and PHA4GE will continue to align efforts and work together to implement data standards in the future.

An important part of putting standards into practice is training public health personnel to use them. The PHA4GE DS WG was pleased to partner with the The Africa Pathogen Genomics Initiative (Africa PGI)’s NGS Academy (https://africacdc.org/institutes/ipg/) to offer a workshop on SARS-CoV-2 contextual data curation and stewardship. Contextual data for public health microbial genomics consists of all the clinical, epidemiological, lab data as well as sample metadata, methods information and quality control metrics that enable the interpretation of sequence data for decision making. This information is often encoded in different systems and spreadsheets using different fields, terms, and formats. Different public repositories have different submission requirements, and different public health agencies have different reporting requirements. Participants learned about the PHA4GE contextual data specification (https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification) and how this information can be standardized to improve interoperability and analyses, as well as different privacy and practical considerations for data sharing.

The workshop was very timely, as the variability in global data standards for SARS-CoV-2 contextual data can be difficult to navigate. As this topic is of broad interest, the PHA4GE DS WG will also be participating in SARS-CoV-2 data standards workshop hosted by ELIXIR, an intergovernmental organisation that coordinates European compute resources, including databases, software tools, training materials, cloud storage and more (https://elixir-europe.org/about-us). The PHA4GE DS WG will present different data standards (e.g. GISAID, ENA, WHO) and how to map between them. We will also describe how the PHA4GE SARS-CoV-2 contextual data specification and associated tools and protocols can provide data structures to help better standardize the information going into different repositories. The workshop is scheduled for Nov 16 (3pmCET), and those interested can register for the workshop here.

To learn more about our activities, and/or how to join, check out our website.

Standardization, not perspiration: #TeamDataStructures partners with organizations around the world to implement data standards to build interoperability

Share:

Related Articles

Project ODIN: advancing environmental genomic surveillance for public health across sub-Saharan Africa

Prof Alan Christoffels represents SA at discussions on global pandemic agreement

International collaboration aids unpacking of hantavirus genomics

Discover PHA4GE Working Groups

Signup for the PHA4GE Newsletter

PHA4GE Membership

Contact us

Sitemap

Connect With PHA4GE