SPHN Quality Control Tools

21.07.2022

Ensuring data quality is of the utmost importance for personalized health research. Validation of data can be a cumbersome and time consuming process. In the Semantic Web, the Shapes Constraint Language (SHACL) is used to facilitate the validation of RDF graph data against a given schema. However writing such checking statements for the validation can be time consuming and difficult if you are not familiar with these languages. To facilitate quality control and compliance with SPHN data specifications at the data provider level, the SPHN Data Coordination Center, together with its collaborators from HUG, BioMedIT and Trivadis Part of Accenture, have developed two quality control tools: the SHACLer and the Quality Check tool, to facilitate the quality control process before data is delivered to the SPHN projects.

The SHACLer tool

The SHACLer is a Python-based tool that automatically builds SHACL rules for validating whether the structure of a data follows the criteria defined for an SPHN-compliant RDF schema. The SHACLer takes as an input an RDF schema file and generates a file with SHACL rules. The generated SHACL file can then be easily used by anyone on any triplestore (e.g. GraphDB, Apache Jena) that supports SHACL validation to check the compliance of certain data with an SPHN schema. Alternatively, the Quality Check tool (see below) can be used. The SHACLer can also be used to generate project-specific SHACL rules. It generates checks, among other things, cardinalities and property value restrictions. This ensures a harmonized data delivery from data providers. 

The Quality Check tool

The Quality Check (QC) is a Java-based tool designed to facilitate the validation process of SPHN RDF data at the data provider level. Based on the SHACL file generated by the SHACLer, some statistical queries in the SPARQL Protocol and RDF Query Language (SPARQL), the SPHN RDF schema and of course data itself, it generates a human-friendly report with information about the conformance of the data to the schema and some basic statistics about it (i.e. number of patients, patients per hospital, data referring to concepts that are not part of the schema). This tool can be run directly at the clinical data warehouse before sending the data to the data users and has the advantage of having no limits on transaction size, which means that hundreds of millions of triples can be checked, depending on the resources of the machine, facilitating the validation of big data.

Availability

Scroll to Top