The challenges of data integration and synchronization

June 17, 2016

As a head-up to the SEMANTiCS 2016 we invited several experts from Linked Enterprise Data Services (LEDS), a “Wachstumskern” project supported by the German Federal Ministry of Education and Research (BMBF), to talk a bit about their work and visions. They will share their insights into the fields of natural language processing, e-commerce, e-government, data integration and quality assurance right here. So stay tuned.

As a research associate of the working group "Agile Knowledge Engineering and Semantic Web" (AKSW) at the Leipzig University, Natanael Arndt is an active participant in the research of "data integration" and "Linked Data" for many years. He has worked with the Leipzig University Library in the management of electronic resources (e-paper, e-books, databases) using Linked Data. Within the LEDS project he currently leads the key area "Management of Background Knowledge", which addresses the topics of co-evolution as well as enrichment of internal data with knowledge from the Web of Data and its management.

If Natanael is not researching the semantic technology field, he enjoys his spare time biking, jogging or helping migrants with their first steps in our technological world. His recent new passion is geo-caching.

What is the scientific status when it comes to synchronization and integration of semantic data?

Basically we are still talking about the old problems which - in terms of computers - have been around since the beginning of the Internet and which couldn’t be completely dissolved yet. 15 years ago, Tim Berners-Lee together with Dan Connolly pinpointed the problem: „One of the most stubborn problems in practical computing is that of synchronizing calendars and address books between different devices. Various combinations of device and program, from the same or different manufacturers, produce very strange results on not-so-rare occasions.“1

Even the development of cloud computing hasn’t resolved the problem. Hence, a solution has become even more urgent. Google for example sidesteps the problem in its contact data management. When conflicts arise they simply duplicate the data.

Besides that general use case there are many specific scenarios where data must be integrated and needs customized solutions. Certainly, some practical solutions already exist, e.g. ETL processes in data warehouses. A more recent approach for the management and query of heterogeneous data are data lakes. Semantic technologies already play an important role in the area of ontology matching, though I think they will become even more relevant on the instance level too.

Which methodological and what technical limitation, there are still to be solved?

The problem of synchronization and integration can be divided into the three parts "converting the syntax", "understanding the semantics" and the actual "synchronizing of datasets" (integration) (see Berners-Lee und Connolly: Delta).

When we speak of semantic data, at least we have RDF as a common syntax. The definition of semantics can be expressed in ontologies.

In regards to synchronization of semantic data we talk about the aspects of changelogs (evolution or versioning), the transfer of the changelogs between the participants and their databases as well as the integration of those changes - merging data while maintaining data consistency - on either side.

I think when searching for solutions we need to separate the various sections of synchronization. In particular, the distinction between structural integration and semantic integration brings a lot of clarity. Also the integration (and linking) of data is a complex field that cannot always be solved automatically, but requires manual labor.

There are many proposed solutions in the individual sections. Our current task is the assessment of these approaches for their usability and the integration of the solutions into a common system, which takes all aspects of the problem into account.

Which aspect are currently of burning interest for researchers?

For the last ten years, software engineering has been increasingly using distributed version control systems, like git and mercurial. In particular for Open Source projects but also for companies with globally distributed developers this development was a great improvement in the synchronization of software source code directories. Those systems use a well-functioning combination of successive versions or patches, along with pull and push methods for transmitting the changes. Since these methods and techniques are so fundamentally simple and successful in software engineering, our currently preferred strategy is the application of these techniques and methods to semantic databases (see „Distributed Collaboration on RDF Datasets Using Git Towards the Quit Store“ by Natanael Arndt, Norman Radtke, Michael Martin).

What’s your contribution with LEDS to tackle those challenges in research?

During the project we will try different approaches in each area (versioning, transfer of changes between stakeholders, and merging of datasets, maintaining consistency) and well evaluate their practicability. In addition, we’ll address the problems of access control and curation of data as well as the systemic integration / orchestration and securing scalability of the individual components.


Further reading

  • Structured Feedback: A Distributed Protocol for Feedback and Patches on the Web of Data by Natanael Arndt, Kurt Junghanns, Roy Meissner, Philipp Frischmuth, Norman Radtke, Marvin Frommhold und Michael Martin in Proceedings of the Workshop on Linked Data on the Web co-located with the 25th International World Wide Web Conference (WWW 2016)
  • Publish and Subscribe for RDF in Enterprise Value Networks by Marvin Frommhold, Natanael Arndt, Sebastian Tramp (geb. Dietzold) und Niklas Petersen in Proceedings of the Workshop on Linked Data on the Web co-located with the 25th International World Wide Web Conference (WWW 2016)

What expect from Natanael at the SEMANTiCS 2016

  • Distributed Collaboration on RDF Datasets Using Git: Towards the Quit Store by Natanael Arndt, Norman Radtke, Michael Martin
  • Towards Versioning of Arbitrary RDF Data by Marvin Frommhold, Ruben Navarro Piris, Natanael Arndt, Sebastian Tramp, Niklas Petersen and Michael Martin

1 Tim Berners-Lee and Dan Connolly: „Delta: an ontology for the distribution of differences between RDF graphs“  

Partners

LEDS is a joint research project addressing the evolution of classic enterprise IT infrastructure to semantically linked data services. The research partners are the Leipzig University and Technical University Chemnitz as well as the semantic technology providers Netresearch, Ontos, brox IT-Solutions, Lecos and eccenca. 

brox IT-Solutions GmbH

Leipzig University

Ontos GmbH

TU Chemnitz

Netresearch GmbH & Co. KG

Lecos GmbH

eccenca GmbH

Supported by