29. June 2022 By Peter Kiss
What is Gaia-X?
The need for data
"Data is the new oil”.
This phrase, formulated by the British mathematician Clive Humby, has become by now a commonplace. Data driven operation and innovation is gaining space in every aspects of economy and private life of people. Big Data, Data Mining, Machine Learning or Artificial are terms, that are getting increasingly impossible to ignore in any businesses.
Data-drivenness naturally involves technologies that build on a significant amount of data. The more powerful and/or more complicated statistical models we want to use, the more data we need to enable them to model reality in an acceptably precise, and the same time, sufficiently generalized manner.
In many cases, however, the stakeholders who could exploit such data-driven methods are not in possession of sufficient data. The insufficientness of the data can appear in two base angle. First, one may lack the necessary amount of data, and, on the other hand, it is also possible, that we just do not have the right kind of data. For example, if we want to train a model, that can differentiate between pictures of dogs and cats, our own data might be images that we took with the camera of our mobile phone. This most probably will just simply not enough to train an artificial neural network, which is nowadays considered to be the best tool to solve such a problem, and in general, needs many thousands, if not millions images (data points) to learn a way to solve this task. In another case, when a factory wants to optimize its logistics, and keep its assembly lines busy all the time, it is most probably not enough to put various sensors everywhere, and have the most detailed view of the activities happening within company, but it also needs to know when can it expect deliveries from its suppliers, and when their partners can take over its products.
When we do not have the data amount what we need for building our models, we still have a couple of possibilities:
- Transfer learning: We can acquire models, that have been trained in more or less similar data, and assuming that the most important features from the perspective of the characteristics of our data are already learned by them, then we can just fine tune the models with our own, limited amount of data.
- Federated learning: The other possibility is, that one finds partners who are in a sense willing to "lend" statistics of their data, that is, they help the training by „suggesting” refinements for the model, based on the data they own, without exposing the data itself.
- Data Sharing: The third option is the most straightforward, and probably the most difficult to achieve, namely, to find partners, who are willing to share their data.
In this post I try to give a rather high-level overview of Gaia-X, a European initiative for building an ecosystem to help the above ways of data-based collaborations.
In 2021, 6 of the 10 companies with the biggest market capitalization are coming from the so-called platform providers.
The way, a platform provider operates, differs from traditional pipeline driven business models, and they work with two- or multi-sided markets, based on the data value chain. (For example, if Facebook offers you some services, as following what happens with your friends, you allow them, in turn, to capitalize the data, that you expose to them)
They offer platform services, which, together with external companies and their open innovation model, adds up to an ecosystem.
Thus platforms are not only enablers of individual business models, or ecosystems but also infrastructures for economies, and as such, they may serve many domains.
From the perspective of data, these platforms have 3 common main components (Figure 1 from the presentation of Prof Dr.-Ing Boris Otto, with title "GAIA-X and IDS", during IDSA Summit 2021):
- Service offerings - Creating value from the raw data, using services of the platform provider, that involves external companies as well
- Platform - publishing the data by integration and aggregation
- Data access - means collecting preprocessing and storing the data
In relation with these 3 layers of platform economy, the question any companies should ask, where they are in this platform model, or on a nation level (or the level of European Union), where do our companies stand, and how could they generate more values to our society. Possibly due to its still fractured markets, EU does not have such tech giants as Meta, Facebook or Amazon, that could host such an ecosystem. This becomes rather problematic when we take into consideration the strict data protection rules of the EU. On one hand giants try to comply with GDPR through keeping the European customers' data in datacenters which are located on the continent. And, on the other hand, European Union incites and supports the creation of Europe's own "giant", that in fact, would be a federation of the various infrastructural, software and data resources of the participating companies, respecting European values, of self-determination, privacy, transparency, security and fair competition. This peer-to-peer ecosystem would be created under the name Gaia-X.
European Data Strategy
European Strategy for Data, aims at creating a single market for data and various services to be shared and exchanged across sectors freely, efficiently and securely within the EU.
The goal of the strategy is to create federated data ecosystems based on shared policies and rules within certain application domains, as agriculture, automotive industries, mobility, healthcare and so on (These are often called data spaces).
Naturally many actors are active in multiple ecosystems at the same time, and there are no clear boundaries between the domains, thus the best choice to achieve these goals might be the creation of a general, open ecosystem, characterized by mutual trust between participants. Such ecosystem can be implemented through a soft infrastructure, specifying legal, operational and functional agreements as well as technical standards for data sovereignty and platform interoperability.
The most important feature of the planned infrastructure is that individuals and organizations will regain the possibility of controlling usage of their data, and users will be enabled to access data in a secure, transparent, trusted, easy and unified fashion.
IDSA and Gaia-X
For specifying the high-level architecture of such a federated platform provider, Gaia-X can build at great extent on the principles of Data Spaces (International Data Spaces, IDS), developed by International Data Spaces Association. International Data Spaces can be understood as a subset of capabilities, that Gaia-X planned to have, and as a standardization effort to create a soft infrastructure for a peer-to-peer marketplace for data sources and data related services, on the principles of data sovereignty, interoperability, trust, and governance.
From a technical perspective, a Data Space can be seen as a data integration concept, which does not require common database schemas and physical data integration but is rather based on distributed data stores and integration on an “as needed” basis on a semantic level.
Thus, a Data Space can be viewed as a standardized soft infrastructure for sharing data. The soft infrastructure is a peer-to-peer system of Connectors, or secure IoT gateways, that are connected via federation services. Maybe, apart from the general standardization efforts, the most important part of development of IDS-s is the creation of these connectors, that above the base requirements of interoperability and trust, can also implement sovereignty. That is, they are able to enforce data usage policies. These policies in principle should be able to express any kind of constraint one might impose, for example:
- data can be read only once, that is it must be removed from the data user's system immediately after read;
- data can be seen only through aggregates;
- date is only allowed to be used after specific anonymization methods.
As one might feel, these requirements are rather challenging to achieve. Most probably the solution is using fully separated trusted computation environments (TCE) on both end of the communication channel, using which, policies can be enforced, potentially with the help of external data services (for example for anonymization), that are also present in the IDS ecosystem.
For this peer-to-peer ecosystem based on IDSs the following federation services has been designed:
- Identity and trust. All participants need to have digital identities to ensure trustworthiness, for which we need common trust anchor, that is given by bodies for dynamic attribute provisioning, or, in other words by granting some verified credentials.
- Federated Catalogue. To find resources we need to have some "list" of available datasets and services that will be our catalog used for storing their metadata. To make it easily searchable, or usable by machines as well, however, for now, we need to constrain this metadata to specific vocabularies and ontologies.
- Sovereign data exchange. The heart of Data spaces is the data exchange itself. By design the most important feature of data sharing in IDSs is that of data sovereignty. On the other hand, to create a market for data, we need some accounting who used what data and how, and how much should he pay for that. This task is assigned to clearing house.
- Compliance. Finally, to ensure the fair and seamless work of the ecosystem, we also need a certification body, who issues certification for the stakeholders, and decides in case of disputes.
A federation of cloud providers, that is infrastructure providers, who takes the Gaia-X standards, and might host services in the ecosystem in a completely portable way, thus preventing vendor lock-in.
To sum up, Gaia-X can be viewed as decentralized platform that tries to put users in the center. Everyone can take up the Gaia-X standard and offer a Gaia-X compliant service. Companies who join the networks and provide data makes the bottom layer (Data access). Building on this, IDSs allow search aggregation and transformation of data based on the mentioned principles (Platform), while the top layer, the services, let them be software of infrastructural services can be deployed on the level of Gaia-X (Service offerings).