A one-year-long consultation and prototyping phase has resulted in concrete plans for establishing data subsetting capabilities in the Blue-Cloud ecosystem and implementing that vision by deploying a series of data lakes at the Blue-Cloud VRE for data repositories and data collections in support of VRE users and developers of the WorkBenches. After consideration and positive tests on selected BDI data collections, it was decided at the latest Blue-Cloud Scientific and Technical Committee (STC) meeting, 5-6 March 2024, in Amsterdam - The Netherlands, to embrace the Beacon technology, as developed by MARIS, for deploying data lakes at the Blue-Cloud VRE for arranging data provision for WorkBenches 1 and 2 and also for VLabs.
The analysis and decisions about the way forward have been documented in Deliverable D2.4 -BDI sub- setting APIs and Data Lakes – Concept and Specifications Report - which was released in early May 2024. It gives details about the envisioned data lake configurations and also about the Beacon technology that is being adopted for powering the data lakes. Furthermore, it describes the implementation planning and related actions.
As a follow-up, developments have been undertaken for configuring the Beacon software and deployments have been established of Beacon instances, prioritised by the data requirements of WorkBenches 1 (T&S) and 2 (eutrophication). In total 8 so-called monolithic Beacon instances have been deployed for different data collections and BDIs. The term 'monolithic' is used to indicate that the Beacon instance concerns one data collection or BDI. After initial configuration at a MARIS server, all 8 instances have been deployed operationally at the Blue-Cloud Virtual Research Environment (VRE). All have been integrated with the D4Science federated AAI service, by which access is arranged. All Beacon instances also have been provided with a Jupyter-notebook which sits on the Beacon API and which makes it easier for the users to interact. The notebooks already contain several queries and users can adapt their own notebooks. There is a Beacon website (https://beacon.maris.nl/) available for more information including detailed documentation (https://maris-development.github.io/beacon/) for end-users on how to work with Beacon.
After considerable testing and feedback rounds on monolithic Beacon instances by members of the WorkBench teams, as a next step, two integrated Beacon instances have been initiated. These are merging data set collections from several monolithic Beacon instances. The challenge is to deliver harmonised collections, federated from the monolithic instances with different data models. For that purpose, a core metadata – data profile has been formulated by the WorkBench 1 and 2 teams which is populated by extracting and mapping from the monolithic instances. This applies not only to syntax but also to semantics. Harmonisation is not applied only for parameters and units, but also for a core set of metadata, that have been selected by the WorkBench teams as necessary for harmonisation and their following actions. For the merging use is made of the federation capabilities of the Beacon technology, while for the semantic mapping, use is made of the Semantic Analyser system of NOC-BODC. Any necessary unit conversion rules can be retrieved and applied using factors and offsets as listed.
The considerable progress made is documented in this deliverable D2.5 - Established BDI sub-setting APIs and Data Lakes – Documentation Report. It gives information about the Beacon technology and the activities undertaken towards the deployment of the 8 monolithic and 2 merged Beacon instances. It also gives an outlook towards the next steps on how the merged Beacon instances will be used by the WorkBench teams as part of their analytical pipelines. Further activities are required around Beacon to achieve full operational and robust data provision workflows on which the WorkBench teams can build their targeted analyses for generating high-quality EOV collections. These activities are specified in this Deliverable and will be finalised in the third project year.