Platform for Data Management

From Ediaqi Wiki

Know-Center Data Management Platform (KDP) will form the central data access point for the EDIAQI project and will be continually enhanced and expanded as the project advances. The KDP is designed to provide a secure, scalable, and robust system for managing and sharing data. It is built on the iRODS data management system for distributed and replicated storage. Metalnx is used as the graphical user interface for an intuitive and user-friendly sharing and searching of data. The project team at Know-Center developed the KDP with the FAIR principles in mind and a strong focus on reproducible research.

Platform Architecture

The design of the platform allows the handling of two different functionalities:

  1. Data Discovery & Management - Handles the organization and storage of data
  2. Data Analytics – Focuses on extracting insights and trends from data

Both functionalities allow the KDP to be a powerful and user-friendly data management and analytics platform that utilizes many different components. Its robust features provide researchers with an efficient and secure way to manage and analyse their data while also facilitating collaboration and discovery.

1. Data Discovery & Management

The heart of the KDP is the iRODS data management system that provides a unified interface for accessing, managing, and sharing data. iRODS consists of two different server components - data catalog which contains information about the file names, locations and permissions, and catalog consumer which retrieves metadata about data objects and collections from the catalog provider. Despite consisting of multiple servers, the platform appears as a single virtual file system to users.

To provide a user-friendly interface for data management tasks, the platform uses Metalnx as the graphical user interface. Users can upload, download, and share data, as well as set permissions and manage metadata. Metalnx is highly customizable with the ability to add custom metadata fields and workflows to fit the specific needs of the platform.

Elasticsearch is used for data discovery, enabling users to easily search and browse data collections based on metadata and other attributes. This ensures that users can quickly find and access the data they need to support their research. The authentication system is powered by Keycloak to ensure secure access to the KDP. It is common for each data discovery and management component to utilize a PostgreSQL database for storing information such as user accounts, metadata on data collections and configs.

2. Data Analytics

The analytics part of the platform exposes JupyterHub or Apache Zeppelin (depending on the required programming languages) to support collaboration and data analysis. JupyterHub is a web-based platform with a shared environment for running Jupyter notebooks. The platform allows partners and researchers to analyze and work with shared data, promoting collaboration and efficiency. Results and findings can then be easily shared among team members and beyond. On the other hand, Apache Zeppelin has a stronger focus on data analytics with built-in support for data sources such as Apache Spark, SQL databases and NoSQL databases.

EDIAQI partners that prefer to keep the data on their server have the option to setup an own iRODS zone and utilize secure federation with the KDP. The iRODS provider of the KDP then accepts the partner server as an additional iRODS consumer. This option enables secure sharing and accessing of data across multiple iRODS zones as if they were part of a single virtual file system. Authentication and authorization protocols ensure that data is only accessible to authorized users within their respective zones, maintaining control over access and security.