Generic Metadata Framework – how to use it in a project?
Creating a cloud solution for data analysis is always a time-consuming, complicated, and complex process, no matter if it is based on a modern data warehouse or a data lakehouse. The implementation of such projects can be described with a simple acronym: ISASA.
ISASA defines the steps that must be taken to build a solution. These steps include:
I – Ingest – data collection
S – Store – data storage
A – Analyse – data analysis
S – Surface – presentation of the prepared data
A – Act – 4xM Make Me More Money
The first four steps are assigned to the team that creates the solution, while the final step – Act – is the Client’s responsibility. Based on the results of data analysis (e.g. reports, dashboards), the Client makes decisions and carries out tasks aimed at improving business operations. The strictly technical tasks include those that do not require any knowledge of the domain as well as those that cannot be performed without the knowledge of the domain, e.g. model building.
The Generic Metadata Framework – why is it working?
The Generic Metadata Framework automates tasks that are recurring in every data analysis project, such as data collection and data lake creation. Thanks to this, the creator of a solution is able to focus on the essence of a given problem, that is, on building a model that corresponds with the pre-defined business needs.
The Generic Metadata Framework helps the solution creator in concentrating on its business aspects. What is more, it simplifies and automates:
- processes of data loading (supporting both full and incremental data loading),
- building data lakes (defining the structure, data partitioning),
- initial data processing (transformation of input data),
- building delta lakes (defining the structure, data partitioning),
- creating data warehouses (defining the model in the views).
Main characteristics:
- Flexibility at the level of architecture – creating data analysis solutions based on modern data warehouses and data lakehouses.
- Automation of recurring tasks, such as data collection.
- Complete architecture that enables the creation of scalable solutions, comprising security, monitoring, data governance, etc.
- Building components available through Azure.
- Flexibility at the level of access to data – easy integration with Power BI.
- Extensibility – the framework can be easily extended, for instance, by supporting new types of data sources.
The graph below presents the areas covered by the framework:
In the first step, data is collected from data sources, including those from on-prem environments, and saved in a data lake (Azure Data Lake Gen2) in the native format. The data source supported at that point is based on SQL Server and mechanisms that allow incremental data loading by means of change tracking and time stamps. The process of configuration as such is based on the metadata collected from the source system.
In the next step, the collected data is preprocessed – the history of data changes is constructed (SCD 2) and saved in a data lake in the delta format. The process is carried out through Spark Databricks, on the basis of the prepared configuration and metadata loaded in the previous step.
The following step is model building. The data is sourced from the tables in the so-called curated zone, while the model itself is saved as views on Spark and additional configuration, which defines, for example, how the model is meant to be fed, whether SCD 1 or SCD 2 are to be used.
The graph below shows the data flow in a solution based on the Generic Metadata Framework.
The concept of zones used in the Generic Metadata Framework
The concept of zones used in the Generic Metadata Framework is very similar to the solution building approaches promoted by Databrick (bronze, silver, gold zones). Thanks to the division into zones, data can be separated not only at the logical level but also at the physical level (dedicated containers on a data lake). It also defines precisely both the input and the output of each stage of data processing. Access to data from each zone is made possible by Azure Synapse Serverless.
The Generic Metadata Framework is built of four modules:
- Data Loader – responsible for loading data from data sources and saving data in a data lake.
- Data Preprocessor – responsible for initial data processing: building a data lake.
- Data Lakehouse – responsible for creating and feeding the data model.
- Synapse Integrator – responsible for transferring data from the data lakehouse to Azure Synapse Dedicated Pool.
Module 3 can work independently from the other modules, while Module 4 is optional: in other words, it is possible to build a solution using the Generic Metadata Framework that does not use the Azure Dedicated Pool. The Generic Metadata Framework also supports building solutions based on the data mesh approach by providing a self-serve platform.
Summary
The usage of Generic Metadata Framework provides infrastructure necessary to build solutions and automates recurring processes, such as data collection, to reduce the migration project time considerably.
It guarantees great flexibility in terms of access to data as well as scalability.