The MBI Framework: A Case Study in Metadata-Driven Data Warehouse Automation (circa 2008)

The MBI Framework: A Case Study in Metadata-Driven Data Warehouse Automation (circa 2008)

1. Abstract: This paper details the MBI (Meta Base Integration) framework, an internally-developed, metadata-driven system implemented around 2008 to address challenges in enterprise data warehousing. Faced with the need for faster development cycles, improved consistency, and enhanced maintainability, the MBI framework provided a comprehensive solution for automating the generation and management of a multi-layered data warehouse architecture. Key features include a PL/SQL-based generation engine utilizing standardized templates managed through a layered API (Definitions, Control, Execution, Utilities), a structured task execution framework, and the automated creation of both Data Vault (CDV, CDU) and presentation (PKB) layers. This system represents a significant, early example of applying Data Warehouse Automation (DWA) principles to achieve "conformed, controlled, condensed code" across the data warehouse lifecycle.

2. Introduction

The mid-to-late 2000s presented significant challenges for enterprise data warehousing. Traditional development approaches often resulted in lengthy project timelines, high development and maintenance costs, difficulties in enforcing standards, and challenges ensuring consistency across complex environments. Extract, Transform, Load (ETL) processes were frequently complex, resource-intensive, and difficult to adapt to evolving business requirements. Concepts like Data Warehouse Automation (DWA) and metadata-driven development were emerging as potential solutions, promising accelerated delivery, improved quality, and greater agility through the use of reusable patterns and automated code generation.

Against this backdrop, the MBI (Meta Base Integration) framework was developed internally around 2008 as a strategic solution to these challenges. The fundamental premise of the MBI API was "to enable people to make available code in a conformed and controlled manor," aiming to reduce duplication, increase conformance to standards, and enhance the ability to improve code over time. It provided a comprehensive, metadata-driven platform built primarily using Oracle PL/SQL to automate the creation, deployment, and management of data warehouse components across a multi-layered architecture.

This paper presents a detailed architectural overview of the MBI system, examining its core framework components, its implementation of a layered architecture including Data Vault and presentation layers, and analyzing the features that positioned it as an advanced system for its time. We will explore the design principles, the metadata-driven generation engine, the task execution framework, and the resulting data warehouse structure based on technical documentation and code artifacts from the system.

3. MBI System Architecture

Article content


The MBI system implements a multi-layered data warehousing architecture designed to systematically ingest, integrate, conform, and present data for analytical and export purposes. The overall process flow, depicted in, reveals distinct stages connected by standardized "Generic loader" processes, indicative of the MBI framework's automated data movement capabilities.

3.1 Staging Layer (Owner_oxt, OWK) At the base of the architecture lies the Staging layer. While specific details of Owner_oxt and OWK are limited in the provided documents, their position in the flow diagram suggests they handle the initial ingestion of data from various source systems (indicated as potentially including SAP, SBL, SMS, SYL, VIS, DTL, EAN based on CDV/CDU view definitions). This layer likely performs minimal transformations, focusing on capturing raw data before it enters the main integration processes. Components like the Sig loader might represent specific ingestion mechanisms.

3.2 Conformed Data Vault Layer (Owner_cdv) Data moves from the Staging layer into the Conformed Data Vault (CDV), implemented within the Owner_cdv schema. This layer serves as the primary integration point, designed using Data Vault modeling principles. Its key characteristics, evident from the schema definition, include:

  • Hubs (_H tables): Storing unique business keys hashed for consistency (e.g., CDV_CUSTOMER_SMS_H, CDV_ADDRESS_SMS_H, CDV_CONTRACT_SMS_H).
  • Links (_L tables): Capturing relationships between business keys (Hubs) over time (e.g., CDV_CUSTOMER_ADDRESS_SMS_L, CDV_CONTRACT_PERIODAMMNT_SMS_L).
  • Satellites (_S tables): Holding descriptive attributes associated with Hubs or Links, tracking historical changes (e.g., CDV_CUSTOMER_SMS_S, CDV_ADDRESS_SMS_S). Each satellite typically includes hashed attribute values (M_HASH_DETAILS) and validity periods (M_START_DATE, M_END_DATE). The diagram notes the CDV acts as a "staging and repository centre for external links," based on a "flexible architecture matched closely to the incoming files," with "no complex transformation or logic applied," which aligns perfectly with Data Vault principles for this layer. Data population is managed via generic loading processes.

3.3 Conformed Data Universe Layer (Owner_cdu) Following the CDV, data is further processed into the Conformed Data Universe (CDU) layer, represented by the Owner_cdu schema. This layer also adheres to Data Vault principles but likely involves applying more business rules and conforming data across different source representations present in the CDV. The architecture diagram labels this as the domain of "CONFORMED PROCESSES" and describes the CDU as holding a "Conformed set of entities populated from the CDV and internal sources". Like the CDV, it comprises:

  • Hubs (_H tables): Representing core business entities (e.g., CDU_CUSTOMER_H, CDU_ADDRESS_H).
  • Links (_L tables): Defining conformed relationships (e.g., CDU_CUSTOMER_ADDRESS_L, CDU_ADDRESS_LOCATION_L).
  • Satellites (_S tables): Containing the conformed descriptive attributes with history (e.g., CDU_CUSTOMER_S, CDU_ADDRESS_S). Again, generic loaders handle the data flow from CDV to CDU.

 

3.4 Conformed Processes Layer (AKB, SKB) Moving upstream from the CDU, the process flow diagram depicts layers labeled AKB and SKB under the heading "CONFORMED PROCESSES". While specific schema definitions for these layers were not provided in the reviewed materials, their position suggests they serve as intermediate, potentially subject-oriented data marts or further refined analytical layers. They consume data from Owner_cdu via generic loaders, implying they structure the conformed Data Vault information into formats more readily consumable for specific business intelligence or reporting needs before data reaches the final presentation layer.

3.5 Presentation / Export Layer (Owner_pkb) The Owner_pkb schema represents the primary presentation layer, as shown in the "EXPORT PROCESSES" section of the flow diagram. Based on the schema definition and its position in the architecture, this layer transforms data sourced from the upstream layers (specifically shown flowing from SKB) into structures optimized for end-user querying, reporting, and data export. Key aspects include:

  • Dimensional / OBT Structures: Tables like PKB_CUSTOMER, PKB_ADDRESS, PKB_CONTRACT, etc., represent core business entities, likely serving as dimensions or forming part of a One Big Table (OBT) structure. Relationship tables (e.g., PKB_CUSTOMER_LOCATION, PKB_CONTRACT_LOCATION) connect these entities, often centered around a PKB_LOCATION table, facilitating analysis.
  • Data Derivation: Views within PKB (e.g., PKB_CU_V, PKB_CO_V, PKB_ADLO_V) clearly demonstrate the transformation logic applied to data sourced from CDU views (PKB_CDU_..._V) to populate the PKB tables.
  • MBI Management: PKB tables contain the standard MBI metadata columns (M_SOURCE, M_RUN_ID_CREATED, etc.), indicating generation and management by the MBI framework. Supporting packages like pkb_utils and pkb_gather_stats further confirm this.
  • External Interaction (KBD): A notable feature of this layer is its interaction with a remote database, evidenced by numerous synonyms pointing to REMOTE_KBD@PKB_KBD and a dedicated package for remote index management (pkb_kbd_index_management). This suggests PKB serves as a source for, or interacts with, an external customer-focused system (Klantbeeld Database).

3.6 Export and Cache Finally, the process flow diagram shows data moving from Owner_pkb through an "Export process" and potentially into a "Data cache". This represents the final delivery mechanism, where data structured in PKB is extracted for downstream applications, external partners, or perhaps loaded into a cache for performance optimization of specific reporting tools. The pkb_dl_sylivan_bic package provides an example of such an export process, writing data to files.

3.7 Architectural Summary In summary, the MBI system architecture represents a structured, multi-stage data flow: from initial staging (OXT/OWK), through integrated Data Vault layers (CDV, CDU) providing a historical and auditable core, into potentially subject-oriented conformed layers (AKB/SKB), and finally culminating in a presentation layer (PKB) optimized for consumption and export. The consistent use of generic loaders between layers highlights the automated, framework-driven nature of the data integration process managed by the MBI API.

4. The MBI API Framework (OWNER_MBI)

The core engine driving the automation and standardization within the architecture is the MBI API framework, implemented primarily within the OWNER_MBI schema. This framework embodies the principles of "Do it once, Do it in one place, Conform to standards, Make it available, Make it improvable, Enable control". Its fundamental purpose is to provide a controlled and consistent method for generating and executing code, particularly DDL and potentially other script types, based on predefined, parameterized templates and metadata. The API is structured into distinct functional layers, ensuring separation of concerns and promoting reusability.

4.1 Definitions Layer (MBI_PKG_API_DEFN_XXX) This layer serves as the repository for code templates or "frameworks". Instead of storing templates in tables, the definitions are implemented as functions within dedicated PL/SQL packages (e.g., MBI_PKG_API_DEFN_DDL, MBI_PKG_API_DEFN_DD, MBI_PKG_API_DEFN_MTV). Each function (e.g., MBI_FTN_API_DEFN_INDEX, MBI_FTN_API_DEFN_PK_CONSTRAINT) returns a CLOB containing the parameterized code string. Parameters within the templates are denoted using a consistent syntax, such as <<PARAMETER_NAME>>. Creating new definitions involves taking a standard script, identifying variable elements, and replacing them with these placeholders.

4.2 Control Layer (MBI_PKG_API_CONTROL) The Control layer acts as the engine for processing the definition templates. Its primary function, encapsulated within MBI_FTN_API_CONTROL, is to take a requested definition name (p_defn) and a collection of parameter names and values (name_val_tab of type mbi_types.param_tt), retrieve the corresponding template, and substitute the placeholders with the provided values. This involves sub-functions:

  • MBI_FTN_API_READ: Dynamically executes a call to the appropriate definition function (identified by p_defn) to retrieve the template CLOB.
  • MBI_FTN_API_REPLACE: Iterates through the provided parameter collection (name_val_tab) and replaces each <<parameter_name>> placeholder in the template CLOB with its corresponding value. The result of the Control layer is a fully formed, executable code string (or potentially multiple strings if handled by Utilities).

4.3 Execution Layer (MBI_PKG_API_EXECUTE, MBI_API) This layer is responsible for invoking the code generation process and executing the resulting scripts.

  • MBI_PRC_API_REQUEST: This is the primary callable procedure for initiating code generation and execution. It accepts the definition name and parameters. Internally, it often calls MBI_PKG_API_CONTROL.MBI_PRC_API_ROUTE which, based on the definition name, might invoke a higher-level Utility function or directly call MBI_FTN_API_CONTROL to generate the script(s).
  • MBI_PRC_API_EXECUTE: This procedure takes the generated script(s) (as a collection of mbi_stats_ot objects) and executes them using EXECUTE IMMEDIATE. Crucially, it respects a debug parameter (p_exec_debug) which allows the generated code to be outputted (e.g., via dbms_output) instead of executed. Execution is performed with the privileges of the calling user, and only scripts generated via the defined templates and control mechanisms can be executed, providing a layer of security. The simplified MBI_API.REQUEST procedure likely acts as a direct wrapper around MBI_PKG_API_EXECUTE.MBI_PRC_API_REQUEST.

4. The MBI API Framework (OWNER_MBI) (Continued)

4.4 Utilities Layer (MBI_PKG_API_UTILS_XXX) While the Definition, Control, and Execution layers provide the core generation mechanism, the Utilities layer offers a crucial level of abstraction and encapsulates higher-level logic and standards. Implemented as procedures within specific utility packages (e.g., MBI_PKG_API_UTILS_DDL, MBI_PKG_API_UTILS_MTV, MBI_PKG_API_UTILS_DD), these utilities serve several key purposes:

  • Parameter Derivation: Utilities often take minimal input parameters and derive others required by the underlying definitions. This is frequently achieved by querying metadata from the Oracle data dictionary (e.g., using cursors over all_tab_comments to find a table alias, or all_tab_columns to get key attributes) or applying standard naming conventions (e.g., deriving tablespace or constraint names based on the owner and table alias). This ensures consistency and reduces the burden on the calling process.
  • Orchestration: They manage multi-step processes that require invoking multiple definitions in sequence. The example of creating a primary key (MBI_PRC_API_CRT_PK) illustrates this well, where the utility first calls the definition to create the unique index and then calls the definition to create the primary key constraint using that index.
  • Encapsulation of Standards: Complex organizational standards (e.g., for naming conventions, tablespace allocation, required indexes for specific table types) are embedded within the utility procedures, ensuring all generated objects conform without requiring developers to remember every detail.
  • Bulk Operations: Utilities often incorporate cursors over the data dictionary to apply standardized actions across multiple database objects that fit a certain pattern (e.g., creating primary keys for all tables matching a specific naming convention like XXX_D% within a schema).
  • Complex Task Management: Extended utilities can orchestrate highly complex tasks involving numerous steps and dependencies, such as the mbi_prc_api_crt_tech_indexes procedure which aims to correctly index and constrain an entire star schema based on metadata analysis.

4.5 Overall API Flow The typical flow when using the MBI API involves a calling process (e.g., a deployment script, a scheduled task) invoking either a specific Utility procedure (MBI_PKG_API_UTILS_XXX.ProcedureName) or directly calling the Execution layer's request procedure (MBI_PKG_API_EXECUTE.MBI_PRC_API_REQUEST) with a definition name and necessary parameters.

  1. If a Utility is called, it may perform metadata lookups, derive additional parameters, and make one or more calls to MBI_PKG_API_EXECUTE.MBI_PRC_API_REQUEST.
  2. MBI_PRC_API_REQUEST (directly or via the Utility) uses MBI_PKG_API_CONTROL.MBI_PRC_API_ROUTE to determine the correct definition(s) or further utility actions required.
  3. The Control layer (MBI_FTN_API_CONTROL) retrieves the template(s) (MBI_FTN_API_READ) and substitutes parameters (MBI_FTN_API_REPLACE).
  4. The resulting executable script(s) are passed back to the Execution layer (MBI_PRC_API_EXECUTE).
  5. MBI_PRC_API_EXECUTE either executes the script(s) using EXECUTE IMMEDIATE or outputs them, based on the debug flag.

This layered, metadata-driven approach provides a powerful, flexible, and controlled mechanism for automating code generation and management within the data warehouse environment.

5. Implementation Aspects

The MBI framework and the associated data warehouse architecture were implemented primarily using Oracle database technologies, with PL/SQL forming the backbone of the generation engine, task execution, and loading processes. Several key implementation choices characterize the system:

5.1 Technology Stack: The core logic resides within Oracle PL/SQL packages stored within the OWNER_MBI schema. This includes the API framework (Definitions, Control, Execution, Utilities), the task control and logging mechanisms, and the standardized data loaders. The data warehouse layers (CDV, CDU, PKB) are implemented as standard Oracle schemas containing tables, views, sequences, and other objects generated or managed by the framework.

5.2 Metadata Utilization: A cornerstone of the MBI framework is its extensive use of metadata. Rather than relying solely on parameters passed during execution, the Utility packages (MBI_PKG_API_UTILS_XXX) actively query the Oracle data dictionary to derive necessary information and enforce standards. Examples include:

  • Using all_tab_comments to retrieve predefined table aliases for constructing standard object names.
  • Querying all_tab_columns to identify primary key attributes or columns matching specific patterns (e.g., '%_ID') for indexing or constraint creation.
  • Checking all_indexes to determine the next available index number according to naming conventions. The implementation favors explicit cursors defined within the PL/SQL packages over predefined views for accessing this metadata, potentially for reasons of performance or code obfuscation. This metadata-driven approach allows utilities to operate with minimal explicit input, adapting dynamically based on the existing database structure and predefined standards.

5.3 Task Execution Framework: The MBI system includes a robust task execution framework built around several repository tables and control packages within OWNER_MBI:

  • Repository: MBI_PROGRAMMA defines executable PL/SQL procedures, MBI_TAAK defines specific instances or configurations of these programs, and MBI_TAAK_PARAMETER / MBI_SYSTEEM_PARAMETER store default and system-wide parameters.
  • Execution Control: mbi_taak_control.start_taak initiates a task run, creating a record in MBI_TAAK_RUN, performing pre-run checks (e.g., checking for existing active runs, start-after-error flags), dumping parameters to MBI_TAAK_RUN_PARAMETER, executing the target program defined in MBI_PROGRAMMA, and logging the final status. Batch execution (start_batch_serial) is also supported.
  • Parameter Handling: The mbi_taak_parameters package provides functions (get_parameter_waarde, get_parameter_waarden) to retrieve parameters for a specific task run, respecting the precedence of run-time vs. task-default vs. system-default values.
  • Logging: mbi_taak_logging offers procedures (log_melding, log_error, log_debug_melding) to record execution progress, warnings, and errors into MBI_TAAK_LOG and MBI_ERROR_LOG, including detailed stack traces in debug mode.

5.4 Standardized Data Loading (mbi_loader): The framework appears to include a dedicated package, mbi_loader, designed to handle standardized data loading patterns between layers, likely powering the "Generic loader" components shown in the architecture diagram. This package dynamically constructs SQL statements (INSERT, MERGE, DELETE) based on task parameters defining the source (SRC_), target (TGT_), load method (LOAD_METHOD), and specific column handling rules (e.g., AUDIT_COLS, DATE_COLS, MATCH_COLS, DEL_COLUMNS). It leverages helper functions to determine column lists and build appropriate clauses, ensuring consistency in how data is moved and integrated.

5.5 Extensibility: The framework was designed with extensibility in mind. Adding support for generating new types of objects primarily involves creating a new parameterized template function in the appropriate DEFN package and, typically, a corresponding procedure in a UTILS package to handle parameter derivation and orchestration logic, following the examples provided in the documentation.

6. Advanced Features (Contextualized for 2008)

Evaluating the MBI framework and its associated data warehouse architecture within the context of data warehousing practices prevalent around 2008 reveals several features that can be considered advanced for their time. While concepts like Data Warehouse Automation (DWA), metadata-driven ETL, and alternative modeling techniques like Data Vault existed, their comprehensive and integrated implementation within a custom-built framework like MBI appears sophisticated.

6.1 Comprehensive Metadata-Driven Generation: The MBI framework's reliance on metadata was not merely for parameterization but extended to the automated generation of database objects and process logic across multiple architectural layers.

  • DDL Automation: The system automatically generated DDL for tables, primary key indexes, constraints, and other indexes based on standards embedded in Utility packages and driven by metadata lookups. This level of automation reduced manual effort and enforced consistency, a key goal of DWA.
  • Multi-Layer Generation: Significantly, the framework managed generation not just for staging or a single target model, but across the distinct Data Vault (CDV, CDU) and Presentation (PKB) layers. Dynamically generating potentially different presentation models (star schemas, OBTs) in PKB from the upstream vault represented a particularly advanced capability, moving beyond simple ETL automation.
  • Metadata Integration: Unlike systems where metadata might reside in disparate locations, the MBI framework leveraged the Oracle data dictionary extensively and integrated parameter definitions within its task repository (MBI_TAAK_PARAMETER, MBI_SYSTEEM_PARAMETER), creating a more unified metadata environment to drive processes. While integrated metadata systems were recognized as important, achieving this in a custom framework was notable.

6.2 Structured PL/SQL Generation Framework: Building a comprehensive DWA framework entirely within Oracle PL/SQL was a distinct implementation choice. While commercial ETL tools and emerging open-source options existed, the MBI API's layered design (Definitions, Control, Execution, Utilities) provided a highly structured, internally controlled environment for code generation. This offered deep customization possibilities and potentially tight integration with the Oracle database, although it required specific PL/SQL expertise. The structured nature contrasts with potentially more ad-hoc scripting approaches sometimes used for automation.

6.3 Early and Automated Data Vault Adoption: In 2008, Data Vault was still gaining traction compared to traditional Kimball and Inmon methodologies. Choosing Data Vault as the core modeling paradigm for the integration layers (CDV and CDU) was itself a forward-looking decision, aiming to leverage its strengths in flexibility, scalability, and auditability. Critically, the MBI framework did not just adopt Data Vault conceptually but provided mechanisms to automate the creation and loading of its core structures (Hubs, Links, Satellites) based on metadata and templates, addressing one of the key challenges often associated with Data Vault implementation – the volume of objects to be created and managed.

6.4 Integrated Task Execution and Control: The framework included a built-in task repository and execution engine (mbi_taak_control, mbi_taak_logging, associated tables). This provided capabilities for scheduling (implied via task definitions), parameterization, dependency management (through batch definitions), status tracking, and logging within the same environment used for code generation. This level of integration offered better control and visibility compared to potentially fragmented approaches relying on separate scripting, scheduling, and logging tools.

While individual concepts like metadata usage or Data Vault existed in 2008, the MBI system's integration of these ideas into a comprehensive, internally developed PL/SQL framework for automating generation across multiple architectural layers, including the presentation layer, represents a notably advanced and ambitious implementation for its time.

7. Conclusion

The MBI framework, developed circa 2008, stands as a testament to the potential of applying metadata-driven principles and Data Warehouse Automation techniques using an internally developed, cohesive platform. This case study, reconstructed from technical documentation and code artifacts, reveals a sophisticated system designed to address prevalent data warehousing challenges of its time.

The system successfully implemented a multi-layered architecture, progressing data from staging through integrated Data Vault layers (Owner_cdv, Owner_cdu) and into a flexible presentation layer (Owner_pkb) suitable for diverse analytical needs. This structure, particularly the early and automated adoption of Data Vault modeling, provided a scalable and auditable core.

The engine driving this architecture, the MBI API framework (Owner_mbi), demonstrated significant strengths:

  • Comprehensive Automation: It extended beyond typical ETL automation to include metadata-driven generation of DDL (tables, indexes, constraints) across multiple layers.
  • Standardization and Consistency: By embedding standards within reusable Utility packages and utilizing parameterized templates, the framework ensured consistency and adherence to architectural principles, fulfilling its goal of "conformed, controlled, condensed code".
  • Integrated Control: The built-in task repository, execution engine, and logging mechanisms provided a unified environment for managing complex data warehouse processes.

Considered within the context of 2008, the MBI system's integrated approach, its scope of automation (especially including presentation layer generation), and its implementation of Data Vault position it as an advanced, forward-looking solution. While commercial DWA and ETL tools were evolving concurrently, the MBI framework represents a powerful example of leveraging internal expertise and a metadata-centric philosophy within an Oracle PL/SQL environment to achieve significant automation and control.

Potential considerations or trade-offs for such an approach include the reliance on specialized PL/SQL skills for maintenance and extension, and the need to continuously evolve the internal framework in parallel with commercial tool advancements. However, the MBI system serves as a valuable case study illustrating how ambitious DWA goals could be realized early on through a well-structured, metadata-driven, custom-built framework.

8. References

  • [Internal Document 1: MBI API (2).doc]
  • [Internal Document 2: owner_mbi.txt]
  • [Internal Document 3: HUB_API.sql]
  • [Internal Document 4: owner_cdv.txt]
  • [Internal Document 5: owner_cdu.txt]

 

Must be one of the first fully automated Edw,s to incorporate datavault and includes many features adopted years later in dv 2.0.

Like
Reply

To view or add a comment, sign in

More articles by Robert Anderson

Others also viewed

Explore content categories