Data vault modeling: Everything you need to know (2024)

Data vault is an agile data modeling technique and architecture, specifically designed for building scalable enterprise data warehouses.

First conceptualized in the 1990s by Dan Linstedt, the Data Vault methodology separates a source system's data structure from its attributes. Instead of conveying business logic through facts, dimensions, or extensively normalized tables, a Data Vault employs a direct approach, transferring data from source systems to a small set of specifically designed tables.

In recent years, data vaults have made a comeback thanks to the rising popularity of data lakehouses. Data lakehouses are a new type of data platform that combines elements of both data lakes and data warehouses. They typically store both raw data and transformed analytic models and tables, using schema-on-read to avoid needing upfront schema definitions.

Data vault modeling is well suited to the lakehouse methodology since the data vault model is easily extensible and ETL changes are easily implemented.

This guide will provide an overview of data vault: its concepts, advantages, key considerations, best practices, and tooling.

Data vault modeling: Everything you need to know (1)

What is data vault modelling?

At its core, the data vault model uses a simple three-layer architecture consisting of hubs, links, and satellites. Here’s a summary of each:

  • Hubs store unique business keys, essentially the unique identifiers for business concepts or objects.
  • Links represent relationships between the unique business keys stored in hubs. Links establish the connections or associations between different business objects.
  • Satellites hold all the descriptive data or attributes related to hubs or links, like the textual descriptions, timestamps, or numerical values. They capture the context, details, and history of business keys and their relationships.

In summary, a hub identifies a business concept, a link maps out how that concept relates to others, and a satellite provides rich details about either the concept or its relationships.

Hubs, links, and satellites: The building blocks of data vault modelling

Hubs

They represent the core business concepts by storing unique business keys and act as the foundational entities upon which the entire Data Vault structure is based. An example of a Hub might be a Customer table that contains the unique identifier for each customer entity.

At a minimum, a Hub typically contains:

  • A surrogate key, which is a system-generated unique identifier.
  • The natural business key, which is the actual unique identifier from the source system.
  • Load date or load-end date timestamps indicating when the key was first loaded and possibly when it was superseded.

Hubs are highly stable as they capture the unchanging essence of a business concept. This means they aren't frequently modified even when other aspects of the data model change.

Links

They model the relationships or associations between business concepts (hubs). They encapsulate many-to-many relationships between keys, thereby expressing how business keys relate to each other across the organixation. An example of a link might be a CustomerOrder table that links customer_id to order_id and represents the relationship between customers and orders.

A link contains:

  • A surrogate key for the Link itself (CustomerOrder id)
  • Surrogate keys from the related Hubs to represent the relationship (customer_id, order_id).
  • Load date or load-end date timestamps capturing when the relationship was first and last recognized.

Links are dynamic, mirroring the evolving nature of business relationships. They can quickly adjust to reflect changes in the way different business concepts are interrelated.

Satellites

They store the contextual, descriptive, and temporal data associated with Hubs or Links. This is where the "meat" of the business information resides, such as attributes, textual descriptions, and numerical values. An example of a satellite might be a CustomerDetails table that contains descriptive attributes about each customer like first_name, last_name, email, address etc. with customer_id as a foreign key mapping to the Customer hub.

A satellite contains:

  • A surrogate key that relates back to the hub or link
  • Descriptive attributes about the business key (from hubs) or the relationship (from links)
  • Load date and load-end date timestamps for each record, which help in tracking changes to attribute values over time
  • A possible record source indicating where each piece of descriptive data came from, useful in auditing scenarios

Satellites are highly volatile compared to hubs and links. They capture all the changes and variations in business data attributes, allowing a historical and audit-friendly view of the data's evolution.

Data vault metadata management and scalability considerations

In the context of data vault, metadata provides the essential descriptive information about the various elements like hubs, links, and satellites, as well as about the lineage, load processes, and business context. The types of data vault metadata are as follows:

  1. Structural metadata includes definitions of tables, columns, data types, keys, indexes, constraints, and relationships. provides insight into the schema design, relationships between entities, and general layout of the data within the data vault.
  2. Operational metadata includes data load timestamps, ETL job logs, transformation rules, source system identifiers, and data quality metrics. It ensures traceability and auditability, enabling teams to understand data lineage, transformation logic, source-to-target mappings, and troubleshooting data issues.
  3. Business metadata captures business-centric definitions, rules, and context of the data elements in the data vault. It ensures that data can be interpreted, contextualized, and utilized effectively by business users, bridging the gap between technical data structures and business semantics.

Scalability considerations

Metadata repositories are specialized storage systems (databases or platforms) that house and manage all the metadata components. Repositories contain tables, views, APIs, or services to capture, access, and interact with various metadata types. They provide a single point of reference for both technical teams and business users. Here's what to consider in terms of scalability:

  • Volume and complexity: As a data vault amasses a large volume of metadata, ensure database scalability in both directions - vertical and horizontal.
  • Load and transformation metadata: As ETL processes and logic evolve, metadata accumulation increases - use efficient storage structures like columnar storage and optimized indexing.
  • Versioning and auditability: Implement a versioning mechanism for metadata, allowing rollback, comparison, and audit of changes.

Advantages of data vault

Data vault offers several benefits across multiple areas, depending on the size, context, and goals of the organization. In the following paragraphs, we'll walk through some of them.

Scalable architecture

At the heart of data vault's design lies its modular structure, built upon hubs, links, and satellites. This granular and modular setup allows for the efficient scaling of data structures as the data grows, both in volume and complexity.

Adaptive to changes

Data vault is designed to absorb changes seamlessly. Whether a business undergoes shifts in its processes, introduces new data sources, or modifies existing ones, the Data Vault model can easily adapt without major overhauls or disruptions to the existing system.

Agile data integration and faster time-to-market

  • Incremental data loading: The model supports the integration of new data incrementally. This means that as new data sources or attributes emerge, they can be added without a complete redesign of the warehouse.
  • Parallelization: The inherent separation between different types of data (business keys, relationships, attributes) allows for parallel loading processes. This parallelism ensures faster data ingestion and integration.
  • Reduced dependencies: The compartmentalized structure means that changes in one part of the model don't mandate changes in other sections. This decoupling fosters an environment for agile development and faster deployments, significantly reducing the time-to-market for new features or data sources.

Support for historical data tracking and auditing requirements

  • Immutable data storage: Data vault inherently stores historical changes, particularly within Satellites. This ensures that the system maintains an immutable record of all data versions over time.
  • Auditing and compliance: The ability to track historical data in an immutable fashion supports various auditing and regulatory compliance needs. Organizations can reliably produce data snapshots from any point in time, meeting stringent data retention requirements.
  • Time-variant data: Data vault's structures inherently support time-variant data, capturing precisely when specific changes or additions were made, providing a robust foundation for temporal data analyses.

Enhanced data lineage

  • Transparent data flow: The data vault methodology inherently documents how data moves from source systems to consumption layers. This clear pathway ensures that the origin and the various transformations data undergoes are well-tracked.
  • Building trust: With clear data lineage, data consumers can place more trust in the data, fully understanding where it originates and how it's been processed.
  • Impact analysis: A well-defined lineage allows for effective impact analyses. Teams can understand the repercussions of changes in source systems, transformations, or business logic, ensuring informed decision-making.

Top five use cases for data vault

Below are the primary use cases where the Data Vault approach shines:

  1. Large-scale data integration: Merges data from diverse systems like during company mergers or integrating legacy with new platforms, using Hubs, Links, and Satellites for a seamless integration.
  2. Evolving data sources: Adapts to rapid changes in source system schemas, making it ideal for startups or tech-driven firms due to its incremental modelling flexibility.
  3. Historical data analysis: Tracks and stores data changes over time, crucial for sectors like finance or healthcare where trend analysis and compliance are paramount.
  4. Data warehousing in agile environments: Supports iterative data warehousing in agile methodologies, allowing continuous integration and delivery through its modular design.
  5. Enhanced data governance and lineage: Maintains clear data origins and transformations, essential for industries requiring transparency in data flow and impact analysis.

Data vault best practices

Your data vault architecture is a foundation that can support business users and drive business value, if done correctly. Below are some key best practices to follow when implementing a data vault data model:

  • Business keys identification: Pinpoint unique identifiers in source systems crucial for Hub foundation.
  • Raw data layer: Maintain a "staging area" for unaltered source data for validation and transformation checks.
  • Data lineage: Prioritize comprehensive tracking of data's origin and transformations for transparency and compliance.
  • Satellite design: Structure Satellites for efficient capture and querying of historical changes.
  • Soft links: Use them judiciously, reserving for truly complex scenarios to avoid performance issues.
  • Historical tracking: Standardize mechanisms in Satellites for uniform date and source tracking.
  • Optimized reporting: Build structures like denormalized views atop Data Vault for efficient querying and reporting.

Tooling

Proper tooling can significantly simplify the complexity of implementing and managing a Data Vault project. Many data integration tools now provide capabilities to support data vault style modeling.

  • Oracle: Supports data vault concepts in Oracle SQL Developer Data Modeler. Provides data modeling, lineage, and governance features.
  • Databricks: Supports data vault concepts in Databricks Lakehouse platform, see example here.
  • Azure Data Factory: Mapping data flows enables implementing Data Vault patterns. Built-in data profiling helps identify hub entities.
  • Erwin: Models, maps, and automates the creation, population, and maintenance of data vault solutions on Snowflake.
  • VaultSpeed: Automates the creation of data vaults.
  • Talend: An ETL tool that also offers components for managing hub, link, and satellite tables, business keys, and relationship life cycles.
  • WhereScape RED: Automates development of data vault models and ETL processing. Provides templates to accelerate development.

Final thoughts

While data vault offers many benefits, it also introduces complexities to consider when implementing. It requires consistency in key definitions across source systems. It can also result in complex link table relationships to map.

If you're going to implement a data vault framework, you'll need extensive metadata management. Your framework may also increase ETL processing time due to added data points.

Understanding these nuances upfront allows balancing trade-offs versus a traditional model. As data complexity grows, organizations need more adaptable modeling approaches. By combining standardization, modularity, and detailed historical tracking, a data vault provides a flexible way to structure enterprise data and serve changing analytics needs.

For more ways to optimize your data management, take Bigeye for a spin.

Data vault modeling: Everything you need to know (2024)

FAQs

What is the Data Vault model approach? ›

A data vault enterprise data warehouse provides both; a single version of facts and a single source of truth. The modeling method is designed to be resilient to change in the business environment where the data being stored is coming from, by explicitly separating structural information from descriptive attributes.

Is Data Vault still relevant? ›

The biggest advantage of having a data vault in place is its adaptability to change. If your source architecture is prone to changes, such as the addition or deletion of columns, new tables, or new/altered relationships, you should definitely implement a data vault.

What problems does Data Vault solve? ›

Traditional data warehousing solutions may require significant time and effort to adapt to such changes. Data vault, with its agile framework, allows for greater flexibility and easier change management. It provides the ability to adapt to new business requirements without having to redesign the entire data model.

What is the difference between hub and satellite in Data Vault? ›

The data vault has three types of entities: hubs, links, and satellites. Hubs represent core business concepts, links represent relationships between hubs, and satellites store information about hubs and relationships between them.

What is the primary key in Data Vault? ›

Primary key is the hashed value of the business key or a sequence number (surrogate key). In Data Vault 2.0 [4], primary keys based on sequence number are replaced by hash-based primary keys. Load date indicates the date and time when the business key initially arrived in the hub.

What are the disadvantages of Data Vault? ›

The disadvantages of the Data Vault

Afterwards, you have to “work back” to a dimensional data model. More knowledge required: with the Data Vault comes a 3rd modeling technique. Employees must master these techniques bit by bit. No integrity: the data in the Data Vault lacks integrity and is not always correct.

What are the criticism of Data Vault? ›

One of the biggest criticisms of data vault is its complexity. The model consists of many different types of tables (Hubs, Links, Satellites), and the relationships between them can become quite complex, especially in large systems.

What are the different types of data vaults? ›

Data vaults have 3 types of entities: Hubs, Links, and Satellites.

What is the difference between Data Vault and dimensional modeling? ›

Dimensional modeling and data vault modeling have some similarities, such as their modular components which can be reused and extended. Dimensional modeling uses facts and dimensions, while data vault modeling uses hubs, links, and satellites.

Is Data Vault normalized? ›

The data vault model is based on normalization and separation of classes of data. In this particular case, the business keys (hubs) are considered a different class than the relationships (links).

What are the advantages of Data Vault over dimensional modeling? ›

Lineage and Audit: As Data Vault includes metadata identifying the source systems, it makes it easier to support data lineage. Unlike the Dimensional Design approach in which data is cleaned before loading, Data Vault changes are always incremental, and results are never lost, which provides an automatic audit trail.

What is the data vault 2.0 methodology? ›

Data Vault 2.0 is a database modeling method published in 2013. It was designed to overcome many of the shortcomings of data warehouses created using relational modeling (3NF) or star schemas (dimensional modeling). Speci fically, it was designed to be scalable and to handle very large amounts of data.

Is Data Vault a data lake? ›

Data Vault is a combination of dimen- sional modeling and third normal form [7] and supports agile project management and use-case-independent modeling [8, 9]. Because it is a simple and flexible modeling technique, Data Vault qualifies for data modeling in data lakes [5].

Is Data Vault 3NF? ›

“The Data Vault is a detailed oriented, historical tracking, and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema.

What are the three approaches to data modeling? ›

There are three types of data models: dimensional, relational, and entity relational. These models follow three approaches: conceptual, logical, and physical. Other data models are also there; however, they are obsolete, such as network, hierarchical, object-oriented, and multi-value.

What is the Data Vault logical model? ›

The Data Vault model is a conceptual and logical data model using table structures. Data Vault represents entities, relationships between entities, and additional context data in three different table types: hubs, links, and satellites. Hubs represent business objects in Data Vault.

What is the difference between a Data Vault and a data warehouse? ›

Data vaults store raw data as-is without applying business rules. Data transformation happens on-demand, and the results are available for viewing in a department-specific data mart. While a traditional data warehouse structure relies on extensive data pre-processing, the data vault model takes a more agile approach.

Which is a benefit of a Data Vault? ›

Flexibility in data storage: the Data Vault provides flexibility in data storage in a number of ways. Additional flexibility: being able to easily add new sources and entities. And for this, you do not have to modify the existing structure. More data storage: even wrong or incomplete data is stored.

Top Articles
Latest Posts
Article information

Author: Dean Jakubowski Ret

Last Updated:

Views: 5405

Rating: 5 / 5 (50 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Dean Jakubowski Ret

Birthday: 1996-05-10

Address: Apt. 425 4346 Santiago Islands, Shariside, AK 38830-1874

Phone: +96313309894162

Job: Legacy Sales Designer

Hobby: Baseball, Wood carving, Candle making, Jigsaw puzzles, Lacemaking, Parkour, Drawing

Introduction: My name is Dean Jakubowski Ret, I am a enthusiastic, friendly, homely, handsome, zealous, brainy, elegant person who loves writing and wants to share my knowledge and understanding with you.