Our expert of Data Vaulting, Pasi, tells all you need to know from Data Vault 2.0 – questions will be answered if needed!
First, that comes to mind from ‘Data Vault’ is a way to model a data-warehouse. That it is, but Data Vault 2.0 is much more than just that. It is a method, architecture and model which makes it possible for a company to store data and study it historically and also to produce information based on this data and distributing it to individuals needing it. Data Vault is developed by Dan Linstedt.
Data Vault 2.0-method is based commonly established process, development, project management and application development methods: CMMI level 5 (optimisation), PMP, Six Sigma, TQM, Agile. The method is platform-independent, and it supports relation-based (RDBMS) and NoSQL (“non-relational”) environments like, for example, Hadoop. The procedure enables hybrid-systems which include pieces of data-warehouse in both, relation-based and NoSQL environments.
Data Vault 2.0 -method’s basic elements are consistency, repeatability and capability to exploit operating model. Multilevel architecture and modelling based on related methods ensure means to flexible, scalable and fault tolerant data-warehouse system development and distribution according to procedures of agile development. Data traceability is kept constantly in mind during the process all the time and new data sources can be implemented without disturbing the already existing system and its functionalities.
In the Data Vault 2.0 method, data-warehousing and information production are separated from each other. This provides capabilities to automatize the phases related to data warehousing. Data is loaded and stored as raw-data; as in format, it is received from a source system. Data is enhanced with information to ease the traceability and latter update-phases when information is fetched from the data warehouse for reporting. Master Data is utilised when the information architecture of data-warehouse is built (e.g. “same as” references) – it is also used to enhance and unify the reported information.
Utilising the Data Vault 2.0
The Data Vault system is based on business concepts. In practice, the best way to find out essential business concepts is directed workshops where a customers’ personnel together define the important concepts/entities (e.g. customer, product) and how they are related with each other in business. Each of the concepts receives an identifier (business key) which enables the data individualism in a source system. The identifier can be a combination of different data. The relevant is the consistency of the identifier: it must remain as a same one through-out the source system.
Normally the identifiers are formed to have clear meaning for the users of the source system (e.g. customer number, product number). Technical surrogate-keys aren’t usually good identifiers because they are given by the system with no representation of anything and in the worst case they can change due to change in the source system, data restoration (backup) or other situations. Depending on a source system, the surrogate-key can, in fact, be the best (or only) identifiers to go with.
Structure of the Data Vault
Data Vault -data-model is a combination of three types of data structures:
Hub – Concept-related (e.g. customer) individualised identifiers (e.g. customer numbers)
Satellite – Concept -related information (attributes), stored data from the source system
Link – Relations between concepts. E.g. references to customers’ orders
The implementation of data-warehouse is based on data individualising identifiers (business keys) and to decentralisation of the information. The stored data receives hash-keys based on identifiers through calculation. When the data is distributed to Hub, Satellite and Link -structures, data related to similar data are merged with the matching value of the hash-key. When, for example, customers’ data is changed in the source system, the same identifier (e.g. customer number) based, calculated hash-key connects new versions of the information to earlier versions – always. Multiple versions of the same data are differentiated from each other with loading-time of the data.
Data Vault 2.0 is INSERT ONLY-method. This gives tools to faster update routines of the warehouse and errors in the programming logic is minimalised. In the data updating phase, the information from the source system is always compared to the last data loaded into the warehouse. When data is fetched from the warehouse for reporting purposes, the picked versions is always the latest one. If there is need to inspect information in different points of time, with fetch-time of the data it is possible to form snapshot-like structures for reporting (PIT – Point In Time -structure), where the direct reference-keys to data versions existing in chosen point of time.
Usage of hash-keys:
- Enables direct referencing
- Simplifies and boosts the loading and search inquiring of the data
- Enables collateral drives (also MPP – Massive Parallel processing), due to
- Hash-keys can be always calculated according to the source system without links to other information
- Separate data can be connected to each after the collateral drive (e.g. MPP) with hash-keys.
- Enables the connectivity of the data in the relation root and NoSQL-environment (same hash-keys)
Data Mart -> Information Mart
In the Data Vault 2.o method, the discussion revolves around Information Marts instead of Data Marts. The background of this change is in reporting: information used in it is almost always refined in one way or another. Thus, it’s information from data. In addition to being just a traditional information structure, the Information Mart can also be, for example, a cube, data-group distributed for analytics tool and so forth. With that in mind, Data Mart should be called as “raw mart” which includes unrefined data.
Managed Self-Service BI (SSBI)
SSBI is a data-warehouse solution where
- End-users (superusers, analysts) can directly access the produced information
- Information used by end-users is always up-to-date
- End-users also produce information back to the warehouse/reporting for other end-users to benefit from (write-back). Example maintaining Master Data.
- Edit- and manipulation methods used by end-users (provenance) are always known
- Data can be processed and used through mobile and web
- Data extraction and processing are always information secured processes
- IT is always involved in attaching the productions of end-users as part of the data-warehouse (end-user cannot bypass the protocols and processes determined by IT)
Data Vault 2.0 support real-time data-warehousing. Principles and procedures are the same as in traditional data-warehousing, but real-time warehousing doesn’t use the staging area. The real-time version of warehousing is practically always ‘insert only’ type of solution. Especially in situations where the stored data masses are grown exponentially (e.g. IoT-environments). In the aforementioned solutions, information arrives in the warehouse directly through message passing mechanism (Message Queue, ESB). The loading of the data is divided into self-paced message queues and the extraction time won’t tell the order in which the data is loaded or updated in the source system. Real-time warehousing solution needs data’s creation time from the source system to ensure the determination of up-dating order. Also, the importance of data-quality is emphasized even further.
The method offers tools to deal with erroneous data and thus increasing the quality of the data. Erroneous data can be composed to so-called Error Marts, where they can be inspected and analysed further.
If you need an expert in data-warehousing or have questions regarding Data Vaults: We’re happy to help – Contact us!