By Thibaut De Vylder
The data vault is a practical response to the limitations of two other important approaches in the data-warehousing world.
The first approach is the enterprise data warehouse (EDW) proposed by Bill Inmon, who believes in centralising data.
The second is Joseph Kendall‘s dimensional model, which proposes to keep sourced data as it is and assemble it as required to address any new demand.
We can summarise the two approaches that gave rise to data-vaulting in a table that, while not exhaustive, gives a good overview:
Bill Inmon’s enterprise data warehouse
Joseph Kendall’s dimensional model
|The data are centralised||The sourced data are not centralised. Some are assembled according to a specific need, via an ETL*|
|There is a twofold effort that must be made by the same team: the first involving centralisation; the second, restitution.||The additional effort, which is made when sources are being added, is increasingly important because the new assemblies/ETLs must remain consistent with the previous ones so as not to create two realities.|
|There is a significant risk that new requirements (for the addition of sources and new reports) will impact the existing requirement.||There is a real risk of inconsistency during maintenance activities (such as to maintain a rule common to all ETLs).|
|The delivery deadline is far off: all data must be sourced before any output is produced.
→ Not Fast, and complex to go far
|The deadline to meet the first requirement is tight, but those for the next ones are farther and farther off.
→ Fast but not far
|→ Centralisation avoids redundancy but it soon becomes complicated||→ Distribution gives the illusion of speed but it quickly becomes complex to run and maintain|
*ETL: Extract Transform and Load = a tool or procedure for copying data from a source system to a destination system
Data-vaulting incorporates the positive aspects of both approaches: see Blog#2, The benefits of data-vaulting. It seems unwieldy because of the addition of tables that are needed in order for some simple rules to be adhered to:
- The data structure is defined by business keys via hubs and links between these hubs
- All details are stored in satellites
This additional table, which also creates quite a readable model for business users, is a more than acceptable drawback, given the advantages the data-vaulting approach offers.
Meeting with Bill Inmon
On 21 May 2014, I had the opportunity to go to a presentation by Emergo, which was one of the first to offer solutions based on data-vaulting 1.0. I had two main reasons for going: first, we had been using the data-vault 1.0 methodology for almost 2 years, and second, one participant had just been given top billing: Bill Inmon.
I asked him: What do you do at this type of event when data-vaulting points up the flaws in your EDW model? He replied with a little story: in the 1990s he’d managed to model the world’s largest company, Walmart in an EDW, and had predicted that from then on all other organisations would be simpler. “But I reached the end of the ocean.”
He also told me that at that time two things had not existed: the internet and the data it generates. We were in a world of small data. And he told me he was not responsible for the fact that his approach was still the one that was studied the most in the world and that his books were the ones that sold the most.
From a conversation between Bill Inmon, “the father of data warehouse”,
and Thibaut De Vylder at the Emergo event, 21 May 2014.