Some Concepts about Data Warehouse
The Three Key Ideas about Warehouse
The purpose of the data warehouse should be clear. In most cases, it is for the analysis of data, not for data production.1
- Subject-oriented: since data warehouses are for decision-makers, arrange them into subjects makes it much easier to access.
- Integrated: many sources are integrated for easy analysis
- Time-variant: observation time should be recorded since the data is also used to analyze the time evolution
- Nonvolatile: simply for analysis
OLTP and OLAP
- OLTP: online transaction processing
- OLAP: online analytical processing
OLTP | OLAP | |
---|---|---|
user | customer | data scientist, managers |
purpose | production | analysis |
content | everything | cleaner data |
database | entity relation model, application-oriented | star/snowflake model, subject-oriented |
history | usually no need to record the history | history is crucial |
query | short and frequent read and write | read-only and but complicated analysis |
However, OLAP and OLTP are due to tech limitations in the past and these concepts are becoming obsolete2.
Scope of Data Warehouse
- Enterprise warehouse: targeting the whole organization
- Data mart: for a specific group of people
- Virtual warehouse: views not tables
Fact and Dimension
Fact is the value of something specified by the dimension. For example, we have a table that tells us the revenue of the year 2018 for keyboards is 3 million euros.
Year | Category | Revenue |
---|---|---|
2017 | Keyboard | 2,500,000 |
2017 | Headset | 200,000 |
2018 | Keyboard | 3,000,000 |
2018 | Headset | 134,000 |
The columns Year
and Category
serve as dimensions and Revenue
serves as the fact.
The dimension tells us what we are talking about and the fact tells us the values. For the previous example, we could set up a coordinate system, with two axes, Year
and Category
. On the Year
axis, we have two values, 2017
and 2018
. On the Category
axis, we have two values Keyboard
and Headset
. If we are looking for the revenue of keyboards in 2018, then we are looking for 2018
and Headset
. It is clearly a 2 dimensional data cube.
In principle, we could have N-dimensional data cube. For example, we would like to look at the revenue of
- keyboards (
Category
), - in 2018 (
Year
), - in Cologne (
City
), - sold offline (
Channel
).
Then we have a 4-dimensional dataset. Revenue is the fact and Category
, Year
, City
and Channel
are the dimensions.
Concept Hierarchies
Concept hierarchies are crucial to OLAP operations.
We consider a data cube about revenue with the dimension Date
with values such as 2018-11-26. We would like to know the monthly revenue, the quarterly revenue, and the annual revenue. It’s straightforward to calculate those. The idea is that monthly, quarterly, and annual revenue form a hierarchy: day $\in$ month $\in$ quarter $\in$ year. This is a convention used by everyone on this planet.
In some other cases, concept hierarchies are very much designed instead of being a convention. Suppose we record the prices of houses. It would be nice to categorize the prices into different ranges of prices. Some would say, we divide them into five ranges:
- below 25 percentile
- 25 to 50 percentile
- 50 to 75 percentile
- above 75 percentile
Others would say, we divide them into 100 euros segments
- 0 to 100 euros
- 100 to 200 euros
- …
In fact, this is not always the case. Sometimes, data products can be delivered using some assist of data warehouse if data product doesn’t rely on the performance of the data source. ↩︎
Huyen2022 Huyen C. Designing machine learning systems: An iterative process for production-ready applications. Sebastopol, CA: O’Reilly Media; 2022. Available: https://books.google.com/books?hl=en&lr=&id=EzhwEAAAQBAJ&oi=fnd&pg=PP1&dq=Designing+Machine+Learning+Systems+Huyen&ots=X4CpX2Dl8f&sig=OMCmdJMMGCITWJXU5pvLqcQO1dU ↩︎
- Data Mining by Jiawei Han, Micheline Kamber, Jian Pei
- Huyen2022 Huyen C. Designing machine learning systems: An iterative process for production-ready applications. Sebastopol, CA: O’Reilly Media; 2022. Available: https://books.google.com/books?hl=en&lr=&id=EzhwEAAAQBAJ&oi=fnd&pg=PP1&dq=Designing+Machine+Learning+Systems+Huyen&ots=X4CpX2Dl8f&sig=OMCmdJMMGCITWJXU5pvLqcQO1dU
L Ma (2018). 'Some Concepts about Data Warehouse', Datumorphism, 11 April. Available at: https://datumorphism.leima.is/wiki/data-warehouse/data-warehouse-concepts/.