Some Concepts about Data Warehouse

The Three Key Ideas about Warehouse

The purpose of the data warehouse should be clear. In most cases, it is for the analysis of data, not for data production.¹

Subject-oriented: since data warehouses are for decision-makers, arrange them into subjects makes it much easier to access.
Integrated: many sources are integrated for easy analysis
Time-variant: observation time should be recorded since the data is also used to analyze the time evolution
Nonvolatile: simply for analysis

OLTP and OLAP

OLTP: online transaction processing
OLAP: online analytical processing

	OLTP	OLAP
user	customer	data scientist, managers
purpose	production	analysis
content	everything	cleaner data
database	entity relation model, application-oriented	star/snowflake model, subject-oriented
history	usually no need to record the history	history is crucial
query	short and frequent read and write	read-only and but complicated analysis

However, OLAP and OLTP are due to tech limitations in the past and these concepts are becoming obsolete².

Scope of Data Warehouse

Enterprise warehouse: targeting the whole organization
Data mart: for a specific group of people
Virtual warehouse: views not tables

Fact and Dimension

Fact is the value of something specified by the dimension. For example, we have a table that tells us the revenue of the year 2018 for keyboards is 3 million euros.

`Year`	`Category`	Revenue
2017	Keyboard	2,500,000
2017	Headset	200,000
2018	Keyboard	3,000,000
2018	Headset	134,000

The columns Year and Category serve as dimensions and Revenue serves as the fact.

The dimension tells us what we are talking about and the fact tells us the values. For the previous example, we could set up a coordinate system, with two axes, Year and Category. On the Year axis, we have two values, 2017 and 2018. On the Category axis, we have two values Keyboard and Headset. If we are looking for the revenue of keyboards in 2018, then we are looking for 2018 and Headset. It is clearly a 2 dimensional data cube.

In principle, we could have N-dimensional data cube. For example, we would like to look at the revenue of

keyboards (Category),
in 2018 (Year),
in Cologne (City),
sold offline (Channel).

Then we have a 4-dimensional dataset. Revenue is the fact and Category, Year, City and Channel are the dimensions.

Concept Hierarchies

Concept hierarchies are crucial to OLAP operations.

We consider a data cube about revenue with the dimension Date with values such as 2018-11-26. We would like to know the monthly revenue, the quarterly revenue, and the annual revenue. It’s straightforward to calculate those. The idea is that monthly, quarterly, and annual revenue form a hierarchy: day $\in$ month $\in$ quarter $\in$ year. This is a convention used by everyone on this planet.

In some other cases, concept hierarchies are very much designed instead of being a convention. Suppose we record the prices of houses. It would be nice to categorize the prices into different ranges of prices. Some would say, we divide them into five ranges:

below 25 percentile
25 to 50 percentile
50 to 75 percentile
above 75 percentile

Others would say, we divide them into 100 euros segments

0 to 100 euros
100 to 200 euros
…

In fact, this is not always the case. Sometimes, data products can be delivered using some assist of data warehouse if data product doesn’t rely on the performance of the data source. ↩︎
Huyen2022 Huyen C. Designing machine learning systems: An iterative process for production-ready applications. Sebastopol, CA: O’Reilly Media; 2022. Available: https://books.google.com/books?hl=en&lr=&id=EzhwEAAAQBAJ&oi=fnd&pg=PP1&dq=Designing+Machine+Learning+Systems+Huyen&ots=X4CpX2Dl8f&sig=OMCmdJMMGCITWJXU5pvLqcQO1dU ↩︎

Planted: 2018-11-23 by L Ma;

References: