Data Warehouse

Take care of your data and your data will show you its power.

4 Data Storage

tl;dr: Use type safe formats such as HDF5 or parquet HDF5 BCOLZ <http://bcolz.blosc.org/en/latest/>_ : not designed for multidimentional data. Zarr <https://github.com/alimanfoo/zarr>_ : works with multidimensional data and also parallel computating. Blaze ecosystem <http://blaze.pydata.org/>_ A article that compares HDF5, BCOLZ, and Zarr: To HDF5 and beyond I also recommend pandas. It is a python module that works very well with data. It even loads HDF5 out of box.

3 OLAP Operations

Some useful OLAP operations

2 Extract, Transform and Load

ETL Process ETL ETL Extract: extract data from sources Transform: transform it to proper format Load: load it to data storage infrastructure E for Extract Should not affect the source system. T for Transform Cleaning Filtering Enriching Splitting Joining L for Load Deal with sync and waiting

1 Some Concepts about Data Warehouse

The Three Key Ideas about Warehouse The purpose of the data warehouse should be clear. In most cases, it is for the analysis of data, not for data production.1 Subject-oriented: since data warehouses are for decision-makers, arrange them into subjects makes it much easier to access. Integrated: many sources are integrated for easy analysis Time-variant: observation time should be recorded since the data is also used to analyze the time evolution Nonvolatile: simply for analysis OLTP and OLAP OLTP: online transaction processing OLAP: online analytical processing OLTP OLAP user customer data scientist, managers purpose production analysis content everything cleaner data database entity relation model, application-oriented star/snowflake model, subject-oriented history usually no need to record the history history is crucial query short and frequent read and write read-only and but complicated analysis Scope of Data Warehouse Enterprise warehouse: targeting the whole organization Data mart: for a specific group of people Virtual warehouse: views not tables Fact and Dimension Fact is the value of something specified by the dimension.