Data Warehouse

Take care of your data and your data will show you its power.

4 Data Storage

Published:
Category: { Data Warehouse }
Summary: tl;dr: Use type safe formats such as HDF5 or parquet HDF5 BCOLZ <http://bcolz.blosc.org/en/latest/>_ : not designed for multidimentional data. Zarr <https://github.com/alimanfoo/zarr>_ : works with multidimensional data and also parallel computating. Blaze ecosystem <http://blaze.pydata.org/>_ A article that compares HDF5, BCOLZ, and Zarr: To HDF5 and beyond I also recommend pandas. It is a python module that works very well with data. It even loads HDF5 out of box.
Pages: 4

3 OLAP Operations

Published:
Category: { Data Warehouse }
Tags:
References: - Data Mining, Section 4.2.5, by Jiawei Han, Micheline Kamber, Jian Pei
Summary: Some useful OLAP operations
Pages: 4

2 Extract, Transform and Load

Published:
Category: { Data Warehouse }
Summary: ETL Process ETL ETL Extract: extract data from sources Transform: transform it to proper format Load: load it to data storage infrastructure E for Extract Should not affect the source system. T for Transform Cleaning Filtering Enriching Splitting Joining L for Load Deal with sync and waiting
Pages: 4

1 Some Concepts about Data Warehouse

Published:
Category: { Data Warehouse }
References: - Data Mining by Jiawei Han, Micheline Kamber, Jian Pei
Summary: The Three Key Ideas about Warehouse The purpose of the data warehouse should be clear. In most cases, it is for the analysis of data, not for data production.1 Subject-oriented: since data warehouses are for decision-makers, arrange them into subjects makes it much easier to access. Integrated: many sources are integrated for easy analysis Time-variant: observation time should be recorded since the data is also used to analyze the time evolution Nonvolatile: simply for analysis OLTP and OLAP OLTP: online transaction processing OLAP: online analytical processing OLTP OLAP user customer data scientist, managers purpose production analysis content everything cleaner data database entity relation model, application-oriented star/snowflake model, subject-oriented history usually no need to record the history history is crucial query short and frequent read and write read-only and but complicated analysis Scope of Data Warehouse Enterprise warehouse: targeting the whole organization Data mart: for a specific group of people Virtual warehouse: views not tables Fact and Dimension Fact is the value of something specified by the dimension.
Pages: 4