Scale Up

#Data Engineering #Data Storage

Many of the example are from the book by Adreas Kretz. Find the link to the book in the references section.

Scaling Up Storage

Scaling Up SQL DB

SAN: Storage Area Network

Use multiple servers on the DB storage to make the query faster.

Good for Read-only DB
Not convinient to update DB

Hadoop

Hadoop:

Distributed storage
Analysis

4 core modules:

Hadoop common
- background functionalities
HDFS
- Divide into blocks
- Distribute
[[MapReduce]] Basics of MapReduce mapreduce
- Old tech
YARN
- Resource management

The Hadoop Ecosystem:

Many tools that can be connected to Hadoop or are based on Hadoop
Not only about the 4 core modules anymore if we consider the ecosystem:
- Spark is also using YARN
No need to use everything in Hadoop
- Data->(Kafka->Spark)->DB
  - The (*) can be using Hadoop

Adapted from https://github.com/andkret/Cookbook

Planted: 2021-05-05 by L Ma;

References:

Kretz2019 The Data Engineering Cookbook

Supplementary:

The Data Engineering Cookbook

Dynamic Backlinks to wiki/data-engeering-for-data-scientist/scale-up:

Data Engineering for Data Scientists: Checklist

A checklist to get a shallow understanding of the basics and the ecosystem

Additional Double Backet Links:

Basics of MapReduce

L Ma (2021). 'Scale Up', Datumorphism, 05 April. Available at: https://datumorphism.leima.is/wiki/data-engeering-for-data-scientist/scale-up/.