Scale Up

#Data Engineering #Data Storage

Many of the example are from the book by Adreas Kretz. Find the link to the book in the references section.

Scaling Up Storage

Scaling Up SQL DB

SAN: Storage Area Network

Use multiple servers on the DB storage to make the query faster.

  • Good for Read-only DB
  • Not convinient to update DB

Hadoop

Hadoop:

  • Distributed storage
  • Analysis

4 core modules:

  • Hadoop common
    • background functionalities
  • HDFS
    • Divide into blocks
    • Distribute
  • MapReduce
    • Old tech
  • YARN
    • Resource management

The Hadoop Ecosystem:

  • Many tools that can be connected to Hadoop or are based on Hadoop
  • Not only about the 4 core modules anymore if we consider the ecosystem:
    • Spark is also using YARN
  • No need to use everything in Hadoop
    • Data->(Kafka->Spark)->DB
      • The (*) can be using Hadoop
Adapted from https://github.com/andkret/Cookbook

Adapted from https://github.com/andkret/Cookbook

Published: by ;

Lei Ma (2021). 'Scale Up', Datumorphism, 05 April. Available at: https://datumorphism.leima.is/wiki/data-engeering-for-data-scientist/scale-up/.

Current Ref:

  • wiki/data-engeering-for-data-scientist/scale-up.md