Data Processing

#Data Engineering

Many of the example are from the book by Adreas Kretz. Find the link to the book in the references section.

Batch Process

Kretz recommend to start from batch processing and move to streaming if needed ¹.

Stream Process

Three methods to stream data

At Least Once:
- message gets processed once or multiple times never dropped
- e.g., time-based GPS data in fleet management, if the stream data has the same timestamp, then we just override the existing data, we do not care how many times the data is being processed or streamed.
At Most Once:
- okay to drop a message
- only processed once at max
- e.g. accident events, do not record multiple times otherwise it is gonna be a misleading signal that the accident happens a lot.
Exactly Once:
- do not drop message
- e.g., banking

Tools

Spark

Data Processing - (Py)Spark

Processing Data using (Py)Spark

[[Spark]] Data Processing - (Py)Spark Processing Data using (Py)Spark is in-memory storage.

e.g., Load from HDFS to memory of workers,
input data and intermediate results are in memory, no disk writes,
no clear map-reduce stage anymore,
can be complex analysis,
supports streaming analysis,
native support for scheduling jobs.

This figure is a screenshot from the book by Adreas Kretz.

RDD

Resilient Distributed Datasets (RDD) is the core of Spark. It is a good tool to work on low level optimizations².

Similar to Map-Reduce
Lower level
Old stuff
Now dataframes/datasets

DataFrames and Datasets

DataFrames and Datasets share the same API since Apache Spark 2.0². They are easier to use than RDD.

Datasets are typed. In this way we can catch more errors in compile time.

SparkSQL

SparkSQL is used to access dataframes.

Run Complex Models in Spark

One could run complex machine learning models directly on Spark using tools such as Tensorflow and MLlib.

Resource management

Spark has its own resource management module. But one could also use Hadoop’s YARN.

When to use Spark

If the processing is simple (summing, counting, etc): Map-reduce is good enough

If the analytics is complex (ML) or need speed: Spark

Spark + Hadoop

e.g, use YARN to manage physical resources.

If Spark and Hadoop is running on the same workers, doesn’t make sense to have multiple resource managers. In this case, use Hadoop’s YARN.

Use Data from Hadoop

Use the local Hadoop data in Spark

Flink

Docs

Stateful Computations
Event-driven applications
Stream and Batch analysis
ETL

Planted: 2021-05-05 by L Ma;

References:

Supplementary:

The Data Engineering Cookbook

Dynamic Backlinks to wiki/data-engeering-for-data-scientist/data-processing:

Data Engineering for Data Scientists: Checklist

A checklist to get a shallow understanding of the basics and the ecosystem

wiki/data-engeering-for-data-scientist/data-processing Links to:

Data Processing - (Py)Spark

Processing Data using (Py)Spark

L Ma (2021). 'Data Processing', Datumorphism, 05 April. Available at: https://datumorphism.leima.is/wiki/data-engeering-for-data-scientist/data-processing/.