## Prerequisites

### Programming

alternatives:

• R
• Matlab

#### Python

Some essential libraries:

• Data
• numpy
• scipy
• pandas
• dask
• Visualization
• matplotlib
• seaborn
• plotly
• and your machine learning libraries

Use virtual environments:

• virtualenv
• conda

Use notebooks

• Jupyter

### Computer Science

These theories make people think faster. They don’t pose direct limits on what data scientists can do but they will definitely give data scientists a boost.

### Math

Some basic understanding of these is absolutely required. Higher levels of these topics will also be listed in details.

• Statistics
• Linear Algebra
• Calculus
• Differential Equations

## Engineering for Data Scientist

I use the book by Adreas Kretz as a checklist 1.

## Data Storage and Retrieval

• Database Basics
• Data Files
• Query Language
• Regular Expression
• Scraping

## Statistics

### Descriptive statistics

It is crucial for the interpretations in statistics.

• Probability theory
• random variable
• probability distribution
• pdf
• pmf
• Bayes
• Summary statistics
• location
• variation
• correlation
• Laws
• Law of large numbers
• Central limit theorem
• Law of total variance
• much more
• Probability Estimation
• Kernel density estimation

### Inferential statistics

To get closer to the ultimate question about causality

• Parameter Estimation
• Maximum Likelihood:
• Hypothesis Testing
• Inference
• Bayesian inference
• Confidence interval
• Frequentist inference

## EDA

Data wrangling and exploratory data analysis.

### Understand the Source of the data

• Know the source
• Understand the data collection procedure
• Understand the limitation of the data

### Dimensionality and Numerosity Reduction

Reduce the dimension of the data:

• PCA
• SparsePCA
• ICA

Numerosity reduction:

• Parametric
• Using model parameters to represent the data
• Non-parametric
• Histograms
• Clustering
• Resampling

### Data Normalization

Normalization is very important in many models.

Normalization of raw data:

Normalization in neural networks:

• Batch normalization

Data imputation

### Binning

• Bin the sparse values
• Bin continuous data if necessary

## Visualization

### What to show

• Relationship
• Composition
• Compose to compare
• Compose to calculate (the total)
• Compose to form a distribution

### Types of Charts Know your charts. Source: Chart Suggestions — A Thought-Starter

### Tools

• Python
• matplotlib
• seaborn
• plotnine
• plotly
• Dashboarding
• streamlit
• plotly dash

## Machine Learning

### Concepts

• Features
• Estimators
• Risk
• Bias and Variance
• Overfitting, Underfitting
• Performance
• Regression
• R^2
• Classification
• F score
• Precision
• Recall

### Frameworks

#### Supervised

##### Regression
• Linear Regression
• Polynomial Regression
• Generalized Linear Model
• Poisson Regression: for counts
##### Classification
• Logistic Regression
• SVM
• Tree
• Naive Bayes
• kNN
• Gaussian Mixture

## Neural Networks

Published: by ;

Lei Ma (0001). 'Curriculum', Datumorphism, 01 April. Available at: https://datumorphism.leima.is/awesome/curriculum/.

Table of Contents

Current Ref:

• awesome/curriculum/index.md

Links to: