## Prerequisites

### Programming

alternatives:

• R
• Matlab

#### Python

Some essential libraries:

Use virtual environments:

Use notebooks

• Jupyter

### Computer Science

These theories make people think faster. They don’t pose direct limits on what data scientists can do but they will definitely give data scientists a boost.

### Math

Some basic understanding of these is absolutely required. Higher levels of these topics will also be listed in details.

## Engineering for Data Scientist

I use the book by Adreas Kretz as a checklist 1.

## Statistics

### Descriptive statistics

It is crucial for the interpretations in statistics.

• Probability theory
• Summary statistics
• location
• variation
• correlation
• Laws
• Law of large numbers
• Central limit theorem
• Law of total variance
• much more
• Probability Estimation
• Kernel density estimation

### Inferential statistics

To get closer to the ultimate question about causality

• Parameter Estimation
• Maximum Likelihood:
• Hypothesis Testing
• Inference
• Bayesian inference
• Confidence interval
• Frequentist inference

## EDA

Data wrangling and exploratory data analysis.

### Understand the Source of the data

• Know the source
• Understand the data collection procedure
• Understand the limitation of the data

### Dimensionality and Numerosity Reduction

Reduce the dimension of the data:

• PCA
• SparsePCA
• ICA

Numerosity reduction:

• Parametric
• Using model parameters to represent the data
• Non-parametric
• Histograms
• Clustering
• Resampling

### Data Normalization

Normalization is very important in many models.

Normalization of raw data:

Normalization in neural networks:

• Batch normalization

Data imputation

### Binning

• Bin the sparse values
• Bin continuous data if necessary

## Visualization

### What to show

• Relationship
• Composition
• Compose to compare
• Compose to calculate (the total)
• Compose to form a distribution

### Types of Charts

Other useful references:

## Machine Learning

### Concepts

• Features
• Estimators
• Risk
• Bias and Variance
• Overfitting, Underfitting
• Loss
• Huber Loss
• Performance
• Regression
• R^2
• Classification
• F score
• Precision
• Recall

### Frameworks

#### Supervised

##### Regression
• Linear Regression
• Polynomial Regression
• Generalized Linear Model
• Poisson Regression: for counts
##### Classification
• Logistic Regression
• SVM
• Tree
• Naive Bayes
• kNN
• Gaussian Mixture

## Neural Networks

Planted: by ;

L Ma (0001). 'Curriculum', Datumorphism, 01 April. Available at: https://datumorphism.leima.is/awesome/curriculum/.