Data wrangling is mostly guided by the exploratory data analysis (EDA). In other words, the data cleaning process should be mostly guided by questions from business and stakeholder or out of curiosity.
There are three key components in EDA.
Clearly state the purpose of this EDA. Are we asking the right question? Does the dataset fit in memory or shall I use distributed preprocessing? Is the dataset good enough to solve the problem? Is there anything we already know from the experts? What are the next steps after EDA? … Go through a checklist and report results. Most elements on the checklist is dynamic and should be generated by some questions you need to answer. Polish the Questions Data Quality and Summary … Communicate with domain experts or stakeholders and yourself. Do the results from EDA make sense to the experts? What do the experts want to know from the data?
This is an iteration process. One could get start from any point and iterate the cycle to reach a certain satisfaction.
For data quality checks, we have a somewhat standard checklist to go through. It is not a complete checklist. In this process, more questions will pop out, and one also should attend to these questions as part of the checklist.
The Checklist Data Quality and Summary
Validating the data quality and generate summary statistics reports.
Rows and Columns : The questions to be answered serve as guides in EDA
What does the row mean?
What does the column mean?
How many columns?
Possible values or ranges
List the theoretical limits on the values and validate against the data.
Types and Formats :
What is each column consists of?
Types of data
Ordinal, Nominal, Interval, Generative, etc
Is the type of the data correct
Are the dates loaded as dates?
Are the numbers loaded as numbers?
Are they strings?
Are the financial values correct?
Are they strings or numbers? EU format, US format?
Missing Values : Are there missing values in each column
Different types of missing values
Notations of missing values are different in different datasets. Read the documentation of the dataset to find out.
Standard missing values
nan, nat, None, na, null...
Represented with a specific value
-1, 0, MISSING, ...
Percentage of missing values in each column
e.g., missingno python package
Duplications : Are there duplications of rows/columns?
Validate by yourself
Do not trust the metadata and documentation of the dataset. Duplications of fields may occur when the documentation says they are unique.
What is the generation process?
Is it a histogram analysis of another row?
Is it a linear combination of other rows?
Visualize the distributions of the values
Know all the values
Value count bar plot
For descrete data, list all possible values and counts
Histogram and KDE
for continuous data, use histograms or KDE.
Boxplot is easier to understand for business people
Gut feeling of where the data points are located
Dispersion of the target value
Is the dispersion of the target value small enough for the algorithm to perform a good prediction?
Use summary statistics to find out the moments.
Mean, median, quartiles, mode...
range, variance, standard deviation, IQR
Kurtosis Correlations, Similarities :
Pearson, Kendall Tau Correlation
Calculate the distance between features or rows to understand the relations between them; Euclidean distance, Mahalanobis distance, Minkowski distance, Jaccard distance, ...
Size : How much space will the data take on our storage device?
To estimate the hardware requirements when deploying the model
Storage on Hard Drive in Different Formats
How much space will the dataset take in different formats?
Combining Data Files : One dataset may come in different files, combine them carefully.
The files should be concated with caution.
Check if there is an overlap between the files.
Exploratory Data Analysis
EDA is one of the very first steps of my data science projects.
Objectives of EDA : The questions to be answered serve as guides in EDA
Polish the Questions
Check if the questions to be answered are valid or well stated; If not, modify them or come up with new ones
Validate Data I/O Methods
Check and validate the methods to load and save the datasets
Is the Dataset Good Enough for the Problem?
Are the features/variables required for the project included?
If not, what other data should be included.
What is the General Quality of the Dataset
Can one answer the questions semi-quantitatively using the data?
Is the dispersion of the target value small enough?
Retrieve Domain Knowledge and Anomalies
Determine the ranges, outliers of the dataset; Talk to domain experts and validate with domain experts.
Propose the Next Steps Communicate with Domain Experts :
What are the features?
Pay attention to the units
Do the results from EDA make sense to the experts?
What do the experts want to know from the data? Workflow :
Polish the Questions
Will there be any new restrictions to the solutions?
related to the dataset
Data Quality and Summary
Data Quality and Summary Statistics
Report Data Quality and Summary
Does the result make sense?
This is a crucial step in EDA. Use techniques such as Fermi estimates to evaluate the summary.
Consistencies between the summary and expert expectations Feature Engineering
Feature Engineering is one of the fundamental activities of data science. This is an practical outline of feature engineering in data science.
Prior Knowledge Simplifies Your Model : The more relevant prior knowledge you have, the simpler the model can be.
Prior knowledge, such as domain knowledge, can be used to
Define the problem more clearly
Filter out unnecessary features
Simplify feature engineering
e.g., combining power and time into total energy used
Locate anomalies Encoding : Encode the features into numerical values for the model to process.
Categorical Data Encoding
Disintegration Feature Crossing : Introduce higher order features to make the model more linearly separable.
Create $x^2$, $x^3$ from $x$ Scaling : Scale the data to different ranges.
Rescale Based on Location and Spread
Scale data into a specific range
Combining Features : Combine several features into one so that the new feature bears more relevant information. Sparse Categorical Data : Some categorical data values do not have a large number of counts. Combining these low count values into one might be helpful. Normalization : For example, if a feature has a very high variance and we are working on a clustering method, it is easier if we normalize the data, e.g., log. Using Statistical Results as Features :
Use the Average of Several Features Extract Values from Texts :
TFIDF Location, Variability, Skewness and Kurtosis :
Fix the Skewness
Box Cox transform Feature Selection :
Remove Redundant Features
What are Redundant Features
Features that are highly correlated to or duplicate of some other features
Only Include Useful Features
Feature selection using domain knowledge, or feature selection algorithms.
Remove High Correlated Features