#machine learning #workflow #EDA

The scope of exploratory data analysis is not universally defined. Some of the contents discussed here may have crossed the line. The whole modeling process is never decoupled anyway.

Data wrangling is mostly guided by the exploratory data analysis (EDA). In other words, the data cleaning process should be mostly guided by questions from business and stakeholder or out of curiosity.

There are three key components in EDA.

  • Clearly state the purpose of this EDA.
    • Are we asking the right question?
    • Does the dataset fit in memory or shall I use distributed preprocessing?
    • Is the dataset good enough to solve the problem?
    • Is there anything we already know from the experts?
    • What are the next steps after EDA?
  • Go through a checklist and report results. Most elements on the checklist is dynamic and should be generated by some questions you need to answer.
    • Polish the Questions
    • Data Quality and Summary
  • Communicate with domain experts or stakeholders and yourself.
    • Do the results from EDA make sense to the experts?
    • What do the experts want to know from the data?

This is an iteration process. One could get start from any point and iterate the cycle to reach a certain satisfaction.

For data quality checks, we have a somewhat standard checklist to go through. It is not a complete checklist. In this process, more questions will pop out, and one also should attend to these questions as part of the checklist.

The Checklist

Data Quality and Summary PNG PDF

Validating the data quality and generate summary statistics reports.

  1. Rows and Columns : The questions to be answered serve as guides in EDA
    1. Rows
      1. Descriptions

        What does the row mean?

      2. Count
    2. Columns
      1. Descriptions

        What does the column mean?

      2. Count

        How many columns?

      3. Possible values or ranges

        List the theoretical limits on the values and validate against the data.

  2. Types and Formats :
    1. Data Types

      What is each column consists of?

      1. Types of data

        Ordinal, Nominal, Interval, Generative, etc

      2. Is the type of the data correct
    2. Data Formats
      1. Are the dates loaded as dates?
      2. Are the numbers loaded as numbers?

        Are they strings?

      3. Are the financial values correct?

        Are they strings or numbers? EU format, US format?

  3. Missing Values : Are there missing values in each column
    1. Different types of missing values

      Notations of missing values are different in different datasets. Read the documentation of the dataset to find out.

      1. Standard missing values

        nan, nat, None, na, null...

      2. Represented with a specific value

        -1, 0, MISSING, ...

    2. Percentage of missing values in each column
    3. Visualizations

      e.g., missingno python package

  4. Duplications : Are there duplications of rows/columns?
    1. Validate by yourself

      Do not trust the metadata and documentation of the dataset. Duplications of fields may occur when the documentation says they are unique.

  5. Distributions :
    1. What is the generation process?
      1. Is it a histogram analysis of another row?
      2. Is it a linear combination of other rows?
    2. Visualize the distributions of the values

      Know all the values

      1. Value count bar plot

        For descrete data, list all possible values and counts

      2. Histogram and KDE

        for continuous data, use histograms or KDE.

      3. Boxplot

        Boxplot is easier to understand for business people

      4. Scatter plot

        Gut feeling of where the data points are located

      5. Contour plot
    3. Dispersion of the target value

      Is the dispersion of the target value small enough for the algorithm to perform a good prediction?

    4. Numerical Summarization

      Use summary statistics to find out the moments.

      1. Locations

        Mean, median, quartiles, mode...

      2. Spreads

        range, variance, standard deviation, IQR

      3. Skewness


      4. Kurtosis
  6. Correlations, Similarities :
    1. Pairplot
    2. Correlations

      Pearson, Kendall Tau Correlation

    3. Distances

      Calculate the distance between features or rows to understand the relations between them; Euclidean distance, Mahalanobis distance, Minkowski distance, Jaccard distance, ...

  7. Size : How much space will the data take on our storage device?
    1. Memory usage

      To estimate the hardware requirements when deploying the model

    2. Storage on Hard Drive in Different Formats

      How much space will the dataset take in different formats?

  8. Combining Data Files : One dataset may come in different files, combine them carefully.
    1. Concat

      The files should be concated with caution.

      1. Validate overlap

        Check if there is an overlap between the files.

Exploratory Data Analysis PNG PDF

EDA is one of the very first steps of my data science projects.

  1. Objectives of EDA : The questions to be answered serve as guides in EDA
    1. Polish the Questions

      Check if the questions to be answered are valid or well stated; If not, modify them or come up with new ones

    2. Validate Data I/O Methods

      Check and validate the methods to load and save the datasets

    3. Is the Dataset Good Enough for the Problem?
      1. Are the features/variables required for the project included?

        If not, what other data should be included.

      2. What is the General Quality of the Dataset
      3. Can one answer the questions semi-quantitatively using the data?

        Is the dispersion of the target value small enough?

    4. Retrieve Domain Knowledge and Anomalies

      Determine the ranges, outliers of the dataset; Talk to domain experts and validate with domain experts.

    5. Propose the Next Steps
  2. Communicate with Domain Experts :
    1. What are the features?

      Pay attention to the units

    2. Do the results from EDA make sense to the experts?
    3. What do the experts want to know from the data?
  3. Workflow :
    1. Polish the Questions
      1. Will there be any new restrictions to the solutions?

        related to the dataset

    2. Data Quality and Summary
      1. Data Quality and Summary Statistics
      2. Report Data Quality and Summary
        1. Does the result make sense?

          This is a crucial step in EDA. Use techniques such as Fermi estimates to evaluate the summary.

        2. Consistencies between the summary and expert expectations

Feature Engineering PNG PDF

Feature Engineering is one of the fundamental activities of data science. This is an practical outline of feature engineering in data science.

  1. Prior Knowledge Simplifies Your Model : The more relevant prior knowledge you have, the simpler the model can be.
    1. Applications

      Prior knowledge, such as domain knowledge, can be used to

      1. Define the problem more clearly
      2. Filter out unnecessary features
      3. Simplify feature engineering

        e.g., combining power and time into total energy used

      4. Locate anomalies
  2. Encoding : Encode the features into numerical values for the model to process.
    1. Methods
      1. Categorical Data Encoding
        1. Binary Encoding
        2. One-hot Encoding
        3. Numerical Encoding
      2. Datetime
        1. Disintegration
  3. Feature Crossing : Introduce higher order features to make the model more linearly separable.
    1. Methods
      1. Create $x^2$, $x^3$ from $x$
  4. Scaling : Scale the data to different ranges.
    1. Methods
      1. Rescale Based on Location and Spread
      2. MinMax

        Scale data into a specific range

  5. Combining Features : Combine several features into one so that the new feature bears more relevant information.
  6. Sparse Categorical Data : Some categorical data values do not have a large number of counts. Combining these low count values into one might be helpful.
  7. Normalization : For example, if a feature has a very high variance and we are working on a clustering method, it is easier if we normalize the data, e.g., log.
  8. Using Statistical Results as Features :
    1. Methods
      1. Use the Average of Several Features
  9. Extract Values from Texts :
    1. Methods
      1. TFIDF
  10. Location, Variability, Skewness and Kurtosis :
    1. Methods
      1. Fix the Skewness
        1. Box Cox transform
  11. Feature Selection :
    1. Remove Redundant Features
      1. What are Redundant Features
        1. Noisy features
        2. Features that are highly correlated to or duplicate of some other features
      2. Methods
        1. Only Include Useful Features

          Feature selection using domain knowledge, or feature selection algorithms.

        2. Remove High Correlated Features

Published: by ;

Lei Ma (0001). 'Workflows', Datumorphism, 01 April. Available at:

Current Ref:

  • awesome/workflows/