The Art of Data Science

#data science

A nice and elegant book on data science

Some Key Ideas

The Epicycle

  1. Question and question refining
  2. EDA
  3. Modeling
  4. Interpretation
  5. Communication

The example of asthma in US is a nice, easy and clear example about the integration of these activities.

Types of Questions

Leek, J. T., & Peng, R. D. (2015). What is the question? Science, 347(6228), 1314–1315.

  1. Descriptive
  2. Exploratory
  3. Inferential
  4. Predictive
  5. Causal
  6. Mechanistic

A Good Question

  1. of interest to you audience
  2. not answered in literature
  3. plausible in your knowledge framework; it should be finding correlations that can already be identified as correlated using the domain knowledge.
  4. answerable: the question should be answerable with current technology or dataset or theory.
  5. specificity: quantify measures, population, sampling, as much as possible


  1. recall bias: about the sample response
  2. selection bias: about sampling

When you are asked to do something

  1. communicate with others to make sure that you can agree on a question to be answered
  2. make sure the question is a good question
  3. determine what type of question it is


Some random thoughts.

We need a knowledge database for the company

Going through the data analysis process, I found that it is often important to make connections to the current knowledge. For example, it is the key step to make sure the question is not answered.

For academic research, it is usually done through looking up in the literature. When the objective or question is related to some internal data and internal product, it is generally not possible to look up in some public database.

Then we need a data analysis question/objective database. While developing the business, we could accumulate a lot of analysis/questions. If some questions are correlated to other questions, it is generally a good idea to make a connection.

Published: by ;

Current Ref:

  • reading/