Datumorphismhttps://datumorphism.leima.is/Recent content on DatumorphismHugo -- gohugo.ioen-USWed, 28 Jul 2021 00:00:00 +0000MaxEnt Modelhttps://datumorphism.leima.is/wiki/machine-learning/energy-based-model/maxent-energy-based-model/Mon, 31 May 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/energy-based-model/maxent-energy-based-model/The Maximum Entropy model, aka MaxEnt model, is a fascinating generative model as it is based on a very intuitive idea from statistical physics - the Principle of Maximum Entropy.
The Idea The essence of the MaxEnt model is that the underlying probability distribution $p(x)$ of the random variables $x$ should
gives the whole system the largest uncertainty, while producing reasonable observables. Uncertainty The uncertainty of the whole system is described by the Shannon entropy based on the probability distributions $p(x)$,Data Engineering for Data Scientists: Checklisthttps://datumorphism.leima.is/wiki/data-engeering-for-data-scientist/checklist/Wed, 05 May 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/data-engeering-for-data-scientist/checklist/It is always good for a data scientist to understand more about data engineering, especially the blueprint of a fully productionized data platform.
There are several things to get into:
Connection to Data Sources Connect to DB Connect to Streaming Data Message Queues Connect to Website Scraping API Other Data Services Data Storage Data Storage Storing big data Data Lake Message Queues Data Processing Data Processing Processing Data is essential.Gibbs Samplinghttps://datumorphism.leima.is/wiki/monte-carlo/gibbs-sampling/Fri, 01 Jan 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/monte-carlo/gibbs-sampling/Principles of Designhttps://datumorphism.leima.is/wiki/data-visualization/design/Fri, 20 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/wiki/data-visualization/design/There are many principles of designing a visual representation of data. However, before we understand how data is represented visually, it would benefit us a lot if we understand the basic principles of designing on 2D surface.
Robin’s CRAP Robin Williams proposed the four elements of design:
Contrast Repetition Alignment Proximity Contrast Use some contrast to distinguish the elements of different contents.
Repetition Repeat the design of similar elements on the same page and across pages to make sure the readers learn the meaning of the design quickly.Model Selectionhttps://datumorphism.leima.is/wiki/model-selection/model-selection/Sun, 08 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/wiki/model-selection/model-selection/Suppose we have a generating process that generates some numbers based on a distribution. Based on a data sample, we could reconstruct some sort of theoretical models to represent the actual generating process.
Which is a Good Model? (1)The black curve represent the generating process. The red rectangle is a very simple model that captures some major samples. The blue step-wise model is capturing more sample data but with more parameters.Receiver Operating Characteristics: ROChttps://datumorphism.leima.is/wiki/machine-learning/performance/roc/Wed, 13 May 2020 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/performance/roc/ROC space is the two-dimensional space spanned by True Positive Rate and False Positive Rate.
ROC Space. The color boxes are indicating the confusion matrices. Green is the fraction of true positive. Orange is the fraction of false positive. Refer to Confusion Matrix for more details.
AUC: Area under Curve TPR = TP Rate FPR = FP Rate The ROC curve is defined by the relation $f(TPR, FPR)$.Tree-based Learninghttps://datumorphism.leima.is/wiki/machine-learning/tree-based/overview/Wed, 25 Dec 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/tree-based/overview/Decision tree is an easy-to-interpret method in supervised learning. Though simple, it is being used in some widely used algorithms such as random forest method.Embeddinghttps://datumorphism.leima.is/wiki/machine-learning/embedding/overview/Sun, 13 Oct 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/embedding/overview/Factorizationhttps://datumorphism.leima.is/wiki/machine-learning/factorization/overview/Mon, 17 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/factorization/overview/Feature Engineeringhttps://datumorphism.leima.is/wiki/machine-learning/feature-engineering/overview/Mon, 17 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/feature-engineering/overview/Naive Bayeshttps://datumorphism.leima.is/wiki/machine-learning/bayesian/naive-bayes/Mon, 17 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/bayesian/naive-bayes/Naive Bayesian is a classifier using Bayes' Theorem Bayes' Theorem Bayes’ Theorem is stated as $$ P(A\mid B) = \frac{P(B \mid A) P(A)}{P(B)} $$ $P(A\mid B)$: likelihood of A given B $P(A)$: marginal probability of A There is a nice tree diagram for the Bayes’ theorem on Wikipedia. Tree diagram of Bayes’ theorem with ‘naive’ assumptions.
Problems with Conditional Probability Calculation By definition, the conditional probability of event $\mathbf Y$ given features $\mathbf X$ is $$ \begin{equation} P(\mathbf Y\mid \mathbf X) = \frac{P(\mathbf Y, \mathbf X)}{ P(\mathbf X) }, \label{def-cp-y-given-x} \end{equation} $$Confusion Matrix (Contingency Table)https://datumorphism.leima.is/wiki/machine-learning/basics/confusion-matrix/Fri, 31 May 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/basics/confusion-matrix/Confusion Matrix It is much easier to understand the confusion matrix if we use a binary classification problem as an example. For example, we have a bunch of cat photos and the user labeled “cute or not” data. Now we are using the labeled data to train a cute-or-not binary classifier.
Then we apply the classifier on the test dataset and we would only find four different kinds of results.Normal Distributionhttps://datumorphism.leima.is/wiki/distributions/normal-distribution/Tue, 22 Jan 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/distributions/normal-distribution/Visualization Math The formula of normal distribution is
$$ \begin{equation} e^{ ( (x - \mu) / \sqrt{2} \sigma )^2 } \end{equation} $$
where $\mu$ controls the “center” or “peak” of the distribution and $\sigma$ tells us how “wide” or “disperse” the distribution is.
To understand the distribution, we take some limits.
$x = \mu$ First of all, when $x = \mu$ we have
$$ e^0 = 1. $$
Notice the argument of the exponential is some squared value and can not be negative.Statistical Hypothesis Testinghttps://datumorphism.leima.is/wiki/statistical-hypothesis-testing/hypothesis-testing/Sun, 20 Jan 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/statistical-hypothesis-testing/hypothesis-testing/When we have a sample of the population, we immediately calculate the mean using the sample, say the result is $\mu_0$. Of course, the population mean $\mu_p$ is unknown and probably can never be known.
This specific sample mean $\mu_0$ is nothing but like an advanced educated guess. Then again, how do we know if our this specific sample mean $\mu_0$ is a faithful representation of the population mean? In fact, this question is not limited to mean.Why Estimation Theoryhttps://datumorphism.leima.is/wiki/statistical-estimation/why-estimation-theory/Sun, 20 Jan 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/statistical-estimation/why-estimation-theory/In statistics, we work with samples. For example, the sample mean is easily calculated. However, it is the population mean that is more valuable.
Suppose we have one sample $S_i$, which is used to calculate the mean of the sample $\mu_i$. We have two key problems to solve at this moment.
Can we use this sample mean $\mu_i$ to represent the population mean $\mu_p$? How good is our estimations? To answer these questions, we need to work out the properties of the samples themselves and work out a theory to instruct us to infer population statistics from sample statistics.What is Statisticshttps://datumorphism.leima.is/wiki/statistics/what-is-statistics/Fri, 18 Jan 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/statistics/what-is-statistics/A Case Study We have a problem.
In our lab, we found a huge amount of similar robots on a planet (physical population). To know more about the weight of these robots (statistical population), we first need to choose some of them (physical sample), then obtain the weight of them (statistical sample).
To describe the data, we could calculate the mean of the weight. We found that the mean weight is 93kg (descriptive statistics).Association Ruleshttps://datumorphism.leima.is/wiki/pattern-mining/association-rules/Sun, 06 Jan 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/pattern-mining/association-rules/Association rule is a method for pattern mining. In this article, we perform an association rule analysis of some demo data.
The Problem Defined Suppose we own a store called KIOSK. Here at KIOSK, we sell 4 different things.
Milk Croissant Coffee Fries We need to know what items are associated with each other when the customers are buying.
We have collected the following data. Beware that this small amount of data might not be enough for a real-world problem.Some Concepts about Data Warehousehttps://datumorphism.leima.is/wiki/data-warehouse/data-warehouse-concepts/Fri, 23 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/data-warehouse/data-warehouse-concepts/The Three Key Ideas about Warehouse The purpose of the data warehouse should be clear. In most cases, it is for the analysis of data, not for data production.1
Subject-oriented: since data warehouses are for decision-makers, arrange them into subjects makes it much easier to access. Integrated: many sources are integrated for easy analysis Time-variant: observation time should be recorded since the data is also used to analyze the time evolution Nonvolatile: simply for analysis OLTP and OLAP OLTP: online transaction processing OLAP: online analytical processing OLTP OLAP user customer data scientist, managers purpose production analysis content everything cleaner data database entity relation model, application-oriented star/snowflake model, subject-oriented history usually no need to record the history history is crucial query short and frequent read and write read-only and but complicated analysis Scope of Data Warehouse Enterprise warehouse: targeting the whole organization Data mart: for a specific group of people Virtual warehouse: views not tables Fact and Dimension Fact is the value of something specified by the dimension.Artificial Neural Networkshttps://datumorphism.leima.is/wiki/machine-learning/neural-networks/artificial-neural-networks/Mon, 19 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/neural-networks/artificial-neural-networks/Artificial neural networks works pretty well for solving some differential equations.
Universal Approximators Maxwell Stinchcombe and Halber White proved that no theoretical constraints for the feedforward networks to approximate any measurable function. In principle, one can use feedforward networks to approximate measurable functions to any accuracy.
However, the convergence slows down if we have a lot of hidden units. There is a balance between accuracy and convergence rate. More hidden units lead to slow convergence but more accuracy.Ordinary Differential Equationshttps://datumorphism.leima.is/wiki/dynamical-system/ordinary-differential-method/Mon, 19 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/dynamical-system/ordinary-differential-method/For a first order differentiation $\frac{\partial f}{\partial t}$, we might have many finite differencing methods.
Euler Method For linear first ODE,
$$ \frac{dy}{dx} = f(x, y), $$
we can discretize the equation using a step size $\delta x \cdot$ so that the differential equation becomes
$$ \frac{y_{n+1} - y_n }{ \delta x } = f(x_n, y_n), $$
which is also written as
$$ y_{n+1} = y_n + \delta x \cdot f(x_n, y_n).Basics of Computationhttps://datumorphism.leima.is/wiki/computation/basics-of-computation/Thu, 13 Sep 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/computation/basics-of-computation/Storage, Precision, Error, etc To have some understanding of how the numbers are processed in computers, we have to understand how the numbers are stored first.
Computers stores everything in binary form 1. Suppose we randomly get some segments in the memory, we have no idea what that stands for since we do not know the type of data it represents.
Some of the most used data types in data science areIntroduction to Node Crawler Serieshttps://datumorphism.leima.is/wiki/nodecrawler/node-crawler-introduction/Sun, 15 Jul 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/nodecrawler/node-crawler-introduction/This is a set of tutorials that will help you with your very first crawler with node.js.
The plan of this tutorial is as follows. First of all, we will write a functional crawler using node.js and dump the data into files or simply print it on screen. In the following article, we will use MongoDB as our data management system and organize our data. Then we will optimize and attack some of the pitfalls.Jupyter Notebookhttps://datumorphism.leima.is/wiki/tools/jupyter/Wed, 20 Jun 2018 15:58:49 -0400https://datumorphism.leima.is/wiki/tools/jupyter/Magics %lsmagic will show all the magics, including line magics and cell magics.
Line magics are magics start with one %; Cell magics are magics that can be used in the whole cell even with line breaks, where the cell should start with %%. %env can be used when setting environment variables inside the notebook.
%env MONGO_URI=localhost:27072 %%bash is a cell magic that allows bash commands in the cell.Regular Expression Basicshttps://datumorphism.leima.is/wiki/sugar/regular-experssions/Wed, 20 Jun 2018 15:58:49 -0400https://datumorphism.leima.is/wiki/sugar/regular-experssions/List of Keys Anchors at the beginning of line ^ import re p = re.compile('^T', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['T'] at the end of the line $ import re p = re.compile('e$', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['e'] Character Classes Printable Characters any character .Short-Time-Fourier-Transformhttps://datumorphism.leima.is/wiki/time-series/short-time-fourier-transform/Wed, 20 Jun 2018 15:58:49 -0400https://datumorphism.leima.is/wiki/time-series/short-time-fourier-transform/Short-Time-Fourier-Transform We Fourier transform the time series data using a Fourier transform, with some window function
\begin{equation} \tilde Y[n,k] = \sum_m Y[n+m] W[m] e^{-i \lambda_k m}, \end{equation}
where $\lambda_k=2\pi k/N$ and $W[m]$ is the window function at $m$.
References and Notes CouseraLinear Methodshttps://datumorphism.leima.is/wiki/machine-learning/linear/linear-methods/Fri, 25 May 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/linear/linear-methods/Solving Classification Problems with Linear Models One simple idea behind classification is to calculate the posterior probability of each class given the variables.
Suppose a dataset have features $F_\alpha$ where $\alpha = 1, 2, \cdots, K$, with corresponding class labels $G_\alpha$. The dataset that provides $N$ datapoints with each deoted as $X_i$. The posterior of the classification is $P(G = G_\alpha \vert X = X_i)$.
A naive idea is to classify the data into two classes $m$ and $n$ using the boundary of a linear modelMachine Learning Overviewhttps://datumorphism.leima.is/wiki/machine-learning/overview/Fri, 25 May 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/overview/What is Machine Learning There are many objectives in machine learning. Two of the most applied objectives are classifications and regressions. In classifications and regression, the following four factors are relevant.
A simple framework of machine learning. The dataset $\tilde{\mathscr D}$ is first encoded by $\mathscr T$, $\mathscr D(\mathbf X, \mathbf Y) = \mathscr T(\tilde{\mathscr D})$. The dataset is feeded into the model, $\bar{\mathbf Y} = f(\mathbf X;\mathbf \theta)$.Unsupervised Learninghttps://datumorphism.leima.is/wiki/machine-learning/unsupervised/overview/Fri, 25 May 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/unsupervised/overview/Unsupervised Learning!
Principle components analysis Clustering K-means Clustering Algorithm:
Assign data points to a group Iterate through until no change: Find centroid Find the point that is closest to the centroids. Assign that data point to the corresponding group of the centroids. How Many Groups
The art of chosing K. Hierarchical Clustering Bottom-up hierarchical groups can be read out from the dendrogram.Some Basic Ideas of Algorithmshttps://datumorphism.leima.is/wiki/algorithms/algorithms-basics/Tue, 20 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/algorithms/algorithms-basics/This set of notes on algorithms is not meant to be comprehensive or complete. These notes are being used as a skeleton framework. There are many useful books to learn about algorithms from a utilitarian point of view. I have listed a few in the references section.
Numerical recipes is a very comprehensive book that I used during my PhD. It covers almost all the algorithms you need for scientific computing.The C++ Languagehttps://datumorphism.leima.is/wiki/programming-languages/cpp/references/Tue, 20 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/programming-languages/cpp/references/C++!
Books The C++ Programming Language Programming Principles and Practice Using C++ The C++ Primer Lectures C++ Beginners Tutorial 1 (For Absolute Beginners) C++ Programming Introduction to C++ Coursear Course: C++ For C Programmers, Part A Top C++ Courses and Tutorials On SoloLearn: C++ Tutorial Practice SoloLearn provides this code playground that we can use to test c++ codes. There is also repl.it
Libraries For solving differential equations:The Python Language: Basicshttps://datumorphism.leima.is/wiki/programming-languages/python/basics/Tue, 20 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/programming-languages/python/basics/Numbers, Arithmetics Two types of numbers exist,
int float, 15 digits, other digits are float error It is worth noting that in Python 2, we have
print(1.0/3) # will give us float numbers # 0.333333333333 while
print(1/3) # will only give us int # 0 However, this was changed in Python 3.
Variables, Functions, Conditions A variable name should start with either a letter or an underscore.
Variables defined inside a function is local and there is no way to find it or use it outside the function.Poisson Regressionhttps://datumorphism.leima.is/wiki/machine-learning/linear/poisson-regression/Fri, 07 May 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/linear/poisson-regression/Poisson regression is a generalized linear model for count data.
To model a dataset that is generated from a Poisson distribution, we only need to model the mean $\mu$ as it is the only parameters. The simplest model we can have for some given features $X$ is a linear model. However, for count data, the effects of the predictors are often multiplicative. The next simplest model we can have isPrinciples of Colorshttps://datumorphism.leima.is/wiki/data-visualization/colors/Fri, 20 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/wiki/data-visualization/colors/Basic Concepts of Colors Color Wheel and Color Sphere There are two dimensions in the color wheel:
Hue Saturation When we add another dimension, lightness, to the wheel, we have a color sphere (1, 2).
Many color systems have been invented. Color wheel and color sphere are two examples of them.Goodness-of-fithttps://datumorphism.leima.is/wiki/model-selection/goodness-of-fit/Sun, 08 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/wiki/model-selection/goodness-of-fit/Does the data agree with the model?
Calculate the distance between data and model predictions. Apply Bayesian methods such as likelihood estimation: likelihood of observing the data if we assume the model; the results will be a set of fitting parameters. … Why don’t we always use goodness-of-fit as a measure of the goodness of a model?
We may experience overfitting. The model may not be intuitive. This is why we would like to balance it with parsimony using some measures of generalizability.Data Types and Level of Measurement in Machine Learninghttps://datumorphism.leima.is/wiki/machine-learning/feature-engineering/data-types/Wed, 15 Jan 2020 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/feature-engineering/data-types/Types of Data There are several debatable categorization methods of data.
The first widely spread theory, or level of measurement, is by S. Stevens. The theory categorizes data into four types, nominal, ordinal, interval, and ratio.
Other methods are proposed for other fields of research. For example, N. R. Chrisman proposed a different method for cartography. However, these are not generic enough for data science. They are more general than a specific field of research.Decision Treehttps://datumorphism.leima.is/wiki/machine-learning/tree-based/decision-tree/Wed, 25 Dec 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/tree-based/decision-tree/In this article, we will explain how decision trees work and build a tree by hand.
The code used in this article can be found in this repo. Definition of the problem We will decide whether one should go to work today. In this demo project, we consider the following features.
feature possible values health 0: feeling bad, 1: feeling good weather 0: bad weather, 1: good weather holiday 1: holiday, 0: not holiday For more compact notations, we use the abstract notation $\{0,1\}^3$ to describe a set of three features each with 0 and 1 as possible values.Bayesian Linear Regressionhttps://datumorphism.leima.is/wiki/machine-learning/bayesian/bayesian-linear-regression/Tue, 18 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/bayesian/bayesian-linear-regression/Linear Regression and Likelihood The linear estimator $y$ is
$$ \begin{equation} y^n = \beta^m X_m^{\phantom{m}n}. \label{eq-linear-model} \end{equation} $$
As usual, we have redefined our data to get rid of the intercept $\beta^0$.
In ordinary linear models, we find the error being the difference between the target $\hat y$ and the estimator $y$
$$ \epsilon = \hat y - y, $$
which is required to have a minimum absolute value.
In linear regressions, we use least squares to solve the problem.NMF: Nonnegative Matrix Factorizatioinhttps://datumorphism.leima.is/wiki/machine-learning/factorization/nmf/Thu, 13 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/factorization/nmf/Decomposition To make it easier to understand, we start with a data point $\mathbf P$ in a $k$-dimensional space spanned by $k$ basis vectors $\mathbf V^k$. Naturally, we could write down the component decomposition of the point using the basis vectors $\mathbf V^k$,
$$ \mathbf P = P_k \mathbf V^k. $$
This is immediately obvious to us since we have been dealing with rank 2 $(k, 1)$ basis vectors and we are talking about the $k$ coordinates for a point.Word2vechttps://datumorphism.leima.is/wiki/machine-learning/embedding/word2vec/Thu, 13 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/embedding/word2vec/Word2vec is a word embedding model that learns the probability of some words being neighbours in a sentence $p_{neighbours}(w_i, w_o)$.
Build a dataset of adjacent words. CBOW; skipgram; negative sampling; Encode the words using vectors. Build a model $f(\{\theta_i\})$ to calculate the probability of the words being neighours and improve the parameters $\{\theta_i\}$ using the dataset.Bias-Variancehttps://datumorphism.leima.is/wiki/machine-learning/basics/bias-variance/Fri, 07 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/basics/bias-variance/Bias and Variance Suppose $f(X)$ is a perfect model that represents a “tight” model of the dataset $(X,Y)$ but some irredicible error $\epsilon$,
$$ \begin{equation} Y = f(X) + \epsilon. \label{dataset-using-true-model} \end{equation} $$
On the other hand, we build another model using a specific method such as k-nearest neighbors, which is denoted as $k(X)$.
Why the two models?
Why are we talking about the perfect model and a model using a specific method?Types of Errors in Statistical Hypothesis Testinghttps://datumorphism.leima.is/wiki/statistical-hypothesis-testing/type-1-error-and-type-2-error/Fri, 31 May 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/statistical-hypothesis-testing/type-1-error-and-type-2-error/Type I and Type II Errors In statistical hypothesis testing, we always have a null hypothesis $H_0$ which refers to the statement to be tested. We have two possible conclusions from a hypothesis testing,
to accept the hypothesis, that is concluding that $H_0$ is true, to reject the hypothesis, that is concluding that $H_0$ is false. However, it is possible that our conclusion is not correct. There are four possible results.Amazon CloudWatch Logshttps://datumorphism.leima.is/wiki/tools/awslogs/Mon, 11 Mar 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/tools/awslogs/Why Suppose we have all kinds of pipelines written in different languages, using different tools, and located in different places. It would be frustrating to pull out the logs.
This is why we need a centralized log service, for example cloudwatch.
Sending logs to CloudWatch First of all, send your logs to awslogs. The easies way is to use boto.
Retrieving and Analyzing Logs First of all, we need this: awslogs.Confidence Intervalhttps://datumorphism.leima.is/wiki/statistical-estimation/confidence-interval/Sun, 20 Jan 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/statistical-estimation/confidence-interval/We will use upper cases for the abstract variable and lower cases for the actual numbers.
Why is Confidence Interval Needed? Suppose I sample the population multiple times, the mean value $\mu_i$ of the sample is calculated for each sample. It is a good question to ask how different these $\mu_i$ are compared to the true mean $\mu_p$ of the population.
In this article, we would need to specify several notations.Jargonshttps://datumorphism.leima.is/wiki/statistics/jargons/Sat, 24 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/statistics/jargons/Accuracy and Precision Accuracy: the measurement compared to the truth Precision: variability of repeated measurements; the more precise, the less variations during each measurement. Accurate Inaccurate Precise Close to true value, small variations in each measurement Far from true value, small variations in each measurement Imprecise Close to true value, large variations in each measurement Far from true value, large variations in each measurement Here is an example.Extract, Transform and Loadhttps://datumorphism.leima.is/wiki/data-warehouse/extract-transform-load/Fri, 23 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/data-warehouse/extract-transform-load/ETL Process ETL
ETL
Extract: extract data from sources Transform: transform it to proper format Load: load it to data storage infrastructure E for Extract Should not affect the source system. T for Transform Cleaning Filtering Enriching Splitting Joining L for Load Deal with sync and waitingPartial Differential Equationshttps://datumorphism.leima.is/wiki/dynamical-system/partial-difference-method/Mon, 19 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/dynamical-system/partial-difference-method/Forward Time Centered Space For $\frac{d f}{d t} = - v \frac{ d f }{ dx }$, we write down the finite difference form 1
$$ \frac{f(t_{n+1}, x_i ) - f(t_n, x_i)}{ \Delta t } = - v \frac{ f(t_n, x_{i+1}) - f(t_n, x_{i-1}) }{ 2\Delta x }. $$
FTCS is an explicit method and is not stable.
Lax Method Change the term $f(t_n, x_i)$ in FTCS to $( f(t_n, x_{i+1}) + f(t_n, x_{i-1}) )/2$ 1.Basics of Programminghttps://datumorphism.leima.is/wiki/computation/basics-of-programming/Sun, 23 Sep 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/computation/basics-of-programming/Recursive and Iterative Solving problems with iterative and recursive methods are two quite different approaches, somehow, to the same kind of problems.
Here we will calculate the factorial of $n$. We define two functions using the iterative method and the recursive method.
Run the program on Repl.it.
def recursiveFactorial(n): if n == 0: return 1 else: return n * recursiveFactorial(n - 1) def iterativeFactorial(n): ans = 1 i=1 while i <= n: ans = ans * i i=i+1 return ans print(recursiveFactorial(0)) print(iterativeFactorial(0))Basic Node Crawlerhttps://datumorphism.leima.is/wiki/nodecrawler/basic-crawler/Sun, 15 Jul 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/nodecrawler/basic-crawler/Prerequisites Nodejs >= 8.9 Overview A model for a crawler is as follows.
A crawler requests data from the server, while the server responds with some data. Here is a graphic illustration
+----------+ +-----------+ | | HTTP Request | | | +----------------> | | Nodejs | | Servers | | <----------------+ | | | HTTP Response | | +----------+ +-----------+ HTTP Requests For a good introduction of HTTP requests, please refer to this video on youtube: Explained HTTP, HTTPS, SSL/TLS API As for the first step, we need to find which url to request.Autoregressive Modelhttps://datumorphism.leima.is/wiki/time-series/autoregressive-model/Wed, 20 Jun 2018 15:58:49 -0400https://datumorphism.leima.is/wiki/time-series/autoregressive-model/Autoregressive Given a time series ${T^i}$, a simple predictive model can be constructed using an autoregressive model.
$$ \begin{equation} T^t = \sum_{i=1}^p \beta_i T^{t - i} + \beta^t + \beta^0. \end{equation} $$
Such a model is usually called an AR(p) model due to the fact that we are using data back in $p$ steps.
Differential Equation For simplicity we will look at a AR(1) model. Assume the time series has a step size of $dt$, our model can be rewritten asUnsupervised Learning: PCAhttps://datumorphism.leima.is/wiki/machine-learning/unsupervised/pca/Fri, 25 May 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/unsupervised/pca/We use the Einstein summation notation in this article. Principal Component Analysis (PCA) is a commonly used trick for dimensionality reduction so that the new features represents most of the variances of the data.
Representations of Dataset In theory, a dataset can be represented by a matrix if we specify the basis. However, the initial given basis is not always the most convinient one. Suppose we find a new set of basis for the dataset, the matrix representation may be simpler and easier to use.Data Structurehttps://datumorphism.leima.is/wiki/algorithms/data-structure/Tue, 20 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/algorithms/data-structure/Dealing with data structure is like dealing with your clothes. Some people randomly drop their clothes somewhere without thinking. But it takes time to retrieve a specific T-shirt. Some people spend more time folding and arranging their clothes. This process makes it easy to find a specific T-shirt. Similar to retrieving clothes, there is always a balance between the computation time (retrieving clothes) and the coding time (folding clothes).
Keywords This section serves as some kind of flashcard keywords.The C++ Language: Basicshttps://datumorphism.leima.is/wiki/programming-languages/cpp/basics/Tue, 20 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/programming-languages/cpp/basics/Make it Work Apart from the traditional way of running C++ code, Jupyter notebook has a clingkernel that make it possibel to run C++ in a Jupyter notebook. Here is the post: Interactive C++ for HPC.
Concepts Namespace Operators: assignment operators (=,+=,-=,*=,/=,%=), increment/decrement operator (++x,x++,--x,x--), relational operators (>,<,>=,<=,==,!=), logicl operators (&&,||,!), left shift (<<), extration operator (>>, or right shift), understand the operator precedence Variables: variable name starts with underscore or latin letters, Pascal case (PascalCase), Camel case (pascalCase) if/else and Loops: if (condition is true ){ then something } Data Types: string (double quote), character (char, 1 byte ASCII character, using single quote), float (4 bytes, always signed), double (8 bytes, always signed), long double (8 or 16 bytes, always signed), singed or unsigned short or long int (signed long int, unsigned int) Pointers: ampersand (&) accesses the address, pointer is variable thus needs to be declared using asterisk (*), can be declared to be int or double or float or char (int \*pt; int\* pt; int * pt;) Functions: overload, recursion Class: identity, atrributes, method/behavior, access specifiers (private or public or protected, by default it is set to private), instantiation of object (creating object), constructor, destructor, encapsulation, scope resolution operator (TheClassYouNeed::somefunction()), selection operator (dot member selection .The Python Language: Decoratorshttps://datumorphism.leima.is/wiki/programming-languages/python/decorators/Tue, 20 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/programming-languages/python/decorators/Functions: first-class objects; can be passed around as arguments.
What that tells us about is that functions can be pass into a function or even returned by a function. For example,
def a_decoration_function( yet_another_function ): def wrapper(): print('Before yet_another_function') yet_another_function() print('After yet_another_function') return wraper def yet_another_function(): print('This is yet_another_function') When we execute a_decoration_function, we will have
Before yet_another_function This is yet_another_function After yet_another_function So a decorator is simply a function that takes a function as an argument, adds some salt to it.A Physicist's Crash Course on Artificial Neural Networkhttps://datumorphism.leima.is/wiki/machine-learning/neural-networks/physicists-crash-course-neural-network/Sat, 02 May 2015 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/neural-networks/physicists-crash-course-neural-network/What is a Neuron What a neuron does is to response when a stimulation is given. This response could be strong or weak or even null. If I would draw a figure, of this behavior, it looks like this.
Neuron response Using simple single neuron responses, we could compose complicated responses. To achieve that, we study the transformations of the response first.
transformations Artificial Neural Network A simple network is a collection of neurons that response to stimulations, which could be the responses of other neurons.Logistic Regressionhttps://datumorphism.leima.is/wiki/machine-learning/linear/logistic-regression/Thu, 27 May 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/linear/logistic-regression/In a classification problem, given a list of features values $x$ and their corresponding classes $\{c_i\}$, the posterior for of the classes, aka conditional probability of the classes, is
$$ p(C=c_i\mid X=x). $$
Likelihood
The likelihood of the data is
$$ p(X=x\mid C=c_i). $$
Logistic Regression for Two Classes For two classes, the simplest model for the posterior is a linear model,
$$ \log \frac{p(C=c_1\mid X=x) }{p(C=c_2\mid X=x)} = \beta_0 + \beta_1 \cdot x, $$Measures of Generalizabilityhttps://datumorphism.leima.is/wiki/model-selection/measures-of-generalizability/Sun, 08 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/wiki/model-selection/measures-of-generalizability/To measure the generalization, we define a generalization error,
$$ \begin{align} \mathcal G = \mathcal L_{P}(\hat f) - \mathcal L_E(\hat f), \end{align} $$
where $\mathcal L_{P}$ is the population loss, $\mathcal L_E$ is the empirical loss, and $\hat f$ is our model by minimizing the empirical loss.
However, we do not know the actual joint probability $p(x, y)$ of our dataset $\{x_i, y_i\}$. Thus the population loss is not known. In machine learning, we usually use cross validation Cross Validation Cross validation is a method to estimate the risk The Learning Problem The learning problem posed by Vapnik:1 Given a sample: $\{z_i\}$ in the probability space $Z$; Assuming a probability measure on the probability space $Z$; Assuming a set of functions $Q(z, \alpha)$ (e.Random Foresthttps://datumorphism.leima.is/wiki/machine-learning/tree-based/random-forest/Wed, 25 Dec 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/tree-based/random-forest/Random forest is an ensemble method based on decision trees. Instead of using one decision tree and model on all the features, the decision tree method can model on a random set of features (feature subspace) using many decision trees and make decisions by democratizing the trees.
Given a proper dataset $\mathscr D(\mathbf X, \mathbf y)$, the ensemble of trees is denoted as ${f_i(\mathbf X)}$, will predict an ensemble of results.Predictions Using Time Series Datahttps://datumorphism.leima.is/wiki/time-series/predictions-time-series-data/Fri, 21 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/time-series/predictions-time-series-data/General Phenological Model for Seasonality In business, time series data $f(t)$ usually carries information about trend $g(t)$ ($g$ is used since trend is usually growth), seasonalities (periodical effects) $p(t)$, holiday effects (structural effects) $s(t)$, etc. We will decompose a time series $f(t)$ into four components
$$ \begin{equation} f(t) = g(t) + p(t) + s(t) + \epsilon(t). \end{equation} $$
To train a model for the predictions, we need to write down the exact models of these three predictable components.Tensor Factorizationhttps://datumorphism.leima.is/wiki/machine-learning/factorization/tensor-factorization/Mon, 17 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/factorization/tensor-factorization/Tensors We will be talking about tensors but we will skip the introduction to tensor for now.
In this article, we follow a commonly used convention for tensors in physics, the abstract index notation. We will denote tensors as $T^{ab\cdots}_ {\phantom{ab\cdots}cd\cdots}$, where the latin indices such as $^{a}$ are simply a placebo for the slot for this “tensor machine”. For a given basis (coordinate system), we can write down the components of this tensor $T^{\alpha\beta\cdots} _ {\phantom{\alpha\beta\cdots}\gamma\delta\cdots}$.Anscombe's quartethttps://datumorphism.leima.is/wiki/data-visualization/anscombes-quartet/Mon, 18 Mar 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/data-visualization/anscombes-quartet/Anscombe’s Quartet Anscombe’s quartet is a brilliant idea that shows the importance and convenience of visual representation of data.
Anscombe’s quartet has four datasets. The values of each dataset are shown below.
x1 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5] y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68] x2 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.OLAP Operationshttps://datumorphism.leima.is/wiki/data-warehouse/olap-operations/Fri, 23 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/data-warehouse/olap-operations/Roll-up or Drill-up The word ‘up’ in the names refers to going up in concept hierarchies.
For example, we would like to know the revenue of the whole year. However, the record of data is
Date Revenue 2018-01-01 1023 2018-01-02 934 … … 2018-12-30 1244 2018-12-31 1302 Roll-up is performed by summing up everything of the column revenue.Finite Element Methodhttps://datumorphism.leima.is/wiki/dynamical-system/finite-element-method/Mon, 19 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/dynamical-system/finite-element-method/Differential Equations and Boundary Conditions Two Types of Boundary Conditions As an example, we have a partial differential equation
$$ \frac{d^2u}{dx^2} + f = 0, $$
which describes a 1D problem.
Dirichlet boundary condition: specify values for $u$, such as $u(0)=u_0$ and $u(L)=u_L$; Neumann boundary condition: specifiy values for $u_{,x}$. If we have only Neumann boundary condition, the solution is not unique. One example for it is tossing a bar, which can have both Neumann BC at both ends but it is moving.Chi-square Correlation Test for Nominal Datahttps://datumorphism.leima.is/wiki/statistics/correlation-analysis-chi-square/Sun, 18 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/statistics/correlation-analysis-chi-square/In this article, we will discuss the chi-square correlation test for detecting correlations between two series.
Steps Find out all the possible values of the two nominal series A and B; Count the co-occurrences of the combinations (A, B); Calculate the expected co-occurrences of the combinations (A, B); Calculate chi-square; Determine whether the hypothesis can be rejected. Define the Series Suppose we are analyzing two series A and B.Basics of Networkhttps://datumorphism.leima.is/wiki/computation/basics-of-network/Sun, 23 Sep 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/computation/basics-of-network/HTTP Keywords Hyper Text Transfer Protocal: deliver hyper text from server to local browser etc. Based on TCP/IP Current version: HTTP/2 Server - Client Client can request through GET, HEAD, POST, PUT, DELETE, TRACE, OPTIONS, CONNECT, PATCH. Transfer anything defined by Content-Type Connectionless Protocol: doesn’t maintain the connection all the time Stateless protocal: A very nice explanation URL Keywords Uniform Resource Locator Interpret each part of this URL: http://abc.Unsupervised Learning: SVMhttps://datumorphism.leima.is/wiki/machine-learning/unsupervised/svm/Fri, 17 Aug 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/unsupervised/svm/SVM is calculating a hyperplane to separate the data points into groups according to the label.
Hyperplane A hyperplane is defined to be of the following form
$$ \begin{equation} \boldsymbol{\beta} \cdot \mathbf x = \beta_0. \end{equation} $$
where $\boldsymbol\beta$ is the normal vector to the plane and is required to be constant.
It is straight forward to show that the distance $d$ from an arbitrary point $\mathbf x'$ to the hyperplane isManage Data Using MongoDBhttps://datumorphism.leima.is/wiki/nodecrawler/manage-data-using-mongodb/Wed, 18 Jul 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/nodecrawler/manage-data-using-mongodb/In most cases, databases makes the management of data quite convenient. In this article, we would scrape data using the code we discussed before but write data into MongoDB.
For installation of MongoDB, please refer to the official documentation.
The Code To write data to MongoDB using Node.js, we choose the package mongojs, which provides almost exactly the standard MongoDB syntax.
To install mongojs,
npm i mongojs --save Here is a module that can write data to MongoDB.The Python Language: Multi-Processinghttps://datumorphism.leima.is/wiki/programming-languages/python/multiprocessing/Thu, 10 May 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/programming-languages/python/multiprocessing/Python has built-in multiprocessing module in its standard library.
One simple example of using the Pool class is the following.
def myfunc(myfuncargs): 'some thing here' with Pool(10) as p: records = p.map(myfunc, myfuncargs) However, there are limitations on this, especially on pickles. Another approach.
from multiprocessing import Pool from multiprocessing.dummy import Pool as ThreadPool with ThreadPool(1) as p: records = p.map(myfunc, myfuncargs) Beware that map function will feed in a list of args to the function.Data Structure: Treehttps://datumorphism.leima.is/wiki/algorithms/data-structure-tree/Tue, 27 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/algorithms/data-structure-tree/mind the data structure: here comes the treeThe C++ Language: Numerical Methodshttps://datumorphism.leima.is/wiki/programming-languages/cpp/numerical/Tue, 20 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/programming-languages/cpp/numerical/Modularize The code should be designed to separate physics or model from numerical methods. Speed vectors are convenient but slow. 1 Do not copy arrays if not necessary. The example would be for a function return. Most of the time, we can pass the pointer of an array to the function and update the array itself without copying anything and no return is needed at all. inline function.GNUPlothttps://datumorphism.leima.is/wiki/tools/gnuplot/Mon, 04 Sep 2017 00:00:00 +0000https://datumorphism.leima.is/wiki/tools/gnuplot/Examples Plot .csv data. Suppose we have data of such.
-0.00999983, 0.99995 -0.0199987, 0.9998 -0.0299955, 0.99955 -0.0399893, 0.9992 -0.0499792, 0.99875 -0.059964, 0.998201 To plot the second column against the first column, we use the using parameter in gnuplot.
gnuplot -e "set terminal png; set datafile separator ',' ; plot 'complex.txt' using 1:2" | imgcat # datafile seperator is not always necessary # imgcat is a script in iterm2 on macBoltzmann Machinehttps://datumorphism.leima.is/wiki/machine-learning/energy-based-model/boltzmann-machine/Sun, 27 Aug 2017 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/energy-based-model/boltzmann-machine/Boltzmann machine is much like a spin glass model in physics. In short words, Boltzmann machine is a machine that has nodes that can take values, and the nodes are connected through some weight. It is just like any other neural nets but with complications and theoretical implications.
Boltzmann machine is usually used as a generative model.
Boltzmann Machine and Physics To obtain a good understanding of Boltzmann machine for a physicist, we begin with Ising model.Restricted Boltzmann Machinehttps://datumorphism.leima.is/wiki/machine-learning/energy-based-model/restricted-boltzmann-machine/Fri, 11 Jun 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/energy-based-model/restricted-boltzmann-machine/Latent variables introduce extra correlations between the nodes in a network. Introducing hidden units can also help us remove the direct connection between some nodes in a Boltzmann machine and create a restricted Boltzmann machine. A restricted Boltzmann machine requires less computation while having some expressing power.
Given Ising like interactions between the nodes, flipping node V1 is likely to also flip node V2 as they are connected through hidden unit H1.Deep Autoregressive Networkhttps://datumorphism.leima.is/wiki/machine-learning/neural-networks/deep-autoregressive-networks/Mon, 15 Feb 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/neural-networks/deep-autoregressive-networks/There are two levels of autoregressiveness in the DARN network:
Inlayer autoregressive connections of the nodes, Intralayer autoregressive connections of nodes. The network is trained on MDL loss.Wavelet Transformhttps://datumorphism.leima.is/wiki/time-series/wavelets/Mon, 07 Dec 2020 00:00:00 +0000https://datumorphism.leima.is/wiki/time-series/wavelets/In general, given a complete set of function $\psi(x; \tilde x)$, we can decompose a function $F(\tilde x)$
$$ F(\tilde x) = \int f(x) \psi(x;\tilde x) dx. $$
The choice of $\psi(x;\tilde x)$ gives us different properties.
Fourier Transform Fourier transform is good for stationary analysis since time is not involved in $F(\omega)$.
$$ F(\omega) = \int_{-\infty}^{\infty} f(t) e^{-i \omega t} dt $$
Short-time Fourier Transform STFT is a Fourier transform with a moving time window $\tau$,Parsimony of Modelshttps://datumorphism.leima.is/wiki/model-selection/parsimony-of-models/Sun, 08 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/wiki/model-selection/parsimony-of-models/For models with a lot of parameters, the goodness-of-fit is very likely to be very high. However, it is also likely to generalize bad. So we need measure of generalizability
Here parsinomy gives us a few advantages.
easy to perceive better generalizationsTerminalhttps://datumorphism.leima.is/wiki/tools/terminal/Tue, 31 Dec 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/tools/terminal/Navigating Some tips to help data scientist navigate faster in terminal.
pushd, popd and dirs pushd to register and change directories: pushd folder_name will change current directory to folder_name and register the folder folder_name in our stack. If no folder name is passed onto the command, it will be default to $HOME folder. popd to go to the last directory in the stack and remove it from the stack. In this example, popd will change the current working directory to folder_name.Statistical Sign Testhttps://datumorphism.leima.is/wiki/statistical-hypothesis-testing/sign-test/Sun, 20 Jan 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/statistical-hypothesis-testing/sign-test/We have a small dataset, but it doesn’t satisfy the t-test conditions. Then we would use as little assumptions as possible.
Wine Taste Suppose we have two bottles of wine, one of them is 300 euros while the other is 100 euros.
Now we ask the question:
Does expensive wine taste better?
We find 10 experts and give them some experiments. The result is recorded then processed into the following table.Data Storagehttps://datumorphism.leima.is/wiki/data-warehouse/data-storage/Fri, 23 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/data-warehouse/data-storage/tl;dr: Use type safe formats such as HDF5 or parquet
HDF5 BCOLZ <http://bcolz.blosc.org/en/latest/>_ : not designed for multidimentional data. Zarr <https://github.com/alimanfoo/zarr>_ : works with multidimensional data and also parallel computating. Blaze ecosystem <http://blaze.pydata.org/>_ A article that compares HDF5, BCOLZ, and Zarr: To HDF5 and beyond
I also recommend pandas. It is a python module that works very well with data. It even loads HDF5 out of box.Bin Size of Histogramhttps://datumorphism.leima.is/wiki/data-visualization/histogram-bin-size/Thu, 22 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/data-visualization/histogram-bin-size/Histograms are good for understanding the distribution of your data.
The Bin Size Problem As an example, we will use the following series as an example.
[1.45,2.20,0.75,1.23,1.25,1.25,3.09,1.99,2.00,0.78,1.32,2.25,3.15,3.85,0.52,0.99,1.38,1.75,1.21,1.75] If we use bin size 1, we get this spiky chart and it is not so informing.
We could also set bin size to 2.
In principle, we could keep tuning the bin size until we get something pretty and informing. But that would be quite depressing.Correlation Coefficient and Covariance for Numeric Datahttps://datumorphism.leima.is/wiki/statistics/correlation-coefficient/Sun, 18 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/statistics/correlation-coefficient/Covariances Correlation coefficient is also known as the Pearson’s product moment coefficient. Review of Standard Deviation For a series of data A, we have the standard deviations
$$ \sigma_A = \sqrt{ \frac{ \sum (a_i - \bar A)^2 }{ n } }, $$
where $n$ is the number of elements in series A.
The standard deviation is very easy to understand. It is basically the average Eucleadian distance between the data points and the average value.Basics of Databasehttps://datumorphism.leima.is/wiki/computation/basics-of-database/Wed, 03 Oct 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/computation/basics-of-database/NoSQL NoSQL = Not only SQL. The four main types of NoSQL databases are
Key-value store: Amazon Dynamo, memcached, Amazon SimpleDB Column-orient store: Google BigTable, Cassandra Graph database: Neo4j, VertexDB Document database: MongoDB Object database: ZODB Database Operations Relations Union: $A\cup B$ Intersection: $A\cap B$ $A - B$ Cartesian Product: $A \times B$ Query Union in database: will combine the data with matching common columns.Restrictions of Websiteshttps://datumorphism.leima.is/wiki/nodecrawler/restrictions/Thu, 19 Jul 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/nodecrawler/restrictions/Beware that scraping data off websites is neither always allowed nor as easy as a few lines of code. The preceding articles enable you to scrape many data, however, man websites have counter measures. In this article, we will be dealing with some of the common ones.
Request Frequency Some websites have limitations on the frequency of API requests. The solution to this is simply a brief pause after each request.Data Structure: Graphhttps://datumorphism.leima.is/wiki/algorithms/data-structure-graph/Tue, 27 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/algorithms/data-structure-graph/mind the data structure: here comes the graphThe Python Language: Performancehttps://datumorphism.leima.is/wiki/programming-languages/python/performance/Tue, 20 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/programming-languages/python/performance/Read the references for performance.
The message:
Use comprehensions Use generatorsMDL and Neural Networkshttps://datumorphism.leima.is/wiki/model-selection/mdl-and-neural-networks/Sun, 14 Feb 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/model-selection/mdl-and-neural-networks/Minimum Description Length ( MDL Minimum Description Length MDL is a measure of how well a model compresses data by minimizing the combined cost of the description of the model and the misfit. ) can be used to construct a concise network. A fully connected network has great expressing power but it is easily overfitting.
One strategy is to apply constraints to the networks:
Limit the connections; Shared weights in subgroups of the network; Constrain the weights using some probability distributions.Histogramhttps://datumorphism.leima.is/wiki/data-visualization/histogram/Tue, 20 Aug 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/data-visualization/histogram/Suppose we check out the burger prices at the stores of Han im Glück, we get a list of numbers. We can arrange the numbers into bins of prices. For example, we can count the number stores that have a price between 10 to 11 euros.Mann-Whitney U Testhttps://datumorphism.leima.is/wiki/statistical-hypothesis-testing/mann-whitney-u-test/Sun, 20 Jan 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/statistical-hypothesis-testing/mann-whitney-u-test/Mann-Whitney U is good at testing heavy-tailed data.Basics of SQLhttps://datumorphism.leima.is/wiki/computation/basics-of-sql/Mon, 19 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/computation/basics-of-sql/Adding a new field to data:
Relational: requires a new column Non-Relational: just add the field to one single document, thus can be easily decentralized. Basics and Background SQL: Structured Query Language
Relational Database:
usually in tables rows are called records columns are certain types of data. Data types of rows are specified: INTEGER TEXT DATE REAL, real numbers NULL … RDBMS: Relational Database Management System, most RDBMS use SQL as the query language.Normalization Methods for Numeric Datahttps://datumorphism.leima.is/wiki/statistics/normalization-methods/Sun, 18 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/statistics/normalization-methods/Normalization of data is critical for statistical analysis and feature engineering.
Min-max Normalization This method is linear and straightforward.
Suppose we are analyzing series A, with elements $a_i$. We already know the min and max of the series, $a_{min}$ and $a_{max}$.
Now we would like to normalize the series to be within the range $[a_{min}', a_{max}']$. We simply solve the value of $a’ _ i$ in $$ \frac{(a’_i - a_{min}')}{ ( a’_{max} - a’_{min} ) } = \frac{(a_i - a_{min})}{ ( a_{max} - a_{min} ) }, $$ where everything on the right hand side is known and $a_{min}‘$ and $a_{max}‘$ are chosen as the new min and max to be scaled to.Optimizationhttps://datumorphism.leima.is/wiki/nodecrawler/optimization/Thu, 19 Jul 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/nodecrawler/optimization/In this article, we will be optimizing the crawler to get better performance.
Batch Jobs In the article about using MongoDB as data storage, we write the data to database whenever we get it. In practice, this is not efficient at all. Here comes the batch jobs. It would be much better if one write to database with batch jobs.
If you recall, the code we used to write to database isGithttps://datumorphism.leima.is/wiki/tools/git/Wed, 22 Jun 2016 00:00:00 +0000https://datumorphism.leima.is/wiki/tools/git/Git Services GitHub Bitbucket GitLab Using Git with GUI There are huge amounts of git commands! There are also a lot of GUIs if you don’t like command line.
GitHub Desktop GitKraken SourceTree … Useful Commands To check all the commits related to a file, use git log -u. Try out git log -g before determining which reflog to deal with. To compare the changes with the last commit, use git diff --cached HEAD~1.Some ML Workflow Frameworkshttps://datumorphism.leima.is/wiki/tools/ml-flow-frameworks/Wed, 13 Jan 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/tools/ml-flow-frameworks/Metaflow Docs
A framework for jupyter notebook data scientists.
Work locally on notebooks. Python environment management using conda. Work in the cloud with Sagemaker. Tasks Methods Comments Code Scripts/Jupyter Notebook Datastore local + S3 metaflow.S3 Compute local + AWS Batch Metadata metaflow service Metadata specifies flow executions: Flows, Runs, Steps, Tasks, and Artifacts. Scheduling AWS Step Functions Deployment AWS Demo from metaflow import FlowSpec, step class BranchFlow(FlowSpec): @step def start(self): self.Boxplothttps://datumorphism.leima.is/wiki/data-visualization/boxplots/Tue, 20 Aug 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/data-visualization/boxplots/Example The Whiskers in Boxplot They are the outlier data points.
Outliers are determined using the interquatile range (IQR, i.e., 25 percentile to 75 percentile.). We usually the lowest data point within 1.5 IQR range below the 25 percentile or the data point within 1.5 IQR range above the 75 percentile.Linear Regressionhttps://datumorphism.leima.is/wiki/statistics/linear-regression/Tue, 01 Jan 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/statistics/linear-regression/In this article, we will use the Einstein summation convention. For example, $$ X_{ij}\beta_ j $$ is equivalent to $$ \sum_j X_{ij}\beta_ j $$ In statistics, we have at least three categories of quantities:
data and labels abstract theoretical quantities parameters and predictions of models The convention is that quantities with $\hat {}$ are the model quantities. Sometimes we do not distinguish the abstract theoretical quantities and model quantities.Basics of MongoDBhttps://datumorphism.leima.is/wiki/computation/basics-of-mongodb/Wed, 03 Oct 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/computation/basics-of-mongodb/This MongoDB Cheatsheet is my best friend.
MongoDB Concepts Documents Collections: just like tables in SQL. Database MongoShell Some examples:
// show the databases show dbs // show collections show collections //set any database to current database use database_name // insert entry db.database_name.insert( an_object_2_be_the_entry ) // read document db.database_name.findOne({'some_field':'value_of_field'}) db.database_name.fidn() // prettify db.database_name.find().pretty()Describing Multi-dimensional Datahttps://datumorphism.leima.is/wiki/statistics/multidimensional-data/Mon, 03 Dec 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/statistics/multidimensional-data/Descriptions of Multidimensional Data Dispersion Matrix As defined in Correlation Coefficient and Covariance for Numeric Data, covariance is about the variance of two series. This property makes it easy to generalize it to multidimensional data.
The generalized quantity is named as dispersion matrix. Suppose we have a $p$ dimensional dataset $X$,
index $x_1$ $x_2$ … $x_p$ 1 2.3 12.3 83.2 9.3 … … … … … N 3.Data Storagehttps://datumorphism.leima.is/wiki/data-engeering-for-data-scientist/data-storage/Wed, 05 May 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/data-engeering-for-data-scientist/data-storage/Many of the example are from the book by Adreas Kretz. Find the link to the book in the references section. Types of Storage and Data Here is a list 1
Files S3 Message Queues Kinesis Relational DB MySQL Postgres Non-relational DB Document Store MongoDB DocumentDB Key-Value Store HBase Redis Kretz2019 The Data Engineering Cookbook ↩︎Comparison of MLOps Frameworkshttps://datumorphism.leima.is/wiki/mlops/comparison-of-frameworks/Wed, 05 May 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/mlops/comparison-of-frameworks/Metaflow Docs
A framework for jupyter notebook data scientists.
Work locally on notebooks. Python environment management using conda. Work in the cloud with Sagemaker. Tasks Methods Comments Code Scripts/Jupyter Notebook Datastore local + S3 metaflow.S3 Compute local + AWS Batch Metadata metaflow service Metadata specifies flow executions: Flows, Runs, Steps, Tasks, and Artifacts. Scheduling AWS Step Functions Deployment AWS DemoData Processinghttps://datumorphism.leima.is/wiki/data-engeering-for-data-scientist/data-processing/Wed, 05 May 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/data-engeering-for-data-scientist/data-processing/Many of the example are from the book by Adreas Kretz. Find the link to the book in the references section. Batch Process Kretz recommend to start from batch processing and move to streaming if needed 1.
Stream Process Three methods to stream data
At Least Once: message gets processed once or multiple times never dropped e.g., time-based GPS data in fleet management, if the stream data has the same timestamp, then we just override the existing data, we do not care how many times the data is being processed or streamed.Signal Processinghttps://datumorphism.leima.is/wiki/algorithms/singal-processing/Tue, 20 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/algorithms/singal-processing/There are many fascinating ideas in signal processing.Signal Processing: Audio Basicshttps://datumorphism.leima.is/wiki/algorithms/signal-processing-audio/Thu, 29 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/algorithms/signal-processing-audio/Keywords Harmonic structure of sound Parson code of music Linear time-invariant theory Autocorrelation Noise Chirps DCT compression Discrete Fourier transform filtering convolution Linear Time-Invariant System We describe the system with $Y(t) = f(X(t))$, where $X(t)$ is the input, and $Y(t)$ is the output.
Linear: $f(a X_1(t) + b X_2(t)) = a f(X_1(t)) + b f(X_2(t))$ Time-invariant: input $X(t+\Delta t)$ will produce the shifted signal $Y(t+\Delta t)$. LTI systems are memory systems, casual, real, and stable.Basics of MapReducehttps://datumorphism.leima.is/wiki/algorithms/map-reduce/Wed, 03 Oct 2018 00:00:00 +0000https://datumorphism.leima.is/wiki/algorithms/map-reduce/Centralized servers are not efficient for big data. Querying and processing data on centralized servers would reach bottleneck of the servers.
MapReduce is used to solve these problems of big data. The two videos are .
Map: take series of key-value pairs and divide them into groups. Reduce: recombine the key-value pairs Checkout the code challenges of MapReduce on HackerRank.Scale Uphttps://datumorphism.leima.is/wiki/data-engeering-for-data-scientist/scale-up/Wed, 05 May 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/data-engeering-for-data-scientist/scale-up/Many of the example are from the book by Adreas Kretz. Find the link to the book in the references section. Scaling Up Storage Scaling Up SQL DB SAN: Storage Area Network Use multiple servers on the DB storage to make the query faster.
Good for Read-only DB Not convinient to update DB Hadoop Hadoop:
Distributed storage Analysis 4 core modules:
Hadoop common background functionalities HDFS Divide into blocks Distribute MapReduce Old tech YARN Resource management The Hadoop Ecosystem:The log-sum-exp Trickhttps://datumorphism.leima.is/cards/machine-learning/neural-networks/log-sum-exp-trick/Wed, 28 Jul 2021 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/neural-networks/log-sum-exp-trick/The cross entropy for a binary class is
$$ p \ln \hat p + (1-p) \ln (1-\hat p), $$
where $p$ is the probability of the label A and $\hat p$ is the predicted probability of label A. Since we have binary classes, $p$ is either 1 or 0. However, the predicted probabilities can be any value between $[0,1]$.
Probability
For a very simple case, $\hat p$ might be a sigmoid like expression with exponential in it,Managing path using pathlib in Pythonhttps://datumorphism.leima.is/til/programming/python/python-managing-paths-using-pathlib-is-easier/Thu, 15 Jul 2021 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-managing-paths-using-pathlib-is-easier/Since Python 3.4
pathlib is object oriented. It is more elegant than os.path. For example, if we need the parent folders of the currrent file, we need os.path.dirname(),
import os print(f"file: {__file__}") # file: main.py # Using os.path os__file_absolute_path = os.path.abspath(__file__) print(f"Using os.path:: file absolute path: {os__file_absolute_path}") # Using os.path:: file absolute path: /home/runner/pathlib/main.py os__file_in_folder = os.path.dirname(os__file_absolute_path) print(f"Using os.path:: file is in folder: {os__file_in_folder}") # Using os.path:: file is in folder: /home/runner/pathlib It is much more easier to get the folder using pathlib.Box-Cox Transformationhttps://datumorphism.leima.is/cards/statistics/box-cox/Tue, 13 Jul 2021 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/box-cox/Box-Cox transformation is a power transformation that involves logs and powers. It transforms data into normal distributions.
The Box-Cox transformation is defined as
$$ y_i^{(\lambda)} = \begin{cases} \lambda ^{-1} (y_i^\lambda - 1) & \quad \text{if } \lambda \neq 0\\ \log(y_i) & \quad \text{if } \lambda = 0. \end{cases} $$
By selecting a proper $\lambda$, we get a Guassian distributed data, with a variable mean. The transformation take $y$ to
$$ \rho(y^{(\lambda)}) =\frac{ \exp{\left( -(y^{(\lambda)} - \beta X)^{T} (y^{(\lambda)} - \beta X)/(2\sigma^2) \right) }}{(\sqrt{2\pi \sigma^2})^n} \prod_{i=1}^n \left\lvert \frac{d y_i^{(\lambda )}}{ dy_i } \right\rvert.The Hubbard-Stratonovich Identityhttps://datumorphism.leima.is/cards/math/hubbard-stratonovich-identity/Thu, 17 Jun 2021 00:00:00 +0000https://datumorphism.leima.is/cards/math/hubbard-stratonovich-identity/The Hubbard version of the Hubbard-Stratonovich identity is1
$$ \begin{align} \exp{\left( a^2 \right)} =& \frac{1}{\sqrt{\pi}} \int_{-\infty}^\infty \mathrm dx\, \exp{ \left( - x^2 - 2 a x \right)}\\ =& \frac{1}{\sqrt{\pi}} \int_{\infty}^{-\infty} \mathrm dx'\, \exp{ \left( - x'^2 + 2 a x' \right)}, \end{align} $$
where we changed the sign of $x$, i.e., $x’ = -x$.
In many partition functions, we have expressions like $\exp{\left( a^2/2\right)}$, using the identity, we have
$$ \begin{align} \exp{\left( \frac{a^2}{2} \right)} =& \frac{1}{\sqrt{\pi}} \int_{\infty}^{-\infty} \mathrm dx\, \exp{ \left( - x^2 + \sqrt{2} a x \right)} \\ =& \frac{1}{\sqrt{2\pi}} \int_{\infty}^{-\infty} \mathrm dx'\, \exp{ \left( - \frac{x'^2}{2} + a x' \right)}, \end{align} $$Likelihoodhttps://datumorphism.leima.is/cards/statistics/likelihood/Wed, 26 May 2021 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/likelihood/For some data points $\{x_i\}$ and a model $\theta$, the likelihood of our data point $x_i$ is $p(x_i\mid \theta)$. To be more specific, the likelihood of all data points is a function of the model $\theta$,
$$ L(\theta) = \Pi_i p(x_i\mid\theta). $$
It should be mentioned that this likelihood is not necessarily a pdf. As an example, we can calculate the likelihood of a Bernoulli distribution for a single event $x$,Gaussian Integralshttps://datumorphism.leima.is/cards/math/gaussian-integrals/Tue, 11 May 2021 00:00:00 +0000https://datumorphism.leima.is/cards/math/gaussian-integrals/The diagonalized case
$$ \begin{eqnarray} Z_0 &=& \int d^n z \exp\left(-\frac{1}{2} z^\mathrm{T} D z\right) \\ &=& \prod_i \int d z_i \exp\left(-\frac{1}{2} \lambda_i z_i^2\right) \\ &=& \prod_i \sqrt{\frac{2\pi}{\lambda_i}} \\ &=& \sqrt{\frac{(2\pi)^n}{\det A}}. \end{eqnarray} $$
For an arbitrary matrix $A$,
$$ Z_J = \int d^n x \exp\left(-\frac{1}{2} x^\mathrm{T} A x + J^\mathrm{T} x\right). $$
$$ \begin{eqnarray} Z_J &=& \int d^n y \exp\left(-\frac{1}{2} {y}^\mathrm{T} A y + \frac{1}{2} J^\mathrm{T}A^{-1}J\right) \\ &=& \sqrt{\frac{(2\pi)^n}{\det A}} \exp\left(\frac{1}{2} J^\mathrm{T}A^{-1}J\right).Cross Validationhttps://datumorphism.leima.is/cards/machine-learning/learning-theories/cross-validation/Thu, 06 May 2021 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/learning-theories/cross-validation/Cross validation is a method to estimate the risk The Learning Problem The learning problem posed by Vapnik:1 Given a sample: $\{z_i\}$ in the probability space $Z$; Assuming a probability measure on the probability space $Z$; Assuming a set of functions $Q(z, \alpha)$ (e.g. loss functions), where $\alpha$ is a set of parameters; A risk functional to be minimized by tunning “the handles” $\alpha$, $R(\alpha)$. The risk functional is $$ R(\alpha) = \int Q(z, \alpha) \,\mathrm d F(z).The Learning Problemhttps://datumorphism.leima.is/cards/machine-learning/learning-theories/learning-problem/Thu, 06 May 2021 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/learning-theories/learning-problem/The learning problem posed by Vapnik:1
Given a sample: $\{z_i\}$ in the probability space $Z$; Assuming a probability measure on the probability space $Z$; Assuming a set of functions $Q(z, \alpha)$ (e.g. loss functions), where $\alpha$ is a set of parameters; A risk functional to be minimized by tunning “the handles” $\alpha$, $R(\alpha)$. The risk functional is
$$ R(\alpha) = \int Q(z, \alpha) \,\mathrm d F(z). $$
A learning problem is the minimization of this risk.Explained Variationhttps://datumorphism.leima.is/cards/statistics/explained-variation/Wed, 05 May 2021 18:05:47 +0200https://datumorphism.leima.is/cards/statistics/explained-variation/Using Fraser information Fraser Information The Fraser information is $$ I_F(\theta) = \int g(X) \ln f(X;\theta) , \mathrm d X. $$ When comparing two models, $\theta_0$ and $\theta_1$, the information gain is $$ \propto (F(\theta_1) - F(\theta_0)). $$ The Fraser information is closed related to Fisher information Fisher Information Fisher information measures the second moment of the model sensitivity with respect to the parameters. , Shannon information, and Kullback information KL Divergence Kullback–Leibler divergence indicates the … , we can define a relative information gain by a modelFraser Informationhttps://datumorphism.leima.is/cards/information/fraser-information/Wed, 05 May 2021 17:49:12 +0200https://datumorphism.leima.is/cards/information/fraser-information/The Fraser information is
$$ I_F(\theta) = \int g(X) \ln f(X;\theta) , \mathrm d X. $$
When comparing two models, $\theta_0$ and $\theta_1$, the information gain is
$$ \propto (F(\theta_1) - F(\theta_0)). $$
The Fraser information is closed related to Fisher information Fisher Information Fisher information measures the second moment of the model sensitivity with respect to the parameters. , Shannon information, and Kullback information KL Divergence Kullback–Leibler divergence indicates the differences between two distributions 1.Fisher Informationhttps://datumorphism.leima.is/cards/information/fisher-information/Wed, 05 May 2021 17:49:03 +0200https://datumorphism.leima.is/cards/information/fisher-information/Given a probability density model $f(X; \theta)$ for a observable $X$, the amount of information that $X$ carriers regarding the model is called Fisher information.
Given ${\theta}$, the probability of observing the value $X$, i.e., the likelihood is
$$ f(X\mid\theta). $$
To describe the suitability of a model and the observables, we can use a the likelihood $f(X\mid \theta)$. One particular interesting property is the sensitivity of the likelihood in terms of the parameter $\theta$ change.Evidence Lower Bound: ELBOhttps://datumorphism.leima.is/wiki/machine-learning/bayesian/elbo/Mon, 12 Apr 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/bayesian/elbo/This article reuses a lot of materials from the references. Please see the references for more details on ELBO. Given a probability distribution density $p(X)$ and a latent variable $Z$, we have the marginalization of the join probability being
$$ \int dZ p(X, Z) = p(X). $$
Using Jensen’s Inequality In many models, we are interested in the log probability density $\log p(X)$ which can be decomposed using an auxillary density of the latent variable $q(Z)$,Jensen's Inequalityhttps://datumorphism.leima.is/cards/math/jensens-inequality/Mon, 12 Apr 2021 00:00:00 +0000https://datumorphism.leima.is/cards/math/jensens-inequality/Jensen’s inequality shows that
$$ f(\mathbb E(X)) \leq \mathbb E(f(X)) $$
for a concave function $f(\cdot)$.Valid Confidence Sets in Multiclass and Multilabel Predictionhttps://datumorphism.leima.is/wiki/machine-learning/classification/valid-confidence-sets-in-multiclass-multilabel-prediction/Thu, 08 Apr 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/classification/valid-confidence-sets-in-multiclass-multilabel-prediction/Ask for valid confidence:
“Valid”: validate for test data, train data, or the generating process? “Confidence”: $P(Y \notin C(X)) \le \alpha$ To avoid too much attention on data based validation, a framework called conformal inference was proposed by Vovk et al. in 2005,
$n$ observations, desired confidence level $1-\alpha$, construct confidence sets $C(x)$ using conform methods so that the sets capture the underlying the distribution a new pair $(X_{n+1}, Y_{n+1})$ from the same distribution, $P(Y_{n+1}\in C(X_{n+1})) \le 1-\alpha$KL Divergencehttps://datumorphism.leima.is/wiki/machine-learning/basics/kl-divergence/Mon, 05 Apr 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/basics/kl-divergence/Given two distributions $p(x)$ and $q(x)$, the Kullback-Leibler divergence is defined as
$$ D_\text{KL}(p(x) \parallel q(x) ) = \int_{-\infty}^\infty p(x) \log\left(\frac{p(x)}{q(x)}\right)\, dx = \mathbb E_{p(x)} \left[\log\left(\frac{p(x)}{q(x)}\right) \right]. $$
Connection to Entropy
Notice that this expression is quite similar to entropy,
$$ H(p(x)) = \int_{-\infty}^{\infty} p(x) \log p(x) , dx. $$
The entropy describes the lower bound of the number of bits (if we use $\log_2$) of how the information can be compressed.Hierarchical Classificationhttps://datumorphism.leima.is/wiki/machine-learning/classification/hierarchical-classification/Tue, 30 Mar 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/classification/hierarchical-classification/Hierarchical Classification Problem Hierarchical classification labels involves hierarchical class labels. The hierarchical class labels maybe predefined or inferred. 1
Class Taxonomy A hierarchical classification problem comes with a class taxonomy.
“IS-A” operator: $\prec$, “IS-NOT-A” operator: $\nprec$ A IS-A relationship of the labels $c_a$ class set $C$ is
one root $R$ in the tree, asymmetric, i.e., $c_i \prec c_j$ and $c_j\prec c_i$ can not be both true, anti-reflexive, i.Classifier Chains for Multilabel Classificationhttps://datumorphism.leima.is/wiki/machine-learning/classification/classifier-chains/Wed, 24 Mar 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/classification/classifier-chains/Multi-label problem In some classification problems, we have multilabel labels to be predicted. Many different approaches are proposed to solve such problems.
Algorithm Level Develop algorithms for multilabel problems, such as
Decision trees, AdaBoost. Problem Transformation On problem or data level, we can transform the multi-label problem to one or more single label problems.
Binary Relevance Method Binary relevance method, aka BM, transforms the problem into a single label problem by training a binary classifier for each label.Binning Data Values using Pandashttps://datumorphism.leima.is/til/programming/pandas/pandas-binning-values/Wed, 10 Mar 2021 00:00:00 +0000https://datumorphism.leima.is/til/programming/pandas/pandas-binning-values/Use the pd.cut function. The bins argument is using (] are the segments. The official documentation comes with detailed examples.
If pandas is not an option, one could use numpy.digitize to find which bin the elements belong to.Deal with Rare Categories Using Pandashttps://datumorphism.leima.is/til/data/deal-with-rare-categories-using-pandas/Wed, 10 Mar 2021 00:00:00 +0000https://datumorphism.leima.is/til/data/deal-with-rare-categories-using-pandas/We will illustrate how to deal with rare categories using pandas mask.
import pandas as pd ############# # Create fake names frequent_names = list('ABC') rare_names = list('DEF') dataset = sum( [[i]*10 for i in frequent_names] + [[i]*2 for i in rare_names], [] ) # Create a series based on the names series = pd.Series(dataset) print(series) # Find the counts of the names in the series series_counts = series.value_counts() print(series_counts) # Find names that has less than 10 counts # And create a mask mask = series.ANOVAhttps://datumorphism.leima.is/wiki/statistics/anova/Sun, 07 Mar 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/statistics/anova/In many problems, we have to test if several distributions associated with several groups of experiments are the same. The null hypothesis to be used is
The distributions of several groups are the same.
ANOVA tests the null hypothesis by comparing the variability between groups and within groups. If the variability between groups are significantly larger than the variability within groups, we are more confident that the distributions of different groups are different.McCulloch-Pitts Modelhttps://datumorphism.leima.is/cards/machine-learning/neural-networks/mcculloch-pitts-model/Thu, 25 Feb 2021 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/neural-networks/mcculloch-pitts-model/The McCulloch-Pitts model maps the input $\{x_1, x_2,\cdots, x_i \cdots, x_N \}$ into a scalar $y\in\{1,-1\}$,
$$ y = \operatorname{sign}( w\cdot x - b). $$
Since $w\cdot x - b = 0$ is a hyperplane, the McCulloch-Pitts model separates the state space using this hyperplane. The shift $b$ determines the interception, and $w$ decides the slope.Rosenblatt's Perceptronhttps://datumorphism.leima.is/cards/machine-learning/neural-networks/rosenblatt-perceptron/Thu, 25 Feb 2021 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/neural-networks/rosenblatt-perceptron/Rosenblatt’s perceptron connects McCulloch-Pitts neurons in levels.
Rosenblatt proposed that we fix all the weights and leave the weights of the last neuron free.
The first few layers but the last layer is used as a transformation of the input data ${x_1, \cdots, x_i, \cdots, x_N}$ into a new space ${z_1, \cdots, z_i, \cdots, z_{N’}}$. The classification is done on the ${z_1, \cdots, z_i, \cdots, z_{N’}}$ space by tuning the last neuron.ERM: Empirical Risk Minimizationhttps://datumorphism.leima.is/cards/machine-learning/learning-theories/empirical-risk-minimization/Thu, 18 Feb 2021 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/learning-theories/empirical-risk-minimization/In a learning problem The Learning Problem The learning problem posed by Vapnik:1 Given a sample: $\{z_i\}$ in the probability space $Z$; Assuming a probability measure on the probability space $Z$; Assuming a set of functions $Q(z, \alpha)$ (e.g. loss functions), where $\alpha$ is a set of parameters; A risk functional to be minimized by tunning “the handles” $\alpha$, $R(\alpha)$. The risk functional is $$ R(\alpha) = \int Q(z, \alpha) \,\mathrm d F(z).SRM: Structural Risk Minimizationhttps://datumorphism.leima.is/cards/machine-learning/learning-theories/structural-risk-minimization/Thu, 18 Feb 2021 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/learning-theories/structural-risk-minimization/ERM ERM: Empirical Risk Minimization In a learning problem The Learning Problem The learning problem posed by Vapnik:1 Given a sample: $\{z_i\}$ in the probability space $Z$; Assuming a probability measure on the probability space $Z$; Assuming a set of functions $Q(z, \alpha)$ (e.g. loss functions), where $\alpha$ is a set of parameters; A risk functional to be minimized by tunning “the handles” $\alpha$, $R(\alpha)$. The risk functional is $$ R(\alpha) = \int Q(z, \alpha) \,\mathrm d F(z).Coding Theory Conceptshttps://datumorphism.leima.is/cards/information/coding-theory-concepts/Wed, 17 Feb 2021 00:00:00 +0000https://datumorphism.leima.is/cards/information/coding-theory-concepts/The code function produces code words. The expected length of the code word is limited by the entropy from the source probability $p$.
The Shannon information content, aka self-information, is described by
$$ - \log_2 p(x=a), $$
for the case that $x=a$.
The Shannon entropy is the expected information content for the whole sequence with probability distribution $p(x)$,
$$ \mathcal H = - \sum_x p(x\in X) \log_2 p(x). $$
The Shannon source coding theorem says that for $N$ samples from the source, we can roughly compress it into $N\mathcal H$.Empirical Losshttps://datumorphism.leima.is/cards/machine-learning/measurement/empirical-loss/Sat, 06 Feb 2021 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/measurement/empirical-loss/Given a dataset with records $\{x_i, y_i\}$ and a model $\hat y_i = f(x_i)$ the empirical loss is calculated on all the records
$$ \begin{align} \mathcal L_{E} = \frac{1}{n} \sum_i^n d(y_i, f(x_i)), \end{align} $$
where $d(y_i, f(x_i))$ is the distance defined between $y_i$ and $f(x_i)$.Population Losshttps://datumorphism.leima.is/cards/machine-learning/measurement/population-loss/Sat, 06 Feb 2021 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/measurement/population-loss/Given a dataset with records $\{x_i, y_i\}$ and a model $\hat y_i = f(x_i)$. Suppose we know the actual generating process of the dataset and the joint probability density distribution of all the data points is $p(x, y)$, the population loss is defined on the whole assumed population,
$$ \begin{align} \mathcal L_{P} = \mathop{\mathbb{E}}_{p(x,y)}[ d(y, f(x))], \end{align} $$
where $d(y, f(x))$ is the distance defined between $y$ and $f(x)$.Data File Formatshttps://datumorphism.leima.is/cards/machine-learning/datatypes/data-file-formats/Tue, 02 Feb 2021 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/datatypes/data-file-formats/Data storage is diverse. For data on smaller scales, we are mostly dealing with some data files.
work_with_data_files
Efficiencies and Compressions Parquet Parquet is fast. But
Don’t use json or list of json as columns. Convert them to strings or binary objects if it is really needed.Machine as a Hologramhttps://datumorphism.leima.is/projects/hologram/Sun, 31 Jan 2021 00:00:00 +0000https://datumorphism.leima.is/projects/hologram/Tutorials on machine learning and data science productivity articlesLatent Variable Modelshttps://datumorphism.leima.is/wiki/machine-learning/bayesian/latent-variable-models/Wed, 27 Jan 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/bayesian/latent-variable-models/In the view of statistics, we know everything about a physical system if we know the probability $p(\mathbf s)$ of all possible states of the physical system $\mathbf s$. Time can also be part of the state specification.
As an example, we will classify fruits into oranges and non oranges. We will have the state vector $\mathbf s = (\text{is orange}, \text{texture } x)$. Our goal is to find the join probability $p(\text{is orange}, x)$.Reparametrization in Expectation Samplinghttps://datumorphism.leima.is/cards/statistics/reparametrization-expectation-sampling/Wed, 20 Jan 2021 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/reparametrization-expectation-sampling/The expectation value of a function $f(z)$ over a Guassian distribution $\mathscr N(z;\mu, \sigma)$ is equivalent to the expectation value of $f()$ a Gaussian distribution $\mathscr N(z;\mu=0, \sigma=1)$, i.e.,
$$ {\mathbb E}_{\mathscr N(z; \mu, \sigma)} \left[ f(z) \right] = {\mathbb E}_{\mathscr N(z; 0, 1)} \left[ f() \right] $$
where
$$ \mathscr N = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{(z-\mu)^2}{2\sigma^2}\right). $$
$$ \begin{align} {\mathbb E}_{\mathscr N(z; \mu, \sigma)} \left[ f(z) \right] &= \int \mathrm d z \frac{1}{\sqrt{2\pi\sigma^2}}\exp \left( -\frac{(z-\mu)^2}{2\sigma^2}\right) f(z) \\ &= \int \mathrm dz \frac{1}{\sigma} \frac{1}{\sqrt{2\pi}} \exp \left( -\frac{1}{2} \left(\frac{z-\mu}{\sigma}\right)^2 \right) f(z) \\ &= \int \mathrm d \left( \sigma z' + \mu \right) \frac{1}{\sigma} \frac{1}{\sqrt{2\pi}} \exp \left( -\frac{1}{2} z'^2 \right) f(\sigma z' + \mu) \\ &= \int \mathrm d z' \frac{1}{\sqrt{2\pi}}\exp \left( -\frac{1}{2} z'^2 \right) f(\sigma z' + \mu) \\ &= \int \mathrm d z' \mathscr N(z'; \mu=0, \sigma=1) f(\sigma z' + \mu) \\ &= {\mathbb E}_{\mathscr N(z'; \mu=0, \sigma=1)} \left[ f(\sigma z' + \mu) \right] \end{align} $$Normalizing Flows: An Introduction and Review of Current Methodshttps://datumorphism.leima.is/reading/normalizing-flow-introduction-1908.09257/Sun, 17 Jan 2021 00:00:00 +0000https://datumorphism.leima.is/reading/normalizing-flow-introduction-1908.09257/To generate complicated distributions step by step from a simple and interpretable distribution.A Simple Machine Learning Project Frameworkhttps://datumorphism.leima.is/blog/data-science/a-simple-machine-learning-framework/Tue, 12 Jan 2021 00:00:00 +0000https://datumorphism.leima.is/blog/data-science/a-simple-machine-learning-framework/ A simple almost stateless machine learning frameworkBasics of Redishttps://datumorphism.leima.is/wiki/computation/basics-of-redis/Fri, 08 Jan 2021 00:00:00 +0000https://datumorphism.leima.is/wiki/computation/basics-of-redis/Basics Redis is:
NoSQL KeyValue In memory Data Structure Server binary safe strings lists, sets, sorted sets, hashes bitmaps, hyperloglogs Open source Redis is:
Fast Low CPU Requirement Scalable Redis can be used as:
Cache Analytics Leaderboard Queues Cookie storage Expiring data Messaging High I/O workloads API throttlings How to persist your data
Snapshot AOF: Append Only File Pros:Audiolization of Covid 19 Data in Europehttps://datumorphism.leima.is/blog/ruthless/audiolization-of-covid19-in-eu/Sun, 03 Jan 2021 00:00:00 +0000https://datumorphism.leima.is/blog/ruthless/audiolization-of-covid19-in-eu/Here is an audiolization sound track using a sample of covid19 data in Europe. The audio is the result of the audiorepr Python package I wrote.PREPhttps://datumorphism.leima.is/cards/communication/prep/Sun, 03 Jan 2021 00:00:00 +0000https://datumorphism.leima.is/cards/communication/prep/PREP PREP is a framework for making your point.
PREP: Point + Reason + Example + Point Point: Make a point; PREP is a good method. Reason: Give the reason; Because it has a clear logic. Example: Show examples; The famous XYZ did ABC then everyone was convinced. Point: State the point for a conclusion.SCQ-Ahttps://datumorphism.leima.is/cards/communication/scq-a/Sun, 03 Jan 2021 00:00:00 +0000https://datumorphism.leima.is/cards/communication/scq-a/SCQ-A SCQ-A: Situation + Conflict + Question + Answer SCA-A is a framework for problem-solving.
Situation: background knowledge, set the stage Complications: what is happening Question: propose your hypothesis Answer: accept or reject the hypothesisWWHhttps://datumorphism.leima.is/cards/communication/wwh/Sun, 03 Jan 2021 00:00:00 +0000https://datumorphism.leima.is/cards/communication/wwh/WWH WWH: What (happened) + Why (this happened) + How (to improve)Graph Creationhttps://datumorphism.leima.is/reading/grammar-of-graphics/graph-creation/Tue, 29 Dec 2020 00:00:00 +0000https://datumorphism.leima.is/reading/grammar-of-graphics/graph-creation/Stages Three stages of making a graph:
Specification Assembly Display Specification Statistical graphic specifications are expressed in six statements
DATA: a set of data operations that create variables from datasets TRANS: variable transformations (e.g., rank) SCALE: scale transformations (e.g., log) COORD: a coordinate system (e.g., polar) ELEMENT: graphs (e.g., points) and their aesthetic attributes (e.g., color) GUIDE: one or more guides (axes, legends, etc.) Assembly Assembling a scene from a specification requires a variety of structures in order to index and link components with each other.Multiset, mset or baghttps://datumorphism.leima.is/cards/math/multiset-mset-bag/Sun, 27 Dec 2020 00:00:00 +0000https://datumorphism.leima.is/cards/math/multiset-mset-bag/A bag is a set in which duplicate elements are allowed.
An ordered bag is a list that we use in programming.Python Class Sequential Inheritancehttps://datumorphism.leima.is/til/programming/python/python-class-inheritance-sequential/Thu, 03 Dec 2020 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-class-inheritance-sequential/# An experiment on python super class Base: def __init__(self): print("Start A") print("End A") class IA(Base): def __init__(self): print("Start IA") super(IA, self).__init__() print("End IA") class IB(IA): def __init__(self): print("Start IB") super(IB, self).__init__() print("End IB") print("Experiment 1:") ib = IB()Three dots in Pythonhttps://datumorphism.leima.is/til/programming/python/python-three-dots/Thu, 03 Dec 2020 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-three-dots/Using three dots in Python:
from abc import abstractmethod class A: def __init__(self): self.name = "A" print("Init") def three_dots(self): ... @abstractmethod def abs_three_dots(self): ... def raise_it(self): raise Exception("Not yet done") a = A() print("\nthree_dots") print(a.three_dots()) print("\nabs_three_dots") print(a.abs_three_dots()) print("\nraise_it") a.raise_it() Returns
three_dots None abs_three_dots None raise_it Traceback (most recent call last): File "main.py", line 27, in <module> a.raise_it() File "main.py", line 14, in raise_it raise Exception("Not yet done") Exception: Not yet doneOrdered Member Functions of a Class in Pythonhttps://datumorphism.leima.is/til/programming/python/python-class-methods-ordered/Wed, 02 Dec 2020 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-class-methods-ordered/# References: # 1. https://stackoverflow.com/questions/48145317/can-i-add-attributes-to-class-methods-in-python from functools import wraps # Define a decorator def attributes(**attrs): """ Set attributes of member functions in a class. ``` class AGoodClass: def __init__(self): self.size = 0 @attributes(order=1) def first_good_member(self, new): return "first good member" @attributes(order=2) def second_good_member(self, new): return "second good member" ``` References: 1. https://stackoverflow.com/a/48146924/1477359 """ def decorator(f): @wraps(f) def wrapper(*args, **kwargs): return f(*args, **kwargs) for attr_name, attr_value in attrs.items(): setattr(wrapper, attr_name, attr_value) return wrapper return decorator class AGoodClass: def __init__(self): self.Postgres Optimization in JOINhttps://datumorphism.leima.is/til/data/postgres.join-begin-with-smallest-cardinality/Sat, 28 Nov 2020 11:39:21 +0100https://datumorphism.leima.is/til/data/postgres.join-begin-with-smallest-cardinality/Join tables together starting with the smallest table (table with less cardinality) speeds things up.Deal with NULL in Postgreshttps://datumorphism.leima.is/til/data/postgres.deal-with-null/Thu, 26 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/til/data/postgres.deal-with-null/Please deal with null carefully.Akaike Information Criterionhttps://datumorphism.leima.is/cards/statistics/aic/Sun, 08 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/aic/Suppose we have a model that describes the data generation process behind a dataset. The distribution by the model is denoted as $\hat f$. The actual data generation process is described by a distribution $f$.
We ask the question:
How good is the approximation using $\hat f$?
To be more precise, how much information is lost if we use our model dist $\hat f$ to substitute the actual data generation distribution $f$?Bayes Factorshttps://datumorphism.leima.is/cards/statistics/bayes-factors/Sun, 08 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/bayes-factors/$$ \frac{p(\mathscr M_1|y)}{ p(\mathscr M_2|y) } = \frac{p(\mathscr M_1)}{ p(\mathscr M_2) }\frac{p(y|\mathscr M_1)}{ p(y|\mathscr M_2) } $$
Bayes factor
$$ \mathrm{BF_{12}} = \frac{m(y|\mathscr M_1)}{m(y|\mathscr M_2)} $$
$\mathrm{BF_{12}}$: how many time more likely is model $\mathscr M_1$ than $\mathscr M_2$.Bayesian Information Criterionhttps://datumorphism.leima.is/cards/statistics/bic/Sun, 08 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/bic/BIC is Bayesian information criterion, it replaced the $+2k$ term in AIC Akaike Information Criterion Suppose we have a model that describes the data generation process behind a dataset. The distribution by the model is denoted as $\hat f$. The actual data generation process is described by a distribution $f$. We ask the question: How good is the approximation using $\hat f$? To be more precise, how much information is lost if we use our model dist $\hat f$ to substitute the actual data generation distribution $f$?Fisher Information Approximationhttps://datumorphism.leima.is/cards/statistics/fia/Sun, 08 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/fia/FIA is a method to describe the minimum description length ( MDL Minimum Description Length MDL is a measure of how well a model compresses data by minimizing the combined cost of the description of the model and the misfit. ) of models,
$$ \mathrm{FIA} = -\ln p(y | \hat\theta) + \frac{k}{2} \ln \frac{n}{2\pi} + \ln \int_\Theta \sqrt{ \operatorname{det}[I(\theta)] d\theta } $$
$I(\theta)$: Fisher information matrix of sample size 1.Kolmogorov Complexityhttps://datumorphism.leima.is/cards/statistics/kolmogorov-complexity/Sun, 08 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/kolmogorov-complexity/Description of Data
The measurement of complexity is based on the observation that the compressibility of data doesn’t depend on the “language” used to describe the compression process that much. This makes it possible for us to find a universal language, such as a universal computer language, to quantify the compressibility of the data.
One intuitive idea is to use a programming language to describe the data. If we have a sequence of data,Minimum Description Lengthhttps://datumorphism.leima.is/cards/statistics/mdl/Sun, 08 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/mdl/The minimum description length, aka, MDL, is based on the relations between regularity and data compression. (See Kolmogorov complexity Kolmogorov Complexity Description of Data The measurement of complexity is based on the observation that the compressibility of data doesn’t depend on the “language” used to describe the compression process that much. This makes it possible for us to find a universal language, such as a universal computer language, to quantify the compressibility of the data.Normalized Maximum Likelihoodhttps://datumorphism.leima.is/cards/statistics/nml/Sun, 08 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/nml/$$ \mathrm{NML} = \frac{ p(y| \hat \theta(y)) }{ \int_X p( x| \hat \theta (x) ) dx } $$Experiments in Biologyhttps://datumorphism.leima.is/blog/ruthless/experiments-in-biology/Sun, 01 Nov 2020 00:00:00 +0000https://datumorphism.leima.is/blog/ruthless/experiments-in-biology/ Inspired by @hanlu.ioThe Science Part in Data Sciencehttps://datumorphism.leima.is/blog/ruthless/science-part-in-data-science/Sat, 31 Oct 2020 00:00:00 +0000https://datumorphism.leima.is/blog/ruthless/science-part-in-data-science/graph TD; s1(An Idea)--d1{Is this idea in the current literature?}; d1{Is this idea in the current literature?}--|Yes|b1(Fail); d1{Is this idea in the current literature?}--|No|b2[Weeks of work]; b2[Weeks of work]--b1(Fail);Conditional Probability Tablehttps://datumorphism.leima.is/cards/statistics/conditional-probability-table/Tue, 27 Oct 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/conditional-probability-table/The conditional probability table, aka CPT, is used to calculate conditional probabilities from a dataset.
Given a dataset with features $\mathbf X$ and their corresponding classes $\mathbf Y$, the conditional probabilities of each class given a certain feature value can be calculated using a CPT which in turn can be calculated using a contigency table Correlation Coefficient and Covariance for Numeric Data Detecting correlations using correlations for numeric data .Pandas Groupby Does Not Guarantee Unique Content in Groupby Columnshttps://datumorphism.leima.is/til/machine-learning/pandas-groupby-caveats/Mon, 20 Apr 2020 00:00:00 +0000https://datumorphism.leima.is/til/machine-learning/pandas-groupby-caveats/Pandas Groupby Does Not Guarantee Unique Content in Groupby Columns, it also considers the datatypes. Dealing with mixed types requires additional attention.== and is in Pythonhttps://datumorphism.leima.is/til/programming/python/python-none/Wed, 01 Apr 2020 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-none/== and is are differentArcsine Distributionhttps://datumorphism.leima.is/cards/statistics/distributions/arcsine/Sat, 14 Mar 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/distributions/arcsine/Arcsine Distribution The PDF is
$$ \frac{1}{\pi\sqrt{x(1-x)}} $$
for $x\in [0,1]$.
It can also be generalized to
$$ \frac{1}{\pi\sqrt{(x-1)(b-x)}} $$
for $x\in [a,b]$.
VisualizeBernoulli Distributionhttps://datumorphism.leima.is/cards/statistics/distributions/bernoulli/Sat, 14 Mar 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/distributions/bernoulli/Two categories with probability $p$ and $1-p$ respectively.
For each experiment, the sample space is $\{A, B\}$. The probability for state $A$ is given by $p$ and the probability for state $B$ is given by $1-p$. The Bernoulli distribution describes the probability of $K$ results with state $s$ being $s=A$ and $N-K$ results with state $s$ being $B$ after $N$ experiments,
$$ P\left(\sum_i^N s_i = K \right) = C _ N^K p^K (1 - p)^{N-K}.Beta Distributionhttps://datumorphism.leima.is/cards/statistics/distributions/beta/Sat, 14 Mar 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/distributions/beta/Beta Distribution Interact {% include extras/vue.html %}
((makeGraph))Binomial Distributionhttps://datumorphism.leima.is/cards/statistics/distributions/binomial/Sat, 14 Mar 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/distributions/binomial/The number of successes in $n$ independent events where each trial has a success rate of $p$.
PMF:
$$ C_n^k p^k (1-p)^{n-k} $$Categorical Distributionhttps://datumorphism.leima.is/cards/statistics/distributions/categorical/Sat, 14 Mar 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/distributions/categorical/By generalizing the Bernoulli distribution to $k$ states, we get a categorical distribution. The sample space is $\{s_1, s_2, \cdots, s_k\}$. The corresponding probabilities for each state are $\{p_1, p_2, \cdots, p_k\}$ with the constraint $\sum_{i=1}^k p_i = 1$.Cauchy-Lorentz Distributionhttps://datumorphism.leima.is/cards/statistics/distributions/cauchy/Sat, 14 Mar 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/distributions/cauchy/Cauchy-Lorentz Distribution .. ratio of two independent normally distributed random variables with mean zero.
Source: https://en.wikipedia.org/wiki/Cauchy_distribution
Lorentz distribution is frequently used in physics.
PDF:
$$ \frac{1}{\pi\gamma} \left( \frac{\gamma^2}{ (x-x_0)^2 + \gamma^2} \right) $$
The median and mode of the Cauchy-Lorentz distribution is always $x_0$. $\gamma$ is the FWHM.
VisualizeGamma Distributionhttps://datumorphism.leima.is/cards/statistics/distributions/gamma/Sat, 14 Mar 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/distributions/gamma/Gamma Distribution PDF:
$$ \frac{\beta^\alpha x^{\alpha-1} e^{-\beta x}}{\Gamma(\alpha)} $$
VisualizeDiagnolize Matriceshttps://datumorphism.leima.is/cards/math/diagonalize-matrix/Wed, 11 Mar 2020 00:00:00 +0000https://datumorphism.leima.is/cards/math/diagonalize-matrix/Given a matrix $\mathbf A$, it is diagonalized using its eigenvectors.
Why are the eigenvectors needed?
Eigenvectors of a matrix $\mathbf A$ are the preferred directions. From the definition of eigenvectors,
$$ \mathbf A \mathbf x = \lambda \mathbf x, $$
we know that the matrix $\mathbf A$ only scales the eigenvectors and no rotations. These directions are special to the matrix $\mathbf A$.
Find the eigenvectors $\mathbf x_i$ of the matrix $\mathbf A$; If we find degerations, the matrix is not diagonalizable.Mahalanobis Distancehttps://datumorphism.leima.is/cards/math/mahalanobis-distance/Wed, 11 Mar 2020 00:00:00 +0000https://datumorphism.leima.is/cards/math/mahalanobis-distance/Mahalanobis distance is a distance calculated using the inverse of the covariance matrix as the metric. For two vectors $\mathbf x$ and $\mathbf y$, the Mahalanobis distance is
$$ d^2 = (x_i - \bar x) g_{ij} (y_j - \bar y), $$
where $g_{ij} = (S^{-1})_{ij}$ and $\mathbf S$ is the covariance matrix.
The covariance is a normalization that mitigates the covariances.Covariance Matrixhttps://datumorphism.leima.is/cards/statistics/covariance-matrix/Tue, 10 Mar 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/covariance-matrix/We use Einstein’s summation convention. Covariance of two discrete series $A$ and $B$ is defined as
$$ \text{Cov} ({A,B}) = \sigma_{A,B}^2 = \frac{ (a_i - \bar A) (b_i - \bar B) }{ n- 1 }, $$
where $n$ is the length of the series. The normalization factor is set to $1/(n-1)$ to mitigate the bias for small $n$.
One could show that
$$ \mathrm{Cov}({A,B}) = E( A,B ) - \bar A \bar B.Jackknife Resamplinghttps://datumorphism.leima.is/cards/statistics/jacknife-resampling/Sun, 26 Jan 2020 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/jacknife-resampling/Jackknife resampling is a method for estimation of the mean and higher order moments.
Given a sample $\{x_i\}$ of size $n$ for the distribution $X$, the jackknife resampling estimates the mean by leaving out each data point systematically. $n$ estimations of the mean will be obtained, with each of the estimations $x_i$
$$ \bar x_i = \frac{1}{n-1} \sum_{j\neq i} x_j. $$
The mean of the sample is
$$ \bar x = \frac{1}{n}\sum_i \bar x_i = \frac{1}{n} \sum_i \left(\frac{1}{n-1} \sum_{j\neq i} x_j\right) = \frac{1}{n}\sum_i x_i.CBOW: Continuous Bag of Wordshttps://datumorphism.leima.is/cards/machine-learning/embedding/continuous-bag-of-words/Thu, 16 Jan 2020 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/embedding/continuous-bag-of-words/Here we encode all words presented in the corpus to demostrate the idea of CBOW. In the real world, we might want to remove some certain words such as the. We use the following quote by Ford in Westworld as an example.
I read a theory once that the human intellect is like peacock feathers. Just an extravagant display intended to attract a mate, just an elaborate mating ritual.Data Typeshttps://datumorphism.leima.is/cards/machine-learning/datatypes/data-types/Thu, 16 Jan 2020 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/datatypes/data-types/Gini Impurityhttps://datumorphism.leima.is/cards/machine-learning/measurement/gini-impurity/Thu, 16 Jan 2020 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/measurement/gini-impurity/The code used in this article can be found in this repo. Suppose we have a dataset $\{0,1\}^{10}$, which has 10 records and 2 possible classes of objects $\{0,1\}$ in each record.
The first example we investigate is a pure 0 dataset.
object 0 0 0 0 0 0 0 0 0 0 0 0 For such an all-0 dataset, we would like to define its impurity as 0.Information Gainhttps://datumorphism.leima.is/cards/machine-learning/measurement/information-gain/Thu, 16 Jan 2020 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/measurement/information-gain/Information gain is a frequently used metric in calculating the gain during a split in tree-based methods.
First o all, the entropy of a dataset if defined as
$$ S = - sum_i p_i \log p_i - sum_i (1-p_i)\log p_i, $$
where $p_i$ is the probability of a class.
The information gain is the difference between the entropy.
For example, in a decision tree algorithm, we would split a node. Before splitting, we assign a label $m$ to the node,Negative Samplinghttps://datumorphism.leima.is/cards/machine-learning/embedding/negative-sampling/Thu, 16 Jan 2020 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/embedding/negative-sampling/Knowledge of CBOW or skipgram is required.
A naive model to train a model of words is to
encode input words and output words using vectors, use the input word vector to predict the output word vector, calculate the errors between predicted output word vector and real output word vector, minimize the errors. However, it is very expensive to prject out the output words and calcualte the error eveytime.PAC: Probably Approximately Correcthttps://datumorphism.leima.is/cards/machine-learning/learning-theories/pac/Thu, 16 Jan 2020 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/learning-theories/pac/skipgram: Continuous skip-gramhttps://datumorphism.leima.is/cards/machine-learning/embedding/continuous-skip-gram/Thu, 16 Jan 2020 00:00:00 +0000https://datumorphism.leima.is/cards/machine-learning/embedding/continuous-skip-gram/We use the following quote by Ford in Westworld as an example.
I read a theory once that the human intellect is like peacock feathers. Just an extravagant display intended to attract a mate, just an elaborate mating ritual. But, of course, the peacock can barely fly. It lives in the dirt, pecking insects out of the muck, consoling itself with its great beauty.
The word intended is surrunded by extravagant display in the front and to attract after it.Improving Document Ranking with Dual Word Embeddingshttps://datumorphism.leima.is/reading/word2vec-in-out-embedding/Sat, 05 Oct 2019 00:00:00 +0000https://datumorphism.leima.is/reading/word2vec-in-out-embedding/Word2vec produces two embedding spaces, the in-embedding and out-embedding.Switch statement in Pythonhttps://datumorphism.leima.is/til/programming/python/python-switch-statement/Tue, 20 Aug 2019 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-switch-statement/Love switch statement? We can design a switch statement it in python.Python Tilde Operatorhttps://datumorphism.leima.is/til/programming/python/python-tilde-operator/Thu, 15 Aug 2019 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-tilde-operator/tilde operator may not work as you expectedArrays and Dicts in MongoDBhttps://datumorphism.leima.is/til/programming/database/mongodb-array-and-dict/Wed, 14 Aug 2019 00:00:00 +0000https://datumorphism.leima.is/til/programming/database/mongodb-array-and-dict/Array of dictionaries becomes hard to update in MongoDB.eval in Python is Dangeroushttps://datumorphism.leima.is/til/programming/python/python-eval/Tue, 13 Aug 2019 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-eval/eval is powerful but really dangerousDealing with Missing Data in Machine Learninghttps://datumorphism.leima.is/wiki/machine-learning/feature-engineering/missing-data/Mon, 05 Aug 2019 00:00:00 +0000https://datumorphism.leima.is/wiki/machine-learning/feature-engineering/missing-data/How to Deal with Missing Data Remove Listwise deletion: Remove the whole record; Works if the missing values are random. Removing values causes problem in many aspects. For example, we can not just delete data when applying our models. Replace with most frequent value central tendency: median, mean, etc fixed value: a string etc New Category: define a new category for missing data Convert the column to a binary valued column indicating if the feature is missing or not.Kendall Tau Correlationhttps://datumorphism.leima.is/cards/statistics/kendall-correlation-coefficient/Sat, 20 Jul 2019 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/kendall-correlation-coefficient/Definition two series of data: $X$ and $Y$ cooccurance of them: $(x_i, x_j)$, and we assume that $i<j$ concordant: $x_i < x_j$ and $y_i < y_j$; $x_i > x_j$ and $y_i > y_j$; denoted as $C$ discordant: $x_i < x_j$ and $y_i > y_j$; $x_i > x_j$ and $y_i < y_j$; denoted as $D$ neither concordant nor discordant: whenever equal sign happens Kendall’s tau is defined as
$$ \begin{equation} \tau = \frac{C- D}{\text{all possible pairs of comparison}} = \frac{C- D}{n^2/2 - n/2} \end{equation} $$Bayes' Theoremhttps://datumorphism.leima.is/cards/statistics/bayes-theorem/Tue, 18 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/bayes-theorem/Bayes’ Theorem is stated as
$$ P(A\mid B) = \frac{P(B \mid A) P(A)}{P(B)} $$
$P(A\mid B)$: likelihood of A given B $P(A)$: marginal probability of A There is a nice tree diagram for the Bayes’ theorem on Wikipedia.
Tree diagram of Bayes’ theoremCanonical Decompositionhttps://datumorphism.leima.is/cards/math/canonical-decomposition/Tue, 18 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/canonical-decomposition/I find this slide from Christoph Freudenthaler very useful.
Canonical decomposition visualized by Christoph FreudenthalerCholesky Decompositionhttps://datumorphism.leima.is/cards/math/cholesky-decomposition/Tue, 18 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/cholesky-decomposition/$$ A = L L^T $$Khatri-Rao Producthttps://datumorphism.leima.is/cards/math/khatri-rao/Tue, 18 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/khatri-rao/$$ \mathbf{A} \ast \mathbf{B} = \left(\mathbf{A}_{ij} \otimes \mathbf{B}_{ij}\right)_{ij} $$Modes and Slices of Tensorshttps://datumorphism.leima.is/cards/math/modes-and-slices-of-tensor/Tue, 18 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/modes-and-slices-of-tensor/ Modes of a tensor Slices of a tensorPoisson Processhttps://datumorphism.leima.is/cards/statistics/poisson-process/Tue, 18 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/cards/statistics/poisson-process/SVD: Singular Value Decompositionhttps://datumorphism.leima.is/cards/math/svd/Tue, 18 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/svd/Given a matrix $\mathbf X \to X_{m}^{\phantom{m}n}$, we can decompose it into three matrices
$$ X_{m}^{\phantom{m}n} = U_{m}^{\phantom{m}k} D_{k}^{\phantom{k}l} (V_{n}^{\phantom{n}l} )^{\mathrm T}, $$
where $D_{k}^{\phantom{k}l}$ is diagonal.
Here we have $\mathbf U$ being constructed by the eigenvectors of $\mathbf X \mathbf X^{\mathrm T}$, while $\mathbf V$ is being constructed by the eigenvectors of $\mathbf X^{\mathrm T} \mathbf X$ (which is also the reason we keep the transpose).
I find this slide from Christoph Freudenthaler very useful.Tucker Decompositionhttps://datumorphism.leima.is/cards/math/tucker-decomposition/Tue, 18 Jun 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/tucker-decomposition/I find this slide from Christoph Freudenthaler very useful. For the definition of mode 1/2/3 unfold, please refer to Modes and Slices of Tensors.
Tucker decomposition visualized by Christoph FreudenthalerLevenshtein Distancehttps://datumorphism.leima.is/cards/math/levenshtein-distance/Sun, 19 May 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/levenshtein-distance/Levenshtein distance calculates the number of operations needed to change one word to another by applying single-character edits (insertions, deletions or substitutions).
The reference explains this concept very well. For consistency, I extracted a paragraph from it which explains the operations in Levenshtein algorithm. The source of the following paragraph is the first reference of this article.
Levenshtein Matrix
Cell (0:1) contains red number 1. It means that we need 1 operation to transform M to an empty string.n-gramhttps://datumorphism.leima.is/cards/math/n-gram/Sun, 19 May 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/n-gram/n-gram is a method to split words into set of substring elements so that those can be used to match words.
Examples Use the following examples to get your first idea about it. I created two columns so that we could compare the n-grams of two different words side-by-side.
n in n-gram is Word One Clean Word: (( sentenceOneWords )) n-grams: (( sentenceOneWordsnGram )) Word Two Clean Word: (( sentenceTwoWords )) n-grams: (( sentenceTwoWordsnGram )) /*************************/ /** The function nGram is a copy of https://github.Add New Kernels to Jupyter Notebook in Conda Environmenthttps://datumorphism.leima.is/til/programming/jupyter-notebook-add-new-kernels-in-conda-env/Sun, 12 May 2019 00:00:00 +0000https://datumorphism.leima.is/til/programming/jupyter-notebook-add-new-kernels-in-conda-env/Python package or python module autoreloading in jupyter notebookAuto-reload Python Packages or Python Modules in Jupyter Notebookhttps://datumorphism.leima.is/til/programming/jupyter-notebook-autoreload-python-modules-or-packages/Sun, 12 May 2019 00:00:00 +0000https://datumorphism.leima.is/til/programming/jupyter-notebook-autoreload-python-modules-or-packages/Python package or python module autoreloading in jupyter notebookBigQuery Meta Tableshttps://datumorphism.leima.is/til/data/bigquery-meta-tables/Sun, 12 May 2019 00:00:00 +0000https://datumorphism.leima.is/til/data/bigquery-meta-tables/Meta tables are very useful when it comes to get bigquery table information programmatically.Calculate Moving Average Using SQL/BigQqueryhttps://datumorphism.leima.is/til/data/bigquery-moving-average/Sun, 12 May 2019 00:00:00 +0000https://datumorphism.leima.is/til/data/bigquery-moving-average/Snippet for calculating moving avg using sql/biguqeryGenerate a Column of Continuous Dates in BigQueryhttps://datumorphism.leima.is/til/data/bigquery-generate-continuous-dates-as-a-column/Sun, 12 May 2019 00:00:00 +0000https://datumorphism.leima.is/til/data/bigquery-generate-continuous-dates-as-a-column/Generate a table with a column of continuous datesGet Current User in BigQueryhttps://datumorphism.leima.is/til/data/bigquery-get-current-user/Sun, 12 May 2019 00:00:00 +0000https://datumorphism.leima.is/til/data/bigquery-get-current-user/BigQuery Current UserMaterialize the Query Result for Performancehttps://datumorphism.leima.is/til/data/bigquery-materialize-query-results-for-performance/Sun, 12 May 2019 00:00:00 +0000https://datumorphism.leima.is/til/data/bigquery-materialize-query-results-for-performance/Materialize the query result for multistage queries to make your query faster and lower the costs.Cosine Similarityhttps://datumorphism.leima.is/cards/math/cosine-similarity/Mon, 06 May 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/cosine-similarity/As simple as the inner product of two vectors
$$ d_{cos} = \frac{\vec A}{\vert \vec A \vert} \cdot \frac{\vec B }{ \vert \vec B \vert} $$
Examples To use cosine similarity, we have to vectorize the words first. There are many different methods to achieve this. For the purpose of illustrating cosine similarity, we use term frequency.
Term frequency is the occurrence of the words. We do not deal with duplications so duplicate words will have some effect on the similarity.Eigenvalues and Eigenvectorshttps://datumorphism.leima.is/cards/math/eigendecomposition/Mon, 06 May 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/eigendecomposition/To find the eigenvectors $\mathbf x$ of a matrix $\mathbf A$, we construct the eigen equation
$$ \mathbf A \mathbf x = \lambda \mathbf x, $$
where $\lambda$ is the eigenvalue.
We rewrite it in the components form,
$$ \begin{equation} A_{ij} x_j = \lambda x_i. \label{eqn-eigen-decomp-def} \end{equation} $$
Mathematically speaking, it is straightforward to find the eigenvectors and eigenvalues.
Eigenvectors are Special Directions Judging from the definition in Eq.($\ref{eqn-eigen-decomp-def}$), the eigenvectors do not change direction under the operation of the matrix $\mathbf A$.Jaccard Similarityhttps://datumorphism.leima.is/cards/math/jaccard-similarity/Mon, 06 May 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/jaccard-similarity/Jaccard index is the ratio of the size of the intersect of the set and the size of the union of the set.
$$ J(A, B) = \frac{ \vert A \cap B \vert }{ \vert A \cup B \vert } $$
Jaccard distance $d_J(A,B)$ is defined as
$$ d_J(A,B) = 1 - J(A,B). $$
Properties If the two sets are the same, $A=B$, we have $J(A,B)=1$ or $d_J(A,B)=0$. We have maximum similarity.Term Frequency - Inverse Document Frequencyhttps://datumorphism.leima.is/cards/math/tf-idf/Mon, 06 May 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/tf-idf/The Art of Data Sciencehttps://datumorphism.leima.is/reading/art-of-data-science/Fri, 19 Apr 2019 00:00:00 +0000https://datumorphism.leima.is/reading/art-of-data-science/A nice and elegant book on data scienceAwesome Stuffhttps://datumorphism.leima.is/projects/awesome/Sun, 07 Apr 2019 00:00:00 +0000https://datumorphism.leima.is/projects/awesome/Summarizations, workflows, experiences, fails, etcBlog Postshttps://datumorphism.leima.is/projects/blog/Sun, 07 Apr 2019 00:00:00 +0000https://datumorphism.leima.is/projects/blog/My blog posts for fun.Combinationshttps://datumorphism.leima.is/cards/math/combinations/Sun, 07 Apr 2019 00:00:00 +0000https://datumorphism.leima.is/cards/math/combinations/Choose X from N is
$$ C_N^X = \frac{N!}{ X! (N-X)! } $$My Data Wikihttps://datumorphism.leima.is/projects/wiki/Sun, 07 Apr 2019 00:00:00 +0000https://datumorphism.leima.is/projects/wiki/A collection of my wiki articles related to data.My Knowledge Cardshttps://datumorphism.leima.is/projects/cards/Sun, 07 Apr 2019 00:00:00 +0000https://datumorphism.leima.is/projects/cards/A collection of my snippets of knowledgeMy Reading Noteshttps://datumorphism.leima.is/projects/reading/Sun, 07 Apr 2019 00:00:00 +0000https://datumorphism.leima.is/projects/reading/A collection of my reading notesTILhttps://datumorphism.leima.is/projects/til/Sun, 07 Apr 2019 00:00:00 +0000https://datumorphism.leima.is/projects/til/Today I LearnedHuman Graphical Perception of Quantitative Information in Data Visualizationhttps://datumorphism.leima.is/reading/graphical-perception/Sun, 17 Mar 2019 00:00:00 +0000https://datumorphism.leima.is/reading/graphical-perception/Data visualization caveatsAdd Data Files to Python Packagehttps://datumorphism.leima.is/til/programming/python/python-package-including-data-file/Wed, 13 Mar 2019 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-package-including-data-file/Add Data Files to Python Package using manifest.in and setup.pyInstalling requirements.txt in Conda Environmentshttps://datumorphism.leima.is/til/programming/python/python-anaconda-install-requirements/Wed, 13 Mar 2019 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-anaconda-install-requirements/Why is pip install -r requirements.txt not working?Information Theory and Statistical Mechanicshttps://datumorphism.leima.is/reading/statistical-physics-and-information-theory/Fri, 01 Mar 2019 00:00:00 +0000https://datumorphism.leima.is/reading/statistical-physics-and-information-theory/Max entropy principle as a method to infer distributions of statistical systemsFlatten 2D List in Pythonhttps://datumorphism.leima.is/til/programming/python/python-flatten-2d-list/Wed, 23 Jan 2019 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-flatten-2d-list/Flatten 2D list using sumPython Datetime on Different OShttps://datumorphism.leima.is/til/programming/python/python-datetime-on-different-os/Mon, 31 Dec 2018 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-datetime-on-different-os/Python datetime on different os behaves inconsistentlyPython If on Numbershttps://datumorphism.leima.is/til/programming/python/python-if-condition-on-numbers/Mon, 31 Dec 2018 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-if-condition-on-numbers/If on int is dangerousPython Long Stringhttps://datumorphism.leima.is/til/programming/python/python-long-string/Mon, 31 Dec 2018 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-long-string/Python long string formattingPython Reliable Path to Filehttps://datumorphism.leima.is/til/programming/python/python-reliable-path/Mon, 31 Dec 2018 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-reliable-path/Find the actual path to fileVSCode on Mac Long Press Keys Not Repeatinghttps://datumorphism.leima.is/til/misc/vscode-on-mac-do-not-repeat/Mon, 31 Dec 2018 00:00:00 +0000https://datumorphism.leima.is/til/misc/vscode-on-mac-do-not-repeat/Enable your key repeat in vscode on macControlled Experimentshttps://datumorphism.leima.is/til/statistics/controlled-experiments/Tue, 04 Dec 2018 00:00:00 +0000https://datumorphism.leima.is/til/statistics/controlled-experiments/The three levels of controlled experimentsSchaum's Outline of Theories and Problems of Elements of Statistics I and IIhttps://datumorphism.leima.is/reading/elements-of-statistics/Thu, 01 Nov 2018 00:00:00 +0000https://datumorphism.leima.is/reading/elements-of-statistics/The basics and all of modern statisticsPandas with MultiProcessinghttps://datumorphism.leima.is/til/programming/pandas/pandas-parallel-multiprocessing/Sun, 09 Sep 2018 00:00:00 +0000https://datumorphism.leima.is/til/programming/pandas/pandas-parallel-multiprocessing/Define number of processes, prs; Split dataframe into prs dataframes; Process each dataframe with one process; Merge processed dataframes into one. A piece of demo code is shown below.
from multiprocessing import Pool from multiprocessing.dummy import Pool as ThreadPool import pandas as pd # Create a dataframe to be processed df = pd.read_csv('somedata.csv').reset_index(drop=True) # Define a function to be applied to the dataframe def nice_func(name, age): return (name,age) # Apply to dataframe def apply_to_df(df_chunks): df_chunks['tupled'] = df_chunks.Beer and Life Expectancyhttps://datumorphism.leima.is/blog/ruthless/beer-and-life-expectancy/Wed, 08 Aug 2018 00:00:00 +0000https://datumorphism.leima.is/blog/ruthless/beer-and-life-expectancy/This is a post of no analysis at all. Everything in this post is meant for fun.
I moved to Germany a few weeks ago and one of the most astonishing things I noticed is that everyone is drinking so much. Yet the life expectance of Germany is pretty high. So I performed this “analysis” for fun.
Life expectancy vs beer consumption (L) per capita per year. Data obtained from wikipediaList of countries by life expectancy and List of countries by beer consumption per capita.Data Mining: Concepts and Techniqueshttps://datumorphism.leima.is/reading/data-mining/Wed, 01 Aug 2018 00:00:00 +0000https://datumorphism.leima.is/reading/data-mining/How data mining was done in the pastFitt's Lawhttps://datumorphism.leima.is/til/misc/fitts-law/Sun, 22 Jul 2018 00:00:00 +0000https://datumorphism.leima.is/til/misc/fitts-law/How fast can you move your mouse to targetCopy Scalars and Lists in Pythonhttps://datumorphism.leima.is/til/programming/python/python-copy-value-or-address/Tue, 03 Jul 2018 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-copy-value-or-address/Python copy values of scalars but addresses of listsCertificate Errors in urllibhttps://datumorphism.leima.is/til/data/python-urllib-ssl/Mon, 25 Jun 2018 00:00:00 +0000https://datumorphism.leima.is/til/data/python-urllib-ssl/Dealing with errors when scraping dataCalculated Columns in Pandashttps://datumorphism.leima.is/til/programming/pandas/pandas-new-column-from-other/Sun, 20 May 2018 00:00:00 +0000https://datumorphism.leima.is/til/programming/pandas/pandas-new-column-from-other/Create new columns in pandastree in Linuxhttps://datumorphism.leima.is/til/programming/trees/Tue, 20 Mar 2018 00:00:00 +0000https://datumorphism.leima.is/til/programming/trees/Trees in computer scienceHeap on Mac and Linuxhttps://datumorphism.leima.is/til/programming/cpp/cpp-heap-mac-linux-diff/Tue, 26 Sep 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/cpp/cpp-heap-mac-linux-diff/Some caveats about heap on mac and linuxC++ int Multiplicationhttps://datumorphism.leima.is/til/programming/cpp/cpp-int-multiply/Thu, 21 Sep 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/cpp/cpp-int-multiply/int multiplication in C++ should be processed with caution.CMake Usagehttps://datumorphism.leima.is/til/programming/cmake-usage/Thu, 21 Sep 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/cmake-usage/How to use CMake to generate makefilesAllocating Memory for Multidimensional Array in C++https://datumorphism.leima.is/til/programming/cpp/cpp-allocating-memory-multidimensional-array/Thu, 14 Sep 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/cpp/cpp-allocating-memory-multidimensional-array/Some caveatsC++ range-for-statementhttps://datumorphism.leima.is/til/programming/cpp/cpp-range-for-statement/Tue, 12 Sep 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/cpp/cpp-range-for-statement/In C++ we can use range-for-statementList All Folders in Linux or Machttps://datumorphism.leima.is/til/programming/linux-mac-list-all-folders/Tue, 01 Aug 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/linux-mac-list-all-folders/Using ls and tree commands to list folders onlyPython Default Parameters Tripped Me Uphttps://datumorphism.leima.is/til/programming/python/python-default-parameters-mutable/Sat, 03 Jun 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-default-parameters-mutable/Python default parameters might be changed with each runSome Tests on Matplotlib Backendshttps://datumorphism.leima.is/til/programming/matplotlib-backend/Tue, 23 May 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/matplotlib-backend/Matplotlib provides many different backendsMathematica Provides Great PlotTheme Optionshttps://datumorphism.leima.is/til/programming/mathematica/mathematica-plottheme/Fri, 19 May 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/mathematica/mathematica-plottheme/Amazingly, Mathematica provides an option for plot that automatically generates beautiful plots.Turn a Series Expansion into Function in Mathematicahttps://datumorphism.leima.is/til/programming/mathematica/mathematica-turn-series-into-function/Mon, 15 May 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/mathematica/mathematica-turn-series-into-function/Turn a series expansion in Mathematica into a functionOvercoming catastrophic forgetting in neural networkshttps://datumorphism.leima.is/reading/overcoming-catastrophic-forgetting-in-neural-networks/Sun, 14 May 2017 00:00:00 +0000https://datumorphism.leima.is/reading/overcoming-catastrophic-forgetting-in-neural-networks/Using a newly defined loss function the authors could implement an idea that achieves the multi-task within one network.Git Asks for Password Whenever I Pull or Pushhttps://datumorphism.leima.is/til/programming/git/git-ssh-asking-pwd-everytime/Thu, 11 May 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/git/git-ssh-asking-pwd-everytime/My git asks for password every time I pull or push even with ssh configured.Command Line Russian Roulettehttps://datumorphism.leima.is/til/programming/command-line-russian-roulette/Tue, 09 May 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/command-line-russian-roulette/Play russian roulette in your command lineGNU Screen Key Conflict with Bashhttps://datumorphism.leima.is/til/programming/gnu-screen-key-conflict-with-bash/Mon, 08 May 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/gnu-screen-key-conflict-with-bash/GNU screen key conflict with bash can be solvedHow to Run Mathematica Script in Terminalhttps://datumorphism.leima.is/til/programming/run-mathematica-script-in-terminal/Mon, 08 May 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/run-mathematica-script-in-terminal/Using math -run or wolfram -run we could execute a Mathematica script through ssh in terminal.GNUPLOT Inline Output in iterm2https://datumorphism.leima.is/til/programming/gnuplot-iterm2-imgcat/Fri, 07 Apr 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/gnuplot-iterm2-imgcat/Using gnuplot in iterm2 we can output result inside terminal combined with imgcatMathematica Exclude Singularities in Plothttps://datumorphism.leima.is/til/programming/mathematica/mathematica-plot-exclude-singularities/Wed, 22 Mar 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/mathematica/mathematica-plot-exclude-singularities/Mathematica Plot might include some non-existant lines sometimes, Exclusions is the potion for it.Passing Function Arguments Through Lists in Mathematicahttps://datumorphism.leima.is/til/programming/mathematica/mathematica-passing-arguments-through-lists/Mon, 20 Feb 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/mathematica/mathematica-passing-arguments-through-lists/We can pass a list of arguments using SequenceGit Pull with Submodulehttps://datumorphism.leima.is/til/programming/git/git-pull-with-submodule/Fri, 03 Feb 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/git/git-pull-with-submodule/Pull git repo with submodulePositioning textblock in LaTeX Beamerhttps://datumorphism.leima.is/til/programming/latex-beamer-textblock-position/Tue, 17 Jan 2017 00:00:00 +0000https://datumorphism.leima.is/til/programming/latex-beamer-textblock-position/Positioning textblock in LaTeX Beamer using textpos package and eso pic packageMathematica Different Output Formshttps://datumorphism.leima.is/til/programming/mathematica/mathematica-different-output-forms/Mon, 28 Nov 2016 00:00:00 +0000https://datumorphism.leima.is/til/programming/mathematica/mathematica-different-output-forms/Mathematica has many different output forms. Understanding them is extremely helpful when making plots.Git Branch Optionshttps://datumorphism.leima.is/til/programming/git/git-branch-details/Sun, 27 Nov 2016 00:00:00 +0000https://datumorphism.leima.is/til/programming/git/git-branch-details/Some useful options about git branchgit pull multi remotehttps://datumorphism.leima.is/til/programming/git/git-pull-multi-remote/Tue, 22 Nov 2016 00:00:00 +0000https://datumorphism.leima.is/til/programming/git/git-pull-multi-remote/working with multi remoteWorking Memory and Brain Waveshttps://datumorphism.leima.is/reading/working-memory-and-brain-waves/Sun, 20 Nov 2016 00:00:00 +0000https://datumorphism.leima.is/reading/working-memory-and-brain-waves/Working memory might be related to the background brain waves from theoretical point of viewPopularity versus similarity in growing networkshttps://datumorphism.leima.is/reading/popularity-vs-similarity/Sun, 06 Nov 2016 00:00:00 +0000https://datumorphism.leima.is/reading/popularity-vs-similarity/Introduce geometry into the manifold of complex networksFormatting Numbers in Pythonhttps://datumorphism.leima.is/til/programming/formating-numbers-python/Tue, 11 Oct 2016 00:00:00 +0000https://datumorphism.leima.is/til/programming/formating-numbers-python/Formatting numbers in python using formatSolving Equations Using Differential Transformation Methodhttps://datumorphism.leima.is/til/math/differential-transformation-method-solving-equations/Tue, 11 Oct 2016 00:00:00 +0000https://datumorphism.leima.is/til/math/differential-transformation-method-solving-equations/Differential transformation method can be used to solve differential equation even integro-differential equations.The Great Chrome Dev Toolhttps://datumorphism.leima.is/til/programming/chrome-dev-tool-usage/Wed, 28 Sep 2016 00:00:00 +0000https://datumorphism.leima.is/til/programming/chrome-dev-tool-usage/How to use the chrome dev tool wiselyStart a Simple Serverhttps://datumorphism.leima.is/til/programming/start-simple-server/Sat, 17 Sep 2016 00:00:00 +0000https://datumorphism.leima.is/til/programming/start-simple-server/With one line of python commandmatplotlib x y limit and aspect ratiohttps://datumorphism.leima.is/til/programming/matplotlib-x-y-limit-and-aspect-ratio/Thu, 21 Jul 2016 00:00:00 +0000https://datumorphism.leima.is/til/programming/matplotlib-x-y-limit-and-aspect-ratio/matplotlib x y limit and aspect ratioTOP Commandhttps://datumorphism.leima.is/til/programming/top/Thu, 21 Jul 2016 00:00:00 +0000https://datumorphism.leima.is/til/programming/top/Some tips about top commandAssigning Values to Multiple Variableshttps://datumorphism.leima.is/til/programming/python/python-assigning-values-to-multiple-variables/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-assigning-values-to-multiple-variables/Assigning Values to Multiple Variablesgitignore by file sizehttps://datumorphism.leima.is/til/programming/git/gitignore-by-file-size/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/git/gitignore-by-file-size/gitignore by file sizeHTML Animations Using CSS: AnimateCSShttps://datumorphism.leima.is/til/programming/html-animate-css/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/html-animate-css/HTML Animations Using CSS AnimateCSSImport in Pythonhttps://datumorphism.leima.is/til/programming/import-in-python/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/import-in-python/Import in PythonIPython or Jupyter Notebook Magicshttps://datumorphism.leima.is/til/programming/ipython-or-jupyter-notebook-magics/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/ipython-or-jupyter-notebook-magics/IPython or Jupyter Notebook MagicsLaTeX Automatically Adjust Figurehttps://datumorphism.leima.is/til/programming/latex-automatically-adjust-figure/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/latex-automatically-adjust-figure/LaTeX Automatically Adjust FigureMathematica Plot Default Font Style and Ticks Style: BaseStylehttps://datumorphism.leima.is/til/programming/mathematica/mathematica-plot-basestyle-default-font-style-and-ticks-style/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/mathematica/mathematica-plot-basestyle-default-font-style-and-ticks-style/Mathematica Plot Default Font Style and Ticks Style BaseStyleMathematica Smooth Plothttps://datumorphism.leima.is/til/programming/mathematica/mathematica-smooth-plot/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/mathematica/mathematica-smooth-plot/Mathematica Smooth PlotMigrating Wordpress to Statichttps://datumorphism.leima.is/til/programming/migrating-wordpress-to-static-site/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/migrating-wordpress-to-static-site/Migrating Wordpress to StaticOpen URL using python using webbrowser modulehttps://datumorphism.leima.is/til/programming/open-url-using-python-webbrowser-module/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/open-url-using-python-webbrowser-module/Open URL using python using webbrowser modulePython Code Stylehttps://datumorphism.leima.is/til/programming/python/python-code-style/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-code-style/Code Style of Python Guide.
PEP 20 – The Zen of Python
1. Beautiful is better than ugly. 2. Explicit is better than implicit. 3. Simple is better than complex. 4. Complex is better than complicated. 5. Flat is better than nested. 6. Sparse is better than dense. 7. Readability counts. 8. Special cases aren't special enough to break the rules. 9. Although practicality beats purity. 10. Errors should never pass silently.Python Creating Listshttps://datumorphism.leima.is/til/programming/python/python-creating-lists/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-creating-lists/Code Style of Python GuidePython enumertatehttps://datumorphism.leima.is/til/programming/python/python-enumerate/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-enumerate/Python enumertate functionPython List Comprehensionshttps://datumorphism.leima.is/til/programming/python/python-list-comprehensions/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-list-comprehensions/Python List ComprehensionsPython Making a Listhttps://datumorphism.leima.is/til/programming/python/python-making-a-list/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-making-a-list/Python Making a ListPython Map vs For in Pythonhttps://datumorphism.leima.is/til/programming/python/python-map-vs-for/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-map-vs-for/Python Map vs For in PythonPython Onliner: Filter Prime Numbershttps://datumorphism.leima.is/til/programming/filter-prime-numbers/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/filter-prime-numbers/Python Onliner Filter Prime NumbersPython Stupid numpy.piecewisehttps://datumorphism.leima.is/til/programming/python/python-stupid-numpy-piecewise/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-stupid-numpy-piecewise/Python Stupid numpy.piecewisePython Various Ways of Writing Loopshttps://datumorphism.leima.is/til/programming/python/python-writing-loops/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-writing-loops/Python Various Ways of Writing LoopsRun a program in the background on ubuntuhttps://datumorphism.leima.is/til/programming/run-program-in-background-ubuntu/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/run-program-in-background-ubuntu/Run a program in the background on ubuntusnakevizhttps://datumorphism.leima.is/til/programming/python/python-profile-snakeviz/Fri, 04 Dec 2015 00:00:00 +0000https://datumorphism.leima.is/til/programming/python/python-profile-snakeviz/Python snakevizArea Enclosed by a Linehttps://datumorphism.leima.is/til/math/area-enclosed-in-a-line/Sun, 15 Feb 2015 00:00:00 +0000https://datumorphism.leima.is/til/math/area-enclosed-in-a-line/Calculate the area enclosed by a lineEigensystem of A Special Matrixhttps://datumorphism.leima.is/til/math/eigensystem-of-a-special-matrix/Sun, 15 Feb 2015 00:00:00 +0000https://datumorphism.leima.is/til/math/eigensystem-of-a-special-matrix/Eigenstates of a very special matrixFeynman Trickhttps://datumorphism.leima.is/til/math/feynman-tricks/Sun, 15 Feb 2015 00:00:00 +0000https://datumorphism.leima.is/til/math/feynman-tricks/An identity about integralSymmetry of second derivativeshttps://datumorphism.leima.is/til/math/symmetry-of-second-derivatives/Sun, 15 Feb 2015 00:00:00 +0000https://datumorphism.leima.is/til/math/symmetry-of-second-derivatives/Symmetry of second derivatives<link>https://datumorphism.leima.is/wiki/dynamical-system/integration-of-ode/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://datumorphism.leima.is/wiki/dynamical-system/integration-of-ode/</guid><description/></item><item><title/><link>https://datumorphism.leima.is/wiki/survival-analysis/survival-probability/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://datumorphism.leima.is/wiki/survival-analysis/survival-probability/</guid><description/></item><item><title>Abouthttps://datumorphism.leima.is/about/Mon, 01 Jan 0001 00:00:00 +0000https://datumorphism.leima.is/about/Datumorphism is my notebook about programming, data scraping, statistics, machine learning, and data visualization.
Join our Enki Team Learn, practice, and play together. Programmers are unstoppable. Intelligence Notebook Notes about neuroscience, machine intelligence, and collective intelligence. Lei Ma Visit this page if you would like to know more about me.Cheatsheetshttps://datumorphism.leima.is/awesome/cheatsheets/Mon, 01 Jan 0001 00:00:00 +0000https://datumorphism.leima.is/awesome/cheatsheets/ Supervised Learning k-Nearest Neighbors [Supervised Learning Classification ] : Linear Regression [Supervised Learning Regression ] : Lasso [Supervised Learning Regression Regularization ] : Ridge [Supervised Learning Regression Regularization ] : ElasticNet [Supervised Learning Regression Regularization ] : Unsupervised Learning k-Means [Unsupervised Learning ] : t-SNE [Unsupervised Learning ] : PCA [Unsupervised Learning Dimension Reduction Feature Selection ] : NMF [Unsupervised Learning ] : Non-negative Matrix FactoringCurriculumhttps://datumorphism.leima.is/awesome/curriculum/Mon, 01 Jan 0001 00:00:00 +0000https://datumorphism.leima.is/awesome/curriculum/Prerequisites Programming Bash: all posts with the bash tag Bash Python: The Python Language The Python Language Python as a programming language all posts with the python tag Python C++: C/C++ C/C++ all posts with the C++ tag C++ alternatives:Gridlines in Matplotlibhttps://datumorphism.leima.is/til/programming/matplotlib-gridlines/Mon, 01 Jan 0001 00:00:00 +0000https://datumorphism.leima.is/til/programming/matplotlib-gridlines/Adding gridlines in matplotlibResearchershttps://datumorphism.leima.is/awesome/researchers/Mon, 01 Jan 0001 00:00:00 +0000https://datumorphism.leima.is/awesome/researchers/ Machine Learning Geoffrey Hinton [machine learning psychology artificial intelligence cognitive science computer science ] : Emeritus Prof. Comp Sci, U.Toronto & Engineering Fellow, GoogleToolshttps://datumorphism.leima.is/awesome/tools/Mon, 01 Jan 0001 00:00:00 +0000https://datumorphism.leima.is/awesome/tools/List of Tools Dashboard ReDash [Python ] : Superset [Python ] : Metabase [Java ] : Google Data Studio [Free Google BigQuery Cloud ] : Google Datastudio is a convinent tool to produce simple yet massive dashboards for the team. Design and Build a Data Warehouse for Business [Courses Warehouse Business ] : Explained Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) [LSTM RNN ] : JavaScript replacements for Python data science tools [JavaScript Tools Data Science ] : https://github.Typography of this Websitehttps://datumorphism.leima.is/typography/Mon, 01 Jan 0001 00:00:00 +0000https://datumorphism.leima.is/typography/Basic Syntax This website uses kramdown as the basic syntax. However, a lot of html/css/js has been applied to generate some certain contents or styles.
Math also follows the kramdown syntax.
Notes div {% highlight html %}
Figure with Caption {% highlight html %}
![]({{ site.url }}/assets/programming/chrome-dev-tools-inspect.png) where {{ site.url }} is the configured url of the site.
Alternatively, we can use the set attributes syntax in kramdown.
{% highlight md %} This is a paragraph with some class.Workflowshttps://datumorphism.leima.is/awesome/workflows/Mon, 01 Jan 0001 00:00:00 +0000https://datumorphism.leima.is/awesome/workflows/The scope of exploratory data analysis is not universally defined. Some of the contents discussed here may have crossed the line. The whole modeling process is never decoupled anyway. Data wrangling is mostly guided by the exploratory data analysis (EDA). In other words, the data cleaning process should be mostly guided by questions from business and stakeholder or out of curiosity.
There are three key components in EDA.