Datumorphism/Recent content on DatumorphismHugo -- gohugo.ioen-USTue, 12 Jan 2021 00:00:00 +0000Principles of Design/wiki/data-visualization/design/Fri, 20 Nov 2020 00:00:00 +0000/wiki/data-visualization/design/There are many principles of designing a visual representation of data. However, before we understand how data is represented visually, it would benefit us a lot if we understand the basic principles of designing on 2D surface.
Robin’s CRAP Robin Williams proposed the four elements of design:
Contrast Repetition Alignment Proximity Contrast Use some contrast to distinguish the elements of different contents.
Repetition Repeat the design of similar elements on the same page and across pages to make sure the readers learn the meaning of the design quickly.Receiver Operating Characteristics: ROC/wiki/machine-learning/performance/roc/Wed, 13 May 2020 00:00:00 +0000/wiki/machine-learning/performance/roc/ROC space is the two-dimensional space spanned by True Positive Rate and False Positive Rate.
ROC Space. The color boxes are indicating the confusion matrices. Green is the fraction of true positive. Orange is the fraction of false positive. Refer to Confusion Matrix for more details.
AUC: Area under Curve TPR = TP Rate FPR = FP Rate The ROC curve is defined by the relation $f(TPR, FPR)$.Tree-based Learning/wiki/machine-learning/tree-based/overview/Wed, 25 Dec 2019 00:00:00 +0000/wiki/machine-learning/tree-based/overview/Decision tree is an easy-to-interpret method in supervised learning. Though simple, it is being used in some widely used algorithms such as random forest method.Embedding/wiki/machine-learning/embedding/overview/Sun, 13 Oct 2019 00:00:00 +0000/wiki/machine-learning/embedding/overview/Factorization/wiki/machine-learning/factorization/overview/Mon, 17 Jun 2019 00:00:00 +0000/wiki/machine-learning/factorization/overview/Feature Engineering/wiki/machine-learning/feature-engineering/overview/Mon, 17 Jun 2019 00:00:00 +0000/wiki/machine-learning/feature-engineering/overview/Naive Bayes/wiki/machine-learning/bayesian/naive-bayesian/Mon, 17 Jun 2019 00:00:00 +0000/wiki/machine-learning/bayesian/naive-bayesian/Naive Bayesian is a classifier using Bayes' Theorem Bayes’ Theorem is stated as $$ P(A\mid B) = \frac{P(B \mid A) P(A)}{P(B)} $$ $P(A\mid B)$: likelihood of A given B $P(A)$: marginal probability of A There is a nice tree diagram for the Bayes’ theorem on Wikipedia. Tree diagram of Bayes’ theorem with ‘naive’ assumptions.
Problems with Conditional Probability Calculation By definition, the conditional probability of event $\mathbf Y$ given features $\mathbf X$ is $$ \begin{equation} P(\mathbf Y\mid \mathbf X) = \frac{P(\mathbf Y, \mathbf X)}{ P(\mathbf X) }, \label{def-cp-y-given-x} \end{equation} $$ whereConfusion Matrix (Contingency Table)/wiki/machine-learning/basics/confusion-matrix/Fri, 31 May 2019 00:00:00 +0000/wiki/machine-learning/basics/confusion-matrix/Confusion Matrix It is much easier to understand the confusion matrix if we use a binary classification problem as an example. For example, we have a bunch of cat photos and the user labeled “cute or not” data. Now we are using the labeled data to train a cute-or-not binary classifier.
Then we apply the classifier on the test dataset and we would only find four different kinds of results.Normal Distribution/wiki/distributions/normal-distribution/Tue, 22 Jan 2019 00:00:00 +0000/wiki/distributions/normal-distribution/Visualization Math The formula of normal distribution is
$$ \begin{equation} e^{ ( (x - \mu) / \sqrt{2} \sigma )^2 } \end{equation} $$
where $\mu$ controls the “center” or “peak” of the distribution and $\sigma$ tells us how “wide” or “disperse” the distribution is.
To understand the distribution, we take some limits.
$x = \mu$ First of all, when $x = \mu$ we have
$$ e^0 = 1. $$
Notice the argument of the exponential is some squared value and can not be negative.Statistical Hypothesis Testing/wiki/statistical-hypothesis-testing/hypothesis-testing/Sun, 20 Jan 2019 00:00:00 +0000/wiki/statistical-hypothesis-testing/hypothesis-testing/When we have a sample of the population, we immediately calculate the mean using the sample, say the result is $\mu_0$. Of course, the population mean $\mu_p$ is unknown and probably can never be known.
This specific sample mean $\mu_0$ is nothing but like an advanced educated guess. Then again, how do we know if our this specific sample mean $\mu_0$ is a faithful representation of the population mean? In fact, this question is not limited to mean.Why Estimation Theory/wiki/statistical-estimation/why-estimation-theory/Sun, 20 Jan 2019 00:00:00 +0000/wiki/statistical-estimation/why-estimation-theory/In statistics, we work with samples. For example, the sample mean is easily calculated. However, it is the population mean that is more valuable.
Suppose we have one sample $S_i$, which is used to calculate the mean of the sample $\mu_i$. We have two key problems to solve at this moment.
Can we use this sample mean $\mu_i$ to represent the population mean $\mu_p$? How good is our estimations? To answer these questions, we need to work out the properties of the samples themselves and work out a theory to instruct us to infer population statistics from sample statistics.What is Statistics/wiki/statistics/what-is-statistics/Fri, 18 Jan 2019 00:00:00 +0000/wiki/statistics/what-is-statistics/A Case Study We have a problem.
In our lab, we found a huge amount of similar robots on a planet (physical population). To know more about the weight of these robots (statistical population), we first need to choose some of them (physical sample), then obtain the weight of them (statistical sample).
To describe the data, we could calculate the mean of the weight. We found that the mean weight is 93kg (descriptive statistics).Association Rules/wiki/pattern-mining/association-rules/Sun, 06 Jan 2019 00:00:00 +0000/wiki/pattern-mining/association-rules/Association rule is a method for pattern mining. In this article, we perform an association rule analysis of some demo data.
The Problem Defined Suppose we own a store called KIOSK. Here at KIOSK, we sell 4 different things.
Milk Croissant Coffee Fries We need to know what items are associated with each other when the customers are buying.
We have collected the following data. Beware that this small amount of data might not be enough for a real-world problem.Some Concepts about Data Warehouse/wiki/data-warehouse/data-warehouse-concepts/Fri, 23 Nov 2018 00:00:00 +0000/wiki/data-warehouse/data-warehouse-concepts/The Three Key Ideas about Warehouse The purpose of the data warehouse should be clear. In most cases, it is for the analysis of data, not for data production.1
Subject-oriented: since data warehouses are for decision-makers, arrange them into subjects makes it much easier to access. Integrated: many sources are integrated for easy analysis Time-variant: observation time should be recorded since the data is also used to analyze the time evolution Nonvolatile: simply for analysis OLTP and OLAP OLTP: online transaction processing OLAP: online analytical processing OLTP OLAP user customer data scientist, managers purpose production analysis content everything cleaner data database entity relation model, application-oriented star/snowflake model, subject-oriented history usually no need to record the history history is crucial query short and frequent read and write read-only and but complicated analysis Scope of Data Warehouse Enterprise warehouse: targeting the whole organization Data mart: for a specific group of people Virtual warehouse: views not tables Fact and Dimension Fact is the value of something specified by the dimension.Artificial Neural Networks/wiki/machine-learning/neural-networks/artificial-neural-networks/Mon, 19 Nov 2018 00:00:00 +0000/wiki/machine-learning/neural-networks/artificial-neural-networks/Artificial neural networks works pretty well for solving some differential equations.
Universal Approximators Maxwell Stinchcombe and Halber White proved that no theoretical constraints for the feedforward networks to approximate any measurable function. In principle, one can use feedforward networks to approximate measurable functions to any accuracy.
However, the convergence slows down if we have a lot of hidden units. There is a balance between accuracy and convergence rate. More hidden units lead to slow convergence but more accuracy.Ordinary Differential Equations/wiki/dynamical-system/ordinary-differential-method/Mon, 19 Nov 2018 00:00:00 +0000/wiki/dynamical-system/ordinary-differential-method/For a first order differentiation $\frac{\partial f}{\partial t}$, we might have many finite differencing methods.
Euler Method For linear first ODE,
$$ \frac{dy}{dx} = f(x, y), $$
we can discretize the equation using a step size $\delta x \cdot$ so that the differential equation becomes
$$ \frac{y_{n+1} - y_n }{ \delta x } = f(x_n, y_n), $$
which is also written as
$$ y_{n+1} = y_n + \delta x \cdot f(x_n, y_n).Basics of Computation/wiki/computation/basics-of-computation/Thu, 13 Sep 2018 00:00:00 +0000/wiki/computation/basics-of-computation/Storage, Precision, Error, etc To have some understanding of how the numbers are processed in computers, we have to understand how the numbers are stored first.
Computers stores everything in binary form 1. Suppose we randomly get some segments in the memory, we have no idea what that stands for since we do not know the type of data it represents.
Some of the most used data types in data science areIntroduction to Node Crawler Series/wiki/nodecrawler/node-crawler-introduction/Sun, 15 Jul 2018 00:00:00 +0000/wiki/nodecrawler/node-crawler-introduction/This is a set of tutorials that will help you with your very first crawler with node.js.
The plan of this tutorial is as follows. First of all, we will write a functional crawler using node.js and dump the data into files or simply print it on screen. In the following article, we will use MongoDB as our data management system and organize our data. Then we will optimize and attack some of the pitfalls.Jupyter Notebook/wiki/tools/jupyter/Wed, 20 Jun 2018 15:58:49 -0400/wiki/tools/jupyter/Magics %lsmagic will show all the magics, including line magics and cell magics.
Line magics are magics start with one %; Cell magics are magics that can be used in the whole cell even with line breaks, where the cell should start with %%. %env can be used when setting environment variables inside the notebook.
%env MONGO_URI=localhost:27072 %%bash is a cell magic that allows bash commands in the cell.Regular Expression Basics/wiki/sugar/regular-experssions/Wed, 20 Jun 2018 15:58:49 -0400/wiki/sugar/regular-experssions/List of Keys Anchors at the beginning of line ^ import re p = re.compile('^T', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['T'] at the end of the line $ import re p = re.compile('e$', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['e'] Character Classes Printable Characters any character .Short-Time-Fourier-Transform/wiki/time-series/short-time-fourier-transform/Wed, 20 Jun 2018 15:58:49 -0400/wiki/time-series/short-time-fourier-transform/Short-Time-Fourier-Transform We Fourier transform the time series data using a Fourier transform, with some window function
\begin{equation} \tilde Y[n,k] = \sum_m Y[n+m] W[m] e^{-i \lambda_k m}, \end{equation}
where $\lambda_k=2\pi k/N$ and $W[m]$ is the window function at $m$.
References and Notes CouseraLinear Methods/wiki/machine-learning/linear/linear-methods/Fri, 25 May 2018 00:00:00 +0000/wiki/machine-learning/linear/linear-methods/Solving Classification Problems with Linear Models One simple idea behind classification is to calculate the posterior probability of each class given the variables.
Suppose a dataset have features $F_\alpha$ where $\alpha = 1, 2, \cdots, K$, with corresponding class labels $G_\alpha$. The dataset that provides $N$ datapoints with each deoted as $X_i$. The posterior of the classification is $P(G = G_\alpha \vert X = X_i)$.
A naive idea is to classify the data into two classes $m$ and $n$ using the boundary of a linear modelMachine Learning Overview/wiki/machine-learning/overview/Fri, 25 May 2018 00:00:00 +0000/wiki/machine-learning/overview/What is Machine Learning There are many objectives in machine learning. Two of the most applied objectives are classifications and regressions. In classifications and regression, the following four factors are relevant.
A simple framework of machine learning. The dataset $\tilde{\mathscr D}$ is first encoded by $\mathscr T$, $\mathscr D(\mathbf X, \mathbf Y) = \mathscr T(\tilde{\mathscr D})$. The dataset is feeded into the model, $\bar{\mathbf Y} = f(\mathbf X;\mathbf \theta)$.Unsupervised Learning/wiki/machine-learning/unsupervised/overview/Fri, 25 May 2018 00:00:00 +0000/wiki/machine-learning/unsupervised/overview/Unsupervised Learning!
Principle components analysis Clustering K-means Clustering Algorithm:
Assign data points to a group Iterate through until no change: Find centroid Find the point that is closest to the centroids. Assign that data point to the corresponding group of the centroids. How Many Groups
The art of chosing K. Hierarchical Clustering Bottom-up hierarchical groups can be read out from the dendrogram.Some Basic Ideas of Algorithms/wiki/algorithms/algorithms-basics/Tue, 20 Mar 2018 00:00:00 +0000/wiki/algorithms/algorithms-basics/This set of notes on algorithms is not meant to be comprehensive or complete. These notes are being used as a skeleton framework. There are many useful books to learn about algorithms from a utilitarian point of view. I have listed a few in the references section.
Numerical recipes is a very comprehensive book that I used during my PhD. It covers almost all the algorithms you need for scientific computing.The C++ Language/wiki/programming-languages/cpp/references/Tue, 20 Mar 2018 00:00:00 +0000/wiki/programming-languages/cpp/references/C++!
Books The C++ Programming Language Programming Principles and Practice Using C++ The C++ Primer Lectures C++ Beginners Tutorial 1 (For Absolute Beginners) C++ Programming Introduction to C++ Coursear Course: C++ For C Programmers, Part A Top C++ Courses and Tutorials On SoloLearn: C++ Tutorial Practice SoloLearn provides this code playground that we can use to test c++ codes. There is also repl.it
Libraries For solving differential equations:The Python Language: Basics/wiki/programming-languages/python/basics/Tue, 20 Mar 2018 00:00:00 +0000/wiki/programming-languages/python/basics/Numbers, Arithmetics Two types of numbers exist,
int float, 15 digits, other digits are float error It is worth noting that in Python 2, we have
print(1.0/3) # will give us float numbers # 0.333333333333 while
print(1/3) # will only give us int # 0 However, this was changed in Python 3.
Variables, Functions, Conditions A variable name should start with either a letter or an underscore.
Variables defined inside a function is local and there is no way to find it or use it outside the function.Principles of Colors/wiki/data-visualization/colors/Fri, 20 Nov 2020 00:00:00 +0000/wiki/data-visualization/colors/Basic Concepts of Colors Color Wheel and Color Sphere There are two dimensions in the color wheel:
Hue Saturation When we add another dimension, lightness, to the wheel, we have a color sphere (1, 2).
Many color systems have been invented. Color wheel and color sphere are two examples of them.Data Types and Level of Measurement in Machine Learning/wiki/machine-learning/feature-engineering/data-types/Wed, 15 Jan 2020 00:00:00 +0000/wiki/machine-learning/feature-engineering/data-types/Types of Data There are several debatable categorization methods of data.
The first widely spread theory, or level of measurement, is by S. Stevens. The theory categorizes data into four types, nominal, ordinal, interval, and ratio.
Other methods are proposed for other fields of research. For example, N. R. Chrisman proposed a different method for cartography. However, these are not generic enough for data science. They are more general than a specific field of research.Decision Tree/wiki/machine-learning/tree-based/decision-tree/Wed, 25 Dec 2019 00:00:00 +0000/wiki/machine-learning/tree-based/decision-tree/In this article, we will explain how decision trees work and build a tree by hand.
The code used in this article can be found in this repo. Definition of the problem We will decide whether one should go to work today. In this demo project, we consider the following features.
feature possible values health 0: feeling bad, 1: feeling good weather 0: bad weather, 1: good weather holiday 1: holiday, 0: not holiday For more compact notations, we use the abstract notation $\{0,1\}^3$ to describe a set of three features each with 0 and 1 as possible values.Bayesian Linear Regression/wiki/machine-learning/bayesian/bayesian-linear-regression/Tue, 18 Jun 2019 00:00:00 +0000/wiki/machine-learning/bayesian/bayesian-linear-regression/Linear Regression and Likelihood The linear estimator $y$ is
$$ \begin{equation} y^n = \beta^m X_m^{\phantom{m}n}. \label{eq-linear-model} \end{equation} $$
As usual, we have redefined our data to get rid of the intercept $\beta^0$.
In ordinary linear models, we find the error being the difference between the target $\hat y$ and the estimator $y$
$$ \epsilon = \hat y - y, $$
which is required to have a minimum absolute value.
In linear regressions, we use least squares to solve the problem.NMF: Nonnegative Matrix Factorizatioin/wiki/machine-learning/factorization/nmf/Thu, 13 Jun 2019 00:00:00 +0000/wiki/machine-learning/factorization/nmf/Decomposition To make it easier to understand, we start with a data point $\mathbf P$ in a $k$-dimensional space spanned by $k$ basis vectors $\mathbf V^k$. Naturally, we could write down the component decomposition of the point using the basis vectors $\mathbf V^k$,
$$ \mathbf P = P_k \mathbf V^k. $$
This is immediately obvious to us since we have been dealing with rank 2 $(k, 1)$ basis vectors and we are talking about the $k$ coordinates for a point.Word2vec/wiki/machine-learning/embedding/word2vec/Thu, 13 Jun 2019 00:00:00 +0000/wiki/machine-learning/embedding/word2vec/Word2vec is a word embedding model that learns the probability of some words being neighbours in a sentence $p_{neighbours}(w_i, w_o)$.
Build a dataset of adjacent words. CBOW; skipgram; negative sampling; Encode the words using vectors. Build a model $f(\{\theta_i\})$ to calculate the probability of the words being neighours and improve the parameters $\{\theta_i\}$ using the dataset.Bias-Variance/wiki/machine-learning/basics/bias-variance/Fri, 07 Jun 2019 00:00:00 +0000/wiki/machine-learning/basics/bias-variance/Bias and Variance Suppose $f(X)$ is a perfect model that represents a “tight” model of the dataset $(X,Y)$ but some irredicible error $\epsilon$,
$$ \begin{equation} Y = f(X) + \epsilon. \label{dataset-using-true-model} \end{equation} $$
On the other hand, we build another model using a specific method such as k-nearest neighbors, which is denoted as $k(X)$.
Why the two models?
Why are we talking about the perfect model and a model using a specific method?Types of Errors in Statistical Hypothesis Testing/wiki/statistical-hypothesis-testing/type-1-error-and-type-2-error/Fri, 31 May 2019 00:00:00 +0000/wiki/statistical-hypothesis-testing/type-1-error-and-type-2-error/Type I and Type II Errors In statistical hypothesis testing, we always have a null hypothesis $H_0$ which refers to the statement to be tested. We have two possible conclusions from a hypothesis testing,
to accept the hypothesis, that is concluding that $H_0$ is true, to reject the hypothesis, that is concluding that $H_0$ is false. However, it is possible that our conclusion is not correct. There are four possible results.Amazon CloudWatch Logs/wiki/tools/awslogs/Mon, 11 Mar 2019 00:00:00 +0000/wiki/tools/awslogs/Why Suppose we have all kinds of pipelines written in different languages, using different tools, and located in different places. It would be frustrating to pull out the logs.
This is why we need a centralized log service, for example cloudwatch.
Sending logs to CloudWatch First of all, send your logs to awslogs. The easies way is to use boto.
Retrieving and Analyzing Logs First of all, we need this: awslogs.Confidence Interval/wiki/statistical-estimation/confidence-interval/Sun, 20 Jan 2019 00:00:00 +0000/wiki/statistical-estimation/confidence-interval/We will use upper cases for the abstract variable and lower cases for the actual numbers.
Why is Confidence Interval Needed? Suppose I sample the population multiple times, the mean value $\mu_i$ of the sample is calculated for each sample. It is a good question to ask how different these $\mu_i$ are compared to the true mean $\mu_p$ of the population.
In this article, we would need to specify several notations.Jargons/wiki/statistics/jargons/Sat, 24 Nov 2018 00:00:00 +0000/wiki/statistics/jargons/Accuracy and Precision Accuracy: the measurement compared to the truth Precision: variability of repeated measurements; the more precise, the less variations during each measurement. Accurate Inaccurate Precise Close to true value, small variations in each measurement Far from true value, small variations in each measurement Imprecise Close to true value, large variations in each measurement Far from true value, large variations in each measurement Here is an example.Extract, Transform and Load/wiki/data-warehouse/extract-transform-load/Fri, 23 Nov 2018 00:00:00 +0000/wiki/data-warehouse/extract-transform-load/ETL Process ETL
ETL
Extract: extract data from sources Transform: transform it to proper format Load: load it to data storage infrastructure E for Extract Should not affect the source system. T for Transform Cleaning Filtering Enriching Splitting Joining L for Load Deal with sync and waitingPartial Differential Equations/wiki/dynamical-system/partial-difference-method/Mon, 19 Nov 2018 00:00:00 +0000/wiki/dynamical-system/partial-difference-method/Forward Time Centered Space For $\frac{d f}{d t} = - v \frac{ d f }{ dx }$, we write down the finite difference form 1
$$ \frac{f(t_{n+1}, x_i ) - f(t_n, x_i)}{ \Delta t } = - v \frac{ f(t_n, x_{i+1}) - f(t_n, x_{i-1}) }{ 2\Delta x }. $$
FTCS is an explicit method and is not stable.
Lax Method Change the term $f(t_n, x_i)$ in FTCS to $( f(t_n, x_{i+1}) + f(t_n, x_{i-1}) )/2$ 1.Basics of Programming/wiki/computation/basics-of-programming/Sun, 23 Sep 2018 00:00:00 +0000/wiki/computation/basics-of-programming/Recursive and Iterative Solving problems with iterative and recursive methods are two quite different approaches, somehow, to the same kind of problems.
Here we will calculate the factorial of $n$. We define two functions using the iterative method and the recursive method.
Run the program on Repl.it.
def recursiveFactorial(n): if n == 0: return 1 else: return n * recursiveFactorial(n - 1) def iterativeFactorial(n): ans = 1 i=1 while i <= n: ans = ans * i i=i+1 return ans print(recursiveFactorial(0)) print(iterativeFactorial(0))Basic Node Crawler/wiki/nodecrawler/basic-crawler/Sun, 15 Jul 2018 00:00:00 +0000/wiki/nodecrawler/basic-crawler/Prerequisites Nodejs >= 8.9 Overview A model for a crawler is as follows.
A crawler requests data from the server, while the server responds with some data. Here is a graphic illustration
+----------+ +-----------+ | | HTTP Request | | | +----------------> | | Nodejs | | Servers | | <----------------+ | | | HTTP Response | | +----------+ +-----------+ HTTP Requests For a good introduction of HTTP requests, please refer to this video on youtube: Explained HTTP, HTTPS, SSL/TLS API As for the first step, we need to find which url to request.Autoregressive Model/wiki/time-series/autoregressive-model/Wed, 20 Jun 2018 15:58:49 -0400/wiki/time-series/autoregressive-model/Autoregressive Given a time series ${T^i}$, a simple predictive model can be constructed using an autoregressive model.
$$ \begin{equation} T^t = \sum_{i=1}^p \beta_i T^{t - i} + \beta^t + \beta^0. \end{equation} $$
Such a model is usually called an AR(p) model due to the fact that we are using data back in $p$ steps.
Differential Equation For simplicity we will look at a AR(1) model. Assume the time series has a step size of $dt$, our model can be rewritten asUnsupervised Learning: PCA/wiki/machine-learning/unsupervised/pca/Fri, 25 May 2018 00:00:00 +0000/wiki/machine-learning/unsupervised/pca/We use the Einstein summation notation in this article. Principal Component Analysis (PCA) is a commonly used trick for dimensionality reduction so that the new features represents most of the variances of the data.
Representations of Dataset In theory, a dataset can be represented by a matrix if we specify the basis. However, the initial given basis is not always the most convinient one. Suppose we find a new set of basis for the dataset, the matrix representation may be simpler and easier to use.Data Structure/wiki/algorithms/data-structure/Tue, 20 Mar 2018 00:00:00 +0000/wiki/algorithms/data-structure/Dealing with data structure is like dealing with your clothes. Some people randomly drop their clothes somewhere without thinking. But it takes time to retrieve a specific T-shirt. Some people spend more time folding and arranging their clothes. This process makes it easy to find a specific T-shirt. Similar to retrieving clothes, there is always a balance between the computation time (retrieving clothes) and the coding time (folding clothes).
Keywords This section serves as some kind of flashcard keywords.The C++ Language: Basics/wiki/programming-languages/cpp/basics/Tue, 20 Mar 2018 00:00:00 +0000/wiki/programming-languages/cpp/basics/Make it Work Apart from the traditional way of running C++ code, Jupyter notebook has a clingkernel that make it possibel to run C++ in a Jupyter notebook. Here is the post: Interactive C++ for HPC.
Concepts Namespace Operators: assignment operators (=,+=,-=,*=,/=,%=), increment/decrement operator (++x,x++,--x,x--), relational operators (>,<,>=,<=,==,!=), logicl operators (&&,||,!), left shift (<<), extration operator (>>, or right shift), understand the operator precedence Variables: variable name starts with underscore or latin letters, Pascal case (PascalCase), Camel case (pascalCase) if/else and Loops: if (condition is true ){ then something } Data Types: string (double quote), character (char, 1 byte ASCII character, using single quote), float (4 bytes, always signed), double (8 bytes, always signed), long double (8 or 16 bytes, always signed), singed or unsigned short or long int (signed long int, unsigned int) Pointers: ampersand (&) accesses the address, pointer is variable thus needs to be declared using asterisk (*), can be declared to be int or double or float or char (int \*pt; int\* pt; int * pt;) Functions: overload, recursion Class: identity, atrributes, method/behavior, access specifiers (private or public or protected, by default it is set to private), instantiation of object (creating object), constructor, destructor, encapsulation, scope resolution operator (TheClassYouNeed::somefunction()), selection operator (dot member selection .The Python Language: Decorators/wiki/programming-languages/python/decorators/Tue, 20 Mar 2018 00:00:00 +0000/wiki/programming-languages/python/decorators/Functions: first-class objects; can be passed around as arguments.
What that tells us about is that functions can be pass into a function or even returned by a function. For example,
def a_decoration_function( yet_another_function ): def wrapper(): print('Before yet_another_function') yet_another_function() print('After yet_another_function') return wraper def yet_another_function(): print('This is yet_another_function') When we execute a_decoration_function, we will have
Before yet_another_function This is yet_another_function After yet_another_function So a decorator is simply a function that takes a function as an argument, adds some salt to it.A Physicist's Crash Course on Artificial Neural Network/wiki/machine-learning/neural-networks/physicists-crash-course-neural-network/Sat, 02 May 2015 00:00:00 +0000/wiki/machine-learning/neural-networks/physicists-crash-course-neural-network/What is a Neuron What a neuron does is to response when a stimulation is given. This response could be strong or weak or even null. If I would draw a figure, of this behavior, it looks like this.
Neuron response Using simple single neuron responses, we could compose complicated responses. To achieve that, we study the transformations of the response first.
transformations Artificial Neural Network A simple network is a collection of neurons that response to stimulations, which could be the responses of other neurons.Random Forest/wiki/machine-learning/tree-based/random-forest/Wed, 25 Dec 2019 00:00:00 +0000/wiki/machine-learning/tree-based/random-forest/Random forest is an ensemble method based on decision trees. Instead of using one decision tree and model on all the features, the decision tree method can model on a random set of features (feature subspace) using many decision trees and make decisions by democratizing the trees.
Given a proper dataset $\mathscr D(\mathbf X, \mathbf y)$, the ensemble of trees is denoted as ${f_i(\mathbf X)}$, will predict an ensemble of results.Predictions Using Time Series Data/wiki/time-series/predictions-time-series-data/Fri, 21 Jun 2019 00:00:00 +0000/wiki/time-series/predictions-time-series-data/General Phenological Model for Seasonality In business, time series data $f(t)$ usually carries information about trend $g(t)$ ($g$ is used since trend is usually growth), seasonalities (periodical effects) $p(t)$, holiday effects (structural effects) $s(t)$, etc. We will decompose a time series $f(t)$ into four components
$$ \begin{equation} f(t) = g(t) + p(t) + s(t) + \epsilon(t). \end{equation} $$
To train a model for the predictions, we need to write down the exact models of these three predictable components.Tensor Factorization/wiki/machine-learning/factorization/tensor-factorization/Mon, 17 Jun 2019 00:00:00 +0000/wiki/machine-learning/factorization/tensor-factorization/Tensors We will be talking about tensors but we will skip the introduction to tensor for now.
In this article, we follow a commonly used convention for tensors in physics, the abstract index notation. We will denote tensors as $T^{ab\cdots}_ {\phantom{ab\cdots}cd\cdots}$, where the latin indices such as $^{a}$ are simply a placebo for the slot for this “tensor machine”. For a given basis (coordinate system), we can write down the components of this tensor $T^{\alpha\beta\cdots} _ {\phantom{\alpha\beta\cdots}\gamma\delta\cdots}$.Anscombe's quartet/wiki/data-visualization/anscombes-quartet/Mon, 18 Mar 2019 00:00:00 +0000/wiki/data-visualization/anscombes-quartet/Anscombe’s Quartet Anscombe’s quartet is a brilliant idea that shows the importance and convenience of visual representation of data.
Anscombe’s quartet has four datasets. The values of each dataset are shown below.
x1 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5] y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68] x2 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.OLAP Operations/wiki/data-warehouse/olap-operations/Fri, 23 Nov 2018 00:00:00 +0000/wiki/data-warehouse/olap-operations/Roll-up or Drill-up The word ‘up’ in the names refers to going up in concept hierarchies.
For example, we would like to know the revenue of the whole year. However, the record of data is
Date Revenue 2018-01-01 1023 2018-01-02 934 … … 2018-12-30 1244 2018-12-31 1302 Roll-up is performed by summing up everything of the column revenue.Finite Element Method/wiki/dynamical-system/finite-element-method/Mon, 19 Nov 2018 00:00:00 +0000/wiki/dynamical-system/finite-element-method/Differential Equations and Boundary Conditions Two Types of Boundary Conditions As an example, we have a partial differential equation
$$ \frac{d^2u}{dx^2} + f = 0, $$
which describes a 1D problem.
Dirichlet boundary condition: specify values for $u$, such as $u(0)=u_0$ and $u(L)=u_L$; Neumann boundary condition: specifiy values for $u_{,x}$. If we have only Neumann boundary condition, the solution is not unique. One example for it is tossing a bar, which can have both Neumann BC at both ends but it is moving.Chi-square Correlation Test for Nominal Data/wiki/statistics/correlation-analysis-chi-square/Sun, 18 Nov 2018 00:00:00 +0000/wiki/statistics/correlation-analysis-chi-square/In this article, we will discuss the chi-square correlation test for detecting correlations between two series.
Steps Find out all the possible values of the two nominal series A and B; Count the co-occurrences of the combinations (A, B); Calculate the expected co-occurrences of the combinations (A, B); Calculate chi-square; Determine whether the hypothesis can be rejected. Define the Series Suppose we are analyzing two series A and B.Basics of Network/wiki/computation/basics-of-network/Sun, 23 Sep 2018 00:00:00 +0000/wiki/computation/basics-of-network/HTTP Keywords Hyper Text Transfer Protocal: deliver hyper text from server to local browser etc. Based on TCP/IP Current version: HTTP/2 Server - Client Client can request through GET, HEAD, POST, PUT, DELETE, TRACE, OPTIONS, CONNECT, PATCH. Transfer anything defined by Content-Type Connectionless Protocol: doesn’t maintain the connection all the time Stateless protocal: A very nice explanation URL Keywords Uniform Resource Locator Interpret each part of this URL: http://abc.Unsupervised Learning: SVM/wiki/machine-learning/unsupervised/svm/Fri, 17 Aug 2018 00:00:00 +0000/wiki/machine-learning/unsupervised/svm/SVM is calculating a hyperplane to separate the data points into groups according to the label.
Hyperplane A hyperplane is defined to be of the following form
$$ \begin{equation} \boldsymbol{\beta} \cdot \mathbf x = \beta_0. \end{equation} $$
where $\boldsymbol\beta$ is the normal vector to the plane and is required to be constant.
It is straight forward to show that the distance $d$ from an arbitrary point $\mathbf x'$ to the hyperplane isManage Data Using MongoDB/wiki/nodecrawler/manage-data-using-mongodb/Wed, 18 Jul 2018 00:00:00 +0000/wiki/nodecrawler/manage-data-using-mongodb/In most cases, databases makes the management of data quite convenient. In this article, we would scrape data using the code we discussed before but write data into MongoDB.
For installation of MongoDB, please refer to the official documentation.
The Code To write data to MongoDB using Node.js, we choose the package mongojs, which provides almost exactly the standard MongoDB syntax.
To install mongojs,
npm i mongojs --save Here is a module that can write data to MongoDB.The Python Language: Multi-Processing/wiki/programming-languages/python/multiprocessing/Thu, 10 May 2018 00:00:00 +0000/wiki/programming-languages/python/multiprocessing/Python has built-in multiprocessing module in its standard library.
One simple example of using the Pool class is the following.
def myfunc(myfuncargs): 'some thing here' with Pool(10) as p: records = p.map(myfunc, myfuncargs) However, there are limitations on this, especially on pickles. Another approach.
from multiprocessing import Pool from multiprocessing.dummy import Pool as ThreadPool with ThreadPool(1) as p: records = p.map(myfunc, myfuncargs) Beware that map function will feed in a list of args to the function.Data Structure: Tree/wiki/algorithms/data-structure-tree/Tue, 27 Mar 2018 00:00:00 +0000/wiki/algorithms/data-structure-tree/mind the data structure: here comes the treeThe C++ Language: Numerical Methods/wiki/programming-languages/cpp/numerical/Tue, 20 Mar 2018 00:00:00 +0000/wiki/programming-languages/cpp/numerical/Modularize The code should be designed to separate physics or model from numerical methods. Speed vectors are convenient but slow. 1 Do not copy arrays if not necessary. The example would be for a function return. Most of the time, we can pass the pointer of an array to the function and update the array itself without copying anything and no return is needed at all. inline function.GNUPlot/wiki/tools/gnuplot/Mon, 04 Sep 2017 00:00:00 +0000/wiki/tools/gnuplot/Examples Plot .csv data. Suppose we have data of such.
-0.00999983, 0.99995 -0.0199987, 0.9998 -0.0299955, 0.99955 -0.0399893, 0.9992 -0.0499792, 0.99875 -0.059964, 0.998201 To plot the second column against the first column, we use the using parameter in gnuplot.
gnuplot -e "set terminal png; set datafile separator ',' ; plot 'complex.txt' using 1:2" | imgcat # datafile seperator is not always necessary # imgcat is a script in iterm2 on macBoltzmann Machine/wiki/machine-learning/neural-networks/boltzmann-machine/Sun, 27 Aug 2017 00:00:00 +0000/wiki/machine-learning/neural-networks/boltzmann-machine/Boltzmann machine is much like a spin glass model in physics. In short words, Boltzmann machine is a machine that has nodes that can take values, and the nodes are connected through some weight. It is just like any other neural nets but with complications and theoretical implications.
Boltzmann Machine and Physics To obtain a good understanding of Boltzmann machine for a physicist, we begin with Ising model. We construct a system of neurons ${ s_i}$ which can take values of 1 or -1, where each pair of them $s_i$ and $s_j$ is connected by weight $J_{ij}$.Terminal/wiki/tools/terminal/Tue, 31 Dec 2019 00:00:00 +0000/wiki/tools/terminal/Navigating Some tips to help data scientist navigate faster in terminal.
pushd, popd and dirs pushd to register and change directories: pushd folder_name will change current directory to folder_name and register the folder folder_name in our stack. If no folder name is passed onto the command, it will be default to $HOME folder. popd to go to the last directory in the stack and remove it from the stack. In this example, popd will change the current working directory to folder_name.Statistical Sign Test/wiki/statistical-hypothesis-testing/sign-test/Sun, 20 Jan 2019 00:00:00 +0000/wiki/statistical-hypothesis-testing/sign-test/We have a small dataset, but it doesn’t satisfy the t-test conditions. Then we would use as little assumptions as possible.
Wine Taste Suppose we have two bottles of wine, one of them is 300 euros while the other is 100 euros.
Now we ask the question:
Does expensive wine taste better?
We find 10 experts and give them some experiments. The result is recorded then processed into the following table.Data Storage/wiki/data-warehouse/data-storage/Fri, 23 Nov 2018 00:00:00 +0000/wiki/data-warehouse/data-storage/tl;dr: Use type safe formats such as HDF5 or parquet
HDF5 BCOLZ <http://bcolz.blosc.org/en/latest/>_ : not designed for multidimentional data. Zarr <https://github.com/alimanfoo/zarr>_ : works with multidimensional data and also parallel computating. Blaze ecosystem <http://blaze.pydata.org/>_ A article that compares HDF5, BCOLZ, and Zarr: To HDF5 and beyond
I also recommend pandas. It is a python module that works very well with data. It even loads HDF5 out of box.Bin Size of Histogram/wiki/data-visualization/histogram-bin-size/Thu, 22 Nov 2018 00:00:00 +0000/wiki/data-visualization/histogram-bin-size/Histograms are good for understanding the distribution of your data.
The Bin Size Problem As an example, we will use the following series as an example.
[1.45,2.20,0.75,1.23,1.25,1.25,3.09,1.99,2.00,0.78,1.32,2.25,3.15,3.85,0.52,0.99,1.38,1.75,1.21,1.75] If we use bin size 1, we get this spiky chart and it is not so informing.
We could also set bin size to 2.
In principle, we could keep tuning the bin size until we get something pretty and informing. But that would be quite depressing.Correlation Coefficient and Covariance for Numeric Data/wiki/statistics/correlation-coefficient/Sun, 18 Nov 2018 00:00:00 +0000/wiki/statistics/correlation-coefficient/Covariances Correlation coefficient is also known as the Pearson’s product moment coefficient. Review of Standard Deviation For a series of data A, we have the standard deviations
$$ \sigma_A = \sqrt{ \frac{ \sum (a_i - \bar A)^2 }{ n } }, $$
where $n$ is the number of elements in series A.
The standard deviation is very easy to understand. It is basically the average Eucleadian distance between the data points and the average value.Basics of Database/wiki/computation/basics-of-database/Wed, 03 Oct 2018 00:00:00 +0000/wiki/computation/basics-of-database/NoSQL NoSQL = Not only SQL. The four main types of NoSQL databases are
Key-value store: Amazon Dynamo, memcached, Amazon SimpleDB Column-orient store: Google BigTable, Cassandra Graph database: Neo4j, VertexDB Document database: MongoDB Object database: ZODB Database Operations Relations Union: $A\cup B$ Intersection: $A\cap B$ $A - B$ Cartesian Product: $A \times B$ Query Union in database: will combine the data with matching common columns.Restrictions of Websites/wiki/nodecrawler/restrictions/Thu, 19 Jul 2018 00:00:00 +0000/wiki/nodecrawler/restrictions/Beware that scraping data off websites is neither always allowed nor as easy as a few lines of code. The preceding articles enable you to scrape many data, however, man websites have counter measures. In this article, we will be dealing with some of the common ones.
Request Frequency Some websites have limitations on the frequency of API requests. The solution to this is simply a brief pause after each request.The Python Language: Performance/wiki/programming-languages/python/performance/Tue, 20 Mar 2018 00:00:00 +0000/wiki/programming-languages/python/performance/Read the references for performance.
The message:
Use comprehensions Use generatorsBoxplot/wiki/data-visualization/boxplots/Tue, 20 Aug 2019 00:00:00 +0000/wiki/data-visualization/boxplots/The Whiskers in Boxplot They are the outlier data points.
Outliers are determined using the interquatile range (IQR, i.e., 25 percentile to 75 percentile.). We usually the lowest data point within 1.5 IQR range below the 25 percentile or the data point within 1.5 IQR range above the 75 percentile.Mann-Whitney U Test/wiki/statistical-hypothesis-testing/mann-whitney-u-test/Sun, 20 Jan 2019 00:00:00 +0000/wiki/statistical-hypothesis-testing/mann-whitney-u-test/Mann-Whitney U is good at testing heavy-tailed data.Basics of SQL/wiki/computation/basics-of-sql/Mon, 19 Nov 2018 00:00:00 +0000/wiki/computation/basics-of-sql/Adding a new field to data:
Relational: requires a new column Non-Relational: just add the field to one single document, thus can be easily decentralized. Basics and Background SQL: Structured Query Language
Relational Database:
usually in tables rows are called records columns are certain types of data. Data types of rows are specified: INTEGER TEXT DATE REAL, real numbers NULL … RDBMS: Relational Database Management System, most RDBMS use SQL as the query language.Normalization Methods for Numeric Data/wiki/statistics/normalization-methods/Sun, 18 Nov 2018 00:00:00 +0000/wiki/statistics/normalization-methods/Normalization of data is critical for statistical analysis and feature engineering.
Min-max Normalization This method is linear and straightforward.
Suppose we are analyzing series A, with elements $a_i$. We already know the min and max of the series, $a_{min}$ and $a_{max}$.
Now we would like to normalize the series to be within the range $[a_{min}', a_{max}']$. We simply solve the value of $a’ _ i$ in $$ \frac{(a’_i - a_{min}')}{ ( a’_{max} - a’_{min} ) } = \frac{(a_i - a_{min})}{ ( a_{max} - a_{min} ) }, $$ where everything on the right hand side is known and $a_{min}‘$ and $a_{max}‘$ are chosen as the new min and max to be scaled to.Optimization/wiki/nodecrawler/optimization/Thu, 19 Jul 2018 00:00:00 +0000/wiki/nodecrawler/optimization/In this article, we will be optimizing the crawler to get better performance.
Batch Jobs In the article about using MongoDB as data storage, we write the data to database whenever we get it. In practice, this is not efficient at all. Here comes the batch jobs. It would be much better if one write to database with batch jobs.
If you recall, the code we used to write to database isGit/wiki/tools/git/Wed, 22 Jun 2016 00:00:00 +0000/wiki/tools/git/Git Services GitHub Bitbucket GitLab Using Git with GUI There are huge amounts of git commands! There are also a lot of GUIs if you don’t like command line.
GitHub Desktop GitKraken SourceTree … Useful Commands To check all the commits related to a file, use git log -u. Try out git log -g before determining which reflog to deal with. To compare the changes with the last commit, use git diff --cached HEAD~1.Linear Regression/wiki/statistics/linear-regression/Tue, 01 Jan 2019 00:00:00 +0000/wiki/statistics/linear-regression/In this article, we will use the Einstein summation convention. For example, $$ X_{ij}\beta_ j $$ is equivalent to $$ \sum_j X_{ij}\beta_ j $$ In statistics, we have at least three categories of quantities:
data and labels abstract theoretical quantities parameters and predictions of models The convention is that quantities with $\hat {}$ are the model quantities. Sometimes we do not distinguish the abstract theoretical quantities and model quantities.Basics of MongoDB/wiki/computation/basics-of-mongodb/Wed, 03 Oct 2018 00:00:00 +0000/wiki/computation/basics-of-mongodb/This MongoDB Cheatsheet is my best friend.
MongoDB Concepts Documents Collections: just like tables in SQL. Database MongoShell Some examples:
// show the databases show dbs // show collections show collections //set any database to current database use database_name // insert entry db.database_name.insert( an_object_2_be_the_entry ) // read document db.database_name.findOne({'some_field':'value_of_field'}) db.database_name.fidn() // prettify db.database_name.find().pretty()Describing Multi-dimensional Data/wiki/statistics/multidimensional-data/Mon, 03 Dec 2018 00:00:00 +0000/wiki/statistics/multidimensional-data/Descriptions of Multidimensional Data Dispersion Matrix As defined in Correlation Coefficient and Covariance for Numeric Data, covariance is about the variance of two series. This property makes it easy to generalize it to multidimensional data.
The generalized quantity is named as dispersion matrix. Suppose we have a $p$ dimensional dataset $X$,
index $x_1$ $x_2$ … $x_p$ 1 2.3 12.3 83.2 9.3 … … … … … N 3.Signal Processing/wiki/algorithms/singal-processing/Tue, 20 Mar 2018 00:00:00 +0000/wiki/algorithms/singal-processing/There are many fascinating ideas in signal processing.Signal Processing: Audio Basics/wiki/algorithms/signal-processing-audio/Thu, 29 Mar 2018 00:00:00 +0000/wiki/algorithms/signal-processing-audio/Keywords Harmonic structure of sound Parson code of music Linear time-invariant theory Autocorrelation Noise Chirps DCT compression Discrete Fourier transform filtering convolution Linear Time-Invariant System We describe the system with $Y(t) = f(X(t))$, where $X(t)$ is the input, and $Y(t)$ is the output.
Linear: $f(a X_1(t) + b X_2(t)) = a f(X_1(t)) + b f(X_2(t))$ Time-invariant: input $X(t+\Delta t)$ will produce the shifted signal $Y(t+\Delta t)$. LTI systems are memory systems, casual, real, and stable.Basics of MapReduce/wiki/algorithms/map-reduce/Wed, 03 Oct 2018 00:00:00 +0000/wiki/algorithms/map-reduce/Centralized servers are not efficient for big data. Querying and processing data on centralized servers would reach bottleneck of the servers.
MapReduce is used to solve these problems of big data. The two videos are .
Map: take series of key-value pairs and divide them into groups. Reduce: recombine the key-value pairs Checkout the code challenges of MapReduce on HackerRank.A Simple Machine Learning Project Framework/blog/data-science/a-simple-machine-learning-framework/Tue, 12 Jan 2021 00:00:00 +0000/blog/data-science/a-simple-machine-learning-framework/ A simple almost stateless machine learning frameworkBasics of Redis/wiki/computation/basics-of-redis/Fri, 08 Jan 2021 00:00:00 +0000/wiki/computation/basics-of-redis/Basics Redis is:
NoSQL KeyValue In memory Data Structure Server binary safe strings lists, sets, sorted sets, hashes bitmaps, hyperloglogs Open source Redis is:
Fast Low CPU Requirement Scalable Redis can be used as:
Cache Analytics Leaderboard Queues Cookie storage Expiring data Messaging High I/O workloads API throttlings How to persist your data
Snapshot AOF: Append Only File Pros:Audiolization of Covid 19 Data in Europe/blog/ruthless/audiolization-of-covid19-in-eu/Sun, 03 Jan 2021 00:00:00 +0000/blog/ruthless/audiolization-of-covid19-in-eu/Here is an audiolization sound track using a sample of covid19 data in Europe. The audio is the result of the audiorepr Python package I wrote.PREP/cards/communication/prep/Sun, 03 Jan 2021 00:00:00 +0000/cards/communication/prep/PREP PREP is a framework for making your point.
PREP: Point + Reason + Example + Point Point: Make a point; PREP is a good method. Reason: Give the reason; Because it has a clear logic. Example: Show examples; The famous XYZ did ABC and everyone was convinced. Point: State the point for conclusion.SCQ-A/cards/communication/scq-a/Sun, 03 Jan 2021 00:00:00 +0000/cards/communication/scq-a/SCQ-A SCQ-A: Situation + Conflict + Question + Answer SCA-A is a framework for problem solving.
Situation: background knowledge, set the stage Complications: what is happening Question: propose your hypothesis Answer: accept or reject the hypothesisWWH/cards/communication/wwh/Sun, 03 Jan 2021 00:00:00 +0000/cards/communication/wwh/WWH WWH: What (happened) + Why (this happened) + How (to improve)Graph Creation/reading/grammar-of-graphics/graph-creation/Tue, 29 Dec 2020 00:00:00 +0000/reading/grammar-of-graphics/graph-creation/Stages Three stages of making a graph:
Specification Assembly Display Specification Statistical graphic specifications are expressed in six statements
DATA: a set of data operations that create variables from datasets TRANS: variable transformations (e.g., rank) SCALE: scale transformations (e.g., log) COORD: a coordinate system (e.g., polar) ELEMENT: graphs (e.g., points) and their aesthetic attributes (e.g., color) GUIDE: one or more guides (axes, legends, etc.) Assembly Assembling a scene from a specification requires a variety of structures in order to index and link components with each other.Multiset, mset or bag/cards/math/multiset-mset-bag/Sun, 27 Dec 2020 00:00:00 +0000/cards/math/multiset-mset-bag/A bag is a set in which duplicate elements are allowed.
An ordered bag is a list that we use in programming.Python Class Sequential Inheritance/til/programming/python/python-class-inheritance-sequential/Thu, 03 Dec 2020 00:00:00 +0000/til/programming/python/python-class-inheritance-sequential/# An experiment on python super class Base: def __init__(self): print("Start A") print("End A") class IA(Base): def __init__(self): print("Start IA") super(IA, self).__init__() print("End IA") class IB(IA): def __init__(self): print("Start IB") super(IB, self).__init__() print("End IB") print("Experiment 1:") ib = IB()Three dots in Python/til/programming/python/python-three-dots/Thu, 03 Dec 2020 00:00:00 +0000/til/programming/python/python-three-dots/Using three dots in Python:
from abc import abstractmethod class A: def __init__(self): self.name = "A" print("Init") def three_dots(self): ... @abstractmethod def abs_three_dots(self): ... def raise_it(self): raise Exception("Not yet done") a = A() print("\nthree_dots") print(a.three_dots()) print("\nabs_three_dots") print(a.abs_three_dots()) print("\nraise_it") a.raise_it() Returns
three_dots None abs_three_dots None raise_it Traceback (most recent call last): File "main.py", line 27, in <module> a.raise_it() File "main.py", line 14, in raise_it raise Exception("Not yet done") Exception: Not yet doneOrdered Member Functions of a Class in Python/til/programming/python/python-class-methods-ordered/Wed, 02 Dec 2020 00:00:00 +0000/til/programming/python/python-class-methods-ordered/# References: # 1. https://stackoverflow.com/questions/48145317/can-i-add-attributes-to-class-methods-in-python from functools import wraps # Define a decorator def attributes(**attrs): """ Set attributes of member functions in a class. ``` class AGoodClass: def __init__(self): self.size = 0 @attributes(order=1) def first_good_member(self, new): return "first good member" @attributes(order=2) def second_good_member(self, new): return "second good member" ``` References: 1. https://stackoverflow.com/a/48146924/1477359 """ def decorator(f): @wraps(f) def wrapper(*args, **kwargs): return f(*args, **kwargs) for attr_name, attr_value in attrs.items(): setattr(wrapper, attr_name, attr_value) return wrapper return decorator class AGoodClass: def __init__(self): self.Postgres Optimization in JOIN/til/postgres.join-begin-with-smallest-cardinality/Sat, 28 Nov 2020 11:39:21 +0100/til/postgres.join-begin-with-smallest-cardinality/Join tables together starting with the smallest table (table with less cardinality) speeds things up.Deal with NULL in Postgres/til/data/postgres.deal-with-null/Thu, 26 Nov 2020 00:00:00 +0000/til/data/postgres.deal-with-null/Please deal with null carefully.Experiments in Biology/blog/ruthless/experiments-in-biology/Sun, 01 Nov 2020 00:00:00 +0000/blog/ruthless/experiments-in-biology/ Inspired by @hanlu.ioThe Science Part in Data Science/blog/ruthless/science-part-in-data-science/Sat, 31 Oct 2020 00:00:00 +0000/blog/ruthless/science-part-in-data-science/graph TD; s1(An Idea)--d1{Is this idea in the current literature?}; d1{Is this idea in the current literature?}--|Yes|b1(Fail); d1{Is this idea in the current literature?}--|No|b2[Weeks of work]; b2[Weeks of work]--b1(Fail);Conditional Probability Table/cards/statistics/conditional-probability-table/Tue, 27 Oct 2020 00:00:00 +0000/cards/statistics/conditional-probability-table/The conditional probability table, aka CPT, is used to calculate conditional probabilities from a dataset.
Given a dataset with features $\mathbf X$ and their corresponding classes $\mathbf Y$, the conditional probabilities of each class given a certain feature value can be calculated using a CPT which in turn can be calculated using a contigency table Detecting correlations using correlations for numeric data .Pandas Groupby Does Not Guarantee Unique Content in Groupby Columns/til/machine-learning/pandas-groupby-caveats/Mon, 20 Apr 2020 00:00:00 +0000/til/machine-learning/pandas-groupby-caveats/Pandas Groupby Does Not Guarantee Unique Content in Groupby Columns, it also considers the datatypes. Dealing with mixed types requires additional attentioin.== and is in Python/til/programming/python/python-none/Wed, 01 Apr 2020 00:00:00 +0000/til/programming/python/python-none/== and is are differentArcsine Distribution/cards/statistics/distributions/arcsine/Sat, 14 Mar 2020 00:00:00 +0000/cards/statistics/distributions/arcsine/Arcsine Distribution The PDF is
$$ \frac{1}{\pi\sqrt{x(1-x)}} $$
for $x\in [0,1]$.
It can also be generalized to
$$ \frac{1}{\pi\sqrt{(x-1)(b-x)}} $$
for $x\in [a,b]$.
VisualizeBernoulli Distribution/cards/statistics/distributions/bernoulli/Sat, 14 Mar 2020 00:00:00 +0000/cards/statistics/distributions/bernoulli/Two categories with probability $p$ and $1-p$ respectively.
For each experiment, the sample space is $\{A, B\}$. The probability for state $A$ is given by $p$ and the probability for state $B$ is given by $1-p$. The Bernoulli distribution describes the probability of $K$ results with state $s$ being $s=A$ and $N-K$ results with state $s$ being $B$ after $N$ experiments,
$$ P\left(\sum_i^N s_i = K \right) = C _ N^K p^K (1 - p)^{N-K}.Beta Distribution/cards/statistics/distributions/beta/Sat, 14 Mar 2020 00:00:00 +0000/cards/statistics/distributions/beta/Beta Distribution Interact {% include extras/vue.html %}
((makeGraph))Binomial Distribution/cards/statistics/distributions/binomial/Sat, 14 Mar 2020 00:00:00 +0000/cards/statistics/distributions/binomial/The number of successes in $n$ independent events where each trial has a success rate of $p$.
PMF:
$$ C_n^k p^k (1-p)^{n-k} $$Categorical Distribution/cards/statistics/distributions/categorical/Sat, 14 Mar 2020 00:00:00 +0000/cards/statistics/distributions/categorical/By generalizing the Bernoulli distribution to $k$ states, we get a categorical distribution. The sample space is $\{s_1, s_2, \cdots, s_k\}$. The corresponding probabilities for each state are $\{p_1, p_2, \cdots, p_k\}$ with the constraint $\sum_{i=1}^k p_i = 1$.Cauchy-Lorentz Distribution/cards/statistics/distributions/cauchy/Sat, 14 Mar 2020 00:00:00 +0000/cards/statistics/distributions/cauchy/Cauchy-Lorentz Distribution .. ratio of two independent normally distributed random variables with mean zero.
Source: https://en.wikipedia.org/wiki/Cauchy_distribution
Lorentz distribution is frequently used in physics.
PDF:
$$ \frac{1}{\pi\gamma} \left( \frac{\gamma^2}{ (x-x_0)^2 + \gamma^2} \right) $$
The median and mode of the Cauchy-Lorentz distribution is always $x_0$. $\gamma$ is the FWHM.
VisualizeGamma Distribution/cards/statistics/distributions/gamma/Sat, 14 Mar 2020 00:00:00 +0000/cards/statistics/distributions/gamma/Gamma Distribution PDF:
$$ \frac{\beta^\alpha x^{\alpha-1} e^{-\beta x}}{\Gamma(\alpha)} $$
VisualizeDiagnolize Matrices/cards/math/diagonalize-matrix/Wed, 11 Mar 2020 00:00:00 +0000/cards/math/diagonalize-matrix/Given a matrix $\mathbf A$, it is diagonalized using its eigenvectors.
Why are the eigenvectors needed?
Eigenvectors of a matrix $\mathbf A$ are the preferred directions. From the definition of eigenvectors,
$$ \mathbf A \mathbf x = \lambda \mathbf x, $$
we know that the matrix $\mathbf A$ only scales the eigenvectors and no rotations. These directions are special to the matrix $\mathbf A$.
Find the eigenvectors $\mathbf x_i$ of the matrix $\mathbf A$; If we find degerations, the matrix is not diagonalizable.Mahalanobis Distance/cards/math/mahalanobis-distance/Wed, 11 Mar 2020 00:00:00 +0000/cards/math/mahalanobis-distance/Mahalanobis distance is a distance calculated using the inverse of the covariance matrix as the metric. For two vectors $\mathbf x$ and $\mathbf y$, the Mahalanobis distance is
$$ d^2 = (x_i - \bar x) g_{i,j} (y_j - \bar y), $$
where $g_{ij} = (S^{-1})_{ij}$ and $\mathbf S$ is the covariance matrix.
The covariance is a normalization that mitigates the covariances.Covariance Matrix/cards/statistics/covariance-matrix/Tue, 10 Mar 2020 00:00:00 +0000/cards/statistics/covariance-matrix/We use Einstein’s summation convention. Covariance of two discrete series $A$ and $B$ is defined as
$$ \text{Cov} ({A,B}) = \sigma_{A,B}^2 = \frac{ (a_i - \bar A) (b_i - \bar B) }{ n- 1 }, $$
where $n$ is the length of the series. The normalization factor is set to $1/(n-1)$ to mitigate the bias for small $n$.
One could show that
$$ \mathrm{Cov}({A,B}) = E( A,B ) - \bar A \bar B.Jackknife Resampling/cards/statistics/jacknife-resampling/Sun, 26 Jan 2020 00:00:00 +0000/cards/statistics/jacknife-resampling/Jackknife resampling is a method for estimation of the mean and higher order moments.
Given a sample $\{x_i\}$ of size $n$ for the distribution $X$, the jackknife resampling estimates the mean by leaving out each data point systematically. $n$ estimations of the mean will be obtained, with each of the estimations $x_i$
$$ \bar x_i = \frac{1}{n-1} \sum_{j\neq i} x_j. $$
The mean of the sample is
$$ \bar x = \frac{1}{n}\sum_i \bar x_i = \frac{1}{n} \sum_i \left(\frac{1}{n-1} \sum_{j\neq i} x_j\right) = \frac{1}{n}\sum_i x_i.CBOW: Continuous Bag of Words/cards/machine-learning/embedding/continuous-bag-of-words/Thu, 16 Jan 2020 00:00:00 +0000/cards/machine-learning/embedding/continuous-bag-of-words/Here we encode all words presented in the corpus to demostrate the idea of CBOW. In the real world, we might want to remove some certain words such as the. We use the following quote by Ford in Westworld as an example.
I read a theory once that the human intellect is like peacock feathers. Just an extravagant display intended to attract a mate, just an elaborate mating ritual.Data Types/cards/machine-learning/datatypes/data-types/Thu, 16 Jan 2020 00:00:00 +0000/cards/machine-learning/datatypes/data-types/Gini Impurity/cards/machine-learning/measurement/gini-impurity/Thu, 16 Jan 2020 00:00:00 +0000/cards/machine-learning/measurement/gini-impurity/The code used in this article can be found in this repo. Suppose we have a dataset $\{0,1\}^{10}$, which has 10 records and 2 possible classes of objects $\{0,1\}$ in each record.
The first example we investigate is a pure 0 dataset.
object 0 0 0 0 0 0 0 0 0 0 0 0 For such an all-0 dataset, we would like to define its impurity as 0.Information Gain/cards/machine-learning/measurement/information-gain/Thu, 16 Jan 2020 00:00:00 +0000/cards/machine-learning/measurement/information-gain/Information gain is a frequently used metric in calculating the gain during a split in tree-based methods.
First o all, the entropy of a dataset if defined as
$$ S = - sum_i p_i \log p_i - sum_i (1-p_i)\log p_i, $$
where $p_i$ is the probability of a class.
The information gain is the difference between the entropy.
For example, in a decision tree algorithm, we would split a node. Before splitting, we assign a label $m$ to the node,Negative Sampling/cards/machine-learning/embedding/negative-sampling/Thu, 16 Jan 2020 00:00:00 +0000/cards/machine-learning/embedding/negative-sampling/Knowledge of CBOW or skipgram is required.
A naive model to train a model of words is to
encode input words and output words using vectors, use the input word vector to predict the output word vector, calculate the errors between predicted output word vector and real output word vector, minimize the errors. However, it is very expensive to prject out the output words and calcualte the error eveytime.PAC: Probably Approximately Correct/cards/machine-learning/learning-theories/pac/Thu, 16 Jan 2020 00:00:00 +0000/cards/machine-learning/learning-theories/pac/skipgram: Continuous skip-gram/cards/machine-learning/embedding/continuous-skip-gram/Thu, 16 Jan 2020 00:00:00 +0000/cards/machine-learning/embedding/continuous-skip-gram/We use the following quote by Ford in Westworld as an example.
I read a theory once that the human intellect is like peacock feathers. Just an extravagant display intended to attract a mate, just an elaborate mating ritual. But, of course, the peacock can barely fly. It lives in the dirt, pecking insects out of the muck, consoling itself with its great beauty.
The word intended is surrunded by extravagant display in the front and to attract after it.Improving Document Ranking with Dual Word Embeddings/reading/word2vec-in-out-embedding/Sat, 05 Oct 2019 00:00:00 +0000/reading/word2vec-in-out-embedding/Word2vec produces two embedding spaces, the in-embedding and out-embedding.Switch statement in Python/til/programming/python/python-switch-statement/Tue, 20 Aug 2019 00:00:00 +0000/til/programming/python/python-switch-statement/Love switch statement? We can design a switch statement it in python.Python Tilde Operator/til/programming/python/python-tilde-operator/Thu, 15 Aug 2019 00:00:00 +0000/til/programming/python/python-tilde-operator/tilde operator may not work as you expectedArrays and Dicts in MongoDB/til/programming/database/mongodb-array-and-dict/Wed, 14 Aug 2019 00:00:00 +0000/til/programming/database/mongodb-array-and-dict/Array of dictionaries becomes hard to update in MongoDB.eval in Python is Dangerous/til/programming/python/python-eval/Tue, 13 Aug 2019 00:00:00 +0000/til/programming/python/python-eval/eval is powerful but really dangerousKendall Tau Correlation/cards/statistics/kendall-correlation-coefficient/Sat, 20 Jul 2019 00:00:00 +0000/cards/statistics/kendall-correlation-coefficient/Definition two series of data: $X$ and $Y$ cooccurance of them: $(x_i, x_j)$, and we assume that $i<j$ concordant: $x_i < x_j$ and $y_i < y_j$; $x_i > x_j$ and $y_i > y_j$; denoted as $C$ discordant: $x_i < x_j$ and $y_i > y_j$; $x_i > x_j$ and $y_i < y_j$; denoted as $D$ neither concordant nor discordant: whenever equal sign happens Kendall’s tau is defined as
$$ \begin{equation} \tau = \frac{C- D}{\text{all possible pairs of comparison}} = \frac{C- D}{n^2/2 - n/2} \end{equation} $$Bayes' Theorem/cards/statistics/bayes-theorem/Tue, 18 Jun 2019 00:00:00 +0000/cards/statistics/bayes-theorem/Bayes’ Theorem is stated as
$$ P(A\mid B) = \frac{P(B \mid A) P(A)}{P(B)} $$
$P(A\mid B)$: likelihood of A given B $P(A)$: marginal probability of A There is a nice tree diagram for the Bayes’ theorem on Wikipedia.
Tree diagram of Bayes’ theoremCanonical Decomposition/cards/math/canonical-decomposition/Tue, 18 Jun 2019 00:00:00 +0000/cards/math/canonical-decomposition/I find this slide from Christoph Freudenthaler very useful.
Canonical decomposition visualized by Christoph FreudenthalerKhatri-Rao Product/cards/math/khatri-rao/Tue, 18 Jun 2019 00:00:00 +0000/cards/math/khatri-rao/Choose X from N is
$$ C_N^X = \frac{N!}{ X! (N-X)! } $$Modes and Slices of Tensors/cards/math/modes-and-slices-of-tensor/Tue, 18 Jun 2019 00:00:00 +0000/cards/math/modes-and-slices-of-tensor/ Modes of a tensor Slices of a tensorPoisson Process/cards/statistics/poisson-process/Tue, 18 Jun 2019 00:00:00 +0000/cards/statistics/poisson-process/Poisson Process Statistics // define getUnixTime Date.prototype.getUnixTime = function () { return this.getTime() / 1000 | 0 }; if (!Date.now) Date.now = function () { return new Date(); } Date.time = function () { return Date.now().getUnixTime(); } POISSON_EVENT_RATE = 1 function get_event_time() { var time = new Date(); return time } all_event = [] all_event_diff = [] var data = [{ x: [get_event_time], y: [1], mode: 'markers', line: { color: '#80CAF6' } }] var layout = { title: { text: 'Poisson Process' }, xaxis: { title: { text: 'Event Time' }, } }; var layout_rate = { title: { text: 'Average Rate of the Poisson Process' }, xaxis: { title: { text: 'Event Time' }, }, yaxis: { title: { text: 'Average Event Rate per Second' }, rangemode: 'tozero' } }; var data_rate = [{ x: [get_event_time], y: [POISSON_EVENT_RATE], mode: 'lines+markers', line: { color: '#80CAF6' } }] Plotly.SVD: Singular Value Decomposition/cards/math/svd/Tue, 18 Jun 2019 00:00:00 +0000/cards/math/svd/Given a matrix $\mathbf X \to X_{m}^{\phantom{m}n}$, we can decompose it into three matrices
$$ X_{m}^{\phantom{m}n} = U_{m}^{\phantom{m}k} D_{k}^{\phantom{k}l} (V_{n}^{\phantom{n}l} )^{\mathrm T}, $$
where $D_{k}^{\phantom{k}l}$ is diagonal.
Here we have $\mathbf U$ being constructed by the eigenvectors of $\mathbf X \mathbf X^{\mathrm T}$, while $\mathbf V$ is being constructed by the eigenvectors of $\mathbf X^{\mathrm T} \mathbf X$ (which is also the reason we keep the transpose).
I find this slide from Christoph Freudenthaler very useful.Tucker Decomposition/cards/math/tucker-decomposition/Tue, 18 Jun 2019 00:00:00 +0000/cards/math/tucker-decomposition/I find this slide from Christoph Freudenthaler very useful. For the definition of mode 1/2/3 unfold, please refer to Modes and Slices of Tensors.
Tucker decomposition visualized by Christoph FreudenthalerLevenshtein Distance/cards/math/levenshtein-distance/Sun, 19 May 2019 00:00:00 +0000/cards/math/levenshtein-distance/Levenshtein distance calculates the number of operations needed to change one word to another by applying single-character edits (insertions, deletions or substitutions).
The reference explains this concept very well. For consistency, I extracted a paragraph from it which explains the operations in Levenshtein algorithm. The source of the following paragraph is the first reference of this article.
Levenshtein Matrix
Cell (0:1) contains red number 1. It means that we need 1 operation to transform M to an empty string.n-gram/cards/math/n-gram/Sun, 19 May 2019 00:00:00 +0000/cards/math/n-gram/n-gram is a method to split words into set of substring elements so that those can be used to match words.
Examples Use the following examples to get your first idea about it. I created two columns so that we could compare the n-grams of two different words side-by-side.
n in n-gram is Word One Clean Word: (( sentenceOneWords )) n-grams: (( sentenceOneWordsnGram )) Word Two Clean Word: (( sentenceTwoWords )) n-grams: (( sentenceTwoWordsnGram )) /*************************/ /** The function nGram is a copy of https://github.Add New Kernels to Jupyter Notebook in Conda Environment/til/programming/jupyter-notebook-add-new-kernels-in-conda-env/Sun, 12 May 2019 00:00:00 +0000/til/programming/jupyter-notebook-add-new-kernels-in-conda-env/Python package or python module autoreloading in jupyter notebookAuto-reload Python Packages or Python Modules in Jupyter Notebook/til/programming/jupyter-notebook-autoreload-python-modules-or-packages/Sun, 12 May 2019 00:00:00 +0000/til/programming/jupyter-notebook-autoreload-python-modules-or-packages/Python package or python module autoreloading in jupyter notebookBigQuery Meta Tables/til/data/bigquery-meta-tables/Sun, 12 May 2019 00:00:00 +0000/til/data/bigquery-meta-tables/Meta tables are very useful when it comes to get bigquery table information programmatically.Calculate Moving Average Using SQL/BigQquery/til/data/bigquery-moving-average/Sun, 12 May 2019 00:00:00 +0000/til/data/bigquery-moving-average/Snippet for calculating moving avg using sql/biguqeryGenerate a Column of Continuous Dates in BigQuery/til/data/bigquery-generate-continuous-dates-as-a-column/Sun, 12 May 2019 00:00:00 +0000/til/data/bigquery-generate-continuous-dates-as-a-column/Generate a table with a column of continuous datesGet Current User in BigQuery/til/data/bigquery-get-current-user/Sun, 12 May 2019 00:00:00 +0000/til/data/bigquery-get-current-user/BigQuery Current UserMaterialize the Query Result for Performance/til/data/bigquery-materialize-query-results-for-performance/Sun, 12 May 2019 00:00:00 +0000/til/data/bigquery-materialize-query-results-for-performance/Materialize the query result for multistage queries to make your query faster and lower the costs.Cosine Similarity/cards/math/cosine-similarity/Mon, 06 May 2019 00:00:00 +0000/cards/math/cosine-similarity/As simple as the inner product of two vectors
$$ d_{cos} = \frac{\vec A}{\vert \vec A \vert} \cdot \frac{\vec B }{ \vert \vec B \vert} $$
Examples To use cosine similarity, we have to vectorize the words first. There are many different methods to achieve this. For the purpose of illustrating cosine similarity, we use term frequency.
Term frequency is the occurrence of the words. We do not deal with duplications so duplicate words will have some effect on the similarity.Eigenvalues and Eigenvectors/cards/math/eigendecomposition/Mon, 06 May 2019 00:00:00 +0000/cards/math/eigendecomposition/To find the eigenvectors $\mathbf x$ of a matrix $\mathbf A$, we construct the eigen equation
$$ \mathbf A \mathbf x = \lambda \mathbf x, $$
where $\lambda$ is the eigenvalue.
We rewrite it in the components form,
$$ \begin{equation} A_{ij} x_j = \lambda x_i. \label{eqn-eigen-decomp-def} \end{equation} $$
Mathematically speaking, it is straightforward to find the eigenvectors and eigenvalues.
Eigenvectors are Special Directions Judging from the definition in Eq.($\ref{eqn-eigen-decomp-def}$), the eigenvectors do not change direction under the operation of the matrix $\mathbf A$.Jaccard Similarity/cards/math/jaccard-similarity/Mon, 06 May 2019 00:00:00 +0000/cards/math/jaccard-similarity/Jaccard index is the ratio of the size of the intersect of the set and the size of the union of the set.
$$ J(A, B) = \frac{ \vert A \cap B \vert }{ \vert A \cup B \vert } $$
Jaccard distance $d_J(A,B)$ is defined as
$$ d_J(A,B) = 1 - J(A,B). $$
Properties If the two sets are the same, $A=B$, we have $J(A,B)=1$ or $d_J(A,B)=0$. We have maximum similarity.Term Frequency - Inverse Document Frequency/cards/math/tf-idf/Mon, 06 May 2019 00:00:00 +0000/cards/math/tf-idf/The Art of Data Science/reading/art-of-data-science/Fri, 19 Apr 2019 00:00:00 +0000/reading/art-of-data-science/A nice and elegant book on data scienceBlog Posts/projects/blog/Sun, 07 Apr 2019 00:00:00 +0000/projects/blog/My blog postsCombinations/cards/math/combinations/Sun, 07 Apr 2019 00:00:00 +0000/cards/math/combinations/Choose X from N is
$$ C_N^X = \frac{N!}{ X! (N-X)! } $$My Data Wiki/projects/wiki/Sun, 07 Apr 2019 00:00:00 +0000/projects/wiki/A collection of my wiki articles related to data.My Knowledge Cards/projects/cards/Sun, 07 Apr 2019 00:00:00 +0000/projects/cards/A collection of my snippets of knowledgeMy Reading Notes/projects/reading/Sun, 07 Apr 2019 00:00:00 +0000/projects/reading/A collection of my reading notesTIL/projects/til/Sun, 07 Apr 2019 00:00:00 +0000/projects/til/Today I LearnedHuman Graphical Perception of Quantitative Information in Data Visualization/reading/graphical-perception/Sun, 17 Mar 2019 00:00:00 +0000/reading/graphical-perception/Data visualization caveatsAdd Data Files to Python Package/til/programming/python/python-package-including-data-file/Wed, 13 Mar 2019 00:00:00 +0000/til/programming/python/python-package-including-data-file/Add Data Files to Python Package using manifest.in and setup.pyInstalling requirements.txt in Conda Environments/til/programming/python/python-anaconda-install-requirements/Wed, 13 Mar 2019 00:00:00 +0000/til/programming/python/python-anaconda-install-requirements/Why is pip install -r requirements.txt not working?Information Theory and Statistical Mechanics/reading/statistical-physics-and-information-theory/Fri, 01 Mar 2019 00:00:00 +0000/reading/statistical-physics-and-information-theory/Max entropy principle as a method to infer distributions of statistical systemsFlatten 2D List in Python/til/programming/python/python-flatten-2d-list/Wed, 23 Jan 2019 00:00:00 +0000/til/programming/python/python-flatten-2d-list/Flatten 2D list using sumPython Datetime on Different OS/til/programming/python/python-datetime-on-different-os/Mon, 31 Dec 2018 00:00:00 +0000/til/programming/python/python-datetime-on-different-os/Python datetime on different os behaves inconsistentlyPython If on Numbers/til/programming/python/python-if-condition-on-numbers/Mon, 31 Dec 2018 00:00:00 +0000/til/programming/python/python-if-condition-on-numbers/If on int is dangerousPython Long String/til/programming/python/python-long-string/Mon, 31 Dec 2018 00:00:00 +0000/til/programming/python/python-long-string/Python long string formattingPython Reliable Path to File/til/programming/python/python-reliable-path/Mon, 31 Dec 2018 00:00:00 +0000/til/programming/python/python-reliable-path/Find the actual path to fileVSCode on Mac Do Not Repeat/til/misc/vscode-on-mac-do-not-repeat/Mon, 31 Dec 2018 00:00:00 +0000/til/misc/vscode-on-mac-do-not-repeat/Enable your key repeat in vscode on macControlled Experiments/til/statistics/controlled-experiments/Tue, 04 Dec 2018 00:00:00 +0000/til/statistics/controlled-experiments/The three levels of controlled experimentsSchaum's Outline of Theories and Problems of Elements of Statistics I and II/reading/elements-of-statistics/Thu, 01 Nov 2018 00:00:00 +0000/reading/elements-of-statistics/The basics and all of modern statisticsPandas with MultiProcessing/til/programming/pandas/pandas-parallel-multiprocessing/Sun, 09 Sep 2018 00:00:00 +0000/til/programming/pandas/pandas-parallel-multiprocessing/Define number of processes, prs; Split dataframe into prs dataframes; Process each dataframe with one process; Merge processed dataframes into one. A piece of demo code is shown below.
from multiprocessing import Pool from multiprocessing.dummy import Pool as ThreadPool import pandas as pd # Create a dataframe to be processed df = pd.read_csv('somedata.csv').reset_index(drop=True) # Define a function to be applied to the dataframe def nice_func(name, age): return (name,age) # Apply to dataframe def apply_to_df(df_chunks): df_chunks['tupled'] = df_chunks.Beer and Life Expectancy/blog/ruthless/beer-and-life-expectancy/Wed, 08 Aug 2018 00:00:00 +0000/blog/ruthless/beer-and-life-expectancy/This is a post of no analysis at all. Everything in this post is meant for fun.
I moved to Germany a few weeks ago and one of the most astonishing things I noticed is that everyone is drinking so much. Yet the life expectance of Germany is pretty high. So I performed this “analysis” for fun.
Life expectancy vs beer consumption (L) per capita per year. Data obtained from wikipediaList of countries by life expectancy and List of countries by beer consumption per capita.Data Mining: Concepts and Techniques/reading/data-mining/Wed, 01 Aug 2018 00:00:00 +0000/reading/data-mining/How data mining was done in the pastFitt's Law/til/misc/fitts-law/Sun, 22 Jul 2018 00:00:00 +0000/til/misc/fitts-law/How fast can you move your mouse to targetCopy Scalars and Lists in Python/til/programming/python/python-copy-value-or-address/Tue, 03 Jul 2018 00:00:00 +0000/til/programming/python/python-copy-value-or-address/Python copy values of scalars but addresses of listsCertificate Errors in urllib/til/data/python-urllib-ssl/Mon, 25 Jun 2018 00:00:00 +0000/til/data/python-urllib-ssl/Dealing with errors when scraping dataCalculated Columns in Pandas/til/programming/pandas/pandas-new-column-from-other/Sun, 20 May 2018 00:00:00 +0000/til/programming/pandas/pandas-new-column-from-other/Create new columns in pandastree in Linux/til/programming/trees/Tue, 20 Mar 2018 00:00:00 +0000/til/programming/trees/Trees in computer scienceHeap on Mac and Linux/til/programming/cpp/cpp-heap-mac-linux-diff/Tue, 26 Sep 2017 00:00:00 +0000/til/programming/cpp/cpp-heap-mac-linux-diff/Some caveats about heap on mac and linuxC++ int Multiplication/til/programming/cpp/cpp-int-multiply/Thu, 21 Sep 2017 00:00:00 +0000/til/programming/cpp/cpp-int-multiply/int multiplication in C++ should be processed with caution.CMake Usage/til/programming/cmake-usage/Thu, 21 Sep 2017 00:00:00 +0000/til/programming/cmake-usage/How to use CMake to generate makefilesAllocating Memory for Multidimensional Array in C++/til/programming/cpp/cpp-allocating-memory-multidimensional-array/Thu, 14 Sep 2017 00:00:00 +0000/til/programming/cpp/cpp-allocating-memory-multidimensional-array/Some caveatsC++ range-for-statement/til/programming/cpp/cpp-range-for-statement/Tue, 12 Sep 2017 00:00:00 +0000/til/programming/cpp/cpp-range-for-statement/In C++ we can use range-for-statementList All Folders in Linux or Mac/til/programming/linux-mac-list-all-folders/Tue, 01 Aug 2017 00:00:00 +0000/til/programming/linux-mac-list-all-folders/Using ls and tree commands to list folders onlyPython Default Parameters Tripped Me Up/til/programming/python/python-default-parameters-mutable/Sat, 03 Jun 2017 00:00:00 +0000/til/programming/python/python-default-parameters-mutable/Python default parameters might be changed with each runSome Tests on Matplotlib Backends/til/programming/matplotlib-backend/Tue, 23 May 2017 00:00:00 +0000/til/programming/matplotlib-backend/Matplotlib provides many different backendsMathematica Provides Great PlotTheme Options/til/programming/mathematica/mathematica-plottheme/Fri, 19 May 2017 00:00:00 +0000/til/programming/mathematica/mathematica-plottheme/Amazingly, Mathematica provides an option for plot that automatically generates beautiful plots.Turn a Series Expansion into Function in Mathematica/til/programming/mathematica/mathematica-turn-series-into-function/Mon, 15 May 2017 00:00:00 +0000/til/programming/mathematica/mathematica-turn-series-into-function/Turn a series expansion in Mathematica into a functionGit Asks for Password Whenever I Pull or Push/til/programming/git/git-ssh-asking-pwd-everytime/Thu, 11 May 2017 00:00:00 +0000/til/programming/git/git-ssh-asking-pwd-everytime/My git asks for password every time I pull or push even with ssh configured.Command Line Russian Roulette/til/programming/command-line-russian-roulette/Tue, 09 May 2017 00:00:00 +0000/til/programming/command-line-russian-roulette/Play russian roulette in your command lineGNU Screen Key Conflict with Bash/til/programming/gnu-screen-key-conflict-with-bash/Mon, 08 May 2017 00:00:00 +0000/til/programming/gnu-screen-key-conflict-with-bash/GNU screen key conflict with bash can be solvedHow to Run Mathematica Script in Terminal/til/programming/run-mathematica-script-in-terminal/Mon, 08 May 2017 00:00:00 +0000/til/programming/run-mathematica-script-in-terminal/Using math -run or wolfram -run we could execute a Mathematica script through ssh in terminal.GNUPLOT Inline Output in iterm2/til/programming/gnuplot-iterm2-imgcat/Fri, 07 Apr 2017 00:00:00 +0000/til/programming/gnuplot-iterm2-imgcat/Using gnuplot in iterm2 we can output result inside terminal combined with imgcatMathematica Exclude Singularities in Plot/til/programming/mathematica/mathematica-plot-exclude-singularities/Wed, 22 Mar 2017 00:00:00 +0000/til/programming/mathematica/mathematica-plot-exclude-singularities/Mathematica Plot might include some non-existant lines sometimes, Exclusions is the potion for it.Passing Function Arguments Through Lists in Mathematica/til/programming/mathematica/mathematica-passing-arguments-through-lists/Mon, 20 Feb 2017 00:00:00 +0000/til/programming/mathematica/mathematica-passing-arguments-through-lists/We can pass a list of arguments using SequenceGit Pull with Submodule/til/programming/git/git-pull-with-submodule/Fri, 03 Feb 2017 00:00:00 +0000/til/programming/git/git-pull-with-submodule/Pull git repo with submodulePositioning textblock in LaTeX Beamer/til/programming/latex-beamer-textblock-position/Tue, 17 Jan 2017 00:00:00 +0000/til/programming/latex-beamer-textblock-position/Positioning textblock in LaTeX Beamer using textpos package and eso pic packageMathematica Different Output Forms/til/programming/mathematica/mathematica-different-output-forms/Mon, 28 Nov 2016 00:00:00 +0000/til/programming/mathematica/mathematica-different-output-forms/Mathematica has many different output forms. Understanding them is extremely helpful when making plots.Git Branch Options/til/programming/git/git-branch-details/Sun, 27 Nov 2016 00:00:00 +0000/til/programming/git/git-branch-details/Some useful options about git branchgit pull multi remote/til/programming/git/git-pull-multi-remote/Tue, 22 Nov 2016 00:00:00 +0000/til/programming/git/git-pull-multi-remote/working with multi remotePopularity versus similarity in growing networks/reading/popularity-vs-similarity/Sun, 06 Nov 2016 00:00:00 +0000/reading/popularity-vs-similarity/Introduce geometry into the manifold of complex networksFormatting Numbers in Python/til/programming/formating-numbers-python/Tue, 11 Oct 2016 00:00:00 +0000/til/programming/formating-numbers-python/Formatting numbers in python using formatSolving Equations Using Differential Transformation Method/til/math/differential-transformation-method-solving-equations/Tue, 11 Oct 2016 00:00:00 +0000/til/math/differential-transformation-method-solving-equations/Differential transformation method can be used to solve differential equation even integro-differential equations.The Great Chrome Dev Tool/til/programming/chrome-dev-tool-usage/Wed, 28 Sep 2016 00:00:00 +0000/til/programming/chrome-dev-tool-usage/How to use the chrome dev tool wiselyStart a Simple Server/til/programming/start-simple-server/Sat, 17 Sep 2016 00:00:00 +0000/til/programming/start-simple-server/With one line of python commandmatplotlib x y limit and aspect ratio/til/programming/matplotlib-x-y-limit-and-aspect-ratio/Thu, 21 Jul 2016 00:00:00 +0000/til/programming/matplotlib-x-y-limit-and-aspect-ratio/matplotlib x y limit and aspect ratioTOP Command/til/programming/top/Thu, 21 Jul 2016 00:00:00 +0000/til/programming/top/Some tips about top commandAssigning Values to Multiple Variables/til/programming/python/python-assigning-values-to-multiple-variables/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/python/python-assigning-values-to-multiple-variables/Assigning Values to Multiple Variablesgitignore by file size/til/programming/git/gitignore-by-file-size/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/git/gitignore-by-file-size/gitignore by file sizeHTML Animations Using CSS: AnimateCSS/til/programming/html-animate-css/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/html-animate-css/HTML Animations Using CSS AnimateCSSImport in Python/til/programming/import-in-python/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/import-in-python/Import in PythonIPython or Jupyter Notebook Magics/til/programming/ipython-or-jupyter-notebook-magics/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/ipython-or-jupyter-notebook-magics/IPython or Jupyter Notebook MagicsLaTeX Automatically Adjust Figure/til/programming/latex-automatically-adjust-figure/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/latex-automatically-adjust-figure/LaTeX Automatically Adjust FigureMathematica Plot Default Font Style and Ticks Style: BaseStyle/til/programming/mathematica/mathematica-plot-basestyle-default-font-style-and-ticks-style/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/mathematica/mathematica-plot-basestyle-default-font-style-and-ticks-style/Mathematica Plot Default Font Style and Ticks Style BaseStyleMathematica Smooth Plot/til/programming/mathematica/mathematica-smooth-plot/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/mathematica/mathematica-smooth-plot/Mathematica Smooth PlotMigrating Wordpress to Static/til/programming/migrating-wordpress-to-static-site/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/migrating-wordpress-to-static-site/Migrating Wordpress to StaticOpen URL using python using webbrowser module/til/programming/open-url-using-python-webbrowser-module/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/open-url-using-python-webbrowser-module/Open URL using python using webbrowser modulePython Code Style/til/programming/python/python-code-style/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/python/python-code-style/Code Style of Python Guide.
PEP 20 – The Zen of Python
1. Beautiful is better than ugly. 2. Explicit is better than implicit. 3. Simple is better than complex. 4. Complex is better than complicated. 5. Flat is better than nested. 6. Sparse is better than dense. 7. Readability counts. 8. Special cases aren't special enough to break the rules. 9. Although practicality beats purity. 10. Errors should never pass silently.Python Creating Lists/til/programming/python/python-creating-lists/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/python/python-creating-lists/Code Style of Python GuidePython enumertate/til/programming/python/python-enumerate/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/python/python-enumerate/Python enumertate functionPython List Comprehensions/til/programming/python/python-list-comprehensions/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/python/python-list-comprehensions/Python List ComprehensionsPython Making a List/til/programming/python/python-making-a-list/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/python/python-making-a-list/Python Making a ListPython Map vs For in Python/til/programming/python/python-map-vs-for/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/python/python-map-vs-for/Python Map vs For in PythonPython Onliner: Filter Prime Numbers/til/programming/filter-prime-numbers/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/filter-prime-numbers/Python Onliner Filter Prime NumbersPython Stupid numpy.piecewise/til/programming/python/python-stupid-numpy-piecewise/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/python/python-stupid-numpy-piecewise/Python Stupid numpy.piecewisePython Various Ways of Writing Loops/til/programming/python/python-writing-loops/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/python/python-writing-loops/Python Various Ways of Writing LoopsRun a program in the background on ubuntu/til/programming/run-program-in-background-ubuntu/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/run-program-in-background-ubuntu/Run a program in the background on ubuntusnakeviz/til/programming/python/python-profile-snakeviz/Fri, 04 Dec 2015 00:00:00 +0000/til/programming/python/python-profile-snakeviz/Python snakevizArea Enclosed by a Line/til/math/area-enclosed-in-a-line/Sun, 15 Feb 2015 00:00:00 +0000/til/math/area-enclosed-in-a-line/Calculate the area enclosed by a lineEigensystem of A Special Matrix/til/math/eigensystem-of-a-special-matrix/Sun, 15 Feb 2015 00:00:00 +0000/til/math/eigensystem-of-a-special-matrix/Eigenstates of a very special matrixFeynman Trick/til/math/feynman-tricks/Sun, 15 Feb 2015 00:00:00 +0000/til/math/feynman-tricks/An identity about integralSymmetry of second derivatives/til/math/symmetry-of-second-derivatives/Sun, 15 Feb 2015 00:00:00 +0000/til/math/symmetry-of-second-derivatives/Symmetry of second derivatives<link>/wiki/dynamical-system/integration-of-ode/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/wiki/dynamical-system/integration-of-ode/</guid><description/></item><item><title/><link>/wiki/survival-analysis/survival-probability/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/wiki/survival-analysis/survival-probability/</guid><description/></item><item><title>About/about/Mon, 01 Jan 0001 00:00:00 +0000/about/Datumorphism is my notebook about programming, data scraping, statistics, machine learning, and data visualization.
Join our Enki Team Learn, practice, and play together. Programmers are unstoppable. Intelligence Notebook Notes about neuroscience, machine intelligence, and collective intelligence. Lei Ma Visit this page if you would like to know more about me.Akaike Information Criterion/cards/statistics/aic/Mon, 01 Jan 0001 00:00:00 +0000/cards/statistics/aic/Suppose we have a model that describes the data generation process behind a dataset. The distribution by the model is denoted as $\hat f$. The actual data generation process is described by a distribution $f$.
We ask the question:
How good is the approximation using $\hat f$?
To be more precise, how much information is lost if we use our model dist $\hat f$ to substitute the actual data generation distribution $f$?Bayes Factors/cards/statistics/bayes-factors/Mon, 01 Jan 0001 00:00:00 +0000/cards/statistics/bayes-factors/$$ \frac{p(\mathscr M_1|y)}{ p(\mathscr M_2|y) } = \frac{p(\mathscr M_1)}{ p(\mathscr M_2) }\frac{p(y|\mathscr M_1)}{ p(y|\mathscr M_2) } $$
Bayes factor
$$ \mathrm{BF_{12}} = \frac{m(y|\mathscr M_1)}{m(y|\mathscr M_2)} $$
$\mathrm{BF_{12}}$: how many time more likely is model $\mathscr M_1$ than $\mathscr M_2$.Bayesian Information Criterion/cards/statistics/bic/Mon, 01 Jan 0001 00:00:00 +0000/cards/statistics/bic/BIC is Bayesian information criterion, it replaced the $+2k$ term in AIC with $k\ln n$
$$ \mathrm{BIC} = -2\ln p(y|\hat\theta) + k\ln n = \ln \left(\frac{n^k}{p^2}\right) $$
$n$ is the observations. We prefer the model with a small BIC.Cheatsheets/awesome/cheatsheets/Mon, 01 Jan 0001 00:00:00 +0000/awesome/cheatsheets/ Supervised Learning k-Nearest Neighbors [Supervised Learning Classification ] : Linear Regression [Supervised Learning Regression ] : Lasso [Supervised Learning Regression Regularization ] : Ridge [Supervised Learning Regression Regularization ] : ElasticNet [Supervised Learning Regression Regularization ] : Unsupervised Learning k-Means [Unsupervised Learning ] : t-SNE [Unsupervised Learning ] : PCA [Unsupervised Learning Dimension Reduction Feature Selection ] : NMF [Unsupervised Learning ] : Non-negative Matrix FactoringCurriculum/awesome/curriculum/Mon, 01 Jan 0001 00:00:00 +0000/awesome/curriculum/Prerequisites Programming Python C++ alternatives: name: R name: Matlab Computer Science These theories make people think faster. They don’t pose direct limits on what data scientists can do but they will definitely give data scientists a boost.
Data Structures Complexity Math Some basic understanding of these is absolutely required. Higher levels of these topics will also be listed in details.
Statistics Linear Algebra Calculus Differential Equations EDA Tools These tools are used almost everywhere in data science.Fisher Information Approximation/cards/statistics/fia/Mon, 01 Jan 0001 00:00:00 +0000/cards/statistics/fia/#FIA is a method to describe the [[minimum-description-length|minimum description length ( #MDL )]] of models,
$$ \mathrm{FIA} = -\ln p(y | \hat\theta) + \frac{k}{2} \ln \frac{n}{2\pi} + \ln \int_\Theta \sqrt{ \operatorname{det}[I(\theta)] d\theta } $$
$I(\theta)$: Fisher information matrix of sample size 1. $$I_{i,j}(\theta) = E\left( \frac{\partial \ln p(y| \theta)}{\partial \theta_i}\frac{ \partial \ln p (y | \theta) }{ \partial \theta_j } \right)$$.Goodness-of-fit/wiki/model-selection/goodness-of-fit/Mon, 01 Jan 0001 00:00:00 +0000/wiki/model-selection/goodness-of-fit/Is the data agree with the model?
distance between data and model predictions likelihood function: likelihood of observing the data if we assume the model; the results will be a set of fitting parameters. Why don’t we always use goodness-of-fit as a measure of the goodness of a model?
overfitting not intuitive This is why we would like to balance it with parsimony using some measures of generalizability.Gridlines in Matplotlib/til/programming/matplotlib-gridlines/Mon, 01 Jan 0001 00:00:00 +0000/til/programming/matplotlib-gridlines/Adding gridlines in matplotlibKolmogorov Complexity/cards/statistics/kolmogorov-complexity/Mon, 01 Jan 0001 00:00:00 +0000/cards/statistics/kolmogorov-complexity/Description:
$\Sigma=\{0,1\}$, a map $f:\Sigma^* \to\Sigma^*$. To describe a string of 0 and 1 $\sigma$, the description is a map so that $f(\tau)=\sigma$.
Kolmogorov complexity $C_f$
$$ C_f(x) = \begin{cases} min\{ \vert p \vert : f(p) = x & \text{if x} \\ \infty & \text{otherwise} \} \end{cases} $$ $f$ can be a universal turing machine.Measures of Generalizability/wiki/model-selection/measures-of-generalizability/Mon, 01 Jan 0001 00:00:00 +0000/wiki/model-selection/measures-of-generalizability/Minimum Description Length/cards/statistics/mdl/Mon, 01 Jan 0001 00:00:00 +0000/cards/statistics/mdl/The minimum description length ( #MDL ) is based on the idea of compression of the data.
MDL looks for the model that compresses the data well. To compress data, we need to find the regularity in the data.
There are many versions of MDL.
crude two-part code Fisher information approximation ( # FIA ) Normalized Maximum likelihood ( #NML )Model Comparison/wiki/model-selection/model-selection/Mon, 01 Jan 0001 00:00:00 +0000/wiki/model-selection/model-selection/The parsimony model comes from the idea of Occam’s razor: We choose the simple model that has more explanatory power.
The instance theory is a good model to explain the lexical decision task but it is not the only one. However, it simply makes it popular.
What is a Good Model? A good model should be presumably
plausibility balance of parsimony and goodness-of-fit coherence of the underlying assumptions easy to understand when it breaks down consistency with known results especially with the simple and basic phenomena ability to explain rather than describe data extent to which model predictions can be falsified through experiments.Normalized Maximum Likelihood/cards/statistics/nml/Mon, 01 Jan 0001 00:00:00 +0000/cards/statistics/nml/$$ \mathrm{NML} = \frac{ p(y| \hat \theta(y)) }{ \int_X p( x| \hat \theta (x) ) dx } $$Parsimony of Models/wiki/model-selection/parsimony-of-models/Mon, 01 Jan 0001 00:00:00 +0000/wiki/model-selection/parsimony-of-models/For models with a lot of parameters, the goodness-of-fit is very likely to be very high. However, it is also likely to generalize bad. So we need measure of generalizability
Here parsinomy gives us a few advantages.
easy to perceive better generalizationsResearchers/awesome/researchers/Mon, 01 Jan 0001 00:00:00 +0000/awesome/researchers/ Machine Learning Geoffrey Hinton [machine learning psychology artificial intelligence cognitive science computer science ] : Emeritus Prof. Comp Sci, U.Toronto & Engineering Fellow, GoogleTools/awesome/tools/Mon, 01 Jan 0001 00:00:00 +0000/awesome/tools/List of Tools Dashboard ReDash [Python ] : Superset [Python ] : Metabase [Java ] : Google Data Studio [Free Google BigQuery Cloud ] : Google Datastudio is a convinent tool to produce simple yet massive dashboards for the team. Design and Build a Data Warehouse for Business [Courses Warehouse Business ] : Explained Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) [LSTM RNN ] : JavaScript replacements for Python data science tools [JavaScript Tools Data Science ] : https://github.Typography of this Website/typography/Mon, 01 Jan 0001 00:00:00 +0000/typography/Basic Syntax This website uses kramdown as the basic syntax. However, a lot of html/css/js has been applied to generate some certain contents or styles.
Math also follows the kramdown syntax.
Notes div {% highlight html %}
Figure with Caption {% highlight html %}
![]({{ site.url }}/assets/programming/chrome-dev-tools-inspect.png) where {{ site.url }} is the configured url of the site.
Alternatively, we can use the set attributes syntax in kramdown.
{% highlight md %} This is a paragraph with some class.Workflows/awesome/workflows/Mon, 01 Jan 0001 00:00:00 +0000/awesome/workflows/The scope of exploratory data analysis is not universally defined. Some of the contents discussed here may have crossed the line. The whole modeling process is never decoupled anyway. Data wrangling is mostly guided by the exploratory data analysis (EDA). In other words, the data cleaning process should be mostly guided by questions from business and stakeholder or out of curiosity.
There are three key components in EDA.