Contrastive Models
Contrastive self-supervised learning models can utilize more data
4 Deep Infomax
Published:
Category: { Machine Learning }
Tags:
References:
- Devon Hjelm R, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, et al. Learning deep representations by mutual information estimation and maximization. arXiv [stat.ML]. 2018. Available: http://arxiv.org/abs/1808.06670
- Newell A, Deng J. How Useful is Self-Supervised Pretraining for Visual Tasks? arXiv [cs.CV]. 2020. Available: http://arxiv.org/abs/2003.14323
Summary: Max Global Mutual Information
Why not just use the global mutual information of the input and encoder output as the objective?
… maximizing MI between the complete input and the encoder output (i.e.,globalMI) is ofteninsufficient for learning useful representations.
– Devon et al[^Devon2018]
[[Mutual information]] Mutual Information Mutual information is defined as $$ I(X;Y) = \mathbb E_{p_{XY}} \ln \frac{P_{XY}}{P_X P_Y}. $$ In the case that $X$ and $Y$ are independent variables, we have $P_{XY} = P_X P_Y$, thus $I(X;Y) = 0$. This makes sense as there would be no “mutual” information if the two variables are independent of each other.
Pages: 5
4 Contrastive Predictive Coding
Published:
Category: { Machine Learning }
Tags:
Summary: Contrastive Predictive Coding, aka CPC, is an autoregressive model combined with InfoNCE loss1.
There are two key ideas in CPC:
Autoregressive models in latent space, and InfoNCE loss that combines mutual information and [[NCE]] Noise Contrastive Estimation: NCE Noise contrastive estimation (NCE) objective function is1 $$ \mathcal L = \mathbb E_{x, x^{+}, x^{-}} \left[ - \ln \frac{ C(x, x^{+})}{ C(x,x^{+}) + C(x,x^{-}) } \right], $$ where $x^{+}$ represents data similar to $x$, $x^{-}$ represents data dissimilar to $x$, $C(\cdot, \cdot)$ is a function to compute the similarities. For example, we can use $$ C(x, x^{+}) = e^{ f(x)^T f(x^{+}) }, $$ so that the objective function becomes $$ \mathcal L = \mathbb E_{x, x^{+}, x^{-}} \left[ - \ln \frac{ e^{ … .
Pages: 5
3 Contrastive Model: Instance-Instance
Published:
Category: { Machine Learning }
Tags:
Summary: It was discovered that the success of [[mutual information based contrastive learning]] Contrastive Model: Context-Instance In contrastive methods, we can manipulate the data to create data entries and infer the changes using a model. These methods are models that “predict relative position”1. Common tricks are shuffling image sections like jigsaw, and rotate the image. We can also adjust the model to discriminate the similarities and differences. For example, to generate contrast, we can also use [[Mutual Information]] Mutual Information Mutual information is defined as $$ I(X;Y) = \mathbb E_{p_{XY}} \ln … is more related to the encoder architecture and the negative sampling strategy1.
Pages: 5
2 Contrastive Model: Context-Instance
Published:
Category: { Machine Learning }
Tags:
Summary: In contrastive methods, we can manipulate the data to create data entries and infer the changes using a model. These methods are models that “predict relative position”1. Common tricks are
shuffling image sections like jigsaw, and rotate the image. We can also adjust the model to discriminate the similarities and differences. For example, to generate contrast, we can also use [[Mutual Information]] Mutual Information Mutual information is defined as $$ I(X;Y) = \mathbb E_{p_{XY}} \ln \frac{P_{XY}}{P_X P_Y}. $$ In the case that $X$ and $Y$ are independent variables, we have $P_{XY} = P_X P_Y$, thus $I(X;Y) = 0$. This makes sense as there would be no “mutual” information if the two variables are independent of each other.
Pages: 5
1 Contrastive Model
Published:
Category: { Machine Learning }
Tags:
Summary: Contrastive models learn to compare1. Contrastive use special objective functions such as [[NCE]] Noise Contrastive Estimation: NCE Noise contrastive estimation (NCE) objective function is1 $$ \mathcal L = \mathbb E_{x, x^{+}, x^{-}} \left[ - \ln \frac{ C(x, x^{+})}{ C(x,x^{+}) + C(x,x^{-}) } \right], $$ where $x^{+}$ represents data similar to $x$, $x^{-}$ represents data dissimilar to $x$, $C(\cdot, \cdot)$ is a function to compute the similarities. For example, we can use $$ C(x, x^{+}) = e^{ f(x)^T f(x^{+}) }, $$ so that the objective function becomes $$ \mathcal L = \mathbb E_{x, x^{+}, x^{-}} \left[ - \ln \frac{ e^{ … and [[Mutual Information]] Mutual Information Mutual information is defined as $$ I(X;Y) = \mathbb E_{p_{XY}} \ln \frac{P_{XY}}{P_X P_Y}.
Pages: 5