Deep Infomax

Max Global Mutual Information

Why not just use the global mutual information of the input and encoder output as the objective?

… maximizing MI between the complete input and the encoder output (i.e.,globalMI) is ofteninsufficient for learning useful representations.

– Devon et al[^Devon2018]

[[Mutual information]] Mutual Information Mutual information is defined as I(X;Y)=EpXYlnPXYPXPY. In the case that X and Y are independent variables, we have PXY=PXPY, thus I(X;Y)=0. This makes sense as there would be no “mutual” information if the two variables are independent of each other. Entropy and Cross Entropy Mutual information is closely related to entropy. A simple decomposition shows that I(X;Y)=H(X)H(XY), which is the reduction of … maximization is performed on the input of the encoder X and the encoded feature X^=Eθ(X),

arg maxθI(X;Eθ(X)).

Being a quantity that is notoriously hard to compute, mutual information I(X;Eθ(X)) is usually estimated using its lower bound, which depends on a choice of a functional Tω. Thus the objective will be maximizing a parametrized mutual information estimation,

arg maxθ,ωI^ω(X;Eθ(X))

Local or Global

Two approaches to apply mutual information on encoders:

  • Global mutual information of full input and full encoding. This is useful for reconstruction of the input.
  • Local mutual information of local patches of input full encoding. This is useful for classification.

Local Mutual Information

To compare local features to the encoder output, we need to extract values from inside the encoder, i.e.,

Eθf,θC=fθfCθC.

The first step, CθC is to map the input into feature maps, the second step, fθf maps the feature maps into the encoding. The feature map CθC is split into patches,

CθC={Cθ(i)}.

The objective is

arg maxθf,θC,ωEi[I^ω(CθC(i);Eθ(X))].

Why does local mutual information help

Devon et al explained the idea behind choosing local mutual information[^Devon2018].

Global mutual information doesn’t specify what is the meaningful information. Some very local noise can also be treated as meaningful information too.

Local mutual information splits the input into patches, and calculate the mutual information between each patch and the encoding. If the model only uses some information from a few local patches, the mutual information objective will be small after averaging all the patches. Thus local mutual information forces the model to use information that is global in the input.

Code

Planted: by ;

Dynamic Backlinks to wiki/machine-learning/contrastive-models/deep-infomax:

L Ma (2021). 'Deep Infomax', Datumorphism, 09 April. Available at: https://datumorphism.leima.is/wiki/machine-learning/contrastive-models/deep-infomax/.