Representation Learning with Mutual Information Maximization¶
-GAN¶
2016 NIPS -
-GAN: Training Generative Neural Samplers using Variational Divergence Minimization 1
-divergence¶
Suppose we want to train a generative model
where
Below is a table of the
Name | |||
---|---|---|---|
Total variation | |||
Kullback-Leibler (KL) | |||
Reverse KL | |||
Pearson |
|||
Neyman |
|||
Squared Hellinger | |||
Jeffrey | |||
Jensen-Shannon | |||
Fenchel conjugate
The Fenchel conjugate of function
We can easily verify that
Variational representation of the -divergence¶
We now derive the variational lower bound on
where
The critical value
-GAN Objective¶
We parameterize the generator
To account for
Name | ||||
---|---|---|---|---|
Total variation | ||||
Kullback-Leibler (KL) | ||||
Reverse KL | ||||
Pearson |
||||
Neyman |
||||
Squared Hellinger | ||||
Jeffery | ||||
Jensen-Shannon | ||||
where
Mutual Information Neural Estimator (MINE)¶
2018 ICML - MINE: Mutual Information Neural Estimation 2
MINE has two variants termed MINE and MINE-
The Donsker-Varadhan representation of KL¶
Proof: Consider the Gibbs distribution
where
The -divergence representation of KL¶
Adopting the variational lower bound for
and the optimal
Estimating Mutual Information¶
We estimate the expectations with empirical samples
When using stochastic gradient descent (SGD), the gradient update of MINE
is a biased estimate of the full gradient update (Why?). This is corrected by an exponential moving average applied to the denominator.
For MINE-
Contrastive Predictive Coding (CPC) and the InfoNCE Loss¶
2010 AISTATS - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models 3
2018 NeurIPS - Representation Learning with Contrastive Predictive Coding 4
Noise-Contrastive Estimation (NCE)¶
Suppose we have observed data
The integral
Performing Maxmimum Likelihood Estimation (MLE) on this objective is not feasible as
where
This blog post (in Chinese) shows by gradient calculation that when the number of negative samples approches infinity, the NCE gradient equals to the MLE gradient.
Contrastive Predictive Coding¶
Let
where
To maximize our contrastive predictive capabilities, we minimize the following InfoNCE loss:
Relation with Mutual Information¶
Proof:
Note that the approximation is more accurate as the number of negative samples increases.
Relation with MINE¶
Let
which is equivalent to the MINE estimator:
Deep InfoMax (DIM)¶
2019 ICLR - Learning deep representations by mutual information estimation and maximization 5
Deep InfoMax is a principled framework for training a continuous and (almost everywhere)
differentiable encoder
Assume that we are given a set of training examples on an input space,
We assert our encoder should be trained according to the following criteria:
- Local and global mutual information maximization
- Statistical constraints (prior in the latent space
).
As a preliminary, we introduce the local feature encoder
The overall DIM objective consists of three parts, global MI, local MI and statistical constraints.
In the following sections, we first introduce how to enfore statistical constraints
Statistical Constraints¶
Why use adversarial objectives for KL regularization?
Here we could also use VAE-style prior regularization
DIM imposes statistical constraints onto learned representations by implicitly training the encoder so that the push-forward distribution,
Note that the discriminator
Local MI Maximization¶
Maximizing the mutual information b/t encoder input and output may not be meaningful enough. We propose to maximize the average MI between the high-level representation and local patches of the image. Because the same global representation is encouraged to have high MI with all the patches, this favours encoding aspects of the data that are shared across patches.
First we encode the input to a feature map,
MI Maximization Objectives¶
The Donsker-Varadhan Objective¶
This lower-bound to the MI is based on the Donsker-Varadhan representation of the KL-divergence. It is the tightest possible bound on KL divergence, but it is less stable and requires many negative samples.
The Jensen-Shannon Objective¶
Since we do not concern the precise value of mutual information, and rather primarily interested in its maximization, we could instead optimize on the Jensen-Shannon divergence. This objective is stable to optimize and requires few negative sample, but it is a looser bound to the true mutual information.
Following
where
The InfoNCE Objective¶
This objective uses noise-contrastive estimation to bound mutual information. It obtains strong results, but requires many negative samples.
Deep Graph Infomax¶
2019 ICLR - Deep Graph Infomax 6
Deep Graph Infomax (DGI) is a general approach for learning node representations within graph-structured data in an unsupervised manner. It relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs.
We first introduce
- The encoder
such that produces node embeddings (or patch representations) that summarize a patch of the graph centered around node . - The readout function
which summarizes the obtained patch representations into a graph-level representation . It is implemented as a sigmoid after a mean . - The discriminator
such that represents the logit scores assigned to this patch-summary pair (should be higher for patches contained within the summary). It is implemented as a bilinear function . - Negative samples are generated by pairing the summary vector
of a graph with patch representations from another graph . This alternative graph is obtained as other elements of a training set in a multi-graph setting, or by an explicit corruption function which permutes row-wise the node feature matrix .
Next we introduce the DGI objective for one training graph
InfoGraph¶
2020 ICLR - InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization 7
InfoGraph studies learning the representations of whole graphs (rather than nodes as in DGI) in both unsupervised and semi-supervised scenarios. Its unsupervised version is similar to DGI except for
- Batch-wise generation of negative samples rather than random-sampling- or corruption-based negative samples.
- GIN methodologies for better graph-level representation learning.
In semi-supervised setting, directly adding a supervised loss would likely result in negative transfer. The authors alleviate this problem by separating the parameters of the supervised encoder
In practice, to reduce the computational overhead, at each training step, we enforce mutual-information maximization on a randomly chosen layer of the encoder.
-
NeurIPS 2016 -
-GAN: Training Generative Neural Samplers using Variational Divergence Minimization; A blog post explaining the paper in Chinese. ↩ -
ICML 2018 - MINE: Mutual Information Neural Estimation ↩
-
AISTATS 2010 - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models ↩
-
NeurIPS 2018 - Representation Learning with Contrastive Predictive Coding ↩
-
ICLR 2019 - Learning deep representations by mutual information estimation and maximization (slides, video); A blog post explaining the paper in Chinese. ↩↩
-
ICLR 2019 - Deep Graph Infomax. For relations with previous unsupervised graph representation learning methods, see the IPAM tutorial Unsupervised Learning with Graph Neural Networks by Thomas Kipf and also Daza's Master Thesis A Modular Framework for Unsupervised Graph Representation Learning. ↩
-
ICLR 2020 - InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization ↩