Kullback–Leibler and Jensen–Shannon Divergence
KL-divergence and JS-divergence are generally used to measure the distance between two probability distributions $p$ and $q$. KL-divergence is formulated as follows.
\[KL(p\Vert q)=\int p(x)\log\frac{p(x)}{q(x)}\text{d}x\]In practice, we usually assume that $p$ and $q$ follow the Gaussian for simplifying the calculation. Notably, KL-divergence is asymmetric, and thus weak $p$ may induce insignificant results. JS-divergence balances $p$ and $q$ as:
\[JS(p\Vert q)=\frac{1}{2}KL(p\Vert \frac{p+q}{2})+\frac{1}{2}KL(q\Vert\frac{p+q}{2})\]JS-divergence is symmetric and stable if switching $p$ and $q$. However, KL and JS-divergence both rely on a strong assumption that $p$ and $q$ should overlap.
Generative Adversarial Network
GAN contains two modules: a generator $G$ that synthesize the fake samples close to the real data distribution, and a discriminator $D$ that learns to determine whether a sample is from the $G$ or the real data distribution $p_r$. In the training phase, $D$ and $G$ are playing the following two-player minimax game as:
\[\begin{aligned} \min_G\max_DL(D,G) &= \mathbb{E}_{x\sim p_r}[\log D(x)]+\mathbb{E}_{z\sim p_z(z)}[\log(1-D(G(z)))] \\ &= \mathbb{E}_{x\sim p_r}[\log D(x)]+\mathbb{E}_{x\sim p_g(x)}[\log(1-D(x))] \end{aligned}\]In other words, the generator $G$ is trained to fool the discriminator $D$ while $D$ is to tell the real data from the generated samples. The code of training a GAN on the MINST dataset is as follows.
1 |
|
We first consider the optimal discriminator $D$ for any generator $G$. The training criterion for the discriminator $D$ is to maximize the $L(G,D)$, where we can obtain the analytical solution of the optimal $D^*$.
\[L(G,D)=\int p_r(x)\log D(x)+p_g(x)\log(1-D(x))\text{d}x\] \[D^*=\arg\max_{D(x)} L(G,D)=\frac{p_r(x)}{p_r(x)+p_g(x)}\] \[\max_DL(G,D)=\mathbb{E}_{x\sim p_{data}(x)}[\log D^*]+\mathbb{E}_{x\sim p_g(x)}[\log(1-D^*)]\]Given the global optimality of $p_g=p_r$, the optimal $D^*(x)$ becomes $1/2$ and the minimum of the $L(G,D)$ is $-2\log2$. Rethinking the distance between $p_r$ and $p_g$, we can get
\[JS(p_r\Vert p_g)=\log2+\frac{1}{2}L(G,D)\] \[L(G,D)=2JS(p_r\Vert p_g)-2\log2\]In essence, the loss function of GAN quantifies the distance between the real data distribution $p_r$ and the generative data distribution $p_g$. According to the JS-divergence, the lower bound of $L(G,D)$ is also $-2\log2$.
Problems in GAN
Although GAN has shown significant potential in image generation, its training is massively unstable. One possible reason is that the generator and the discriminator are trained independently without interaction. Updating the gradient of both models simultaneously may not guarantee convergence.
The other possible cause is that $p_r$ and $p_g$ rest in low dimensional manifolds, where they are almost disjointed. In this case the optimal discriminator will be perfect and its gradient will be zero almost everywhere. When the discriminator is perfect, the generator will hardly update due to vanishing gradients.
\[\lim_{\Vert D-D^*\Vert=0}\nabla_\theta\mathbb{E}_{z\sim p(z)}[\log(1-D(G_\theta(z)))]=0\]When the discriminator gets better, the gradient of the generator vanishes. This means the generator may always produce the same outputs, which is commonly referred to as mode collapse. See Arjovsky and Bottou for more details.
Improved Techniques for Training GANs
(1) Adding noises. Vanishing gradients always occurs in that $p_r$ and $p_g$ are disjoint. We can add continuous noise to the inputs of the discriminator, therefore smoothening the distribution of the probability mass.
(2) Softer metrics of distribution distance. When $p_r$ and $p_g$ are disjoint, the JS-divergence can not provide a meaningful value. Wasserstein metric is introduced to replace JS-divergence due to its better performance.
As suggested in Salimans, et al., we list improved rechniques for training GANs, including (3) feature matching, (4) mini-batch discrimination, (5) historical averaging , (6) one-sided label smoothing, and (7) virtual batch normalization. See the original paper for more details.
Wasserstein GAN
Wasserstein distance $W(p,q)$ is the minimum cost of transporting the whole probability mass of $p$ to match the probability mass of $q$, which is defined as
\[W(p,q)=\inf_{\gamma\sim\Gamma}\mathbb{E}_{(x,y)\sim\gamma}[\Vert x-y\Vert]=\sum_{x,y}\gamma(x,y)\Vert x-y\Vert\]where $\inf$ means the infimum and $\Gamma$ denotes the set of all possible joint probability distributions between $p$ and $q$. In essence, Wasserstein distance is a measure of energy conversion if treating the $\gamma(x,y)$ as force and $\Vert x-y\Vert$ as displacement. Even two distributions are located in lower dimensional manifolds without overlaps, Wasserstein distance can still provide a meaningful value.
However, the infimum in $W(p,q)$ is intractable. According to the Kantorovich-Rubinstein duality, we can obtain
\[W(p,q)=\frac{1}{K}\sup_{\Vert f\Vert_L\leq K}\mathbb{E}_{x\sim p}[f(x)]-\mathbb{E}_{x\sim q}[f(x)]\]where $\sup$ is the opposite of $\inf$ and the function $f$ satisfies K-Lipschitz continuous. Suppose the $f_\omega$ is parameterized by $\omega$, the discriminator of WGAN is optimized by
\[L(p_r,p_g)=W(p_r,p_g)=\max_{\omega}\mathbb{E}_{x\sim p_r}[f_\omega(x)]-\mathbb{E}_{z\sim p(z)}[f_\omega(g_\theta(z))]\]Now comes the question of maintaining the K-Lipschitz continuous of $f_\omega$ in the training phase. Arjovsky presents a simple yet very practical trick: clamp the weights $\omega$ to a fixed box such as $[-0.01,0.01]$, inducing a compact space of $\omega$ and thus ensuring the Lipschitz continuity of $f_\omega$. The specific algorithm and PyTorch implementation of WGAN are as follows.
1 |
|
Empirically, the WGAN recommended taking RMSProp or SGD as the optimizer rather than a momentum-based optimizer such as Adam. Gulrajani proposed an alternative way to enforce the Lipschitz constraint via gradient penalty as follows. $p(x)$ is sampled uniformly along straight lines between pairs of points sampled from $p_r$ and $p_g$. See the original paper for more details.
\[L=\underbrace{-\mathbb{E}_{x\sim p_r}[D(x)]+\mathbb{E}_{x\sim p_g(x)}[D(x)]}_{\text{Original critic loss}}+\lambda\mathbb{E}_{x\sim p(x)}[(\Vert\nabla_xD(x)\Vert_2-1)^2]\]