In 2014, Ian Goodfellow, then at OpenAI, published his seminal paper, titled “Generative Adversarial Networks”1, detailing how competition between generator and discriminator functions, approximated by neural networks, can train the generator to produce realistic images.

In this article I will be discussing the theory behind this idea, my own implementation in Julia (mirroring the network structure in Goodfellow’s original paper), and show some of the images I was able to get out.


To begin with let’s model our generator as a parameterized function $G(z; \theta)$ that outputs images that are the same size as the data; this function is over a lower dimensional latent space, which we assign a prior distribution $p_z(z)$.

We will also model the discriminator as a parametrized function $D(x; \omega)$ which takes in an image $x$ and outputs a single scalar interpreted as the probability of $x$ being in the training data set.

In practice, $D$ and $G$ are usually both implemented as neural networks, either fully connected multi-layer perceptrons (MLPs) or convolutional and “deconvolutional”.

Deconvolutional layers, also called convolutional transpose layers, “upsample” from a lower dimensional space to a higher dimensional space, and while the term is sometimes used, it implies an inverse to a convolutional layer, which it is not. The alternative name (convolutional transpose) is more apt and more frequently used lately – at least Flux.jl implements a ConvTranspose layer.

the minimax problem

At a high level, in training these networks, we want $D$ to accurately tell a real image from a fake image and we want $G$ to trick $D$ into believing the generated images are real. This results in the minimax optimization problem:

$$ \begin{equation} \min_G \max_D \ \mathbb{E}_{x \sim p_{data}(x)}\left[\log D(x)\right] + \mathbb{E}_{z \sim p_z(z)} \left[ \log\left(1 - D(G(z)) \right)\right] \end{equation} $$

Here, $1 - D(G(z))$ is the probability that a generated image is fake, which we want to minimize w.r.t. $G$, while simultaneously maximizing $D$’s ability to tell real from fake.

Using the value function in (1), denoted $V(G, D)$, we can iteratively update $G$ and $D$, in turn, using minibatch stochastic gradient descent; see below or Algorithm 1 in [Goodfellow et al.] for details.

global optimality

Ultimately, we want the distribution of generated images to match the data distribution, i.e., $p_g = p_{\text{data}}$.

For any fixed (suboptimal) generator $G$ we want to find $D^*_G(x)$ which maximizes $V$:

$$ \begin{align} V(G, D) &= \int_x p_{\text{data}}(x) \log(D(x)) \ \mathrm{d}x + \int_z p_{z}(z) \log(1 - D(G(z))) \ \mathrm{d}z \nonumber \\\ &= \int_x p_{\text{data}}(x) \log(D(x)) + p_{g}(x) \log(1 - D(x)) \ \mathrm{d}x \end{align} $$

The integrand of (2), which has the form $y \to a \log y + b \log(1 - y)$ with $(a, b) \in \mathbb{R}^2 \backslash \{0,0\}$ with a maximum in $[0,1]$ at $a \over a+b$. So we then have

$$ \begin{equation} D^*_G(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)} \end{equation} $$

Now, looking back at (1) we can see that we need to minimize

$$ \begin{align} C(G) &= \max_D V(G, D) = V(G, D^*_G) \nonumber \\\ &= \mathbb{E}_{x \sim p_{\text{data}}}[\log D^*_G(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D^*_G(G(z))] \nonumber \\\ &= \mathbb{E}_{x \sim p_{\text{data}}}[\log D^*_G(x)] + \mathbb{E}_{x \sim p_g}[\log(1 - D^*_G(x))] \nonumber \\\ &= \mathbb{E}_{x \sim p_{\text{data}}}\left[\log \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)} \right] + \mathbb{E}_{x \sim p_g}\left[\log \frac{p_g(x)}{p_{\text{data}}(x) + p_g(x)} \right] \end{align} $$

With equation (3) we can clearly see that $D^*_G(x) = {1 \over 2}$ when $p_g = p_{\text{data}}$, which implies that at its absolute minimum

$$ C(G^*) = \mathbb{E}_{x \sim p_{\text{data}}}[-\log 2] + \mathbb{E}_{x \sim p_g}[-\log 2] = -\log 4. $$

We now note that the Kullback-Leibler divergence is defined as

$$ \begin{align*} D_{KL}(p \ \Vert \ q) &\equiv \int_x p(x) \log {p(x) \over q(x)} \ \mathrm{d}x \\\ &= \mathbb{E}_{x \sim p(x)} \left[ \log {p(x) \over q(x)} \right] \end{align*} $$

and the Jenson-Shannon divergence, which is symmetric, strictly positive, and $0$ iff the distributions are equal, is defined as

$$ \begin{align*} D_{JS}(p \ \Vert \ q) &\equiv {1 \over 2}D_{KL}\left(p \ \Bigg\Vert \ {p + q \over 2} \right) + {1 \over 2}D_{KL}\left(q \ \Bigg\Vert \ {p + q \over 2} \right) \\\ &= {1 \over 2} \mathbb{E}_{x \sim p(x)} \left[ \log {2 \ p(x) \over p(x) + q(x)} \right] + {1 \over 2} \mathbb{E}_{x \sim q(x)} \left[ \log {2 \ q(x) \over p(x) + q(x)} \right] \\\ &= {1 \over 2} \left( \mathbb{E}_{x \sim p(x)} \left[ \log {p(x) \over p(x) + q(x)} \right] + \mathbb{E}_{x \sim q(x)} \left[ \log {q(x) \over p(x) + q(x)} \right] + \log 4 \right) \end{align*} $$

Thus, using (4), the cost function for the generator $G$, for optimal $D^*_G$, can be written as

$$ \begin{equation} C(G) = -\log 4 + 2 \cdot D_{JS}(p_{\text{data}} \ \Vert \ p_g). \end{equation} $$


The above analysis shows that our loss for $G$ is well defined when $D$ is close to optimality. In order to get the discriminator to be as close to optimal as possible, we can train it in an inner loop with $k$ updates for ever one update to the generator. This procedure schematically goes as follows:

Algorithm minibatch SGD training of generative adversarial nets

for $n$ iterations:

for $k$ inner loops:

  • sample $m$ noise samples $z_i \sim p_g(z)$
  • sample $m$ data samples $x_i \sim p_{\text{data}}(x)$
  • update $D$ to ascend via SGD with $$ \nabla_\omega \ {1 \over m} \sum_{i=1}^m \log D\left(x_i; \ \omega\right) + \log \big( 1 - D(G(z_i; \ \theta); \ \omega)\big) $$
  • sample $m$ noise samples $z_i \sim p_g(z)$
  • update $G$ to descend via SGD with $$ \nabla_\theta \ {1 \over m} \sum_{i=1}^m \log \big( 1 - D(G(z_i; \ \theta); \ \omega)\big) $$

The plots below show what the training loss dynamics looks like for different values of $k$:


We can see that the generator $G$’s loss decays more steeply when $k$ is larger, which is good, but comes at a cost: the computational cost increases steeply. This trade off should be considered when training.


An implementation of the “simple” GAN model described here (plus some more advanced models that aren’t yet implemented at the time of writing this) can be found on github at FluxGAN.jl. The package runs on the gpu (if available) and can be easily installed and used.


Visualizing the training of a GAN can be tricky; I found it helpful to make the following gif which shows the generator’s output for specific points in the latent space during the training process.


The following are randomly sampled outputs from generators trained with different architectures on different data sets, all with $k=1$.


fully connected network (code)


convolutional network (code)

fully connected network (code)

  1. Goodfellow et al. (2014) “Generative Adversarial Networks” arXiv:1406.2661 ↩︎