InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Show me an executive summary.

1) Purpose and scope

Researchers aimed to create a way for machines to learn meaningful breakdowns of data features without human labels. They focused on images like handwritten digits, faces, and objects. The goal was to improve unsupervised learning, which uses unlabeled data abundant in the real world, for tasks like classification and visualization.

2) Methods overview

They modified Generative Adversarial Networks (GANs)—systems where one network generates images and another distinguishes real from fake. They split the generator's input into random noise and "latent codes" meant to capture key data traits, like digit shape or face pose. To ensure the generator used these codes meaningfully, they maximized mutual information between codes and output images using a computable lower bound and an extra network to guess codes from images. They trained on datasets like MNIST digits, CelebA faces, SVHN house numbers, and 3D-rendered faces and chairs.

3) Key results

On MNIST, one code captured digit identity (0-9) with 95% accuracy; others controlled rotation and stroke width. On 3D faces, codes learned pose, lighting, and face width without labels. On chairs, codes handled rotation and width variations. On SVHN and CelebA, codes separated background digits, glasses presence, hairstyles, and emotions. Representations matched supervised methods but used no labels.

4) Main conclusion

InfoGAN learns interpretable data factors unsupervised, adding little extra computation to GANs.

5) Interpretation of findings

This reduces reliance on costly labels, cutting data preparation time and expense for AI training. It lowers risk in applications like face recognition by capturing natural variations (pose, style) over artificial ones. Performance improves for downstream tasks like object detection, as factors are disentangled and generalizable beyond training ranges. Unlike prior supervised or weakly supervised methods, InfoGAN works fully unsupervised on complex, noisy data— a step toward more robust AI.

6) Recommendations and next steps

Adopt InfoGAN for unsupervised representation learning in image generation projects to save labeling costs. Test on VAE models or hierarchical codes for broader use. Prioritize: apply to real-world tasks like reinforcement learning policies. Trade-offs: standard GANs are simpler but less interpretable; InfoGAN adds tuning of one parameter (lambda). Run pilots on domain-specific data before full rollout.

7) Limitations and confidence

Assumes stable GAN training; may fail on very high-dimensional data without tweaks. Latent code structure (discrete or continuous) needs choice based on data. High confidence in results on tested datasets, as mutual information bounds were tight and visuals showed clear control; caution on untested domains needing validation.

Xi Chen†‡{}^{\dagger\ddagger}†‡, Yan Duan†‡{}^{\dagger\ddagger}†‡, Rein Houthooft†‡{}^{\dagger\ddagger}†‡, John Schulman†‡{}^{\dagger\ddagger}†‡, Ilya Sutskever‡{}^{\ddagger}, Pieter Abbeel†‡{}^{\dagger\ddagger}†‡
†{}^{\dagger} UC Berkeley, Department of Electrical Engineering and Computer Sciences
‡{}^{\ddagger} OpenAI

Abstract

This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound of the mutual information objective that can be optimized efficiently. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing supervised methods.

1 Introduction

Show me a brief summary.

In this section, unsupervised representation learning seeks disentangled factors from unlabeled data to enable downstream tasks like classification and visualization, as generative models like VAEs and GANs may entangle latent variables despite synthesizing realistic samples. InfoGAN addresses this by extending GANs to maximize mutual information between a structured subset of noise variables, termed latent codes, and generated observations, ensuring codes capture salient semantic features without supervision. Experiments demonstrate InfoGAN disentangles digit shapes from styles on MNIST, poses from lighting on 3D faces, and backgrounds from central digits on SVHN, yielding interpretable representations rivaling supervised methods and suggesting information-regularized generative modeling as a promising path forward.

Unsupervised learning can be described as the general problem of extracting value from unlabelled data which exists in vast quantities. A popular framework for unsupervised learning is that of representation learning [1, 2], whose goal is to use unlabelled data to learn a representation that exposes important semantic features as easily decodable factors. A method that can learn such representations is likely to exist [2], and to be useful for many downstream tasks which include classification, regression, visualization, and policy learning in reinforcement learning.
While unsupervised learning is ill-posed because the relevant downstream tasks are unknown at training time, a disentangled representation, one which explicitly represents the salient attributes of a data instance, should be helpful for the relevant but unknown tasks. For example, for a dataset of faces, a useful disentangled representation may allocate a separate set of dimensions for each of the following attributes: facial expression, eye color, hairstyle, presence or absence of eyeglasses, and the identity of the corresponding person. A disentangled representation can be useful for natural tasks that require knowledge of the salient attributes of the data, which include tasks like face recognition and object recognition. It is not the case for unnatural supervised tasks, where the goal could be, for example, to determine whether the number of red pixels in an image is even or odd. Thus, to be useful, an unsupervised learning algorithm must in effect correctly guess the likely set of downstream classification tasks without being directly exposed to them.
A significant fraction of unsupervised learning research is driven by generative modelling. It is motivated by the belief that the ability to synthesize, or “create” the observed data entails some form of understanding, and it is hoped that a good generative model will automatically learn a disentangled representation, even though it is easy to construct perfect generative models with arbitrarily bad representations. The most prominent generative models are the variational autoencoder (VAE) [3] and the generative adversarial network (GAN) [4].
In this paper, we present a simple modification to the generative adversarial network objective that encourages it to learn interpretable and meaningful representations. We do so by maximizing the mutual information between a fixed small subset of the GAN’s noise variables and the observations, which turns out to be relatively straightforward. Despite its simplicity, we found our method to be surprisingly effective: it was able to discover highly semantic and meaningful hidden representations on a number of image datasets: digits (MNIST), faces (CelebA), and house numbers (SVHN). The quality of our unsupervised disentangled representation matches previous works that made use of supervised label information [5–9]. These results suggest that generative modelling augmented with a mutual information cost could be a fruitful approach for learning disentangled representations.
In the remainder of the paper, we begin with a review of the related work, noting the supervision that is required by previous methods that learn disentangled representations. Then we review GANs, which is the basis of InfoGAN. We describe how maximizing mutual information results in interpretable representations and derive a simple and efficient algorithm for doing so. Finally, in the experiments section, we first compare InfoGAN with prior approaches on relatively clean datasets and then show that InfoGAN can learn interpretable representations on complex datasets where no previous unsupervised approach is known to learn representations of comparable quality.

2 Related Work

Show me a brief summary.
There exists a large body of work on unsupervised representation learning. Early methods were based on stacked (often denoising) autoencoders or restricted Boltzmann machines [10–13]. A lot of promising recent work originates from the Skip-gram model [14], which inspired the skip-thought vectors [15] and several techniques for unsupervised feature learning of images [16].
Another intriguing line of work consists of the ladder network [17], which has achieved spectacular results on a semi-supervised variant of the MNIST dataset. More recently, a model based on the VAE has achieved even better semi-supervised results on MNIST [18]. GANs [4] have been used by Radford et al. [19] to learn an image representation that supports basic linear algebra on code space. Lake et al. [20] have been able to learn representations using probabilistic inference over Bayesian programs, which achieved convincing one-shot learning results on the OMNI dataset.
In addition, prior research attempted to learn disentangled representations using supervised data. One class of such methods trains a subset of the representation to match the supplied label using supervised learning: bilinear models [21] separate style and content; multi-view perceptron [22] separate face identity and view point; and Yang et al. [23] developed a recurrent variant that generates a sequence of latent factor transformations. Similarly, VAEs [5] and Adversarial Autoencoders [9] were shown to learn representations in which class label is separated from other variations.
Recently several weakly supervised methods were developed to remove the need of explicitly labeling variations. disBM [24] is a higher-order Boltzmann machine which learns a disentangled representation by “clamping” a part of the hidden units for a pair of data points that are known to match in all but one factors of variation. DC-IGN [7] extends this “clamping” idea to VAE and successfully learns graphics codes that can represent pose and light in 3D rendered images. This line of work yields impressive results, but they rely on a supervised grouping of the data that is generally not available. Whitney et al. [8] proposed to alleviate the grouping requirement by learning from consecutive frames of images and use temporal continuity as supervisory signal.
Unlike the cited prior works that strive to recover disentangled representations, InfoGAN requires no supervision of any kind. To the best of our knowledge, the only other unsupervised method that learns disentangled representations is hossRBM [13], a higher-order extension of the spike-and-slab restricted Boltzmann machine that can disentangle emotion from identity on the Toronto Face Dataset [25]. However, hossRBM can only disentangle discrete latent factors, and its computation cost grows exponentially in the number of factors. InfoGAN can disentangle both discrete and continuous latent factors, scale to complicated datasets, and typically requires no more training time than regular GAN.

3 Background: Generative Adversarial Networks

Show me a brief summary.

In this section, Generative Adversarial Networks tackle the challenge of training deep generative models to mimic real data distributions without assigning explicit probabilities to every sample. A generator transforms random noise into synthetic data, adversarially trained against a discriminator that distinguishes real from fake samples in a minimax game where the discriminator maximizes classification accuracy and the generator minimizes detection. For any fixed generator, the optimal discriminator outputs the ratio of real data density to the sum of real and generated densities, yielding a formal objective that balances expected log-probabilities of the discriminator correctly identifying real data and being fooled by generated samples.

Goodfellow et al. [4] introduced the Generative Adversarial Networks (GAN), a framework for training deep generative models using a minimax game. The goal is to learn a generator distribution PG(x)P_G(x)PG(x) that matches the real data distribution Pdata(x)P_{\text{data}}(x)Pdata(x). Instead of trying to explicitly assign probability to every xxx in the data distribution, GAN learns a generator network GGG that generates samples from the generator distribution PGP_GPG by transforming a noise variable z∼Pnoise(z)z \sim P_{\text{noise}}(z)zPnoise(z) into a sample G(z)G(z)G(z). This generator is trained by playing against an adversarial discriminator network DDD that aims to distinguish between samples from the true data distribution PdataP_{\text{data}}Pdata and the generator’s distribution PGP_GPG. So for a given generator, the optimal discriminator is D(x)=Pdata(x)/(Pdata(x)+PG(x))D(x) = P_{\text{data}}(x)/(P_{\text{data}}(x) + P_G(x))D(x)=Pdata(x)/(Pdata(x)+PG(x)). More formally, the minimax game is given by the following expression:
minGmaxDV(D,G)=ExPdata[logD(x)]+Eznoise[log(1D(G(z)))](1)\min_G \max_D V(D,G) = \mathbb{E}_{x \sim P_{\text{data}}} [\log D(x)] + \mathbb{E}_{z \sim \text{noise}} [\log (1 - D(G(z)))] \tag{1}
💭 Click to ask about this equation
(1)

4 Mutual Information for Inducing Latent Codes

Show me a brief summary.

In this section, standard GANs produce entangled representations because their unstructured noise vector allows arbitrary use without semantic alignment. To induce disentangled latent codes capturing factors like digit identity, rotation, or stroke width, the noise decomposes into incompressible randomness z and structured latent code c with independent priors, input to generator G(z,c). Maximizing mutual information between c and outputs—quantifying uncertainty reduction in c from observing images—ensures c's information persists, avoiding trivial solutions where c is ignored. This regularization forms an augmented minimax game balancing adversarial training with information preservation for interpretable representations.

The GAN formulation uses a simple factored continuous input noise vector zzz, while imposing no restrictions on the manner in which the generator may use this noise. As a result, it is possible that the noise will be used by the generator in a highly entangled way, causing the individual dimensions of zzz to not correspond to semantic features of the data.
However, many domains naturally decompose into a set of semantically meaningful factors of variation. For instance, when generating images from the MNIST dataset, it would be ideal if the model automatically chose to allocate a discrete random variable to represent the numerical identity of the digit (0-9), and chose to have two additional continuous variables that represent the digit’s angle and thickness of the digit’s stroke. It is the case that these attributes are both independent and salient, and it would be useful if we could recover these concepts without any supervision, by simply specifying that an MNIST digit is generated by an independent 1-of-10 variable and two independent continuous variables.
In this paper, rather than using a single unstructured noise vector, we propose to decompose the input noise vector into two parts: (i) zzz, which is treated as source of incompressible noise; (ii) ccc, which we will call the latent code and will target the salient structured semantic features of the data distribution. Mathematically, we denote the set of structured latent variables by c1,c2,…,cLc_1, c_2, \dots, c_Lc1,c2,,cL. In its simplest form, we may assume a factored distribution, given by P(c1,c2,…,cL)=∏i=1LP(ci)P(c_1, c_2, \dots, c_L) = \prod_{i=1}^L P(c_i)P(c1,c2,,cL)=i=1LP(ci). For ease of notation, we will use latent codes ccc to denote the concatenation of all latent variables cic_ici.
We now propose a method for discovering these latent factors in an unsupervised way: we provide the generator network with both the incompressible noise zzz and the latent code ccc, so the form of the generator becomes G(z,c)G(z, c)G(z,c). However, in standard GAN, the generator is free to ignore the additional latent code ccc by finding a solution satisfying PG(x∣c)=PG(x)P_G(x|c) = P_G(x)PG(xc)=PG(x). To cope with the problem of trivial codes, we propose an information-theoretic regularization: there should be high mutual information between latent codes ccc and generator distribution G(z,c)G(z, c)G(z,c). Thus I(c;G(z,c))I(c; G(z, c))I(c;G(z,c)) should be high.
In information theory, mutual information between XXX and YYY, I(X;Y)I(X; Y)I(X;Y), measures the “amount of information” learned from knowledge of random variable YYY about the other random variable XXX. The mutual information can be expressed as the difference of two entropy terms:
I(X;Y)=H(X)H(XY)=H(Y)H(YX)(2)I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) \tag{2}
💭 Click to ask about this equation
(2)
This definition has an intuitive interpretation: I(X;Y)I(X; Y)I(X;Y) is the reduction of uncertainty in XXX when YYY is observed. If XXX and YYY are independent, then I(X;Y)=0I(X; Y) = 0I(X;Y)=0, because knowing one variable reveals nothing about the other; by contrast, if XXX and YYY are related by a deterministic, invertible function, then maximal mutual information is attained. This interpretation makes it easy to formulate a cost: given any x∼PG(x)x \sim P_G(x)xPG(x), we want PG(c∣x)P_G(c|x)PG(cx) to have a small entropy. In other words, the information in the latent code ccc should not be lost in the generation process. Similar mutual information inspired objectives have been considered before in the context of clustering [26–28]. Therefore, we propose to solve the following information-regularized minimax game:
minGmaxDVI(D,G)=V(D,G)λI(c;G(z,c))(3)\min_G \max_D V_I(D,G) = V(D,G) - \lambda I(c; G(z, c)) \tag{3}
💭 Click to ask about this equation
(3)

5 Variational Mutual Information Maximization

Show me a brief summary.

In this section, maximizing mutual information between latent codes and generated images proves intractable due to the need for the inaccessible posterior over codes given images. A variational lower bound addresses this by introducing an auxiliary distribution Q to approximate the posterior, yielding a tight, Monte Carlo-approximable objective via a lemma that sidesteps posterior sampling and leverages reparametrization for generator updates. The bound equals mutual information when Q matches the posterior and achieves maximum for discrete codes at code entropy. Thus, InfoGAN is formulated as a minimax game augmenting the standard GAN objective with this scalable regularization, enabling efficient, unsupervised disentangled representation learning.

In practice, the mutual information term I(c;G(z,c))I(c; G(z, c))I(c;G(z,c)) is hard to maximize directly as it requires access to the posterior P(c∣x)P(c|x)P(cx). Fortunately we can obtain a lower bound of it by defining an auxiliary distribution Q(c∣x)Q(c|x)Q(cx) to approximate P(c∣x)P(c|x)P(cx):
I(c;G(z,c))=H(c)H(cG(z,c))=ExG(z,c)[EcP(cx)[logP(cx)]]+H(c)=ExG(z,c)[DKL(P(x)Q(x))0+EcP(cx)[logQ(cx)]]+H(c)ExG(z,c)[EcP(cx)[logQ(cx)]]+H(c)(4)\begin{aligned} I(c; G(z, c)) &= H(c) - H(c|G(z, c)) \\ &= \mathbb{E}_{x \sim G(z,c)} [\mathbb{E}_{c' \sim P(c|x)} [\log P(c'|x)]] + H(c) \\ &= \mathbb{E}_{x \sim G(z,c)} [\underbrace{D_{KL}(P(\cdot|x) \| Q(\cdot|x))}_{\geq 0} + \mathbb{E}_{c' \sim P(c|x)} [\log Q(c'|x)]] + H(c) \\ &\geq \mathbb{E}_{x \sim G(z,c)} [\mathbb{E}_{c' \sim P(c|x)} [\log Q(c'|x)]] + H(c) \end{aligned} \tag{4}
💭 Click to ask about this equation
(4)
This technique of lower bounding mutual information is known as Variational Information Maximization [29]. We note in addition that the entropy of latent codes H(c)H(c)H(c) can be optimized over as well since for common distributions it has a simple analytical form. However, in this paper we opt for simplicity by fixing the latent code distribution and we will treat H(c)H(c)H(c) as a constant. So far we have bypassed the problem of having to compute the posterior P(c∣x)P(c|x)P(cx) explicitly via this lower bound but we still need to be able to sample from the posterior in the inner expectation. Next we state a simple lemma, with its proof deferred to Appendix, that removes the need to sample from the posterior.
Lemma 5.1 For random variables X,YX, YX,Y and function f(x,y)f(x, y)f(x,y) under suitable regularity conditions: Ex∼X,y∼Y∣x[f(x,y)]=Ex∼X,y∼Y∣x,x′∼X∣y[f(x′,y)]\mathbb{E}_{x \sim X, y \sim Y|x} [f(x, y)] = \mathbb{E}_{x \sim X, y \sim Y|x, x' \sim X|y} [f(x', y)]ExX,yYx[f(x,y)]=ExX,yYx,xXy[f(x,y)].
By using Lemma A.1, we can define a variational lower bound, LI(G,Q)L_I(G,Q)LI(G,Q), of the mutual information, I(c;G(z,c))I(c; G(z, c))I(c;G(z,c)):
LI(G,Q)=EcP(c),xG(z,c)[logQ(cx)]+H(c)=ExG(z,c)[EcP(cx)[logQ(cx)]]+H(c)I(c;G(z,c))(5)\begin{aligned} L_I(G,Q) &= \mathbb{E}_{c \sim P(c), x \sim G(z,c)} [\log Q(c|x)] + H(c) \\ &= \mathbb{E}_{x \sim G(z,c)} [\mathbb{E}_{c' \sim P(c|x)} [\log Q(c'|x)]] + H(c) \\ &\leq I(c; G(z, c)) \end{aligned} \tag{5}
💭 Click to ask about this equation
(5)
We note that LI(G,Q)L_I(G,Q)LI(G,Q) is easy to approximate with Monte Carlo simulation. In particular, LIL_ILI can be maximized w.r.t. QQQ directly and w.r.t. GGG via the reparametrization trick. Hence LI(G,Q)L_I(G,Q)LI(G,Q) can be added to GAN’s objectives with no change to GAN’s training procedure and we call the resulting algorithm Information Maximizing Generative Adversarial Networks (InfoGAN).
Eq (4) shows that the lower bound becomes tight as the auxiliary distribution QQQ approaches the true posterior distribution: Ex[DKL(P(⋅∣x)∥Q(⋅∣x))]→0\mathbb{E}_x [D_{KL}(P(\cdot|x) \| Q(\cdot|x))] \to 0Ex[DKL(P(x)Q(x))]0. In addition, we know that when the variational lower bound attains its maximum LI(G,Q)=H(c)L_I(G,Q) = H(c)LI(G,Q)=H(c) for discrete latent codes, the bound becomes tight and the maximal mutual information is achieved. In Appendix, we note how InfoGAN can be connected to the Wake-Sleep algorithm [30] to provide an alternative interpretation.
Hence, InfoGAN is defined as the following minimax game with a variational regularization of mutual information and a hyperparameter λ\lambdaλ:
minG,QmaxDVInfoGAN(D,G,Q)=V(D,G)λLI(G,Q)(6)\min_{G,Q} \max_D V_{\text{InfoGAN}}(D,G,Q) = V(D,G) - \lambda L_I(G,Q) \tag{6}
💭 Click to ask about this equation
(6)

6 Implementation

Show me a brief summary.

In this section, practical implementation of InfoGAN's auxiliary distribution Q addresses the challenge of efficiently maximizing mutual information without disrupting GAN training. Q is parametrized as a neural network sharing convolutional layers with the discriminator and adding only a final fully connected layer for conditional distribution outputs—softmax for categorical codes and factored Gaussians for continuous ones—yielding negligible extra computation. The mutual information lower bound converges faster than core GAN losses, effectively at no cost, while lambda tunes simply to 1 for discrete codes or lower for continuous to align scales with differential entropy. DCGAN techniques suffice for stable training, requiring no novel adjustments.

In practice, we parametrize the auxiliary distribution QQQ as a neural network. In most experiments, QQQ and DDD share all convolutional layers and there is one final fully connected layer to output parameters for the conditional distribution Q(c∣x)Q(c|x)Q(cx), which means InfoGAN only adds a negligible computation cost to GAN. We have also observed that LI(G,Q)L_I(G,Q)LI(G,Q) always converges faster than normal GAN objectives and hence InfoGAN essentially comes for free with GAN.
For categorical latent code cic_ici, we use the natural choice of softmax nonlinearity to represent Q(ci∣x)Q(c_i|x)Q(cix). For continuous latent code cjc_jcj, there are more options depending on what is the true posterior P(cj∣x)P(c_j|x)P(cjx). In our experiments, we have found that simply treating Q(cj∣x)Q(c_j|x)Q(cjx) as a factored Gaussian is sufficient.
Even though InfoGAN introduces an extra hyperparameter λ\lambdaλ, it’s easy to tune and simply setting to 1 is sufficient for discrete latent codes. When the latent code contains continuous variables, a smaller λ\lambdaλ is typically used to ensure that λLI(G,Q)\lambda L_I(G,Q)λLI(G,Q), which now involves differential entropy, is on the same scale as GAN objectives.
Since GAN is known to be difficult to train, we design our experiments based on existing techniques introduced by DC-GAN [19], which are enough to stabilize InfoGAN training and we did not have to introduce new trick. Detailed experimental setup is described in Appendix.

7 Experiments

Show me a brief summary.

In this section, experiments test whether InfoGAN efficiently maximizes mutual information to yield disentangled, interpretable representations on image datasets by visualizing single latent factor traversals. On MNIST, the information lower bound rapidly reaches its entropy maximum for categorical codes, surpassing regular GANs lacking such incentives. Traversals uncover semantic controls: digit identity, rotation, and width on MNIST; pose, lighting, and novel face width on 3D faces; chair rotation and type interpolation; house number styles on noisy SVHN; and azimuth, glasses, hairstyles, emotion on cluttered CelebA. These unsupervised results match supervised benchmarks, proving InfoGAN's robustness for discovering visual concepts.

The first goal of our experiments is to investigate if mutual information can be maximized efficiently. The second goal is to evaluate if InfoGAN can learn disentangled and interpretable representations by making use of the generator to vary only one latent factor at a time in order to assess if varying such factor results in only one type of semantic variation in generated images. DC-IGN [7] also uses this method to evaluate their learned representations on 3D image datasets, on which we also apply InfoGAN to establish direct comparison.

7.1 Mutual Information Maximization

To evaluate whether the mutual information between latent codes ccc and generated images G(z,c)G(z, c)G(z,c) can be maximized efficiently with proposed method, we train InfoGAN on MNIST dataset with a uniform categorical distribution on latent codes c∼Cat(K=10,p=0.1)c \sim \text{Cat}(K=10, p=0.1)cCat(K=10,p=0.1). In Fig 1, the lower bound LI(G,Q)L_I(G,Q)LI(G,Q) is quickly maximized to H(c)≈2.30H(c) \approx 2.30H(c)2.30, which means the bound (4) is tight and maximal mutual information is achieved.
As a baseline, we also train a regular GAN with an auxiliary distribution QQQ when the generator is not explicitly encouraged to maximize the mutual information with the latent codes. Since we use expressive neural network to parametrize QQQ, we can assume that QQQ reasonably approximates the true posterior P(c∣x)P(c|x)P(cx) and hence there is little mutual information between latent codes and generated images in regular GAN. We note that with a different neural network architecture, there might be a higher mutual information between latent codes and generated images even though we have not observed such case in our experiments. This comparison is meant to demonstrate that in a regular GAN, there is no guarantee that the generator will make use of the latent codes.

7.2 Disentangled Representation

To disentangle digit shape from styles on MNIST, we choose to model the latent codes with one categorical code, c1∼Cat(K=10,p=0.1)c_1 \sim \text{Cat}(K=10, p=0.1)c1Cat(K=10,p=0.1), which can model discontinuous variation in data, and two continuous codes that can capture variations that are continuous in nature: c2,c3∼Unif(−1,1)c_2, c_3 \sim \text{Unif}(-1, 1)c2,c3Unif(1,1).
In Figure 2, we show that the discrete code c1c_1c1 captures drastic change in shape. Changing categorical code c1c_1c1 switches between digits most of the time. In fact even if we just train InfoGAN without any label, c1c_1c1 can be used as a classifier that achieves 5% error rate in classifying MNIST digits by matching each category in c1c_1c1 to a digit type. In the second row of Figure 2a, we can observe a digit 7 is classified as a 9.
Continuous codes c2,c3c_2, c_3c2,c3 capture continuous variations in style: c2c_2c2 models rotation of digits and c3c_3c3 controls the width. What is remarkable is that in both cases, the generator does not simply stretch or rotate the digits but instead adjust other details like thickness or stroke style to make sure the resulting images are natural looking. As a test to check whether the latent representation learned by InfoGAN is generalizable, we manipulated the latent codes in an exaggerated way: instead of plotting latent codes from −1-11 to 111, we plot it from −2-22 to 222 covering a wide region that the network was never trained on and we still get meaningful generalization.
Next we evaluate InfoGAN on two datasets of 3D images: faces [31] and chairs [32], on which DC-IGN was shown to learn highly interpretable graphics codes.
On the faces dataset, DC-IGN learns to represent latent factors as azimuth (pose), elevation, and lighting as continuous latent variables by using supervision. Using the same dataset, we demonstrate that InfoGAN learns a disentangled representation that recover azimuth (pose), elevation, and lighting on the same dataset. In this experiment, we choose to model the latent codes with five continuous codes, ci∼Unif(−1,1)c_i \sim \text{Unif}(-1, 1)ciUnif(1,1) with 1≤i≤51 \leq i \leq 51i5.
Since DC-IGN requires supervision, it was previously not possible to learn a latent code for a variation that’s unlabeled and hence salient latent factors of variation cannot be discovered automatically from data. By contrast, InfoGAN is able to discover such variation on its own: for instance, in Figure 3d a latent code that smoothly changes a face from wide to narrow is learned even though this variation was neither explicitly generated or labeled in prior work.
On the chairs dataset, DC-IGN can learn a continuous code that representes rotation. InfoGAN again is able to learn the same concept as a continuous code (Figure 4a) and we show in addition that InfoGAN is also able to continuously interpolate between similar chair types of different widths using a single continuous code (Figure 4b). In this experiment, we choose to model the latent factors with four categorical codes, c1,c2,c3,c4∼Cat(K=20,p=0.05)c_1, c_2, c_3, c_4 \sim \text{Cat}(K=20, p=0.05)c1,c2,c3,c4Cat(K=20,p=0.05) and one continuous code c5∼Unif(−1,1)c_5 \sim \text{Unif}(-1, 1)c5Unif(1,1).
Next we evaluate InfoGAN on the Street View House Number (SVHN) dataset, which is significantly more challenging to learn an interpretable representation because it is noisy, containing images of variable-resolution and distracting digits, and it does not have multiple variations of the same object. In this experiment, we make use of four 10-dimensional categorical variables and two uniform continuous variables as latent codes. We show two of the learned latent factors in Figure 5.
Finally we show in Figure 6 that InfoGAN is able to learn many visual concepts on another challenging dataset: CelebA [33], which includes 200,000 celebrity images with large pose variations and background clutter. In this dataset, we model the latent variation as 10 uniform categorical variables, each of dimension 10. Surprisingly, even in this complicated dataset, InfoGAN can recover azimuth as in 3D images even though in this dataset no single face appears in multiple pose positions. Moreover InfoGAN can disentangle other highly semantic variations like presence or absence of glasses, hairstyles and emotion, demonstrating a level of visual understanding is acquired without any supervision.

8 Conclusion

Show me a brief summary.

In this section, unsupervised learning of interpretable, disentangled representations from complex data poses a key challenge, as prior methods demand supervision. InfoGAN addresses this via GANs augmented with mutual information maximization between latent codes and generated outputs, ensuring codes capture salient semantic factors like digit identity or pose. This yields high-quality representations on tough datasets with minimal added cost and easy training, paving the way for extensions to VAEs, hierarchical latents, enhanced semi-supervised learning, and data discovery tools.

This paper introduces a representation learning algorithm called Information Maximizing Generative Adversarial Networks (InfoGAN). In contrast to previous approaches, which require supervision, InfoGAN is completely unsupervised and learns interpretable and disentangled representations on challenging datasets. In addition, InfoGAN adds only negligible computation cost on top of GAN and is easy to train. The core idea of using mutual information to induce representation can be applied to other methods like VAE [3], which is a promising area of future work. Other possible extensions to this work include: learning hierarchical latent representations, improving semi-supervised learning with better codes [34], and using InfoGAN as a high-dimensional data discovery tool.

References

Show me a brief summary.

In this section, foundational challenges in unsupervised representation learning and disentangled generative modeling are traced through 34 key works spanning deep architectures, GANs, VAEs, inverse graphics networks, and mutual information-based clustering. Citations highlight supervised breakthroughs like DC-IGN for 3D pose recovery alongside unsupervised precursors such as adversarial autoencoders and Wake-Sleep algorithms, culminating in recent GAN extensions for categorical priors. Collectively, they validate InfoGAN's minimal-cost maximization of latent-code mutual information, enabling automatic discovery of interpretable factors like digit style, facial pose, and object rotation across complex datasets.

Show me a brief summary.

In this section, foundational references ground InfoGAN's experimental framework in unsupervised generative modeling. Core to the approach, the Helmholtz machine and wake-sleep algorithm enable bidirectional inference between generative and recognition distributions, while Adam optimization, batch normalization, up-convolutional architectures for chair synthesis, and leaky rectifier units ensure stable, efficient training of deep networks. Collectively, these citations affirm robust implementation details, from network designs to hyperparameters, yielding interpretable latent representations across diverse datasets.

[1] Y. Bengio, “Learning deep architectures for ai,” Foundations and trends in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
[2] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1798–1828, 2013.
[3] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” ArXiv preprint arXiv:1312.6114, 2013.
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014, pp. 2672–2680.
[5] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi-supervised learning with deep generative models,” in NIPS, 2014, pp. 3581–3589.
[6] B. Cheung, J. A. Livezey, A. K. Bansal, and B. A. Olshausen, “Discovering hidden factors of variation in deep networks,” ArXiv preprint arXiv:1412.6583, 2014.[7] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, “Deep convolutional inverse graphics network,” in NIPS, 2015, pp. 2530–2538.
[8] W. F. Whitney, M. Chang, T. Kulkarni, and J. B. Tenenbaum, “Understanding visual concepts with continuation learning,” ArXiv preprint arXiv:1602.06822, 2016.
[9] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow, “Adversarial autoencoders,” ArXiv preprint arXiv:1511.05644, 2015.
[10] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006.
[11] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
[12] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in ICLR, 2008, pp. 1096–1103.
[13] G. Desjardins, A. Courville, and Y. Bengio, “Disentangling factors of variation via generative entangling,” ArXiv preprint arXiv:1210.5474, 2012.
[14] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” ArXiv preprint arXiv:1301.3781, 2013.
[15] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in NIPS, 2015, pp. 3276–3284.
[16] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in ICCV, 2015, pp. 1422–1430.
[17] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, “Semi-supervised learning with ladder networks,” in NIPS, 2015, pp. 3532–3540.
[18] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther, “Improving semi-supervised learning with auxiliary deep generative models,” in NIPS Workshop on Advances in Approximate Bayesian Inference, 2015.
[19] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” ArXiv preprint arXiv:1511.06434, 2015.
[20] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Human-level concept learning through probabilistic program induction,” Science, vol. 350, no. 6266, pp. 1332–1338, 2015.
[21] J. B. Tenenbaum and W. T. Freeman, “Separating style and content with bilinear models,” Neural computation, vol. 12, no. 6, pp. 1247–1283, 2000.
[22] Z. Zhu, P. Luo, X. Wang, and X. Tang, “Multi-view perceptron: A deep model for learning face identity and view representations,” in NIPS, 2014, pp. 217–225.
[23] J. Yang, S. E. Reed, M.-H. Yang, and H. Lee, “Weakly-supervised disentangling with recurrent transformations for 3d view synthesis,” in NIPS, 2015, pp. 1099–1107.
[24] S. Reed, K. Sohn, Y. Zhang, and H. Lee, “Learning to disentangle factors of variation with manifold interaction,” in ICML, 2014, pp. 1431–1439.
[25] J. Susskind, A. Anderson, and G. E. Hinton, “The Toronto face dataset,” Tech. Rep., 2010.
[26] J. S. Bridle, A. J. Heading, and D. J. MacKay, “Unsupervised classifiers, mutual information and ’phantom targets’,” in NIPS, 1992.
[27] D. Barber and F. V. Agakov, “Kernelized infomax clustering,” in NIPS, 2005, pp. 17–24.
[28] A. Krause, P. Perona, and R. G. Gomes, “Discriminative clustering by regularized information maximization,” in NIPS, 2010, pp. 775–783.
[29] D. Barber and F. V. Agakov, “The IM algorithm: A variational approach to information maximization,” in NIPS, 2003.
[30] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The" wake-sleep" algorithm for unsupervised neural networks,” Science, vol. 268, no. 5214, pp. 1158–1161, 1995.
[31] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3d face model for pose and illumination invariant face recognition,” in AVSS, 2009, pp. 296–301.
[32] M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic, “Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models,” in CVPR, 2014, pp. 3762–3769.
[33] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in ICCV, 2015.
[34] J. T. Springenberg, “Unsupervised and semi-supervised learning with categorical generative adversarial networks,” ArXiv preprint arXiv:1511.06390, 2015.
[7] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, “Deep convolutional inverse graphics network,” in NIPS, 2015, pp. 2530–2538.
[8] W. F. Whitney, M. Chang, T. Kulkarni, and J. B. Tenenbaum, “Understanding visual concepts with continuation learning,” ArXiv preprint arXiv:1602.06822, 2016.
[9] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow, “Adversarial autoencoders,” ArXiv preprint arXiv:1511.05644, 2015.
[10] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006.
[11] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
[12] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in ICLR, 2008, pp. 1096–1103.
[13] G. Desjardins, A. Courville, and Y. Bengio, “Disentangling factors of variation via generative entangling,” ArXiv preprint arXiv:1210.5474, 2012.
[14] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” ArXiv preprint arXiv:1301.3781, 2013.
[15] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in NIPS, 2015, pp. 3276–3284.
[16] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in ICCV, 2015, pp. 1422–1430.
[17] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, “Semi-supervised learning with ladder networks,” in NIPS, 2015, pp. 3532–3540.
[18] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther, “Improving semi-supervised learning with auxiliary deep generative models,” in NIPS Workshop on Advances in Approximate Bayesian Inference, 2015.
[19] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” ArXiv preprint arXiv:1511.06434, 2015.
[20] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Human-level concept learning through probabilistic program induction,” Science, vol. 350, no. 6266, pp. 1332–1338, 2015.
[21] J. B. Tenenbaum and W. T. Freeman, “Separating style and content with bilinear models,” Neural computation, vol. 12, no. 6, pp. 1247–1283, 2000.
[22] Z. Zhu, P. Luo, X. Wang, and X. Tang, “Multi-view perceptron: A deep model for learning face identity and view representations,” in NIPS, 2014, pp. 217–225.
[23] J. Yang, S. E. Reed, M.-H. Yang, and H. Lee, “Weakly-supervised disentangling with recurrent transformations for 3d view synthesis,” in NIPS, 2015, pp. 1099–1107.
[24] S. Reed, K. Sohn, Y. Zhang, and H. Lee, “Learning to disentangle factors of variation with manifold interaction,” in ICML, 2014, pp. 1431–1439.
[25] J. Susskind, A. Anderson, and G. E. Hinton, “The Toronto face dataset,” Tech. Rep., 2010.
[26] J. S. Bridle, A. J. Heading, and D. J. MacKay, “Unsupervised classifiers, mutual information and ’phantom targets’,” in NIPS, 1992.
[27] D. Barber and F. V. Agakov, “Kernelized infomax clustering,” in NIPS, 2005, pp. 17–24.
[28] A. Krause, P. Perona, and R. G. Gomes, “Discriminative clustering by regularized information maximization,” in NIPS, 2010, pp. 775–783.
[29] D. Barber and F. V. Agakov, “The IM algorithm: A variational approach to information maximization,” in NIPS, 2003.
[30] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The" wake-sleep" algorithm for unsupervised neural networks,” Science, vol. 268, no. 5214, pp. 1158–1161, 1995.
[31] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3d face model for pose and illumination invariant face recognition,” in AVSS, 2009, pp. 296–301.
[32] M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic, “Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models,” in CVPR, 2014, pp. 3762–3769.
[33] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in ICCV, 2015.
[34] J. T. Springenberg, “Unsupervised and semi-supervised learning with categorical generative adversarial networks,” ArXiv preprint arXiv:1511.06390, 2015.

A Proof of Lemma 5.1

Show me a brief summary.

In this section, a lemma establishes that, for random variables $X$ and $Y$ and any suitable function $f(x,y)$, the expectation of $f(x,y)$—with $x$ from $X$ and $y$ from $Y$ given $x$—equals the expectation of $f(x',y)$, where $x'$ is drawn independently from $X$ given $y$. The proof begins with the double integral definition of the original expectation, rewrites it using the joint density $P(x,y)$, inserts a marginalization over $x'$ via $\int P(x'|y) dx' = 1$, and rearranges the nested integrals to recover the target form, confirming the equivalence under regularity conditions.

Lemma A.1 For random variables X,YX, YX,Y and function f(x,y)f(x, y)f(x,y) under suitable regularity conditions: Ex∼X,y∼Y∣x[f(x,y)]=Ex∼X,y∼Y∣x,x′∼X∣y[f(x′,y)]\mathbb{E}_{x \sim X, y \sim Y|x} [f(x, y)] = \mathbb{E}_{x \sim X, y \sim Y|x, x' \sim X|y} [f(x', y)]ExX,yYx[f(x,y)]=ExX,yYx,xXy[f(x,y)].
Proof
ExX,yYx[f(x,y)]=xP(x)yP(yx)f(x,y)dydx=xyP(x,y)f(x,y)dydx=xyP(x,y)f(x,y)xP(xy)dxdydx=xP(x)yP(yx)xP(xy)f(x,y)dxdydx=ExX,yYx,xXy[f(x,y)](7)\begin{aligned} \mathbb{E}_{x \sim X, y \sim Y|x} [f(x, y)] &= \int_x P(x) \int_y P(y|x) f(x, y) dy dx \\ &= \int_x \int_y P(x, y) f(x, y) dy dx \\ &= \int_x \int_y P(x, y) f(x, y) \int_{x'} P(x'|y) dx' dy dx \\ &= \int_x P(x) \int_y P(y|x) \int_{x'} P(x'|y) f(x', y) dx' dy dx \\ &= \mathbb{E}_{x \sim X, y \sim Y|x, x' \sim X|y} [f(x', y)] \end{aligned} \tag{7}
💭 Click to ask about this equation
(7)

B Interpretation as “Sleep-Sleep” Algorithm

Show me a brief summary.

In this section, InfoGAN is framed as a Helmholtz machine, with the generator defining the top-down generative distribution and the recognition network providing the bottom-up inference. This setup parallels the Wake-Sleep algorithm, where optimizing the mutual information surrogate loss for the recognition network replicates the sleep phase by maximizing the expected log-likelihood of inferred codes from generator samples. Unlike Wake-Sleep's wake-phase generator updates from real data, InfoGAN also refines the generator in the sleep phase to leverage the full prior over latent codes, yielding a "Sleep-Sleep" training dynamic. This approach uniquely compels the generator to embed interpretable information in latents, differentiating InfoGAN and inspiring extensions to other generative models.

We note that InfoGAN can be viewed as a Helmholtz machine [1]: PG(x∣c)P_G(x|c)PG(xc) is the generative distribution and Q(c∣x)Q(c|x)Q(cx) is the recognition distribution. Wake-Sleep algorithm [2] was proposed to train Helmholtz machines by performing “wake” phase and “sleep” phase updates.
The “wake” phase update proceeds by optimizing the variational lower bound of log⁡PG(x)\log P_G(x)logPG(x) w.r.t. generator:
maxGExData,cQ(cx)[logPG(xc)](8)\max_G \mathbb{E}_{x \sim \text{Data}, c \sim Q(c|x)} [\log P_G(x|c)] \tag{8}
💭 Click to ask about this equation
(8)
The “sleep” phase updates the auxiliary distribution QQQ by “dreaming” up samples from current generator distribution rather than drawing from real data distribution:
maxQEcP(c),xPG(xc)[logQ(cx)](9)\max_Q \mathbb{E}_{c \sim P(c), x \sim P_G(x|c)} [\log Q(c|x)] \tag{9}
💭 Click to ask about this equation
(9)
Hence we can see that when we optimize the surrogate loss LIL_ILI w.r.t. QQQ, the update step is exactly the “sleep” phase update in Wake-Sleep algorithm. InfoGAN differs from Wake-Sleep when we optimize LIL_ILI w.r.t. GGG, encouraging the generator network GGG to make use of latent codes ccc for the whole prior distribution on latent codes P(c)P(c)P(c). Since InfoGAN also updates generator in “sleep” phase, our method can be interpreted as “Sleep-Sleep” algorithm. This interpretation highlights InfoGAN’s difference from previous generative modeling techniques: the generator is explicitly encouraged to convey information in latent codes and suggests that the same principle can be applied to other generative models.

C Experiment Setup

Show me a brief summary.

In this section, experimental setups standardize InfoGAN training across MNIST, SVHN, CelebA, Faces, and Chairs datasets for reproducible unsupervised representation learning. Adam optimization, batch normalization, leaky ReLUs in discriminators, ReLUs in up-convolutional generators, and fixed rates (2e-4 for D/Q, 1e-3 for G, λ=1) form the backbone, with softmax for discrete codes and diagonal Gaussians for continuous ones via exponential parameterization. Dataset-specific CNN architectures share discriminator-Q networks, varying by image size, channels, and latent dimensions (e.g., 74 for MNIST, 189 for Chairs), while tailored hyperparameters—like distinct λ for continuous/discrete codes in Chairs rotation/width or per-variation rates in Faces—optimize disentangled factors such as pose or lighting, enabling robust, interpretable generative control.

For all experiments, we use Adam [3] for online optimization and apply batch normalization [4] after most layers, the details of which are specified for each experiment. We use an up-convolutional architecture for the generator networks [5]. We use leaky rectified linear units (lRELU) [6] with leaky rate 0.1 as the nonlinearity applied to hidden layers of the discrminator networks, and normal rectified linear units (RELU) for the generator networks. Unless noted otherwise, learning rate is 2e-4 for DDD and 1e-3 for GGG; λ\lambdaλ is set to 1.
For discrete latent codes, we apply a softmax nonlinearity over the corresponding units in the recognition network output. For continuous latent codes, we parameterize the approximate posterior through a diagonal Gaussian distribution, and the recognition network outputs its mean and standard deviation, where the standard deviation is parameterized through an exponential transformation of the network output to ensure positivity.
The details for each set of experiments are presented below.

C.1 MNIST

The network architectures are shown in Table 1. The discriminator DDD and the recognition network QQQ shares most of the network. For this task, we use 1 ten-dimensional categorical code, 2 continuous latent codes and 62 noise variables, resulting in a concatenated dimension of 74.

The discriminator and generator CNNs used for MNIST dataset.

The discriminator and generator CNNs used for MNIST dataset.
discriminator DD / recognition network QQ generator GG
Input 28×2828 \times 28 Gray image Input R74\in \mathbb{R}^{74}
4×44 \times 4 conv. 64 lRELU. stride 2 FC. 1024 RELU. batchnorm
4×44 \times 4 conv. 128 lRELU. stride 2. batchnorm FC. 7×7×1287 \times 7 \times 128 RELU. batchnorm
FC. 1024 lRELU. batchnorm 4×44 \times 4 upconv. 64 RELU. stride 2. batchnorm

C.2 SVHN

The network architectures are shown in Table 2. The discriminator DDD and the recognition network QQQ shares most of the network. For this task, we use 4 ten-dimensional categorical code, 4 continuous latent codes and 124 noise variables, resulting in a concatenated dimension of 168.

The discriminator and generator CNNs used for SVHN dataset.

The discriminator and generator CNNs used for SVHN dataset.
discriminator DD / recognition network QQ generator GG
Input 32×3232 \times 32 Color image Input R168\in \mathbb{R}^{168}
4×44 \times 4 conv. 64 lRELU. stride 2 FC. 2×2×4482 \times 2 \times 448 RELU. batchnorm
4×44 \times 4 conv. 128 lRELU. stride 2. batchnorm 4×44 \times 4 upconv. 256 RELU. stride 2. batchnorm
4×44 \times 4 conv. 256 lRELU. stride 2. batchnorm 4×44 \times 4 upconv. 128 RELU. stride 2.

C.3 CelebA

The network architectures are shown in Table 3. The discriminator DDD and the recognition network QQQ shares most of the network. For this task, we use 10 ten-dimensional categorical code and 128 noise variables, resulting in a concatenated dimension of 228.

The discriminator and generator CNNs used for CelebA dataset.

The discriminator and generator CNNs used for CelebA dataset.
discriminator DD / recognition network QQ generator GG
Input 32×3232 \times 32 Color image Input R228\in \mathbb{R}^{228}
4×44 \times 4 conv. 64 lRELU. stride 2 FC. 2×2×4482 \times 2 \times 448 RELU. batchnorm
4×44 \times 4 conv. 128 lRELU. stride 2. batchnorm 4×44 \times 4 upconv. 256 RELU. stride 2. batchnorm
4×44 \times 4 conv. 256 lRELU. stride 2. batchnorm 4×44 \times 4 upconv. 128 RELU. stride 2.

C.4 Faces

The network architectures are shown in Table 4. The discriminator DDD and the recognition network QQQ shares the same network, and only have separate output units at the last layer. For this task, we use 5 continuous latent codes and 128 noise variables, so the input to the generator has dimension 133.
We used separate configurations for each learned variation, shown in Table 5.

The discriminator and generator CNNs used for Faces dataset.

The discriminator and generator CNNs used for Faces dataset.
discriminator DD / recognition network QQ generator GG
Input 32×3232 \times 32 Gray image Input R133\in \mathbb{R}^{133}
4×44 \times 4 conv. 64 lRELU. stride 2 FC. 1024 RELU. batchnorm
4×44 \times 4 conv. 128 lRELU. stride 2. batchnorm FC. 8×8×1288 \times 8 \times 128 RELU. batchnorm
FC. 1024 lRELU. batchnorm 4×44 \times 4 upconv. 64 RELU. stride 2. batchnorm
FC. output layer 4×44 \times 4 upconv. 1 sigmoid.

The hyperparameters for Faces dataset.

The hyperparameters for Faces dataset.
Learning rate for DD / QQ Learning rate for GG λ\lambda
Azimuth (pose) 2e-4 5e-4 0.2
Elevation 4e-4 3e-4 0.1
Lighting 8e-4 3e-4 0.1
Wide or Narrow learned using the same network as the lighting variation

C.5 Chairs

The network architectures are shown in Table 6. The discriminator DDD and the recognition network QQQ shares the same network, and only have separate output units at the last layer. For this task, we use 1 continuous latent code, 3 discrete latent codes (each with dimension 20), and 128 noise variables, so the input to the generator has dimension 189.

The discriminator and generator CNNs used for Chairs dataset.

The discriminator and generator CNNs used for Chairs dataset.
discriminator DD / recognition network QQ generator GG
Input 64×6464 \times 64 Gray image Input R189\in \mathbb{R}^{189}
4×44 \times 4 conv. 64 lRELU. stride 2 FC. 1024 RELU. batchnorm
4×44 \times 4 conv. 128 lRELU. stride 2. batchnorm FC. 8×8×2568 \times 8 \times 256 RELU. batchnorm
4×44 \times 4 conv. 256 lRELU. stride 2. batchnorm 4×44 \times 4 upconv. 256 RELU. batchnorm
We used separate configurations for each learned variation, shown in Table 7. For this task, we found it necessary to use different regularization coefficients for the continuous and discrete latent codes.

The hyperparameters for Chairs dataset.

The hyperparameters for Chairs dataset.
Learning rate for DD / QQ Learning rate for GG λcont\lambda_{\text{cont}} λdisc\lambda_{\text{disc}}
Rotation 2e-4 1e-3 10.0 1.0
Width 2e-4 1e-3 0.05 2.0

References

[1] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel, “The helmholtz machine,” Neural computation, vol. 7, no. 5, pp. 889–904, 1995.
[2] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The" wake-sleep" algorithm for unsupervised neural networks,” Science, vol. 268, no. 5214, pp. 1158–1161, 1995.
[3] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ArXiv preprint arXiv:1412.6980, 2014.
[4] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” ArXiv preprint arXiv:1502.03167, 2015.
[5] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox, “Learning to generate chairs with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1538–1546.
[6] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, vol. 30, 2013, p. 1.