Deep Learning

What are Generative Adversarial Networks (GANs)?

A Generative Adversarial Network (GAN) is a deep learning architecture where two neural networks, a generator and a discriminator, are trained simultaneously in competition. The generator creates synthetic data; the discriminator tries to distinguish real from fake. This adversarial process pushes the generator to produce increasingly realistic outputs. GANs are used for image synthesis, data augmentation, and generating synthetic training data.

Why do GANs matter for ML practitioners?

GANs are relevant to applied ML in several ways:

Data augmentation: generate synthetic training examples when labelled data is scarce
Synthetic data for privacy: generate realistic but non-identifiable data for testing pipelines with sensitive content
Image-to-image translation: transform document scans, improve image quality, or augment visual datasets
Embedding space analysis: GANs have been used to explore and interpolate in the latent spaces of encoder models

For document processing pipelines, GANs can augment training data for OCR, layout detection, and visual document understanding models.

How does a GAN work?

A GAN has two components trained against each other:

Generator (G) takes random noise as input and produces synthetic samples (images, text, vectors). Its goal: fool the discriminator.

Discriminator (D) takes a sample (real or generated) and outputs a probability that it’s real. Its goal: correctly identify fakes.

Training alternates:

Train D to distinguish real from generated samples
Train G to produce samples that fool D

The generator never sees real data directly; it only receives feedback through the discriminator’s gradients.

Noise z → [Generator] → Fake sample
                              ↓
Real sample → [Discriminator] → Real / Fake?
                              ↓
                    Gradients back to Generator

What are the main GAN architectures?

Architecture	Key innovation	Best for
DCGAN	Convolutional layers in G and D	Image generation
Conditional GAN (cGAN)	Condition on class label	Class-specific generation
Pix2Pix	Image-to-image translation	Document enhancement, style transfer
CycleGAN	Unpaired image-to-image	Domain adaptation
StyleGAN	High-fidelity, controllable	Photorealistic face/image synthesis
WGAN	Wasserstein loss, more stable training	General improvement over vanilla GAN

What are the main challenges with GANs?

Mode collapse: the generator learns to produce only a few types of outputs that fool the discriminator, losing diversity.

Training instability: the adversarial objective is a minimax game; both networks need to improve at a similar rate or training diverges.

Evaluation difficulty: unlike classifiers, there’s no single loss metric to track. FID (Fréchet Inception Distance) is the standard for image quality.

Replaced by diffusion models: for high-quality image synthesis, diffusion models (Stable Diffusion, DALL-E) have largely superseded GANs due to more stable training and better output diversity.

GANs vs VAEs vs Diffusion models

	GAN	VAE	Diffusion
Output quality	High	Medium	Very high
Training stability	Low	High	High
Diversity	Risk of mode collapse	Good	Excellent
Speed (inference)	Fast	Fast	Slow
Latent space	Implicit	Explicit	Implicit

For generating synthetic training data for document or text tasks today, LLM-based generation often outperforms GANs for text, while diffusion models dominate for image synthesis.

Frequently asked questions

Are GANs still state of the art? For image synthesis, diffusion models have largely replaced GANs at the frontier. GANs remain useful for specific applications (fast inference, conditional generation) and are widely deployed in production systems trained before diffusion models matured.

Can GANs be used for text generation? Text GANs are challenging because text is discrete (gradients don’t flow through token sampling). Techniques like REINFORCE or Gumbel-softmax attempt this, but LLMs have largely made text GANs obsolete.

What is FID score? Fréchet Inception Distance measures the distance between the distribution of real and generated images in the feature space of an Inception network. Lower FID = higher quality and diversity.