Why did we open-source our inference engine? Read the post
← All Glossary Articles

What are Generative Adversarial Networks (GANs)?

A Generative Adversarial Network (GAN) is a deep learning architecture where two neural networks, a generator and a discriminator, are trained simultaneously in competition. The generator creates synthetic data; the discriminator tries to distinguish real from fake. This adversarial process pushes the generator to produce increasingly realistic outputs. GANs are used for image synthesis, data augmentation, and generating synthetic training data.


Why do GANs matter for ML practitioners?

GANs are relevant to applied ML in several ways:

  • Data augmentation: generate synthetic training examples when labelled data is scarce
  • Synthetic data for privacy: generate realistic but non-identifiable data for testing pipelines with sensitive content
  • Image-to-image translation: transform document scans, improve image quality, or augment visual datasets
  • Embedding space analysis: GANs have been used to explore and interpolate in the latent spaces of encoder models

For document processing pipelines, GANs can augment training data for OCR, layout detection, and visual document understanding models.


How does a GAN work?

A GAN has two components trained against each other:

Generator (G) takes random noise as input and produces synthetic samples (images, text, vectors). Its goal: fool the discriminator.

Discriminator (D) takes a sample (real or generated) and outputs a probability that it’s real. Its goal: correctly identify fakes.

Training alternates:

  1. Train D to distinguish real from generated samples
  2. Train G to produce samples that fool D

The generator never sees real data directly; it only receives feedback through the discriminator’s gradients.

Noise z → [Generator] → Fake sample
Real sample → [Discriminator] → Real / Fake?
Gradients back to Generator

What are the main GAN architectures?

ArchitectureKey innovationBest for
DCGANConvolutional layers in G and DImage generation
Conditional GAN (cGAN)Condition on class labelClass-specific generation
Pix2PixImage-to-image translationDocument enhancement, style transfer
CycleGANUnpaired image-to-imageDomain adaptation
StyleGANHigh-fidelity, controllablePhotorealistic face/image synthesis
WGANWasserstein loss, more stable trainingGeneral improvement over vanilla GAN

What are the main challenges with GANs?

Mode collapse: the generator learns to produce only a few types of outputs that fool the discriminator, losing diversity.

Training instability: the adversarial objective is a minimax game; both networks need to improve at a similar rate or training diverges.

Evaluation difficulty: unlike classifiers, there’s no single loss metric to track. FID (Fréchet Inception Distance) is the standard for image quality.

Replaced by diffusion models: for high-quality image synthesis, diffusion models (Stable Diffusion, DALL-E) have largely superseded GANs due to more stable training and better output diversity.


GANs vs VAEs vs Diffusion models

GANVAEDiffusion
Output qualityHighMediumVery high
Training stabilityLowHighHigh
DiversityRisk of mode collapseGoodExcellent
Speed (inference)FastFastSlow
Latent spaceImplicitExplicitImplicit

For generating synthetic training data for document or text tasks today, LLM-based generation often outperforms GANs for text, while diffusion models dominate for image synthesis.


Frequently asked questions

Are GANs still state of the art? For image synthesis, diffusion models have largely replaced GANs at the frontier. GANs remain useful for specific applications (fast inference, conditional generation) and are widely deployed in production systems trained before diffusion models matured.

Can GANs be used for text generation? Text GANs are challenging because text is discrete (gradients don’t flow through token sampling). Techniques like REINFORCE or Gumbel-softmax attempt this, but LLMs have largely made text GANs obsolete.

What is FID score? Fréchet Inception Distance measures the distance between the distribution of real and generated images in the feature space of an Inception network. Lower FID = higher quality and diversity.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.