Text-To-Image Context#

Data#

Have I been trained Laion 5b Laion Aesthetics

Text & Images#

CLIP (Contrastive Language-Image Pre-training) is a dual-encoder model that learns from hundreds of millions of image–caption pairs. It turns each picture and each sentence into a vector, trains so that matching pairs land close together in a shared mathematical space.

OpenAI CLIP

Clip

Embeddings

This shared space enables semantic calculations using simple arithmetic combinations of embeddings.

Calculating with Clip

ZeroCap

What are embedding-spaces?#

Embeddings

Mapping Embeddings

Stable Diffusion#

Stable Diffusion is an open-source text-to-image latent diffusion model: it starts with random noise in a compact latent space, repeatedly removes that noise according to guidance from a CLIP-encoded text prompt.

Stable Diffusion Explainer