Text to Image explained

Text-to-image generators have gained tremendous popularity over the past years. This specific application of artificial intelligence allows users to create digital images based on textual input. It seems that any desired image is just a few clicks away, with little to no technological knowledge about the inner workings needed.

But what happens underneath the hood? How does a text-to-image AI exactly work? This step-by-step explainer will lift (some) of the curtains of image generation and shed light inside the black box of AI.

Disclaimer: this is a simplified explanation of text-to-image generation. In reality, there are many different types of generation models, approaches and techniques. To learn more about this complex technology, we recommend reading this or this research. 

Scroll to start →

Intro slide Text-to-Image AI explainer

Inside the Black Box

Text-to-Image Generation Explained

Text-to-image generators are an online tool which transforms a text into an image using artificial intelligence. On platforms such as DALL-E, ChatGPT, Midjourney or Stable Diffusion, a user can submit a piece of text, keyword or sentence (also known as a prompt). Based on this prompt, the AI spits out an image, or series of images, that visualise the given prompt. Prompts can be as simple (“image of a cat”) or elaborate (“photorealistic image of a black cat running across a rooftop in an urban environment... “) as a user may want.

To successfully create the visualisation, the AI needs to understand what the prompt means and understand the individual concepts within the prompt, such as ‘cat’, ‘rooftop’, and ‘urban’, et cetera.

To create an understanding of how these concepts are visualised, an image generation AI is trained on many existing images. Using a self-learning algorithm, the AI goes through thousands of images together with their descriptions. This actually includes a lot of human labour. The images need to be labelled correctly and to improve results, the database should include many different visualisations. For ‘cat’, that means for example that the training set should include images of different species, colours, and poses.

thousand images on average needed to train an text-to-image generator

The algorithm then observes every image pixel-by-pixel. By comparing this to the other images in the training set, it will start to detect certain patterns in the organisation of the pixels. For instance, an almond-shaped cluster of dark coloured pixels eventually could be recognised as a cat’s eye, a pink-coloured triangular group of pixels as a cat’s nose. Through this pattern recognition, the algorithm creates an understanding about how a ‘cat’ is represented on pixel-level. Pattern recognition requires an immense amount of time and data. For large text-to-image AI systems, this training period could last for months and requires millions, sometimes billions, of images.

Try it yourself! Drag the labels to their right spot →

Graphic depiction of AI label, stating Cat_Eye 0.83
Graphic depiction of AI label, stating Cat_Ear 0.97
Graphic depiction of AI label, stating Cat_Eye 0.73
Graphic depiction of AI label, stating Cat_Head 0.97
Graphic depiction of AI label, stating Cat_Nose 0.75
Graphic depiction of AI label, stating Cat_Ear 0.97

Through pattern recognition, AI mimics how the human brain works. In our brains, visual input from our eyes is processed by a dense network of neurons that communicate information with each other. At the end of this process, our neural network draws its conclusion: this image must represent a cat! The connections between neurons can change or strengthen based on exposure to images and learning. The network of a self-learning algorithm works similarly: its ‘neurons’ swap out information and as such, can detect patterns in the pixels.

After a text-to-image AI finished training, it can link the presented text with their matching pixel representation. On top of that, the algorithm can also understands text combinations. It recognises not only the visual representation of ‘cat’ but also ‘black cat’ or ‘cat on a rooftop’.

The next step in generation is to reverse this process. When presented with a textual prompt, the algorithm meticulously calculates which pixel should follow the next one to accurately represent this prompt, based on the understanding it has of its words.



next: Polaroid Photography in Multiple Perspectives

Lots of polaroids together making a big collage of images