Stable Diffusion Image Generation: A Step-by-Step Guide

Stable Diffusion Image Generation: A Step-by-Step Guide
AI generated image. Generated using MidJourney v5

Introduction to Stable Diffusion Image Generation

Image synthesis is an exciting field in artificial intelligence that enables the creation of realistic images from textual descriptions or other input data. One of the latest advancements in this domain is stable diffusion, a process that leverages the power of deep learning models to generate high-quality images. This includes the image you see on the blog cover, which was generated with Midjourney.

In this blog post, I will explore stable diffusion and demonstrate how you can generate images using a pre-trained stable diffusion model. This educational project aims to help you learn about image synthesis techniques and understand the stable diffusion process.

Exploring the Stable Diffusion Model

Stable diffusion is a powerful method for generating images, which combines ideas from denoising score matching and contrastive divergence. It is particularly well-suited for image synthesis tasks due to its ability to produce high-quality images with impressive detail and realism.

Pre-trained Models for Stable Diffusion

There are several pre-trained models available for stable diffusion image generation, with varying levels of quality and performance. One such model is the stabilityai/stable-diffusion-2-1 model, which is hosted on Hugging Face's Model Hub. This model is designed to produce high-quality images and has been fine-tuned to work efficiently with a wide range of text prompts.

Model Description and Capabilities

The stabilityai/stable-diffusion-2-1 model leverages a deep neural network architecture and is trained on a large dataset of images and corresponding textual descriptions. This enables the model to generate images with intricate details and high fidelity, closely matching the input text prompts.

For our API, I will be using this pre-trained stable diffusion model to generate images based on user-provided text prompts. By integrating the model into a simple Flask API, we can create a user-friendly interface that allows users to easily generate images using stable diffusion techniques.

Setting up our project

To create our API for stable diffusion image generation, we will need to install and set up a few tools and libraries. In this section, we will cover the required components and provide instructions on how to install and configure them.

Required Tools and Libraries

Python 3.7 or higher

Pipenv (for managing dependencies)

Pytorch (Install)

pip3 install torch torchvision torchaudio --index-url <https://download.pytorch.org/whl/cu117>

Huggingface Transformers  & Diffusers library

pipenv install transformers accelerate diffusers["torch"] 

Flask (for creating the API)

Additionally, we will need

  • a stable internet connection for downloading the pre-trained stable diffusion model (approximately 10GB)
  • a GPU for running the model (I tested it with my NVIDIA 2060 GPU).

Installation and Setup Process

Follow the steps below to install and set up the required tools and libraries:

  1. Clone my GitHub repository containing the source code:

git clone <https://github.com/prakashrx/stable-diffusion.git>
cd stable-diffusion
pipenv install

Now that we have installed the required tools and libraries, lets implement the API for stable diffusion image generation.

Code Walkthrough

The main file for our Flask API is app.py, which contains the necessary code to set up the API, load the stable diffusion model, and define the endpoint for image generation. Below is a brief overview of the code structure:

Import the necessary libraries and modules:

In order to set up our API for stable diffusion image generation, we need to import the necessary libraries and modules.

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
import random
from flask import Flask, Response, request, jsonify
from flask_cors import CORS
import io
from PIL import Image

To start, initialize the Flask application and enable Cross-Origin Resource Sharing (CORS) using the Flask-CORS extension. This allows the server to handle requests from different domains. Enabling CORS ensures secure communication with other servers and clients

app = Flask(__name__)
CORS(app)

To use the stable diffusion model, load the pretrained model by downloading it onto your system. This takes up approximately 10 GB of space on your local drive. On Windows, the model will be downloaded to %user%\\.cache\\huggingface for the first time. After that, the model will be loaded from this cache. Please ensure that you have enough space on your drives before running the application.

model_id = "stabilityai/stable-diffusion-2-1"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = [pipe.to](<http://pipe.to/>)("cuda")

In my case, due to insufficient GPU memory (8 GB), I had to enable attention slicing using the following Python code. If you have a decent GPU with higher memory, this step is unnecessary.

pipe.enable_attention_slicing()

The generate_image function takes a single argument, prompt, which is the text prompt used to generate the image. It utilizes the pipe object, which is an instance of the pre-trained stable diffusion model, to generate the image based on the input prompt:

def generate_image(prompt):
    image = pipe(prompt).images[0]
    return image

Finally, let's put this together and expose the generate route as a Flask route that maps to the /generate endpoint.

@app.route("/generate", methods=["GET"])
def generate():
    prompt = request.args.get("prompt")
    print("prompt:", prompt)
    image = generate_image(prompt)
    image_bytes = io.BytesIO()
    image.save(image_bytes, format="JPEG")
    image_bytes = image_bytes.getvalue()
    print(f"Generated image of size {len(image_bytes)} bytes.")

    # return the response as an image
    return Response(image_bytes, mimetype="image/jpeg")

The generated image is converted to the JPEG format, and the route returns the image as a response with a mime type of image/jpeg.

Full code can be found in my Github Repo

Running the Flask API

Now that we've implemented the Flask API, let's see how to run it and generate images using the /generate endpoint.

Starting the Flask Server

pipenv shell
python app.py

The API will be available at http://localhost:5000/generate.

Generating Images

To generate an image using the API, send a GET request with the prompt query parameter:

<http://localhost:5000/generate?prompt=Beautiful%20sunset%20over%20the%20ocean>

This request will generate an image of a beautiful sunset over the ocean and return it as a JPEG image. You can use any text prompt that you'd like, and the API will generate an image based on your input.

https://raw.githubusercontent.com/prakashrx/stable-diffusion/main/images/screenshot.jpg

Potential Enhancements and Use Cases

Now that we have our educational Flask API for stable diffusion image generation, there are several enhancements and use cases to consider:

Additional Features

  1. Custom model selection: Allow users to choose from multiple pre-trained stable diffusion models to generate images with different characteristics.
  2. Adjustable image quality: Provide options to adjust image quality settings or add an option for generating images at a faster speed with lower fidelity.
  3. Batch image generation: Enable the API to generate multiple images at once based on an array of text prompts.

Real-World Applications and Use Cases

  1. Content creation: Use the API to create original images for blog posts, social media, or advertisements based on textual descriptions.
  2. Data augmentation: Generate additional training data for machine learning tasks that require images with specific characteristics.
  3. Design prototyping: Quickly create visual prototypes based on textual descriptions to aid in the design process.

I thoroughly enjoyed working on this project. Though it was a simple educational exercise, I learned a lot from it. It's an excellent starting point for learning about image synthesis techniques, and there are many more things to explore.

I encourage you to continue exploring and experimenting with different pre-trained models, customization options, and use cases. Happy coding!