Introduction
Text-to-image generation is a type of generative AI to create an image based on a provided text description. The purpose is to produce an image that accurately portrays the intricacies and nuances described in the text itself. However, this undertaking comes with substantial challenges, as it requires the model to comprehend both the meaning and structure of the textual input while generating visually realistic images. The applications for text-to-image generation are vast, spanning domains such as AI photography, concept art, architectural design, fashion, video games, graphic design, and numerous other creative fields.
Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. It is an open-source model that can be hosted through AWS. For smooth real-time interactions with the model, it’s important to use accelerated hardware like GPUs or AWS Inferentia2 (Amazon’s machine learning inference accelerator).
Input prompt:
- An astronaut riding horse in space, space, interplanetary, so real, unreal, amazing lighting, cinematic, intense, detailed (left image)
- Polaroid photo of a scavenger exploring ancient ruins in mars (right image)
Use case overview
In this two-part blog series, we will discuss how to develop an AI image generator application using Stable Diffusion in AWS.
For this first part, we demonstrate how to leverage AWS Deep Learning Containers (DLC) and Inferentia to optimize serving Stable Diffusion 2.1 base that can perform text2image predictions. We also demonstrate a benchmark test to compare the latency and cost for text2image of the three deployment configurations:
- 1 base (Default)
- 1 base (DJL Serving to host model using Deepspeed)
- 1 base (Inferentia 2)
Serve Stable Diffusion 2.1 base using Sagemaker’s DLC
The pipelines library provides a simple way to run state-of-the-art diffusion models in inference. More information on diffusers can be found here.
SageMaker maintains deep learning containers (DLCs) with popular open source libraries for hosting large models such as GPT, T5, OPT, BLOOM, and Stable Diffusion on AWS infrastructure. More information on libraries supported by Sagemaker’s DLCs can be found here.
Below are steps to deploying Stable Diffusion 2.1 base using diffusers and by extending Sagemaker’s DLC:
- Create Dockerfile by extending AWS’s DLC container with requirements for StableDiffusion 2.1 base.
- Create an ECR repository and ECR image based on the created Docker file.
- Create a custom inference script that is specific to our use case. This script utilizes the artifact that can be found on Hugging Face or in S3 bucket used by Sagemaker Jumpstart.
- Deploy inference script with created ECR image using Sagemaker real time inference endpoint.
Prompt:
An astronaut llama in space, space, interplanetary, so real, unreal, amazing lighting, cinematic, intense, detailed (left image)
For instructions on creating a custom ECR image, you can find detailed guidance in the “extending-image-notebook” located here.
Notebook to deploy the Stable Diffusion model can be found here.
Serve Stable Diffusion 2.1 base using DJL and Deepspeed
Sagemaker’s DLCs support libraries to enable model parallelism and inference optimizations such as DJL-Serving and DeepSpeed Inference.
DJL-Serving is an open-source, high-performance model server powered by DJL. It takes multiple deep learning models or workflows, and makes them available through an HTTP endpoint. Versions 0.19 and above are supported by SageMaker and work with Amazon EC2 instances with multiple GPUs to facilitate large model inference (LMI) with model parallelism.
DeepSpeed Inference is an open-source inference optimization library. It includes model partitioning schemes for model parallelism with supported models, including many transformer models. It also has optimized kernels for popular models such as OPT, GPT, and BLOOM that can significantly improve inference latency. The version of DeepSpeed in the LMI DLCs is optimized and tested to work on SageMaker. It includes several enhancements, including support for BF16 precision models.
Below are steps to deploying Stable Diffusion 2.1 base using DJL and Deepspeed:
- Download a pre-packaged model from a specified S3 location. For example, you can use the following URL to download the prepackaged model: s3://jumpstart-cache-prod-us-west-2/stabilityai-infer/prepack/v1.0.0/infer-prepack-model-txt2img-stabilityai-stable-diffusion-v2-1-base.tar.gz.
- After downloading the prepackaged model, you need to unpack it on your desired S3 location. This can be done by using the appropriate tools to extract the contents of the downloaded .tar.gz file directly on S3. Make sure to take note of the S3 path where the model is unpacked.
- Use the provided DJLModel class to create an instance, specifying the S3 path where the unpacked model is stored, the IAM role required for accessing resources, the SageMaker session, and additional parameters such as the name, data type, and number of partitions. For example:
model = DJLModel( |
- Use the deploy method on the DJLModel instance to deploy the model. Specify the instance type (e.g., “ml.g5.xlarge”), an endpoint name (e.g., “stable-diffusion-2-1-DJL”), and serializers/deserializers if needed. For example:
predictor = model.deploy( |
Code to deploy this solution can be found here.
Serve Stable Diffusion 2.1 base using Inferentia 2
AWS Inferentia 2 represents the latest advancement in the Inferentia series, succeeding its predecessor, Inferentia 1, which was introduced in 2019. Leveraging the power of Inferentia 1, Amazon EC2 Inf1 instances delivered superior performance by achieving a 25% increase in throughput and a 70% reduction in cost compared to equivalent G5 instances utilizing the NVIDIA A10G GPU.
The newly developed Inferentia 2 chip delivers exceptional enhancements in performance. It offers a remarkable 4x increase in throughput and a notable 10x reduction in latency compared to its predecessor, Inferentia 1. Correspondingly, the newly launched Amazon EC2 Inf2 instances demonstrate remarkable improvements, providing up to 2.6x better throughput, an 8.1x decrease in latency, and a 50% increase in performance per watt when compared to similar G5 instances. Inferentia 2 offers a balance between cost-effective inference optimization, thanks to its high throughput, and swift response times for your applications, courtesy of its low inference latency.
To cater to different requirements, Inf2 instances are available in various sizes, each equipped with a varying number of Inferentia 2 chips, ranging from 1 to 12. When multiple chips are present, they benefit from a lightning-fast direct Inferentia 2 to Inferentia 2 interconnectivity, enabling distributed inference on large-scale models. For instance, the largest Inf2 instance size, inf2.48xlarge, incorporates 12 chips and offers ample memory capacity to accommodate a 175-billion parameter model such as GPT-3 or BLOOM. In this blog, we apply the inf2.xlarge with Stable Diffusion 2.1 base.
Deployment steps:
To deploy Stable Diffusion 2.1 base using Inferentia 2, we need to perform two key steps. First, we need to compile the model to run on Inf2 using the AWS Trainium (trn1) instance. Then, we need to use a custom inference script specifically designed for Inferentia 2 to enable running the model.
Below are the detailed steps to accomplish this:
- Compile the model: Utilize the AWS Trainium (trn1) instance to compile the model and make it compatible with Inf2. This step involves running the `compile.py` script using neuron-x on a SageMaker train job running on an ml.trn1.2xlarge instance.
- Deploy the compiled model: After compiling the model, deploy it using a PyTorchModel. During deployment, provide the path to the compiled model in the Amazon S3 bucket and include the custom `inference.py` script.
- Inference: Deploying the model will create a SageMaker Estimator on an Inf2 instance. This Estimator will be used for inference, allowing you to run predictions using the deployed model.
Code to deploy this solution can be found here.
Performance Benchmark
Lastly, we conducted a benchmark test by processing 100 prompts to compare the latency and cost for text2image of the three configurations below.
Configuration |
Average latency (sec/img) |
Instance type |
Instance cost per hour |
StableDiffusion2.1 base (Default) |
3.91 |
ml.g5.xlarge |
$1.4084 |
StableDiffusion2.1 base (DJL Serving to host model using Deepspeed) |
2.55 |
ml.g5.xlarge |
$1.4084 |
StableDiffusion2.1 base (Inferentia 2) |
2.36 |
ml.inf2.xlarge |
$0.99 |
The Inferentia 2 configuration achieved an average latency of 2.36 seconds per image. This makes it 40% faster than the default configuration, which has a latency of 3.91 seconds per image, and 8% faster than the DJL+Deepspeed configuration, which has a latency of 2.55 seconds per image.
Furthermore, in terms of cost-efficiency, the Inferentia 2 configuration cost $649.00 per 1 million images processed. It outperforms the default configuration by 57.54%, which costs $1,529.68 per 1 million images, and the DJL+Deepspeed configuration by 34.91%, which costs $997.62 per 1 million images.
Configurations |
|
Default |
DJL+Deepspeed |
Inferentia 2 |
|
Latency |
3.91s |
2.55s |
2.36s |
Default |
3.91s |
0.00% |
-34.78% |
-39.64% |
DJL+Deepspeed |
2.55s |
53.33% |
0.00% |
-7.45% |
Inferentia 2 |
2.36s |
65.68% |
8.05% |
0.00% |
Configurations |
|
Default |
DJL+Deepspeed |
Inferentia 2 |
|
Cost |
$1,529.68 |
$997.62 |
$649.00 |
Default |
$1,529.68 |
0.00% |
-34.79% |
-57.54% |
DJL+Deepspeed |
$997.62 |
53.27% |
0.00% |
-34.91% |
Inferentia 2 |
$649.00 |
135.54% |
53.71% |
0.00% |
Conclusion
In this post, we discussed how you can deploy Stable Diffusion 2.1 base using Sagemaker DLC, DJL with Deepspeed and Inferentia 2. We also proved through benchmarking that Inferentia 2 configuration not only delivers significantly lower latency, making it faster than other configurations, but it also provides remarkable cost savings, making it a highly economical choice for serving Stable Diffusion.
In the next blog, How to create an AI image generator application using Stable Diffusion – Part 2/2, you will learn how to expand our Stable Diffusion model to perform image2image. In addition, we will learn how to use vector databases to enable image/prompt recommendations. Finally, we will create and deploy our AI image generator application via Streamlit.
Check out our open source Git repository for more open source materials. Also, contact us to learn how to deploy and optimize your favorite generation AI model.