Video GenerationActive

THUDM / CogVideoX-2b

CogVideoX-2b

bythudm

Entry-level model, balancing compatibility. Low cost for running and secondary development.

Text to VideoCogvideoFp16

Model ID

CogVideoX-2b

Provider

thudm

Added

1724927572

wiro playground—thudm/CogVideoX-2b

Added 1724927572

CogVideoX-2B

📄 中文阅读 |
🤗 Huggingface Space |
🌐 Github |
📜 arxiv

Demo Show

Video Gallery with Captions

A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.

The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from it’s tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.

A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.

In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.

Model Introduction

CogVideoX is an open-source version of the video generation model originating from QingYing. The table below displays the list of video generation models we currently offer, along with their foundational information.

Model Name
CogVideoX-2B (This Repository)
CogVideoX-5B

Model Description
Entry-level model, balancing compatibility. Low cost for running and secondary development.
Larger model with higher video generation quality and better visual effects.

Inference Precision
FP16* (Recommended), BF16, FP32, FP8*, INT8, no support for INT4
BF16 (Recommended), FP16, FP32, FP8*, INT8, no support for INT4

Single GPU VRAM Consumption
FP16: 18GB using SAT / 12.5GB* using diffusersINT8: 7.8GB* using diffusers with torchao
BF16: 26GB using SAT / 20.7GB* using diffusersINT8: 11.4GB* using diffusers with torchao

Multi-GPU Inference VRAM Consumption
FP16: 10GB* using diffusers
BF16: 15GB* using diffusers

Inference Speed(Step = 50, FP/BF16)
Single A100: ~90 secondsSingle H100: ~45 seconds
Single A100: ~180 secondsSingle H100: ~90 seconds

Fine-tuning Precision
FP16
BF16

Fine-tuning VRAM Consumption (per GPU)
47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT)
63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)

Prompt Language
English*

Prompt Length Limit
226 Tokens

Video Length
6 Seconds

Frame Rate
8 Frames per Second

Video Resolution
720 x 480, no support for other resolutions (including fine-tuning)

Positional Encoding
3d_sincos_pos_embed
3d_rope_pos_embed

Data Explanation

When testing with the diffusers library, the enable_model_cpu_offload() option and pipe.vae.enable_tiling() optimization were enabled. This solution has not been tested on devices other than NVIDIA A100 / H100. Typically, this solution is adaptable to all devices above the NVIDIA Ampere architecture. If the optimization is disabled, memory usage will increase significantly, with peak memory being about 3 times the table value.
The CogVideoX-2B model was trained using FP16 precision, so it is recommended to use FP16 for inference.
For multi-GPU inference, the enable_model_cpu_offload() optimization needs to be disabled.
Using the INT8 model will lead to reduced inference speed. This is done to allow low-memory GPUs to perform inference while maintaining minimal video quality loss, though the inference speed will be significantly reduced.
Inference speed tests also used the memory optimization mentioned above. Without memory optimization, inference speed increases by approximately 10%. Only the diffusers version of the model supports quantization.
The model only supports English input; other languages can be translated to English for refinement by large models.

Note

Using SAT for inference and fine-tuning of SAT version
models. Feel free to visit our GitHub for more information.

Quick Start 🤗

This model supports deployment using the huggingface diffusers library. You can deploy it by following these steps.
We recommend that you visit our GitHub and check out the relevant prompt
optimizations and conversions to get a better experience.

Install the required dependencies

# diffusers>=0.30.1
# transformers>=0.44.0
# accelerate>=0.33.0 (suggest install from source)
# imageio-ffmpeg>=0.5.1
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

Run the code

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-2b",
torch_dtype=torch.float16
)

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

Quantized Inference

PytorchAO and Optimum-quanto can be used to quantize the Text Encoder, Transformer and VAE modules to lower the memory requirement of CogVideoX. This makes it possible to run the model on free-tier T4 Colab or smaller VRAM GPUs as well! It is also worth noting that TorchAO quantization is fully compatible with torch.compile, which allows for much faster inference speed.
# To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
# Source and nightly installation is only required until next release.

import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
+ from transformers import T5EncoderModel
+ from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight

+ quantization = int8_weight_only

+ text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
+ quantize_(text_encoder, quantization())

+ transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
+ quantize_(transformer, quantization())

+ vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.bfloat16)
+ quantize_(vae, quantization())

# Create pipeline and run inference
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-2b",
+ text_encoder=text_encoder,
+ transformer=transformer,
+ vae=vae,
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

Additionally, the models can be serialized and stored in a quantized datatype to save disk space when using PytorchAO. Find examples and benchmarks at these links:

torchao
quanto

Explore the Model

Welcome to our github, where you will find:

More detailed technical details and code explanation.
Optimization and conversion of prompt words.
Reasoning and fine-tuning of SAT version models, and even pre-release.
Project update log dynamics, more interactive opportunities.
CogVideoX toolchain to help you better use the model.
INT8 model inference code support.

Model License

The CogVideoX-2B model (including its corresponding Transformers module and VAE module) is released under the Apache 2.0 License.
The CogVideoX-5B model (Transformers module) is released under the CogVideoX LICENSE.

Citation

@article{yang2024cogvideox,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
journal={arXiv preprint arXiv:2408.06072},
year={2024}
}

.hf-sanitized.hf-sanitized-x4LfeZ5AGWxbgEF4FnjMa .video-container { display: flex; flex-wrap: wrap; justify-content: space-around; }
.hf-sanitized.hf-sanitized-x4LfeZ5AGWxbgEF4FnjMa .video-item { width: 45%; margin-bottom: 20px; transition: transform 0.3s; }
.hf-sanitized.hf-sanitized-x4LfeZ5AGWxbgEF4FnjMa .video-item:hover { transform: scale(1.1); }
.hf-sanitized.hf-sanitized-x4LfeZ5AGWxbgEF4FnjMa .caption { text-align: center; margin-top: 10px; font-size: 11px; }

Example prompts

Great starting points for CogVideoX-2b.

A majestic view of an untouched tropical rainforest during sunrise, with vibrant green foliage glistening from morning dew, sunlight breaking through the thick canopy of trees, exotic colorful birds like toucans and parrots flying in slow motion, a gentle mist covering the ground, and a small stream flowing over mossy rocks, accompanied by the faint sounds of chirping birds and rustling leaves.Video Generation

An ancient stone castle standing tall on a grassy hilltop, surrounded by a dense morning fog, its weathered walls glowing faintly in the golden light of dawn, with distant silhouettes of knights on horseback patrolling the grounds, and the faint fluttering of flags bearing medieval crests. A soft breeze rustling the tall grass adds a timeless atmosphere.Video Generation

A vibrant underwater scene teeming with life, featuring a majestic sea turtle gliding gracefully above a colorful coral reef, schools of fish shimmering in the sunlight, small sharks swimming in the background, and rays of light piercing the crystal-clear blue water, creating a sense of peace and wonder in the ocean depths.Video Generation

A close-up shot of a powerful tiger prowling through tall golden savanna grass under the warm glow of a setting sun, its intense amber eyes locked onto its surroundings, each muscle rippling with quiet strength, the background blending into a serene mix of orange and purple hues of the evening sky.Video Generation

A determined athlete climbing a steep rocky mountain under a fiery orange sky, with sweat glistening on their face, a backpack slung over their shoulder, gripping onto jagged rocks with one hand while reaching for the summit with the other, the sun breaking over the peak, symbolizing perseverance and triumph over adversity.Video Generation

API quick start

Run CogVideoX-2b with a single API call.

POST https://api.wiro.ai/v1/Run/THUDM/CogVideoX-2b

{
  "prompt": "A sprawling futuristic city illuminated b…",
  "steps": 50,
  "scale": 6.0,
  "seed": 123456
}

View full API docs

Task History

No tasks yet

THUDM / CogVideoX-2b

CogVideoX-2b

Example prompts

API quick start