はじめに
CogVideoXに関しては以前も記事を書いています。touch-sp.hatenablog.com
touch-sp.hatenablog.com
今回は「CogVideoX1.5-5B」でText2Videoを行いました。
使用したPC
2つのPCで実行しました。PC1(デスクトップ)
OS Windows 11 プロセッサ Core(TM) i7-14700K 実装 RAM 96.0 GB GPU RTX 4090 (VRAM 24GB)
CUDA 12.4 Python 3.12
PC2(ノート)
OS Windows 11 プロセッサ Intel(R) Core(TM) i7-12700H 実装 RAM 32.0 GB GPU RTX 3080 Laptop (VRAM 16GB)
CUDA 11.8 Python 3.12
Python環境構築
CUDAのバージョンは各自変更して下さい。pip install torch==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124 pip install diffusers[torch] pip install transformers sentencepiece imageio imageio-ffmpeg pip install torchao
diffusers==0.32.1 imageio==2.36.1 imageio-ffmpeg==0.5.1 sentencepiece==0.2.0 torch==2.5.1+cu124 torchao==0.7.0 transformers==4.48.0
Pythonスクリプト
import torch from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel, AutoencoderKLCogVideoX, TorchAoConfig from diffusers.utils import export_to_video from decorator import gpu_monitor, time_monitor import gc def reset_memory(): gc.collect() torch.cuda.empty_cache() torch.cuda.reset_accumulated_memory_stats() torch.cuda.reset_peak_memory_stats() def print_memory(): max_memory = round(torch.cuda.max_memory_allocated() / 1024**3, 2) max_reserved = round(torch.cuda.max_memory_reserved() / 1024**3, 2) print(f"{max_memory=} GB") print(f"{max_reserved=} GB") @gpu_monitor(interval=0.5) @time_monitor def main(): # downloaded from https://huggingface.co/THUDM/CogVideoX1.5-5B model_id = "CogVideoX1.5-5B" prompt = ( "A panda sits on a wooden stool in a serene bamboo forest." "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes." "Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays." "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance." ) pipe = CogVideoXPipeline.from_pretrained( model_id, transformer=None, vae=None, torch_dtype=torch.bfloat16 ) pipe.enable_model_cpu_offload() with torch.no_grad(): prompt_embeds, negative_prompt_embeds = pipe.encode_prompt( prompt=prompt ) print("text_encoder:") print_memory() del pipe reset_memory() quantization_config = TorchAoConfig("int8wo") transformer = CogVideoXTransformer3DModel.from_pretrained( model_id, subfolder="transformer", quantization_config=quantization_config, torch_dtype=torch.bfloat16 ) vae = AutoencoderKLCogVideoX.from_pretrained( model_id, subfolder="vae", quantization_config=quantization_config, torch_dtype=torch.bfloat16 ) pipe = CogVideoXPipeline.from_pretrained( model_id, transformer=transformer, vae=vae, text_encoder=None, tokenizer=None, torch_dtype=torch.bfloat16 ) pipe.enable_model_cpu_offload() pipe.vae.enable_tiling() pipe.vae.enable_slicing() video = pipe( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, width=768, height=768, num_videos_per_prompt=1, num_inference_steps=50, num_frames=81, guidance_scale=6, generator=torch.Generator().manual_seed(42) ).frames[0] export_to_video(video, "output.mp4", fps=8) print("transformer and vae:") print_memory() if __name__ == "__main__": main()
結果
PC1(RTX 4090)
text_encoder: max_memory=8.93 GB max_reserved=8.96 GB transformer and vae: max_memory=9.09 GB max_reserved=12.27 GB time: 595.49 sec GPU 0 - Used memory: 13.93/23.99 GB
PC2(RTX 3080 Laptop)
text_encoder: max_memory=8.93 GB max_reserved=8.96 GB transformer and vae: max_memory=9.09 GB max_reserved=12.27 GB time: 3199.73 sec GPU 0 - Used memory: 12.82/16.00 GB
作成された動画はGoogle Bloggerに載せています。
support-touchsp.blogspot.com
以前のモデルと比較して格段に質が上がっています。
その他
ベンチマークはこちらで記述したスクリプトで行いました。touch-sp.hatenablog.com