https://touch-sp.hatenablog.com/entry/2025/05/11/101849

はじめに

「FramePack」は少ないVRAMで高品質な長尺動画生成を可能にする技術のようです。

Lvmin Zhangという人を中心として開発されたようですが、この方はかなり実績ある人のようです。

「ControlNet」や「Fooocus」や「Stable Diffusion WebUI Forge」を開発した実績がありました。

Python環境構築

pip install torch==2.6.0+cu126 --index-url https://download.pytorch.org/whl/cu126
pip install git+https://github.com/huggingface/diffusers
pip install transformers accelerate
pip install imageio imageio-ffmpeg

Pythonスクリプト

「Image2Video」なので画像を指定するのですが2枚の画像（最初と最後の画像）を指定することができます。

import torch
from diffusers import BitsAndBytesConfig, HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel
from diffusers.utils import export_to_video, load_image
from transformers import SiglipImageProcessor, SiglipVisionModel

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained(
    "lllyasviel/FramePackI2V_HY",
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16
)
 
feature_extractor = SiglipImageProcessor.from_pretrained(
    "lllyasviel/flux_redux_bfl",
    subfolder="feature_extractor"
)

image_encoder = SiglipVisionModel.from_pretrained(
    "lllyasviel/flux_redux_bfl",
    subfolder="image_encoder",
    torch_dtype=torch.float16
)
pipe = HunyuanVideoFramepackPipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo",
    transformer=transformer,
    feature_extractor=feature_extractor,
    image_encoder=image_encoder,
    torch_dtype=torch.float16
)

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

prompt = (
    "CG animation style, a small blue bird takes off from the ground,"
    "andquickly flapping its wings up and down repeatedly. "
    "The camera follows the bird upward, capturing its flight up in the air. "
    "A close-up, low-angle perspective."
)

first_image = load_image("first_frame.png")
last_image = load_image("last_frame.png")

output = pipe(
    image=first_image,
    last_image=last_image,
    prompt=prompt,
    height=512,
    width=512,
    num_frames=91,
    num_inference_steps=30,
    guidance_scale=9.0,
    generator=torch.Generator().manual_seed(1000),
).frames[0]

export_to_video(output, "output.mp4", fps=30)

指定した画像

結果

結果はGoogle Bloggerに載せておきます。
support-touchsp.blogspot.com

ランキング参加中

プログラミング