はじめに
Black Forest LabsからFLUX.2-devが公開されました。
FLUX.1-devとFLUX.1 Kontext-devを組み合わせたようなモデルになっています。
結果
FLUX.1 Kontext-devと比較してみました。
左が元画像、右が作成画像です。
今回の結果


すこし手が大きすぎるような気がします。
FLUX.1 Kontext-devの結果


こちらは腕の長さに違和感があります。 こちらの記事を見て下さい。
Pythonスクリプト
import torch from diffusers import Flux2Pipeline from diffusers.utils import load_image repo_id = "diffusers/FLUX.2-dev-bnb-4bit" #quantized text-encoder and DiT. VAE still in bf16 device = "cuda" torch_dtype = torch.bfloat16 pipe = Flux2Pipeline.from_pretrained( repo_id, torch_dtype=torch_dtype ) pipe.enable_model_cpu_offload() prompt = "Make the lady hold a sign that says 'FLUX.2 dev is awesome'" image = load_image("girl.jpg") image = pipe( prompt=prompt, image=[image], #multi-image input generator=torch.Generator(device=device).manual_seed(42), num_inference_steps=50, guidance_scale=4, ).images[0] image.save("flux2_output.png")
環境構築
pyproject.tomlを載せておきます。
uvを使うとuv syncだけで環境構築できると思います。
[project] name = "flux" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = ">=3.13" dependencies = [ "accelerate==1.12.0", "bitsandbytes==0.48.2", "diffusers @ git+https://github.com/huggingface/diffusers@c8656ed73c638e51fc2e777a5fd355d69fa5220f", "hf-xet==1.2.0", "torch==2.9.1+cu126", "torchvision==0.24.1+cu126", "transformers==4.57.3", ] [[tool.uv.index]] name = "torch-cuda" url = "https://download.pytorch.org/whl/cu126" explicit = true [tool.uv.sources] torch = [{ index = "torch-cuda" }] torchvision = [{ index = "torch-cuda" }]
補足
Flash Attentionの効果を試した結果を載せておきます。 RTX 4090で実験しています。
Flash Attentionはこちらを参考にビルドしました。
Flash Attentionあり
model load: 14.84 sec flash attention: 0.00 sec image generation: 246.33 sec total time: 261.17 sec GPU 0 - Used memory: 23.86/23.99 GB
Flash Attentionなし
model load: 14.94 sec image generation: 268.20 sec total time: 283.15 sec GPU 0 - Used memory: 23.87/23.99 GB
Pythonコード
import torch from diffusers import Flux2Pipeline from diffusers.utils import load_image import time from decorator import gpu_monitor @gpu_monitor(interval=0.5) def main(): start_time = time.time() repo_id = "diffusers/FLUX.2-dev-bnb-4bit" #quantized text-encoder and DiT. VAE still in bf16 device = "cuda" torch_dtype = torch.bfloat16 pipe = Flux2Pipeline.from_pretrained( repo_id, torch_dtype=torch_dtype ) model_load_time = time.time() pipe.transformer.set_attention_backend("flash") flash_attention_time = time.time() pipe.enable_model_cpu_offload() prompt = "Make the lady hold a sign that says 'FLUX.2 dev is awesome'" image = load_image("girl.jpg") image = pipe( prompt=prompt, image=[image], #multi-image input generator=torch.Generator(device=device).manual_seed(42), num_inference_steps=50, guidance_scale=4, ).images[0] image.save("flux2_output.png") end_time = time.time() print(f"model load: {(model_load_time - start_time):.2f} sec") print(f"flash attention: {(flash_attention_time - model_load_time):.2f} sec") print(f"image generation: {(end_time - flash_attention_time):.2f} sec") print(f"total time: {(end_time - start_time):.2f} sec") if __name__ == "__main__": main()