はじめに
OmniGenはいろいろなことができるモデルです。前回の記事はこちらです。touch-sp.hatenablog.com
今回は姿勢を維持して新たな画像を生成してみます。
ControlNetに近いですがポーズ画像を生成することなく直接あらたな画像を作成できます。
元画像

結果

手の描写がおかしいです。そのへんが弱点かもしれません。
FLUX.1-devでControlNetを使った場合の結果も比較のため載せておきます。

Pythonスクリプト
元画像の作成
元画像は「FLUX.1-dev」と「anzu-flux v2.2」というLoRAを使って作成しました。一部「FLUX.1-Fill-dev」で修正を加えています。
import torch from diffusers import FluxPipeline import gc def flush(): gc.collect() torch.cuda.empty_cache() model_id = "black-forest-labs/FLUX.1-dev" prompt="Realistic photo. A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD." pipeline = FluxPipeline.from_pretrained( model_id, transformer=None, vae=None ).to("cuda") with torch.no_grad(): prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt( prompt=prompt, prompt_2=None, ) del pipeline flush() pipeline = FluxPipeline.from_pretrained( model_id, text_encoder=None, text_encoder_2=None, tokenizer=None, tokenizer_2=None, torch_dtype=torch.bfloat16 ) pipeline.load_lora_weights("anzu-flux-LoRA_v22.safetensors") pipeline.enable_sequential_cpu_offload() seed = 20250228 generator = torch.Generator().manual_seed(seed) image = pipeline( prompt_embeds=prompt_embeds.bfloat16(), pooled_prompt_embeds=pooled_prompt_embeds.bfloat16(), width=1024, height=1024, num_inference_steps=27, generator=generator, guidance_scale=3.5, joint_attention_kwargs={"scale": 1.0}, ).images[0] image.save(f"lora_result_seed{seed}.jpg")
OmniGenで画像生成
import torch from diffusers import OmniGenPipeline from diffusers.utils import load_image pipe = OmniGenPipeline.from_pretrained( "Shitao/OmniGen-v1-diffusers", torch_dtype=torch.bfloat16 ) pipe.to("cuda") prompt="Following the pose of this image <img><|image_1|></img>, generate a new photo: A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him." input_images=[load_image("lora_result_seed20250228.jpg")] seed = 20250301 for i in range(3): new_seed = seed + 12345 * i generator = torch.manual_seed(new_seed) image = pipe( prompt=prompt, input_images=input_images, guidance_scale=2, img_guidance_scale=1.6, use_input_image_size_as_output=True, generator=generator ).images[0] image.save(f"omnigen_result_seed{new_seed}.jpg")
FLUX.1-devで画像生成
いったんControlNet用のポーズ画像を作る必要があります。OmniGenを使いました。import torch from diffusers import OmniGenPipeline from diffusers.utils import load_image pipe = OmniGenPipeline.from_pretrained( "Shitao/OmniGen-v1-diffusers", torch_dtype=torch.bfloat16 ) pipe.to("cuda") prompt="Detect the skeleton of human in this image: <img><|image_1|></img>" input_images=[load_image("lora_result_seed20250228.jpg")] seed = 20250301 generator = torch.manual_seed(seed) image1 = pipe( prompt=prompt, input_images=input_images, guidance_scale=2, img_guidance_scale=1.6, use_input_image_size_as_output=True, generator=generator ).images[0] image1.save("pose.png")

その画像をもとに画像生成しました。
import torch from diffusers.utils import load_image from diffusers import FluxControlNetPipeline, FluxControlNetModel base_model = 'black-forest-labs/FLUX.1-dev' controlnet_model = 'InstantX/FLUX.1-dev-Controlnet-Union' controlnet = FluxControlNetModel.from_pretrained( controlnet_model, torch_dtype=torch.bfloat16 ) pipe = FluxControlNetPipeline.from_pretrained( base_model, controlnet=controlnet, torch_dtype=torch.bfloat16 ) pipe.enable_sequential_cpu_offload() control_image_depth = load_image("pose.png") controlnet_conditioning_scale = 0.5 control_mode = 4 width, height = control_image_depth.size prompt = " A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him." seed = 20250301 for i in range(3): new_seed = seed + 12345 * i generator = torch.manual_seed(new_seed) image = pipe( prompt, control_image=control_image_depth, control_mode=control_mode, width=width, height=height, controlnet_conditioning_scale=controlnet_conditioning_scale, num_inference_steps=24, guidance_scale=3.5, generator=generator ).images[0] image.save(f"flux_controlnet_result_seed{new_seed}.jpg")