https://touch-sp.hatenablog.com/entry/2026/01/15/070925

はじめに

GLM-Imageは一つのモデルでText2ImageとImage2Imageの両方が行えます。

それぞれ行ったうえでQwen-Imageと比較してみました。

4bit量子化を使ってRTX 4090(VRAM 24GB)1枚で動くようにしています。

Text to Image Generation (Qwen-Image-2512と比較)

GLM-Image

Qwen-Image-2512

Image to Image Generation (Qwen-Image-Edit-2511と比較)

元画像

結果

GLM-Image

Qwen-Image-Edit-2511

Pythonスクリプト

Text to Image Generation

import torch
from diffusers.pipelines.glm_image import GlmImagePipeline
from diffusers.quantizers import PipelineQuantizationConfig

pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={
        "load_in_4bit": True,
        "bnb_4bit_quant_type": "nf4",
        "bnb_4bit_compute_dtype": torch.bfloat16
    },
    components_to_quantize=["text_encoder", "transformer"]
)

pipe = GlmImagePipeline.from_pretrained(
    "zai-org/GLM-Image",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16
)

pipe.enable_model_cpu_offload()

prompt = "A beautifully designed modern food magazine style dessert recipe illustration, themed around a raspberry mousse cake. The overall layout is clean and bright, divided into four main areas: the top left features a bold black title 'Raspberry Mousse Cake Recipe Guide', with a soft-lit close-up photo of the finished cake on the right, showcasing a light pink cake adorned with fresh raspberries and mint leaves; the bottom left contains an ingredient list section, titled 'Ingredients' in a simple font, listing 'Flour 150g', 'Eggs 3', 'Sugar 120g', 'Raspberry puree 200g', 'Gelatin sheets 10g', 'Whipping cream 300ml', and 'Fresh raspberries', each accompanied by minimalist line icons (like a flour bag, eggs, sugar jar, etc.); the bottom right displays four equally sized step boxes, each containing high-definition macro photos and corresponding instructions, arranged from top to bottom as follows: Step 1 shows a whisk whipping white foam (with the instruction 'Whip egg whites to stiff peaks'), Step 2 shows a red-and-white mixture being folded with a spatula (with the instruction 'Gently fold in the puree and batter'), Step 3 shows pink liquid being poured into a round mold (with the instruction 'Pour into mold and chill for 4 hours'), Step 4 shows the finished cake decorated with raspberries and mint leaves (with the instruction 'Decorate with raspberries and mint'); a light brown information bar runs along the bottom edge, with icons on the left representing 'Preparation time: 30 minutes', 'Cooking time: 20 minutes', and 'Servings: 8'. The overall color scheme is dominated by creamy white and light pink, with a subtle paper texture in the background, featuring compact and orderly text and image layout with clear information hierarchy."

image = pipe(
    prompt=prompt,
    height=32*32,
    width=32*36,
    num_inference_steps=50,
    guidance_scale=1.5,
    generator=torch.Generator(device="cuda").manual_seed(42),
).images[0]

image.save("output_t2i.png")

Image to Image Generation

vision_language_encoderも量子化できますが、それをやると生成画像の質が低下しました。

また次の1行がないとエラーが出ました。

pipe.vision_language_encoder.to("cuda")

import torch
from diffusers.pipelines.glm_image import GlmImagePipeline
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers.utils import load_image

pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={
        "load_in_4bit": True,
        "bnb_4bit_quant_type": "nf4",
        "bnb_4bit_compute_dtype": torch.bfloat16
    },
    components_to_quantize=["text_encoder", "transformer"]
)

pipe = GlmImagePipeline.from_pretrained(
    "zai-org/GLM-Image",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16
)

pipe.enable_model_cpu_offload()

pipe.vision_language_encoder.to("cuda")

image1 = load_image("1.jpg").convert("RGB")
image2 = load_image("2.jpg").convert("RGB")

prompt = "Two women are sitting side by side on a sofa in a cafe."

image = pipe(
    prompt=prompt,
    image=[image1, image2],
    height=32 * 32,
    width=32 * 32,
    num_inference_steps=50,
    guidance_scale=1.5,
    generator=torch.Generator(device="cuda").manual_seed(42),
).images[0]

image.save("output_i2i.png")

ランキング参加中

プログラミング