https://kazuhito00.hatenablog.com/entry/2024/08/20/204803

MiniCPM-V2.6 は、単一画像、複数画像、動画などを処理できるマルチモーダルLLMです。
個人的な感想ですが、この手のローカルで動かせるVLMで、複数画像や動画を処理できるものは珍しい気がしますね👀

MiniCPM-V2.6は、公式の説明では以下のような特徴があるらしいです。

合計8Bパラメータ
単一画像、複数画像、およびビデオ理解においてGPT-4Vを上回ります
単一画像理解ではGPT-4o mini、Gemini 1.5 Pro、Claude 3.5 Sonnetよりも優れている
強力なOCR機能
多言語サポート
エンドサイド展開
優れたトークン密度で、MiniCPM-V 2.6はiPadなどのエンドサイドデバイスでのリアルタイムビデオ理解をサポート

Colaboratoryで試した感じ、速度の割に結構精度が良いように感じます🦔

MiniCPM-V2.6 味見中👀①

プロンプト：explain this image.
回答：In the center of this image, a pair of hands is seen holding a red and white handheld gaming device. The device has a joystick on the right side, indicating it's designed for one-handed use. On the left side, there… pic.twitter.com/b1hCMljFei
— 高橋かずひと@闇のパワポLT職人 (@KzhtTkhs) 2024年8月20日

MiniCPM-V2.6 味見中👀②
画像2枚を入力して質問。

プロンプト：Compare image 1 and image 2, tell me about the differences between image 1 and image 2.
回答：In the second image, one of the individuals has their arm extended with a clenched fist, indicating a gesture of celebration… pic.twitter.com/ds0ohwzPvQ
— 高橋かずひと@闇のパワポLT職人 (@KzhtTkhs) 2024年8月20日

MiniCPM-V2.6 味見中👀③
動画入力
内部処理的には、動画を1秒おきのフレームに切り出して、フレーム列とプロンプトを入力

プロンプト：describe the video
回答：The video begins with a wide shot of a Qantas airplane on the runway, followed by a close-up of a cheerful koala character… pic.twitter.com/n0Hhn2MzHc
— 高橋かずひと@闇のパワポLT職人 (@KzhtTkhs) 2024年8月20日

MiniCPM-V2.6 味見中👀④
フューショット学習

プロンプト：What does this picture represent?
　1枚目の回答：S
　2枚目の回答：G
回答：U pic.twitter.com/KL0yIjZj18
— 高橋かずひと@闇のパワポLT職人 (@KzhtTkhs) 2024年8月20日

フューショット学習無しで同じ質問した場合↓

プロンプト：What does this picture represent?
回答：The image is a silhouette of two people standing side by side, facing each other. The background appears to be blurred with horizontal lines, possibly indicating a wall or a textured… https://t.co/0gUZNMqObo
— 高橋かずひと@闇のパワポLT職人 (@KzhtTkhs) 2024年8月20日

今回試したノートブックは以下にコミットしています🦔

github.com