image-to-video

Vidu Q3 Mix R2V

Vidu Q3 Mix reference-to-video model. Generates a video with your characters from 1-7 reference images using mixed-style synthesis. Both `prompt` and `reference_images` are required. `aspect_ratio` supports 16:9/9:16/1:1; `resolution` supports 720p/1080p only; `duration` range 1-16s.

试试看！API 文档

探索 AI 模型

搜索供应商或模型名称

查看所有模型

minimax/hailuo-2.3-t2v

text-to-video

Hailuo 2.3 generates high-quality videos from text with exceptional instruction following and state-of-the-art extreme physics simulation.

$0.0540~$0.1200/sec

minimax/hailuo-02-t2v

text-to-video

Hailuo 02 masters text-to-video generation with exceptional instruction following and sets a new standard in visual realism via extreme physics.

$0.0540~$0.1200/sec

google/imagen-4.0-generate-001

text-to-image

Google Imagen 4.0 standard text-to-image model. High-quality photorealistic output. Supports batch generation (up to 4), person control, and up to 2K.

$0.0330/img

google/veo-3-t2v

text-to-video

Google Veo 3.0 stable text-to-video model. Supports `prompt`, `negativePrompt`, `aspectRatio`, `durationSeconds`, `resolution` (up to 1080p), and `personGeneration`.

$0.4600/sec

bytedance/seedream-5.0-lite

text-to-image

ByteDance Seedream 5.0 Lite text-to-image model with 2K/3K custom resolutions and configurable output format.

$0.0370/img

alibaba/z-image-turbo

text-to-image

Z-Image Turbo is a lightweight text-to-image model that quickly generates images with Chinese and English text rendering support. It always outputs 1 PNG image per request.

$0.0100~$0.0200/img

bytedance/seedance-2.0-fast-i2v

image-to-video

Faster variant of Dreamina Seedance 2.0 image-to-video. Accepts the same multimodal inputs as Seedance 2.0 I2V—text prompt plus optional reference images and audio—with lower latency. Resolution limited to 480p/720p.

$0.0650~$0.1400/sec

google/nano-banana-pro

text-to-image

Nano Banana Pro image generation model with higher quality output. Supports aspect ratio and image size (1K/2K/4K resolution).

$0.1300~$0.2200/img

vidu/viduq3-turbo-r2v

image-to-video

Vidu Q3 Turbo reference-to-video model. Fast generation with your characters from 1-7 reference images. Both `prompt` and `reference_images` are required. `aspect_ratio` supports 16:9/9:16/1:1; `resolution` supports 540p/720p/1080p; `duration` range 3-16s.

$0.0100~$0.0320/sec

vidu/viduq3-r2v

image-to-video

Vidu Q3 reference-to-video model. Balanced quality from 1-7 reference images. Both `prompt` and `reference_images` are required. `aspect_ratio` supports 16:9/9:16/1:1; `resolution` supports 540p/720p/1080p; `duration` range 3-16s.

vidu/one-click-trending-replicate

video-to-video

Vidu one-click trending replicate model. Recreates a trending video style with your own subjects. Both `video_url` (reference trend video) and `images` (subject images, 1-7) are required. Supports `prompt`, `aspect_ratio`, `resolution` (default 1080p), and `remove_audio`.

$0.0300~$0.0500/sec

vidu/lip-sync

video-to-video

Vidu lip sync model. Reanimates the lip movements in a video to match a replacement audio track. `video_url` is required. Provide `audio_url` as the new audio to sync lips to. Use `reference_face_image_url` to preserve face identity consistency across the video.

$0.0100/sec

vidu/motion-sync

video-to-video

Vidu motion sync model. Transfers motion from a source video onto a target character image. Both `image_url` and `video_url` are required.

$0.0250/sec

vidu/one-click-ad-film

image-to-video

Vidu one-click ad film model. Automatically generates a marketing video from 1-7 product or scene images. `images` is required. Supports `prompt` (up to 2000 chars), `duration` (10-60s, default 15), `aspect_ratio`, and `language` (zh/en).

$0.1000/sec

vidu/one-click-general-film

image-to-video

Vidu one-click general film model. Automatically generates a cinematic film from 1-7 images. Both `images` and `duration` (10-180s) are required. Optionally accepts `prompt` (up to 3000 chars) and `aspect_ratio`.

$0.1000/sec

The GPT-Image series by OpenAI consists of advanced multimodal models, such as GPT-Image-1 and GPT-Image-2, designed for generating and editing photorealistic images from text and image inputs.

MiniMax's Hailuo 02 series is a top-ranked cinematic AI video suite for T2V/I2V, generating native 1080p clips with ultra-realistic physics, character consistency, and director-level controls.

MiniMax's Hailuo 2.3 series elevates cinematic AI video gen with 4K T2V/I2V, hyper-realistic physics/motion, extended clips, and advanced character consistency.

HappyHorse is a leading open-source AI video generation model with 15 billion parameters that jointly produces high-quality 1080p videos and synchronized audio from text or image prompts, currently topping the Artificial Analysis Video Arena leaderboard.

Google Imagen is Google's premier text-to-image diffusion model, excelling in photorealistic, high-resolution image generation from textual prompts with unmatched detail, creativity, and adherence to complex descriptions.

Kuaishou's Kling v3 series is an open multimodal AI suite for T2I/I2V/T2V, generating 4K cinematic visuals with native audio, multi-shot narratives, precise motion control, and consistent characters.

Nano Banana is an advanced AI image generation and editing model based on Google's Gemini technology, delivering fast, precise transformations with exceptional prompt understanding, consistent character editing, and high-quality visuals.

Qwen Image is Alibaba's unified 7B text-to-image generation and editing model series, renowned for high-fidelity visuals, superior text rendering, Photoshop-like layered editing, and top rankings on global leaderboards.

ByteDance's Seedance is a multimodal AI video generation model that creates cinematic 1080p multi-shot videos from text, images, audio, or video prompts with immersive audio-visual realism and director-level creative controls.

ByteDance's Seedream is a high-fidelity text-to-image and editing model supporting native 4K resolution, batch generation, superior typography, and consistent character rendering for professional creative workflows.

Google Veo 3 is Google DeepMind's groundbreaking text-to-video AI model, unveiled at Google I/O 2025, that generates high-fidelity 4K cinematic videos with native synchronized audio from text or image prompts, offering professional controls and multi-scene coherence.

Google Veo 3.1 is the advanced successor to Veo 3, released in October 2025, enhancing 4K video generation with richer native audio, superior narrative control, precise image-to-video conversion, and seamless character consistency for dynamic storytelling.

Vidu Q3 is Shengshu AI’s advanced text-to-video and image-to-video model that generates up to 16-second clips with native audio, enhanced motion, and precise camera control.

Alibaba's Wan 2.6 is a powerful open-source AI video generation model that creates cinematic 1080p multi-shot videos with native audio-visual synchronization, supporting text-to-video, image-to-video, and professional storytelling workflows.

Alibaba's Wan 2.7 series is a comprehensive open-weight AI suite for image generation/editing and video creation, featuring thinking mode reasoning, first/last frame control, up to 4K images and 1080p videos, native audio sync, and exceptional text rendering accuracy.

接入领先 AI 媒体模型

探索图片、视频与音频模型，通过透明定价、在线试运行和统一 API 快速接入生产环境。

开始搜索

没有找到需要的模型？告诉我们。

视频生成101

图片生成66

探索 AI 模型

精选

最新发布

模型系列

接入领先 AI 媒体模型