
Vidu Q3 Mix reference-to-video model. Generates a video with your characters from 1-7 reference images using mixed-style synthesis. Both `prompt` and `reference_images` are required. `aspect_ratio` supports 16:9/9:16/1:1; `resolution` supports 720p/1080p only; `duration` range 1-16s.

Hailuo 2.3 generates high-quality videos from text with exceptional instruction following and state-of-the-art extreme physics simulation.

Hailuo 02 masters text-to-video generation with exceptional instruction following and sets a new standard in visual realism via extreme physics.

Google Imagen 4.0 standard text-to-image model. High-quality photorealistic output. Supports batch generation (up to 4), person control, and up to 2K.

Google Veo 3.0 stable text-to-video model. Supports `prompt`, `negativePrompt`, `aspectRatio`, `durationSeconds`, `resolution` (up to 1080p), and `personGeneration`.

ByteDance Seedream 5.0 Lite text-to-image model with 2K/3K custom resolutions and configurable output format.

Z-Image Turbo is a lightweight text-to-image model that quickly generates images with Chinese and English text rendering support. It always outputs 1 PNG image per request.

Faster variant of Dreamina Seedance 2.0 image-to-video. Accepts the same multimodal inputs as Seedance 2.0 I2V—text prompt plus optional reference images and audio—with lower latency. Resolution limited to 480p/720p.

Nano Banana Pro image generation model with higher quality output. Supports aspect ratio and image size (1K/2K/4K resolution).

Vidu Q3 Turbo reference-to-video model. Fast generation with your characters from 1-7 reference images. Both `prompt` and `reference_images` are required. `aspect_ratio` supports 16:9/9:16/1:1; `resolution` supports 540p/720p/1080p; `duration` range 3-16s.

Vidu Q3 reference-to-video model. Balanced quality from 1-7 reference images. Both `prompt` and `reference_images` are required. `aspect_ratio` supports 16:9/9:16/1:1; `resolution` supports 540p/720p/1080p; `duration` range 3-16s.

Vidu Q3 Mix reference-to-video model. Generates a video with your characters from 1-7 reference images using mixed-style synthesis. Both `prompt` and `reference_images` are required. `aspect_ratio` supports 16:9/9:16/1:1; `resolution` supports 720p/1080p only; `duration` range 1-16s.

Vidu one-click trending replicate model. Recreates a trending video style with your own subjects. Both `video_url` (reference trend video) and `images` (subject images, 1-7) are required. Supports `prompt`, `aspect_ratio`, `resolution` (default 1080p), and `remove_audio`.

Vidu lip sync model. Reanimates the lip movements in a video to match a replacement audio track. `video_url` is required. Provide `audio_url` as the new audio to sync lips to. Use `reference_face_image_url` to preserve face identity consistency across the video.

Vidu motion sync model. Transfers motion from a source video onto a target character image. Both `image_url` and `video_url` are required.

Vidu one-click ad film model. Automatically generates a marketing video from 1-7 product or scene images. `images` is required. Supports `prompt` (up to 2000 chars), `duration` (10-60s, default 15), `aspect_ratio`, and `language` (zh/en).

Vidu one-click general film model. Automatically generates a cinematic film from 1-7 images. Both `images` and `duration` (10-180s) are required. Optionally accepts `prompt` (up to 3000 chars) and `aspect_ratio`.

The GPT-Image series by OpenAI consists of advanced multimodal models, such as GPT-Image-1 and GPT-Image-2, designed for generating and editing photorealistic images from text and image inputs.

MiniMax's Hailuo 02 series is a top-ranked cinematic AI video suite for T2V/I2V, generating native 1080p clips with ultra-realistic physics, character consistency, and director-level controls.

MiniMax's Hailuo 2.3 series elevates cinematic AI video gen with 4K T2V/I2V, hyper-realistic physics/motion, extended clips, and advanced character consistency.

HappyHorse is a leading open-source AI video generation model with 15 billion parameters that jointly produces high-quality 1080p videos and synchronized audio from text or image prompts, currently topping the Artificial Analysis Video Arena leaderboard.

Google Imagen is Google's premier text-to-image diffusion model, excelling in photorealistic, high-resolution image generation from textual prompts with unmatched detail, creativity, and adherence to complex descriptions.

Kuaishou's Kling v3 series is an open multimodal AI suite for T2I/I2V/T2V, generating 4K cinematic visuals with native audio, multi-shot narratives, precise motion control, and consistent characters.

Nano Banana is an advanced AI image generation and editing model based on Google's Gemini technology, delivering fast, precise transformations with exceptional prompt understanding, consistent character editing, and high-quality visuals.

Qwen Image is Alibaba's unified 7B text-to-image generation and editing model series, renowned for high-fidelity visuals, superior text rendering, Photoshop-like layered editing, and top rankings on global leaderboards.

ByteDance's Seedance is a multimodal AI video generation model that creates cinematic 1080p multi-shot videos from text, images, audio, or video prompts with immersive audio-visual realism and director-level creative controls.

ByteDance's Seedream is a high-fidelity text-to-image and editing model supporting native 4K resolution, batch generation, superior typography, and consistent character rendering for professional creative workflows.

Google Veo 3 is Google DeepMind's groundbreaking text-to-video AI model, unveiled at Google I/O 2025, that generates high-fidelity 4K cinematic videos with native synchronized audio from text or image prompts, offering professional controls and multi-scene coherence.

Google Veo 3.1 is the advanced successor to Veo 3, released in October 2025, enhancing 4K video generation with richer native audio, superior narrative control, precise image-to-video conversion, and seamless character consistency for dynamic storytelling.

Vidu Q3 is Shengshu AI’s advanced text-to-video and image-to-video model that generates up to 16-second clips with native audio, enhanced motion, and precise camera control.

Alibaba's Wan 2.6 is a powerful open-source AI video generation model that creates cinematic 1080p multi-shot videos with native audio-visual synchronization, supporting text-to-video, image-to-video, and professional storytelling workflows.

Alibaba's Wan 2.7 series is a comprehensive open-weight AI suite for image generation/editing and video creation, featuring thinking mode reasoning, first/last frame control, up to 4K images and 1080p videos, native audio sync, and exceptional text rendering accuracy.
