A curated list of AI video generation APIs, SDKs, and production-ready tools. Focused on services developers can integrate today.
Last verified: March 2026
- Text-to-Video APIs
- Real-Time and Interactive Video
- Video Style Transfer and Motion
- Avatar and Talking Head APIs
- Video Enhancement APIs
- Video Understanding APIs
- Open Source Models
- SDKs and Developer Tooling
- Infrastructure and Deployment
- Evaluation and Observability
- Templates and Example Projects
- Learning Resources
- OpenAI Sora – Text-to-video and image-to-video via the
v1/videosendpoint. Sora 2 supports up to 90s at 4K with spatial audio. API Docs | SDK: Python, Node - Runway (Gen-4) – Text-to-video and image-to-video with Gen-4 Turbo. Async task-based REST API with polling helpers. Docs | SDK: Python, Node
- Google Veo 2 / Veo 3 – Google's video generation models via Vertex AI and Gemini API. Veo 2 is GA; Veo 3 in paid preview. Docs | SDK: Google Cloud Python
- Pika (v2.2) – Text-to-video and image-to-video with Pikaframes multi-keyframe interpolation. API powered by fal.ai. Docs | SDK: fal Python, fal JS
- Luma Dream Machine – High-quality text-to-video with character reference and style reference inputs. Ray 3 is the latest model. Docs | SDK: Python, JS
- Kling AI – Text-to-video and image-to-video from Kuaishou. Up to 30s clips at 1080p/30fps. Async task-based API. Docs | Also on fal.ai
- MiniMax / Hailuo – Hailuo 2.3 model. Text-to-video and image-to-video up to 1080p, 10s clips. Docs | SDK: Python, Node
- Vidu (Shengshu Technology) – Now on Vidu Q3, the first long-form AI video model with native audio-video generation in a single output. Ranked #2 globally on Artificial Analysis benchmarks.
- xAI Aurora / Grok Imagine – Text-to-video and image-to-video using xAI's Aurora autoregressive MoE model. 6–15s clips at 720p with synchronized audio. API
- Seedance 2.0 (ByteDance) – Dual-Branch Diffusion Transformer for simultaneous video + audio generation. Up to 15s at 2K resolution. Available via Dreamina.
- Stability AI (SVD) – Image-to-video via Stable Video Diffusion. Hosted API deprecated July 2025; open weights available for self-hosting. GitHub
- Higgsfield – Cinematic video platform aggregating 15+ premium models (Sora 2, Kling 2.6, Veo 3.1, etc.) with camera simulation, character consistency, and lip-sync. 15M+ users.
- Magic Hour – Multi-modal AI video generation API. Text-to-video, image-to-video, style transfer, 4K upscaling. Scales to zero when idle. Docs | API
- InVideo AI – Turns text prompts into full videos using Sora 2 and Veo 3.1 as underlying models. OpenAI's first official Sora 2 integration partner. 50M+ users.
- Fliki – Text-to-video and text-to-speech platform. Enterprise API with 2,500+ voices in 80+ languages. Docs
- Morph Studio – No-code AI video studio aggregating Wan 2.6, Kling 2.6 Pro, Seedance, Sora 2, Veo 3 into a single canvas with storyboarding and style transfer.
- Decart (Lucy 2) – Real-time video transformation at 30fps 1080p with near-zero latency. Live-stream style transfer, character swaps, environment transformation, product placement. ~$3/hour. Docs | Platform
- PixVerse – Text-to-video and image-to-video platform. PixVerse-R1 adds real-time interactive video at 720p HD with native audio. Platform | Docs
- DomoAI – AI video-to-video style remixer. 50+ styles (anime, Ghibli, cinematic). v2.4.1 supports text-to-video, image-to-video, talking avatars, and animation.
- Viggle AI – Motion-transfer video tool that animates static characters to match a motion video or live webcam input. Mix Mode, Live Mode, and VTubing support. 40M+ users.
- Synthesia – Avatar-based video creation from scripts. 140+ languages, custom avatars, template workflows. API in beta. Docs
- D-ID – Talking head video generation from text or audio. Express and Premium+ avatars, real-time WebRTC streaming. Docs | SDK: Python
- HeyGen – AI avatar video generation and real-time streaming avatars via WebRTC. Template-based workflows. Docs | SDK: JS/TS
- Tavus – Conversational video AI. Phoenix-4 model does real-time gaussian-diffusion facial synthesis at ~600ms latency. Replica API clones face + voice. Integrates with Pipecat and LiveKit. Avatar API | Video API
- Colossyan – Enterprise avatar video platform. 130+ avatars, 600+ voices, 100+ languages, instant avatar creation from phone recording. API
- DeepBrain AI (AI Studios) – AI avatar video platform with REST API. Integrates AWS, Azure, ElevenLabs, IBM Watson, NVIDIA Riva. Docs
- Hour One – AI avatar video generator with 100+ presenters, voice cloning, 100+ languages. API + Zapier integration.
- Elai.io – AI video platform with streaming avatar API for interactive e-learning. Turns documents and scripts into avatar-presented videos. Docs
- Captions / Mirage – Mirage API generates hyperrealistic talking-head videos from script + image + actor ID. Natural gestures, eye contact, synchronized audio. API | Docs
- SynthLife – Virtual AI influencer creation. Creates AI personas for TikTok, YouTube, Instagram with auto-scheduling and unlimited content generation.
- Topaz Video AI – AI video upscaling, denoising, frame interpolation, and artifact correction. 3M+ users; used by Google, Tesla, NASA. API | Docs
- Twelve Labs – Video understanding and intelligence API. Multimodal semantic search, analysis, and text generation from video content. Docs
Models with hosted inference, Docker support, or diffusers integration.
- Wan 2.1 (Alibaba) – SOTA open T2V-14B model. Also supports I2V, editing, T2I, and V2A. T2V-1.3B runs on consumer GPUs. Apache 2.0. HuggingFace | On Replicate, fal.ai
- Wan 2.2 (Alibaba) – World's first open-source MoE video diffusion model. +65.6% more image training data and +83.2% more video data vs 2.1. 5B and 14B variants. Apache 2.0. HuggingFace
- MAGI-1 (Sand AI) – 24B param autoregressive denoising model. Generates video chunk-by-chunk (24 frames/chunk). Supports T2V, I2V, V2V with streaming generation. Outperforms Wan 2.1 and HunyuanVideo on benchmarks. Apache 2.0. HuggingFace | Paper
- Step-Video-T2V (StepFun) – 300B parameter text-to-video model, up to 204 frames, bilingual (EN/ZH). MIT license. TI2V | HuggingFace
- SkyReels (SkyworkAI) – V1: human-centric video fine-tuned on HunyuanVideo. V2: infinite-length video via Autoregressive Diffusion-Forcing. V3: multimodal reaching closed-source SOTA levels. V1 | V3
- HunyuanVideo (Tencent) – 13B+ param model; v1.5 is 8.3B and runs on consumer GPUs. I2V, Avatar, and Foley variants available. v1.5 | On Replicate, fal.ai
- CogVideoX (Zhipu AI / Z.AI) – CogVideoX-5B flagship; supports 10s videos. Commercial product "Ying" available via API. Apache 2.0 (2B). HuggingFace | API
- NVIDIA Cosmos – World foundation model for physical AI (robotics, autonomous vehicles). Cosmos-Predict2.5 generates physics-based video simulations from text/image/video/sensor inputs. Website | Docs
- Meta Movie Gen – 30B param T2V + 13B audio model. Personalized video from single reference photo, local/global editing, synchronized audio. Research paper public; weights not yet released. Rolling out inside Instagram Reels.
- LTX-Video / LTX-2 (Lightricks) – First DiT-based real-time video gen model. LTX-2 adds native 4K at 50fps with synchronized audio. LTX-2 | ComfyUI Nodes
- Pyramid Flow – Efficient autoregressive video generation using pyramidal flow matching. Up to 10s at 768p, 24fps. ICLR 2025. HuggingFace | Paper
- Open-Sora – Open reproduction of Sora-like generation. 2s–15s at 144p–720p. T2V, I2V, V2V. Apache 2.0.
- AnimateDiff – Plug-and-play animation module for Stable Diffusion models. Merged into HuggingFace diffusers. Diffusers Docs | On Replicate
- Mochi 1 (Genmo) – 10B param T2V model with AsymmDiT architecture. 5.4s at 30fps. Apache 2.0. On fal.ai, Replicate
- Allegro (Rhymes AI) – 2.8B param VideoDiT. 6s clips at 720p/15fps. Merged into diffusers. Apache 2.0. HuggingFace
- OmniHuman-1 (ByteDance) – Multimodal human video generation from single image + motion signal (audio, video, or text). Full-body, any aspect ratio. On Replicate
- Replicate SDK – Python/JS client for 100+ hosted video models. Async, streaming, webhooks, fine-tuning. Docs |
pip install replicate - fal.ai SDK – Serverless AI inference with Python, JS, and Swift SDKs. Hosts Kling, Veo, Pika, Wan, LTX, Luma, and more. Docs |
pip install fal-client/npm install @fal-ai/client - Runway SDK – Official Python and Node.js SDKs with type annotations, async support, and built-in polling.
pip install runwayml/npm install @runwayml/sdk - Luma AI SDK – Sync and async clients for all Dream Machine generation modes. JS Docs |
pip install lumaai - HeyGen Streaming Avatar SDK – TypeScript SDK for real-time WebRTC interactive avatar sessions.
npm install @heygen/streaming-avatar - MiniMax MCP Server – Model Context Protocol servers for video gen, TTS, and voice cloning. JS
- HuggingFace Diffusers – The canonical PyTorch library for diffusion models including video pipelines. Docs |
pip install diffusers
- Lambda Labs – On-demand H100/B200 GPUs. SSH and JupyterLab access with REST API for instance management.
- CoreWeave – Kubernetes-native AI cloud with enterprise-scale GPU infrastructure.
- RunPod – GPU pods (persistent) and serverless endpoints. REST, GraphQL, and CLI. Docs
- Modal – Serverless Python-first GPU platform. Container spin-up in ~1 second. Docs
- Together AI – Inference API for 200+ open models plus Instant Clusters for self-service GPU clusters.
- Replicate – Serverless model hosting. Run open-source video models via REST API. Docs
- fal.ai – Serverless inference for generative media. 600+ models. Python, JS, Swift SDKs. Docs
- WaveSpeedAI – Fast AI inference with no cold starts. 600+ models. 30–50% cheaper than HuggingFace Inference. 99.9% uptime SLA. GitHub
- Pollo AI – Video API aggregator providing access to Kling, Veo 3.1, Runway, Hailuo, Wan 2.6, and Pollo 2.0. Docs | API
- FFmpeg – Industry-standard multimedia processing. Encode, decode, transcode, stream, filter. GitHub
- HandBrake – Open-source video transcoder wrapping FFmpeg. GUI and CLI. GitHub
- Mux – API-first video infrastructure. Upload, encode, stream (VOD + live), analytics. SDKs for Node, Python, Ruby, Go, and more. Docs
- Cloudflare Stream – Video upload, encoding, and CDN delivery billed per minute watched. Live streaming via RTMP/SRT. Docs
- Backblaze B2 – S3-compatible object storage at ~$0.006/GB/month. Free egress via Cloudflare. Docs
- hls.js – JavaScript HLS playback via MSE. Used by major streaming platforms.
- Shaka Player – Google's open-source DASH + HLS player.
- VBench / VBench-2.0 – Comprehensive benchmark for video generative models. 16 fine-grained dimensions including subject consistency, motion smoothness, temporal flickering. VBench-2.0 adds Physics and Commonsense evaluation. Leaderboard | Paper (CVPR 2024)
- fal.ai Next.js Video Generator – Official Next.js template with queue management and TypeScript. One-click Vercel deploy. Vercel Template
- HeyGen Streaming Avatar Demo – Next.js/TypeScript starter for real-time WebRTC avatar sessions.
- Stability AI SVD Streamlit Demo – Streamlit demo scripts for Stable Video Diffusion. (see
scripts/demo/) - B2 Video Object Detection with Transformers – Video object detection pipeline using HuggingFace Transformers with Backblaze B2 cloud storage integration.
- Google Gemini Streamlit + Cloud Run – Sample app using Gemini multimodal with Streamlit, deployable to Cloud Run.
- OpenAI Videos API Reference
- Runway API Quickstart
- Google Veo Developer Guide
- Luma AI API Docs
- HuggingFace Diffusers Video Pipeline Guide
- fal.ai Documentation
- NVIDIA Cosmos Documentation
Contributions welcome! Please read the contribution guidelines first. PRs for new tools, corrections, and updates are appreciated.
