Generate video + audio in one shot, native 4K resolution, up to 20 seconds of coherent video, runs locally on consumer GPUs
LTX-2 is a next-generation multimodal AI audio-video generation model developed by Israeli company Lightricks. It can simultaneously generate high-quality video clips with sound, dialogue, and background audio effects in a single generation process, up to approximately 20 seconds long. Its core selling point is support for up to 4K resolution and high frame rate output, with synchronized video motion, lip movements, and audio.
Generates video and audio in the same diffusion process, no need for additional dubbing or post-production compositing. Dialogue, sound effects, and background music can align with video actions.
Official specifications support up to 4096×2160 (4K) resolution and approximately 50 FPS frame rate, sufficient for short films and commercial-grade content.
| Parameter | Specification |
|---|---|
| Max Resolution | 4096×2160 (4K) |
| Max Frame Rate | ~50 FPS |
| Output Quality | Short film & commercial-grade content |
Native high-quality output means:
Single generation up to approximately 20 seconds, emphasizing consistent visual style across frames, reducing flicker and structural collapse, more suitable for narrative and camera movement scenarios.
The model jointly models three dimensions: temporal (video), spatial (visual), and acoustic (audio), learning physical correspondences between actions and sounds, such as door opening accompanied by door sound.
| Dimension | Description |
|---|---|
| Temporal (Video) | Continuity and motion between frames |
| Spatial (Visual) | Visual content and composition of each frame |
| Acoustic (Audio) | Sound waveforms, dialogue, sound effects |
The model learns physical correspondences between actions and sounds, such as:
Official documentation states optimization for mainstream NVIDIA GPUs, enabling local operation on high-VRAM consumer graphics cards with inference efficiency several times higher than previous models.
Supports content control through text prompts, images, sketches, and more. Can be used in workflow tools like ComfyUI, offering multiple quality and speed modes (Fast, Pro, Ultra) to balance effects and processing time.
| Control Method | Description |
|---|---|
| Text Prompts | Describe scenes, actions, and styles in natural language |
| Image Input | Guide generation style with reference images |
| Sketch Control | Define composition and motion with hand-drawn sketches |
| ControlNet | Structural control via Canny/Depth/Pose, etc. |
Multiple quality modes available:
Natively integrated into ComfyUI, Day 0 support, no additional plugins required.
LTX-2's main advantages: runs locally, generates high-quality "audio-video sync" 4K videos in one shot, with high efficiency, strong controllability, and high degree of open-source accessibility.
| Feature | Detailed Description |
|---|---|
| Native High Quality | Supports up to native 4K resolution and ~50 FPS, coherent visuals with consistent style, can be used directly for professional editing and post-production without additional upscaling and frame interpolation |
| True Audio-Video Sync | Jointly generates video and audio in a single inference, character lip movements align with dialogue, actions with sound effects, rhythm with background music, avoiding the disconnected feel of "post-dubbing" |
| Feature | Detailed Description |
|---|---|
| Efficient Inference | Official and reviews indicate that compared to similar models, LTX-2 can reduce computational cost by ~50%, faster generation under same settings, runs efficiently on data center GPUs and consumer GPUs |
| Consumer GPU Support | Model specifically optimized for NVIDIA RTX GPUs, with NVFP4/NVFP8 weights, can generate 4K video locally, reducing dependence on cloud and subscription platforms |
| Feature | Detailed Description |
|---|---|
| Longer Duration | Can generate up to ~10-20 seconds of continuous audio-video clips per generation, longer than many open-source solutions, suitable for complete shots and short scenes, not just "animated GIF-level" effects |
| Fine-Grained Control | Provides multi-keyframe, 3D camera path, rhythm control capabilities, supports "director-level" control through prompts, camera settings, style LoRAs, suitable for storyboarding and narrative needs |
| Feature | Detailed Description |
|---|---|
| Open-Source Weights & Code | Lightricks announced opening LTX-2's weights, code, and benchmarks, allowing users to freely deploy locally and customize training in an open-source manner, more flexible and controllable than many closed-source SaaS tools |
| Complete Tool Ecosystem | Natively integrated into workflow tools like ComfyUI, with tutorials and optimized workflows from NVIDIA and the community, beginners can get started quickly, advanced users can deeply customize |
No installation required, experience LTX-2's powerful features directly in your browser
Note: Initial loading may take some time, please be patient. Powered by Hugging Face Spaces.
Operating System & Framework: Recommended Windows or Linux, install the latest version of ComfyUI locally (from comfyui.org or Github), and update to the nightly/latest commit that supports LTX-2.
| Configuration Level | Recommended GPU | VRAM Required | Description |
|---|---|---|---|
| Comfortable | RTX 4070/4080/4090 | 16G+ | Can run high resolution and long frame counts |
| Basic | RTX 3080/4060 | 12-16G | Requires FP8 lightweight model |
| Low-End | RTX 3060/3070 | 8-12G | Requires reduced resolution and frame count |
git pull or use the launcher to update to the latest version, ensuring it includes LTX-2 related nodes and templatesmodels/ltx, models/checkpoints, etc.)In nodes like LTXVModelConfigurator or LTX-2 configuration node, set:
| Parameter | Recommended Value | Description |
|---|---|---|
| Resolution | 768×512 or 1280×720 | Must be multiples of 32 |
| Frame Count | 64-128 frames | Determines video duration, must follow specific formula |
| FPS | 24-30 | Frame rate, affects video smoothness |
| Sampling Steps | 10-15 | More steps = higher quality but slower speed |
| CFG | 2-4 | Too high causes stiff visuals |
Text-to-Video: Input positive prompts (detailed scene, action, style descriptions) and negative prompts (such as "blurry, jittery, watermark, worst quality") in the prompt node, then click run queue to generate video.
Suitable for creating scenes from scratch. Focus on clearly describing: subject, scene, action, camera movement, lighting, and style.
Positive Prompt:
a close-up shot of a young woman speaking to camera in a cozy kitchen,
warm soft lighting, natural expressions, 4K film look, cinematic style
Negative Prompt:
worst quality, low quality, blurry, jittery, unstable motion,
distorted faces, extra limbs, watermark, text, logo, flickering
To emphasize sound effects or dialogue, add descriptions like "clear dialogue, realistic sound effects, background music" to help the model generate richer audio tracks.
Load reference video via VHS_LoadVideo, combine with Canny/Depth/Pose ControlNet nodes to let LTX-2 redraw style or add audio while maintaining original structure and motion.
| Control Type | Description | Use Cases |
|---|---|---|
| Canny Edge | Preserves frame contours and structure | Style transfer, redrawing |
| Depth | Preserves spatial depth relationships | Scene reconstruction |
| Pose | Preserves character actions and poses | Character motion transfer |
sigma_shift appropriately to balance speed and quality| Type | ❌ Bad Example | ✅ Good Example |
|---|---|---|
| Static Description | a man in a park | a man jogging slowly through a city park, passing trees and benches, breathing visibly in the cold air |
| Missing Timeline | woman cooking | a woman chopping vegetables, then stirring a pot on the stove, steam rising |
| Unclear Camera | person walking | medium shot of a person walking towards camera, handheld style, shallow depth of field |
Check if using old ComfyUI Portable version, incompatible plugins, or insufficient VRAM. Try:
Use official LTX-2 audio-video sync nodes in the workflow, don't manually swap audio tracks externally to maintain the temporal structure from model generation.
Based on public information and community feedback, LTX-2's user reviews are generally positive, though there are some common criticisms.
Many creators believe the LTX series stands out among similar open-source models for visual detail, lighting, and motion coherence, with characters and scenes less prone to "collapse" in long shots.
On high-performance GPUs, LTX models can generate several seconds of video in just seconds, commonly rated as "faster than most open-source video models," suitable for iterative prompt testing and style adjustment.
Many reviews mention it's "sufficient" for educational content, marketing materials, and short video creation scenarios, not just a tech demo but something that can be integrated into actual workflows.
Overall Assessment: Users generally recognize the LTX series (including LTX-2) for its advantages in the "speed + quality + open-source" combination, willing to consider it a strong option in open-source video generation, but also generally agree that achieving best results requires certain hardware conditions and tinkering effort.
LTX-2 user FAQs mainly focus on output quality, parameter/workflow settings, and hardware/stability issues.
Common causes and solutions:
Solutions:
Solutions:
Rules:
Recommended starter configuration:
| Version | Precision | VRAM Required | Use Case |
|---|---|---|---|
| Full BF16 | Highest | 16G+ | Maximum quality, ample VRAM |
| Full FP8 | High | 12-16G | Balance quality & VRAM |
| Distilled FP8 | Medium | 8-12G | Limited VRAM, speed priority |
| LoRA Lightweight | Med-Low | 8G | Low-end GPU, quick testing |
Core principles:
Example:
a close-up shot of a young woman speaking to camera
in a cozy kitchen, warm soft lighting, natural expressions,
4K film look, cinematic style
Optimization suggestions:
Common causes:
Solutions: