LTX-2 — The New Standard for Open-Source Audio-Video Sync Generation

Generate video + audio in one shot, native 4K resolution, up to 20 seconds of coherent video, runs locally on consumer GPUs

Audio-Video Sync 4K HD 50FPS High Frame Rate Open Source & Free Local Deployment Native ComfyUI Support

What is LTX-2?

LTX-2 is a next-generation multimodal AI audio-video generation model developed by Israeli company Lightricks. It can simultaneously generate high-quality video clips with sound, dialogue, and background audio effects in a single generation process, up to approximately 20 seconds long. Its core selling point is support for up to 4K resolution and high frame rate output, with synchronized video motion, lip movements, and audio.

LTX-2 Core Capabilities

🎥

One-Shot Generation of "Video + Audio"

Generates video and audio in the same diffusion process, no need for additional dubbing or post-production compositing. Dialogue, sound effects, and background music can align with video actions.

  • Dialogue Sync: Character lip movements align with speech
  • Action Sound Effects: Door opening accompanied by door sound, footsteps match audio
  • Background Music: Rhythm coordinates with video actions
  • No Post-Production: Eliminates tedious steps like dubbing, mixing, and timeline alignment
📺

High Resolution & High Frame Rate

Official specifications support up to 4096×2160 (4K) resolution and approximately 50 FPS frame rate, sufficient for short films and commercial-grade content.

ParameterSpecification
Max Resolution4096×2160 (4K)
Max Frame Rate~50 FPS
Output QualityShort film & commercial-grade content

Native high-quality output means:

  • Can be used directly for professional editing and post-production
  • No need for additional upscaling and frame interpolation
  • Outstanding detail and lighting performance among open-source models
⏱️

Duration & Coherence

Single generation up to approximately 20 seconds, emphasizing consistent visual style across frames, reducing flicker and structural collapse, more suitable for narrative and camera movement scenarios.

  • Up to ~20 seconds of continuous audio-video clips per generation
  • Visual style remains consistent across frames
  • Reduces flicker and structural collapse
  • Better suited for complete scenes with narrative and camera movement
  • Not just "animated GIF-level" effects, but truly usable complete shots

LTX-2 Technical Architecture & Features

🧠

Multimodal Diffusion Architecture

The model jointly models three dimensions: temporal (video), spatial (visual), and acoustic (audio), learning physical correspondences between actions and sounds, such as door opening accompanied by door sound.

DimensionDescription
Temporal (Video)Continuity and motion between frames
Spatial (Visual)Visual content and composition of each frame
Acoustic (Audio)Sound waveforms, dialogue, sound effects

The model learns physical correspondences between actions and sounds, such as:

  • Door opening → Door hinge rotation sound
  • Person walking → Footstep sounds
  • Speaking lip movements → Corresponding speech

Deep GPU Optimization

Official documentation states optimization for mainstream NVIDIA GPUs, enabling local operation on high-VRAM consumer graphics cards with inference efficiency several times higher than previous models.

  • Can run locally on high-VRAM consumer GPUs
  • Inference efficiency improved several times over previous models
  • Computational cost reduced by approximately 50%
  • With NVFP4/NVFP8 low-precision weights, can generate 4K video locally
  • Reduces dependence on cloud and subscription platforms
🎛️

Flexible Control & Workflow

Supports content control through text prompts, images, sketches, and more. Can be used in workflow tools like ComfyUI, offering multiple quality and speed modes (Fast, Pro, Ultra) to balance effects and processing time.

Control MethodDescription
Text PromptsDescribe scenes, actions, and styles in natural language
Image InputGuide generation style with reference images
Sketch ControlDefine composition and motion with hand-drawn sketches
ControlNetStructural control via Canny/Depth/Pose, etc.

Multiple quality modes available:

  • Fast Mode: Quick output, suitable for iterative testing
  • Pro Mode: Balances quality and speed
  • Ultra Mode: Highest quality output

Natively integrated into ComfyUI, Day 0 support, no additional plugins required.

LTX-2 Unique Advantages

LTX-2's main advantages: runs locally, generates high-quality "audio-video sync" 4K videos in one shot, with high efficiency, strong controllability, and high degree of open-source accessibility.

✅ Quality & Audio-Video Sync Advantages

FeatureDetailed Description
Native High QualitySupports up to native 4K resolution and ~50 FPS, coherent visuals with consistent style, can be used directly for professional editing and post-production without additional upscaling and frame interpolation
True Audio-Video SyncJointly generates video and audio in a single inference, character lip movements align with dialogue, actions with sound effects, rhythm with background music, avoiding the disconnected feel of "post-dubbing"

✅ Performance & Local Execution Advantages

FeatureDetailed Description
Efficient InferenceOfficial and reviews indicate that compared to similar models, LTX-2 can reduce computational cost by ~50%, faster generation under same settings, runs efficiently on data center GPUs and consumer GPUs
Consumer GPU SupportModel specifically optimized for NVIDIA RTX GPUs, with NVFP4/NVFP8 weights, can generate 4K video locally, reducing dependence on cloud and subscription platforms

✅ Duration & Creative Control Advantages

FeatureDetailed Description
Longer DurationCan generate up to ~10-20 seconds of continuous audio-video clips per generation, longer than many open-source solutions, suitable for complete shots and short scenes, not just "animated GIF-level" effects
Fine-Grained ControlProvides multi-keyframe, 3D camera path, rhythm control capabilities, supports "director-level" control through prompts, camera settings, style LoRAs, suitable for storyboarding and narrative needs

✅ Open-Source & Ecosystem Advantages

FeatureDetailed Description
Open-Source Weights & CodeLightricks announced opening LTX-2's weights, code, and benchmarks, allowing users to freely deploy locally and customize training in an open-source manner, more flexible and controllable than many closed-source SaaS tools
Complete Tool EcosystemNatively integrated into workflow tools like ComfyUI, with tutorials and optimized workflows from NVIDIA and the community, beginners can get started quickly, advanced users can deeply customize

✅ Advantages Over Other Video Models

🎮 Try LTX-2 Online

No installation required, experience LTX-2's powerful features directly in your browser

Note: Initial loading may take some time, please be patient. Powered by Hugging Face Spaces.

LTX-2 Quick Start Guide

📋

LTX-2 Environment & Hardware Requirements

Operating System & Framework: Recommended Windows or Linux, install the latest version of ComfyUI locally (from comfyui.org or Github), and update to the nightly/latest commit that supports LTX-2.

GPU Configuration

Configuration LevelRecommended GPUVRAM RequiredDescription
ComfortableRTX 4070/4080/409016G+Can run high resolution and long frame counts
BasicRTX 3080/406012-16GRequires FP8 lightweight model
Low-EndRTX 3060/30708-12GRequires reduced resolution and frame count
📥

LTX-2 Installation & Model Download

Install ComfyUI

  • Download ComfyUI from the official website or repository, install dependencies (Python, CUDA, etc.) as instructed, then run the ComfyUI server and open the local address in your browser to see the interface
  • If you already have ComfyUI, run git pull or use the launcher to update to the latest version, ensuring it includes LTX-2 related nodes and templates

Download LTX-2 Model Weights

  • Follow tutorials to download the corresponding LTX-2 main model (such as BF16, FP8, or lightweight LoRA version) and required text encoders (such as Gemma 3 or specified CLIP model)
  • Place in ComfyUI's models directory (such as models/ltx, models/checkpoints, etc.)
  • Some integrated packages or "one-click packages" will automatically place models in the correct path, beginners can prioritize these integrated versions
🎛️

Using LTX-2 in ComfyUI

Use Official/Template Workflows

  • Open ComfyUI's template browser, select LTX-2/LTX Video workflow templates in the "Video" or "LTX" category, and load the node graph with one click
  • Templates typically already include model loading, text encoder, sampler, VAE decoder, video composition nodes, etc. Just modify prompts and a few parameters to generate

Set Key Parameters

In nodes like LTXVModelConfigurator or LTX-2 configuration node, set:

ParameterRecommended ValueDescription
Resolution768×512 or 1280×720Must be multiples of 32
Frame Count64-128 framesDetermines video duration, must follow specific formula
FPS24-30Frame rate, affects video smoothness
Sampling Steps10-15More steps = higher quality but slower speed
CFG2-4Too high causes stiff visuals

Text-to-Video: Input positive prompts (detailed scene, action, style descriptions) and negative prompts (such as "blurry, jittery, watermark, worst quality") in the prompt node, then click run queue to generate video.

LTX-2 Generation Guide & Tips

📝 LTX-2 Text-to-Video (T2V)

Suitable for creating scenes from scratch. Focus on clearly describing: subject, scene, action, camera movement, lighting, and style.

Prompt Structure

  • Subject: Who is in the frame (young woman, middle-aged man, cyberpunk robot)
  • Scene: Where (indoor studio, cozy kitchen, rainy street, futuristic city rooftop)
  • Action: What they're doing (speaking to camera, playing the piano, running, dancing slowly)
  • Camera: How it's shot (close-up shot, camera slowly moves forward, shallow depth of field, 25 fps cinematic look)

Example Prompts

Positive Prompt:
a close-up shot of a young woman speaking to camera in a cozy kitchen, 
warm soft lighting, natural expressions, 4K film look, cinematic style

Negative Prompt:
worst quality, low quality, blurry, jittery, unstable motion, 
distorted faces, extra limbs, watermark, text, logo, flickering

To emphasize sound effects or dialogue, add descriptions like "clear dialogue, realistic sound effects, background music" to help the model generate richer audio tracks.

🎞️ LTX-2 Video-to-Video (V2V) & Control

Load reference video via VHS_LoadVideo, combine with Canny/Depth/Pose ControlNet nodes to let LTX-2 redraw style or add audio while maintaining original structure and motion.

Common Control Methods

Control TypeDescriptionUse Cases
Canny EdgePreserves frame contours and structureStyle transfer, redrawing
DepthPreserves spatial depth relationshipsScene reconstruction
PosePreserves character actions and posesCharacter motion transfer

Common Optimizations

  • Lower CFG (e.g., 2-4)
  • Reduce sampling steps (10-15)
  • Adjust sigma_shift appropriately to balance speed and quality

✍️ LTX-2 Prompt Optimization Tips

Basic Writing Approach

  • Write in English, short but specific: Prioritize English descriptions, keep to one or two complete sentences or under ~200 words, avoid long keyword stuffing
  • Clear information structure: Follow the order "who is the subject → where is the scene → what action → camera/lighting/style" to reduce ambiguity

Scene & Action Description Tips

Type❌ Bad Example✅ Good Example
Static Descriptiona man in a parka man jogging slowly through a city park, passing trees and benches, breathing visibly in the cold air
Missing Timelinewoman cookinga woman chopping vegetables, then stirring a pot on the stove, steam rising
Unclear Cameraperson walkingmedium shot of a person walking towards camera, handheld style, shallow depth of field

Camera Language Vocabulary

  • Shot Types: close-up, medium shot, wide shot, extreme close-up, establishing shot
  • Camera Movement: tracking shot, dolly zoom, slow pan, tilt up/down, handheld, static camera
  • Lighting Style: soft warm light, neon lights, golden hour, harsh shadows, rim lighting
  • Visual Style: film look, anime style, watercolor style, 4K cinematic, vintage 8mm

Audio & Multilingual Prompts

  • Specify language & dialogue: e.g., "she says in clear Mandarin: '大家好'", helps model generate corresponding speech and lip sync
  • Describe sound atmosphere: Add terms like "clear dialogue, soft background music, subtle city ambience, gentle rain sounds"

Practical Tips

  • Start with theme, then expand: Begin with a short theme sentence, use prompt generators or LLMs to expand, then refine and modify yourself
  • Make small incremental adjustments: Change only one thing at a time (e.g., just camera, just lighting), observe differences, gradually find templates and patterns that work for LTX-2

⚙️ LTX-2 Quality Enhancement & Troubleshooting

Improving Quality

  • Use two-stage generation: First generate draft at lower resolution, then use LTX-2's upscaling/high-quality workflow for secondary super-resolution or resampling to achieve near-4K output
  • Maintain parameter standards: Keep resolution and frame count as multiples of 32, maintain VRAM headroom to avoid freezing or overflow
  • Streamline prompts: Avoid overly long or complex prompts, emphasize subject, scene, and action information

Common Troubleshooting

Workflow Crashes or Errors

Check if using old ComfyUI Portable version, incompatible plugins, or insufficient VRAM. Try:

  • Update ComfyUI to latest version
  • Disable conflicting custom nodes
  • Reduce resolution/frame count

Video and Audio Out of Sync

Use official LTX-2 audio-video sync nodes in the workflow, don't manually swap audio tracks externally to maintain the temporal structure from model generation.

LTX-2 User Reviews

Based on public information and community feedback, LTX-2's user reviews are generally positive, though there are some common criticisms.

👍 Positive Aspects

🎨
Quality & Coherence

Many creators believe the LTX series stands out among similar open-source models for visual detail, lighting, and motion coherence, with characters and scenes less prone to "collapse" in long shots.

Generation Speed

On high-performance GPUs, LTX models can generate several seconds of video in just seconds, commonly rated as "faster than most open-source video models," suitable for iterative prompt testing and style adjustment.

💼
Usability & Practical Value

Many reviews mention it's "sufficient" for educational content, marketing materials, and short video creation scenarios, not just a tech demo but something that can be integrated into actual workflows.

⚖️ Neutral or Controversial Feedback

⚠️ Negative or Critical Points

Overall Assessment: Users generally recognize the LTX series (including LTX-2) for its advantages in the "speed + quality + open-source" combination, willing to consider it a strong option in open-source video generation, but also generally agree that achieving best results requires certain hardware conditions and tinkering effort.

LTX-2 Frequently Asked Questions

LTX-2 user FAQs mainly focus on output quality, parameter/workflow settings, and hardware/stability issues.

Q: Generated video barely moves or is almost static?

Common causes and solutions:

  • Resolution/frame count not compliant → Ensure multiples of 32
  • CFG too high → Reduce to 2-4
  • Missing motion-related prompts → Add specific actions and processes in description, e.g., "a man jogging slowly through a park, passing trees"
  • Prompt too static → Avoid just writing "a man in a park," describe the action clearly

Q: Camera keeps randomly zooming or panning?

Solutions:

  • Use "static camera LoRA" or camera control nodes to constrain camera movement
  • Specify camera type explicitly in prompt, e.g., "static camera, medium shot"
  • Define background motion clearly, e.g., "camera slowly pans left"

Q: Running out of VRAM?

Solutions:

  • Use FP8 or lower precision models (FP4)
  • Reduce resolution and frame count (e.g., from 1280×720 to 768×512)
  • Check if any nodes are still running on CPU
  • Close other programs using VRAM
  • Use ComfyUI's low-VRAM mode

Q: How to set resolution and frame count?

Rules:

  • Resolution must be multiples of 32, e.g., 768×512, 1280×720, 1024×576
  • Frame count must follow specific formula, common values: 64, 96, 128, 160 frames
  • Non-compliant settings easily lead to no motion or errors

Recommended starter configuration:

  • Resolution: 768×512 or 1280×720
  • Frame count: 64-96 frames
  • FPS: 24-30
  • Sampling steps: 10-15

Q: Which model version to use? Full or Distilled?

VersionPrecisionVRAM RequiredUse Case
Full BF16Highest16G+Maximum quality, ample VRAM
Full FP8High12-16GBalance quality & VRAM
Distilled FP8Medium8-12GLimited VRAM, speed priority
LoRA LightweightMed-Low8GLow-end GPU, quick testing

Q: How to write better prompts?

Core principles:

  • Write in English, model understands English better
  • Short but specific, keep under 200 words
  • Follow order: subject → scene → action → camera/lighting/style
  • Specify action process, don't just write static descriptions

Example:

a close-up shot of a young woman speaking to camera 
in a cozy kitchen, warm soft lighting, natural expressions, 
4K film look, cinematic style

Q: Inference too slow?

Optimization suggestions:

  • Reduce resolution/frame count
  • Switch to lower precision weights (FP8/FP4)
  • Reduce sampling steps (10-15 steps sufficient)
  • Use Fast mode instead of Ultra mode
  • Check if any nodes are running on CPU

Q: Workflow incompatible/errors?

Common causes:

  • ComfyUI or plugin versions too old
  • Node changes causing incompatibility
  • Missing required custom nodes

Solutions:

  • Update ComfyUI to latest version
  • Update or reinstall related plugins
  • Use latest official workflow templates
  • Check node status in ComfyUI Manager