Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Human?

1The Chinese University of Hong Kong 2National University of Singapore 3Video Rebirth 4University of Oxford
(* indicates equal contribution, † indicates corresponding author)

TL;DR: Can you identify which video is AI-generated?

Please turn on the sound 👂

1
2
3
4
5
6
7
8
9
10
11
12
Real: 2, 6, 7, 11
AI-generated: 1, 3, 4, 5, 8, 9, 10, 12

Abstract

Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio–visual coupling, featuring the following dimensions: (i) Immersive ASMR video-audio sources. Built on carefully curated real ASMR videos, the benchmark targets fine-grained action–object interactions with diversity across objects, actions, and backgrounds. (ii) Peer-Review evaluation. An adversarial creator–reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56% accuracy (random 50%), far below that of human experts (81.25%). Adding audio improves real–fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio–visual consistency.

Figure 1. An overview of Peer-Review framework for ASMR video reality testing. Video generation models (“creators”) attempt to synthesize fake ASMR videos that can fool multimodal reviewers, while video-understanding models (“reviewers”) aim to detect fakes.
Figure 2. Illustration of Video Reality Test creation pipeline, encompassing four phases: Collection, Preprocessing, Captioning, and Clustering.

Data Analysis

We release the complete ASMR based Video Reality Test corpus: real videos, extracted images, prompts, and outputs from 13 different video-generation settings (OpenSoraV2 variants,Wan2.2 variants, Sora2,Veo3.1-fast, Hunyuan,StepFun, etc.). For each of the 149 scenes we therefore provide 1 + k clips (with k = 13 fakery families). Both ModelScope and Hugging Face mirrors host identical contents.

Figure 3. Detailed analysis of Video Reality Test. (a) Example instances across different dimensions. (b) Distribution statistics. (c) Action statistics and time distribution.

Leaderboards

Please contact us via email (wjqkoko@foxmail.com) to update and submit your model results.

Table 1. Performance comparison on Video Reality Test for video understanding. We compare various VLMs (Open-source and Proprietary) against human performance. 1st, 2nd, and 3rd denote the best, second, and third performers.
Model Veo3.1
Fast
Sora2 Wan2.2
A14B
Wan2.2
5B
Opensora
V2
Hunyuan
Video
Step
Video
Avg. (↑) Rank
Random505050505050505013
Human81.2591.2586.2591.2591.2591.2591.2589.111
Open-source Models
Qwen3-VL-8B*57.7987.6955.5656.2851.5054.5083.8463.885
Qwen3-VL-30B-A3B51.0851.3549.4454.7447.0949.7481.6854.8712
Qwen2.5-VL-72B49.5071.0751.0551.0054.5053.5081.5058.8710
Qwen3-VL-235B-A22B56.5380.7553.8952.6650.7948.1990.5361.918
GLM-4.5V54.6463.7554.9057.5966.2461.0187.1363.616
Proprietary Models
GPT-4o-mini52.5051.7853.6850.5053.0050.5089.0057.2811
GPT-4o51.5051.2755.2655.5056.5056.5095.0060.229
GPT-5 (Preview)54.5595.4355.2657.5056.7856.5093.9767.144
Gemini-2.5-Flash47.7287.5653.5555.4455.1553.0678.6361.597
+ Audio52.5593.6553.5555.4455.1553.0678.6363.158
Gemini-2.5-Pro51.5684.4959.0960.2162.3065.7687.9867.343
+ Audio56.0087.7259.0960.2162.3065.7687.9868.443
Gemini-3-Pro-Preview77.8989.9057.6773.8765.8380.9087.9476.272

* Note: Qwen3 links direct to the Qwen organization as specific Qwen3 repositories may be internal or under the Qwen2.5 umbrella currently.

Table 2. Performance comparison on Video Reality Test of video generation models on different generate types. Image means the start-frame image and Text means the text description. Lower Avg. score indicates better fooling capability.
Inputs Image Text GPT-4o
-mini
GPT-4o Gemini-2.5
-Flash
Gemini-2.5
-Pro
Avg. (↓) Rank
Opensora-V2
Text2Vid14.0010.0028.7239.1822.988
ImgText2Vid12.0015.0030.2135.1623.599
Text2Img2Vid14.0018.0045.3643.7530.2810
Wan2.2
Text2Vid-A14B12.0013.0024.4721.7417.805
ImgText2Vid-A14B8.897.7823.5326.1916.103
ImgText2Vid-5B7.0013.0030.5333.3320.976
HunyuanVideo
Text2Vid8.007.0025.5118.5614.772
ImgText2Vid7.0015.0026.5342.3922.737
Sora2
Text2Vid16.009.00100.0097.8955.7212
ImgText2Vid8.253.0995.7979.1746.5811
+ Audio8.253.0997.8988.6649.47 +2.8911
- Watermark8.256.1925.0024.7416.554
StepVideo
Text2Vid84.0092.0073.6886.8183.6213
Veo3.1-Fast
ImgText2Vid11.005.0016.1617.0012.541
+ Audio11.005.0019.2025.0015.05 +2.511

Experiment Results

Q1-1. Are current VLMs reliable at distinguishing real and generated videos?

Evaluation of 10 VLMs shows that proprietary models generally outperform open-source ones, with Gemini-2.5-pro performing best and GPT-5 next. Overall performance remains limited: the top score is only 68.44, far below humans at 89.11. Most open-source models perform poorly, leaving substantial room for improvement.

Q1-2. Does adding audio improve VLM detection performance?

Yes. Adding audio increases detection accuracy by roughly 5 points for the Gemini models. This is mainly because current generated audio often misaligns with the video; for example, Sora2 produces human voices rather than continuous ASMR sounds, making inconsistencies more detectable.

Q2-1. Can current video generation models successfully fool VLMs?

Yes. Generation models are highly effective at misleading VLMs. Veo3.1-Fast performs best, with only 12.54% of its videos detected as fake. Some open-source models (e.g., HunyuanVideoI2V, Wan2.2-A14B) even outperform the closed-source Sora2, showing the performance gap between proprietary and open-source models is narrowing.

Q2-2. What factors affect the realism of generated videos?

Adding audio can increase the likelihood of videos being detected as fake (e.g., Veo3.1-Fast rises from 12.54% to 15.05%) due to audio-visual mismatches. Generation type (Image2Video, TextImage2Video) has limited effect, but model scale matters significantly; larger models generally produce more realistic videos, e.g., Wan2.2-A14B outperforms its 5B version.

Real or Fake: Can You Tell the Difference?

Scene 1: Cloth Texture & Physics

Real

Sora-2

Veo-3.1-fast

Scene 2: Intricate Carving

Real

Sora-2

Veo-3.1-fast

Scene 3: Material Deformation (Wood)

Real

Sora-2

Veo-3.1-fast

Scene 4: Fine Interactions (Tomato Slicing)

Real

Sora-2

Open Sora

Kling

Wan-A14B

BibTeX

@misc{wang2025videorealitytest,
      title={Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?}, 
      author={Jiaqi Wang and Weijia Wu and Yi Zhan and Rui Zhao and Ming Hu and James Cheng and Wei Liu and Philip Torr and Kevin Qinghong Lin},
      year={2025},
      eprint={2512.13281},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.13281}, 
}