All case studies

Generative Video / Image Quality Assessment

Subjective Video Quality Scoring at 98% Agreement for a Generative Video Model Team

Cross-continental rater pool across the USA, UK, India, and Bangladesh scored 16,000 model-generated videos on noise, sharpness, exposure, color, and overall quality. Pairwise preferences across A/B variants and free-text reasoning fed reward modeling. The subjective brief made the 98% target the hard part.

Video QualityIQA / VQAGenerative VideoPairwise Preference

Client

Generative video model team

Volume

16,000 videos, ≈80,000 A/B pairs, ≈640,000 metric judgments

Duration

12 weeks

Team

60 trained raters across USA, UK, India, and Bangladesh, 8 senior reviewers, 2 calibration leads

Languages

English

The challenge

The client was training and evaluating a generative video model. Quality was not a single objective metric. It was five subjective metrics layered together: noise, sharpness, exposure, color, and an overall quality score. Each metric needed per-video ratings on a 1 to 5 scale plus pairwise preferences between A/B model variants for the same prompt.

Subjective ratings at scale are hard to standardize. Sharpness and noise can correlate, because denoising softens detail. Exposure judgments depend on display calibration. Color perception drifts by region and culture. Overall quality is informed by the four sub-metrics but does not always equal their average. The client wanted clean reward modeling signal, which meant the raters had to behave like one coherent system, not 60 independent judges.

Hitting 98% agreement on a subjective task is unusual. Most subjective video quality work in the literature reports kappa in the 0.6 to 0.8 range. The client needed reproducible scoring that the model could learn from without picking up rater-specific bias. The brief was demanding, and it took a different annotation design than a standard image labeling project.

Our approach

Five-metric rating schema

Every video was rated on the client's five quality dimensions, each on a 1 to 5 scale with anchored definitions written into the guideline. The overall score followed published anchor language so it stayed comparable across raters and across regions.

  • Noise: low-level artifacts, grain, compression blocks, temporal flicker
  • Sharpness: edge clarity, focus, detail preservation against denoising softness
  • Exposure: brightness range, clipping, dark crush, scene-appropriate dynamic range
  • Color: saturation, white balance, hue accuracy against scene context
  • Overall: 5 excellent, 4 minor imperfections, 3 visible but usable, 2 multiple clear issues, 1 severely degraded

Pairwise A/B preference per metric

Beyond single-video ratings, raters compared A and B variants from the same generative prompt and chose a preferred variant per metric. This pairwise mode is what the client's reward model consumed. The methodology drew on RLHF-style preference annotation as used in InstructGPT (Ouyang et al., 2022) and recent generative-video preference work in VBench (Huang et al., 2024).

Reasoning capture was offered as an optional free-text field. About 40% of pairs came back with reasoning. The optional structure was intentional. Required reasoning produces shallow text that adds noise. Optional reasoning gets used by raters who genuinely have something to say, and those notes turned out to be the most valuable downstream signal.

Cross-continental rater pool

The 60-rater pool was deliberately distributed across the USA, UK, India, and Bangladesh. Color perception and viewing-condition norms vary by region, and a single-country rater pool would have produced a model that worked well for that region and drifted elsewhere. Distributing raters across four countries forced the consensus to cover the diversity the client's production traffic would see.

Each region had its own calibration sub-lead who knew the local conventions. Disagreements that traced to regional perception (saturated reds preferred in one market, more muted in another) were surfaced explicitly rather than averaged away, and the client used those signals to inform downstream model behavior.

Calibration anchors and pilot

The pilot week ran 400 anchor videos through every rater. Anchors had pre-determined target ratings authored by the client's image quality team. Raters whose scores drifted more than one step from anchors were retrained before joining production. Anchors were re-sampled every week during production to catch drift over time.

  • Per-metric calibration thresholds with retraining trigger at >1 step drift
  • Weekly anchor sampling on 5% of in-flight work
  • Display calibration check at workstation setup
  • Reference videos in the guideline for each 1 to 5 rating

How we reached 98% agreement

Subjective scoring at 98% inter-rater agreement does not come from any single fix. It came from stacking the right design choices.

  • Anchored definitions per metric instead of free-text descriptions
  • Pairwise preferences as the primary annotation mode, with single-video ratings as a secondary signal that pairwise verifies against
  • Multi-region calibration to surface and handle regional bias instead of averaging it away
  • Senior reviewer adjudication on any pair where regional sub-leads disagreed
  • Iterative guideline revision driven by the optional reasoning corpus, which exposed where the schema was unclear
  • Hard rejection of raters who failed weekly anchor re-checks

Results

98%

Inter-rater agreement

16,000

Videos rated

≈80,000

Pairwise judgments

USA · UK · India · Bangladesh

Rater countries

What made it work

  • 1

    Anchored rating definitions matter more than the scale itself. A 1 to 5 scale with vague descriptors produces low agreement. The same scale with reference videos and explicit per-point criteria converges fast.

  • 2

    Multi-region distribution surfaces bias instead of hiding it. Averaging single-region ratings gives the illusion of agreement and produces a model that fails outside that region. Cross-region consensus forces the schema to handle perceptual variance.

  • 3

    Optional reasoning beats required reasoning on subjective tasks. Required reasoning produces filler. Optional reasoning gets used where it matters and becomes useful training signal in its own right.

  • 4

    Pairwise preference is more stable than single-video rating for reward modeling. The client used single ratings for evaluation dashboards and pairwise preferences for training, which matched the literature on RLHF preference data quality.

References

Published research that informed the labeling schema and workflow.

  1. Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). · NeurIPS 2022
  2. Huang, Z. et al. (2024). VBench: Comprehensive Benchmark Suite for Video Generative Models. · CVPR 2024
  3. Zhang, R. et al. (2018). The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (LPIPS). · CVPR 2018
  4. Mittal, A., Moorthy, A. K., Bovik, A. C. (2012). No-Reference Image Quality Assessment in the Spatial Domain (BRISQUE). · IEEE TIP
  5. Wang, J. et al. (2023). Exploring CLIP for Assessing the Look and Feel of Images (CLIP-IQA). · AAAI 2023

Have a similar project?

Share your data and requirements. We will scope the workflow, team, timeline, and pricing model.