BLEU evaluation metric

Last updated: May 08, 2025

The BLEU (Bilingual Evaluation Understudy) metric compares translated sentences from machine translations to sentences from reference translations to measure the similarity between reference texts and predictions.

Metric details

BLEU is a generative AI quality evaluation metric that measures how well generative AI assets perform tasks.

Scope

The BLEU metric evaluates generative AI assets only.

Types of AI assets: Prompt templates
Generative AI tasks:
- Text summarization
- Content generation
- Question answering
- Retrieval augmented generation (RAG)
Supported languages: English

Scores and values

The BLEU metric score indicates the similarity between the machine translation and reference translations. Higher scores indicate more similarity between reference texts and predictions.

Range of values: 0.0-1.0
Best possible score: 1.0

Settings

Thresholds:
- Lower limit: 0.8
- Upper limit: 1
Parameters:
- Max order: Maximum n-gram order to use when completing BLEU score
- Smooth: Whether or not to apply a smoothing function to remove noise from data

Parent topic: Evaluation metrics

Was the topic helpful?

0/1000

Metric detailsCopy link to section

ScopeCopy link to section

Scores and valuesCopy link to section

SettingsCopy link to section

Metric details

Scope

Scores and values

Settings