About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
BLEU evaluation metric
Last updated: May 08, 2025
The BLEU (Bilingual Evaluation Understudy) metric compares translated sentences from machine translations to sentences from reference translations to measure the similarity between reference texts and predictions.
Metric details
BLEU is a generative AI quality evaluation metric that measures how well generative AI assets perform tasks.
Scope
The BLEU metric evaluates generative AI assets only.
- Types of AI assets: Prompt templates
- Generative AI tasks:
- Text summarization
- Content generation
- Question answering
- Retrieval augmented generation (RAG)
- Supported languages: English
Scores and values
The BLEU metric score indicates the similarity between the machine translation and reference translations. Higher scores indicate more similarity between reference texts and predictions.
- Range of values: 0.0-1.0
- Best possible score: 1.0
Settings
- Thresholds:
- Lower limit: 0.8
- Upper limit: 1
- Parameters:
- Max order: Maximum n-gram order to use when completing BLEU score
- Smooth: Whether or not to apply a smoothing function to remove noise from data
Parent topic: Evaluation metrics
Was the topic helpful?
0/1000