A small pipeline for reproducing BOUQuET machine-translation benchmark numbers and adding new models to the leaderboard.
It is intentionally minimal:
- One dataset (
facebook/bouquet,benchmark_sentence_level). - One inference path (single-turn chat) with two backends: vLLM (offline, in-process) or any OpenAI-compatible HTTP endpoint.
- Three metrics: ChrF++, MetricX-24, GlotLID.
- No orchestration: the CLI never spawns processes itself. Multi-GPU / multi-node parallelism is user-driven via
--shard I/Nplus a shell loop (or Slurmsrun, k8s indexed jobs, etc.) — see Run all directions on an 8-GPU node below.
The pipeline is intended to be hackable to allow evaluating custom translation systems or prompt functions (see the Custom translators section).
Requires Python ≥ 3.10.
With pip:
git clone https://github.com/facebookresearch/bouquet-eval
cd bouquet-eval
python3 -m venv .venv
source .venv/bin/activate
python3 -m ensurepip --upgrade
python3 -m pip install -e . # core deps (CPU-friendly, supports OpenAI backend + scoring)
python3 -m pip install -e .[vllm] # add vLLM for local inference (requires a GPU)With uv (faster, recommended):
git clone https://github.com/facebookresearch/bouquet-eval
cd bouquet-eval
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
make install # core + dev deps (CPU-friendly, supports OpenAI backend + scoring)
make install-all # add vLLM for local inference (requires a GPU)A CUDA GPU is required for vLLM inference and for MetricX scoring; ChrF++ and GlotLID run fine on CPU.
bouquet-eval translate \
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--src-lang eng_Latn --tgt-lang fra_Latn \
--output-dir runs/llama-3.1-8b/export OPENAI_API_KEY=sk-...
bouquet-eval translate \
--backend openai \
--model gpt-4o-mini \
--src-lang eng_Latn --tgt-lang fra_Latn \
--output-dir runs/gpt-4o-mini/--base-url lets you point at any compatible server (a local vllm serve, OpenRouter, Together, etc.).
bouquet-eval score --output-dir runs/llama-3.1-8b/This writes runs/llama-3.1-8b/scores.csv (one row per direction) and runs/llama-3.1-8b/scored_rows.parquet (per-sentence detail).
bouquet-eval run \
--backend openai --model gpt-4o-mini \
--src-lang eng_Latn --tgt-lang fra_Latn \
--output-dir runs/gpt-4o-mini/bouquet-eval itself doesn't fork or pin GPUs — that's better handled by the shell, so the same recipe works on one node or many. The trick is --shard I/N: each process runs only directions [I::N], and atomic per-direction outputs guarantee that concurrent processes never collide. For a model that fits on one GPU (≲ 30B at bf16), launch one process per GPU and aggregate at the end:
OUT=runs/llama-3.1-8b-full
mkdir -p $OUT
for i in 0 1 2 3 4 5 6 7; do
CUDA_VISIBLE_DEVICES=$i bouquet-eval run \
--backend vllm --model meta-llama/Llama-3.1-8B-Instruct \
--all --shard $i/8 \
--output-dir $OUT \
> $OUT/shard_$i.log 2>&1 &
done
wait
bouquet-eval aggregate --output-dir $OUTThe same pattern works on Slurm (srun --ntasks-per-node=8 + --shard $SLURM_PROCID/$SLURM_NTASKS) and across multiple nodes (have each node take a disjoint range of shard ranks; aggregate after srun finishes). Each shard process does both translation and scoring for its 1/N of the directions — MetricX loads once per shard and amortizes across all of that shard's rows.
Each shard process loads one full copy of the translation model. Pick by what fits on a single GPU at your inference dtype:
| Model size (bf16) | Recommended config (8-GPU node) |
|---|---|
| ≤ ~30B | 8 shards, --tensor-parallel-size 1 |
| ~30–70B | 4 shards, --tensor-parallel-size 2 |
| 70B+ or very long context | 2 shards, --tensor-parallel-size 4 |
| >150B / 405B | 1 shard, --tensor-parallel-size 8 |
Translation, scoring, and aggregate are all independently resumable at the direction level: if a shard dies midway you re-run it (or even change N) and only the missing directions get reprocessed. The dataset has ~1062 directions × 854 sentences; each direction is written as its own JSONL (translation) and as its own <src>-<tgt>.scored.parquet (scoring).
bouquet-eval translate --output-dir DIR
[--all | --src-lang ... | --tgt-lang ... | --directions ...] [--shard I/N]
--backend {vllm,openai} --model NAME [--system-prompt ...] [--limit N]
[--tensor-parallel-size N] ...
bouquet-eval score --output-dir DIR [--shard I/N]
[--all | --src-lang ... | --tgt-lang ... | --directions ...]
[--metrics chrf,metricx,glotlid] [--metricx-model ...]
bouquet-eval aggregate --output-dir DIR
bouquet-eval run (translate args) + (score args) [--shard I/N]
Direction selection (any combination):
--all: every direction in the dataset.--src-lang eng_Latn --tgt-lang fra_Latn: filter by source/target (each repeatable).--directions eng_Latn-fra_Latn,eng_Latn-deu_Latn: explicit list.
Useful flags:
--limit N: take only the first N sentences per direction (handy for smoke tests).--system-prompt "...": prepend a system message; default is empty.--temperature 0 --seed 0: greedy by default, for reproducibility.- Sharding:
--shard I/Nmakes a single process handle only directions[I::N]. Multi-GPU / multi-node parallelism is done by spawning N shard processes from the shell (see Run all directions on an 8-GPU node above) and callingbouquet-eval aggregateonce they all finish. - vLLM:
--tensor-parallel-size,--dtype,--max-model-len,--gpu-memory-utilization,--deterministic(turns off some vLLM optimisations for tighter run-to-run reproducibility, at the cost of throughput). - OpenAI:
--base-url,--api-key-env,--concurrency(parallel HTTP requests). - MetricX:
--metricx-model,--metricx-device,--metricx-qe.
runs/<your-model>/
├── eng_Latn-fra_Latn.jsonl # translation, one row per sentence
├── eng_Latn-fra_Latn.scored.parquet # scoring, one row per sentence
├── eng_Latn-deu_Latn.jsonl
├── eng_Latn-deu_Latn.scored.parquet
├── ...
├── scores.csv # produced by `aggregate`: per-direction means
└── scored_rows.parquet # produced by `aggregate`: concatenated per-row
Each translation JSONL row contains: src_lang, tgt_lang, src_text, ref_text, mt_text. Each <src>-<tgt>.scored.parquet produced by score carries those columns plus whichever of chrfpp, metricx, glotlid_score, glotlid_pred were requested. aggregate concatenates them into scored_rows.parquet and emits per-direction means as scores.csv. The score command auto-runs aggregate for you when --shard is absent.
- Generate translations + scores. For a single GPU this is one command:
For an 8-GPU node, use the shell-loop + shard pattern from Run all directions on an 8-GPU node above (one process per GPU, then
bouquet-eval run --backend vllm --model <repo> --all --output-dir runs/<name>/
bouquet-eval aggregate --output-dir runs/<name>/). - Open a PR against the leaderboard repo with
runs/<name>/scores.csvandruns/<name>/manifest.jsonattached and the exact CLI commands in the description. - We might decide to rerun the published baseline on identical hardware to spot-check before merging.
The manifest.json written to every output dir records: bouquet_eval version, full argv, backend / model / translator spec / sampling params / system prompt, the verbatim prompt template, the dataset config, and the number of directions. Resumed runs (and individual shards) append a new entry rather than overwrite, so the manifest preserves the full history of what produced the outputs.
bouquet-eval ships with two translator backends but the runner is happy with any object that implements:
class Translator(Protocol):
def translate(self, prompts: list[str]) -> list[str]: ...(bouquet_eval.Translator is a runtime-checkable Protocol — no subclassing required.)
from bouquet_eval import run_translation
from bouquet_eval.data import select_directions
class MyTranslator:
def translate(self, prompts):
... # call your model however you like
run_translation(
translator=MyTranslator(),
directions=select_directions(src_filter=["eng_Latn"]),
output_dir="runs/mine/",
)This gives you the same atomic-write / resume / per-direction JSONL output as the CLI. You can then bouquet-eval score --output-dir runs/mine/ exactly as you would for a built-in backend.
bouquet-eval translate \
--translator mypackage.translators:MyTranslator \
--translator-kwargs '{"model_path": "/path", "thinking": true}' \
--all --output-dir runs/mine/--translator pkg.module:Name imports the named callable (class or factory) and calls it with the JSON dict from --translator-kwargs. The returned object must have a .translate(prompts) method. When --translator is given, --backend / --model are ignored.
Skip translation entirely and feed pre-existing translations to the scorer. Write one file per direction at runs/mine/<src>-<tgt>.jsonl with rows of the form
{"src_lang": "eng_Latn", "tgt_lang": "fra_Latn", "src_text": "...", "ref_text": "...", "mt_text": "..."}then run bouquet-eval score --output-dir runs/mine/.
Reproducibility note. If you customize the prompt or sampling, your scores are no longer directly comparable to leaderboard entries. Custom translators / prompt fns are responsible for documenting their differences in their
manifest.json(the CLI records--translator,--translator-kwargs,--prompt-fn,--prompt-kwargsautomatically).
For convenience, bouquet_eval.translators.hf_seq2seq ships three opt-in
adapters that wrap Hugging Face transformers seq2seq models behind the
Translator Protocol. They are not registered as --backend values - load
them via --translator like any other custom translator:
# NLLB-200 (sets tokenizer.src_lang + forced_bos_token_id)
bouquet-eval run \
--translator bouquet_eval.translators.hf_seq2seq:NLLBTranslator \
--translator-kwargs '{"model":"facebook/nllb-200-3.3B","num_beams":5}' \
--prompt-fn bouquet_eval.translators.hf_seq2seq:nllb_prompt \
--all --output-dir runs/nllb-200-3.3B/
# MADLAD-400 (prepends "<2xx>" target-language token)
bouquet-eval run \
--translator bouquet_eval.translators.hf_seq2seq:MadladTranslator \
--translator-kwargs '{"model":"google/madlad400-10b-mt"}' \
--prompt-fn bouquet_eval.translators.hf_seq2seq:madlad_prompt \
--all --output-dir runs/madlad400-10b-mt/
# Aya-101 (instruction-tuned mT5-XXL; uses the default prompt fn)
bouquet-eval run \
--translator bouquet_eval.translators.hf_seq2seq:AyaTranslator \
--translator-kwargs '{"model":"CohereForAI/aya-101"}' \
--all --output-dir runs/aya-101/See the module docstring for details.
Anywhere the CLI builds a prompt it calls a single PromptFn:
PromptFn = Callable[[str, str, str], str] # (src_lang, tgt_lang, src_text) -> rendered_user_messageThe default implementation (bouquet_eval.prompt.build_prompt) renders our zero-shot template. Any callable matching the signature can replace it.
bouquet-eval translate \
--backend vllm --model meta-llama/Llama-3.1-8B-Instruct \
--prompt-fn mypackage.prompts:make_rag_prompt \
--prompt-kwargs '{"index_path": "/data/my-index", "k": 4}' \
--src-lang eng_Latn --tgt-lang fra_Latn \
--output-dir runs/mine/--prompt-fn pkg.module:Name resolves the named attribute. If --prompt-kwargs is given (a JSON object), the attribute is treated as a factory and called with those kwargs; the factory must return the actual PromptFn. If --prompt-kwargs is omitted, the attribute is used directly — convenient for stateless prompt functions.
# mypackage/prompts.py
from bouquet_eval import PromptFn
def make_rag_prompt(index_path: str, k: int = 4) -> PromptFn:
index = load_index(index_path) # heavy, runs once at startup
def prompt_fn(src_lang, tgt_lang, src_text):
examples = index.retrieve(src_text, k=k)
return render_with_examples(src_lang, tgt_lang, src_text, examples)
return prompt_fn# mypackage/prompts.py
def few_shot_prompt(src_lang: str, tgt_lang: str, src_text: str) -> str:
return f"""Examples:
EN: Hello -> FR: Bonjour
EN: Goodbye -> FR: Au revoir
{src_lang} -> {tgt_lang}:
{src_text}"""bouquet-eval translate ... --prompt-fn mypackage.prompts:few_shot_prompt ...from bouquet_eval import run_translation
run_translation(
translator=MyTranslator(),
directions=[("eng_Latn", "fra_Latn")],
output_dir="runs/mine/",
prompt_fn=make_rag_prompt("/data/my-index"),
)The default
LANG_NAMEStable insrc/bouquet_eval/prompt.pyis intentionally minimal and is only consulted by the defaultbuild_prompt. A custom prompt fn is responsible for whatever language-name lookup it needs.
- ChrF++ (sacrebleu,
word_order=2): per-sentence character n-gram F-score. Range[0, 100]. - MetricX-24-hybrid-xl (default
google/metricx-24-hybrid-xl-v2p6): a learned regression metric. Lower is better; range[0, 25]. By default uses both source and reference; pass--metricx-qefor reference-free. - GlotLID (
cis-lmu/glotlid): probability that the FastText LID model assigns the expected target language to the hypothesis. A small remap table aligns BOUQuET codes (e.g.cmn_Hans) to GlotLID's label space (e.g.cmn_Hani). - GlotLID × MetricX (
glotlid_x_metricx): a composite thataggregatecomputes from the two metrics asglotlid_score * (1 - sqrt(metricx / 25)). The rescaling maps a perfect MetricX of 0 to 1 and the worst of 25 to 0 (while compressing the high-error tail), so the product lives in[0, 1]and rewards both correct target-language ID and high translation quality. Only emitted when bothglotlid_scoreandmetricxare present.
src/bouquet_eval/
├── cli.py # argparse entry point: translate / score / run
├── data.py # HF dataset loader + direction enumeration
├── prompt.py # zero-shot translation template + lang-name table
├── translate.py # VllmChatTranslator + OpenAIChatTranslator
├── runner.py # translate / score / aggregate orchestration
├── io.py # atomic per-direction writes + resume
├── utils.py # shared helpers (token-budget batching)
├── metrics/
│ ├── chrf.py
│ ├── metricx.py
│ └── glotlid.py
└── translators/
└── hf_seq2seq.py # opt-in HF seq2seq adapters (NLLB / MADLAD / Aya)
tests/
Heavy imports (vllm, torch, transformers) are lazy so bouquet-eval --help works on a CPU-only login node.
MIT. See LICENSE.
The MetricXScorer implementation is informed by
google-research/metricx (Apache-2.0).
The GlotLID model is from
cis-lmu/glotlid (Apache-2.0).
The BOUQuET dataset is described at
facebook/bouquet.