Skip to content

facebookresearch/bouquet

bouquet-eval

A small pipeline for reproducing BOUQuET machine-translation benchmark numbers and adding new models to the leaderboard.

It is intentionally minimal:

  • One dataset (facebook/bouquet, benchmark_sentence_level).
  • One inference path (single-turn chat) with two backends: vLLM (offline, in-process) or any OpenAI-compatible HTTP endpoint.
  • Three metrics: ChrF++, MetricX-24, GlotLID.
  • No orchestration: the CLI never spawns processes itself. Multi-GPU / multi-node parallelism is user-driven via --shard I/N plus a shell loop (or Slurm srun, k8s indexed jobs, etc.) — see Run all directions on an 8-GPU node below.

The pipeline is intended to be hackable to allow evaluating custom translation systems or prompt functions (see the Custom translators section).


Install

Requires Python ≥ 3.10.

With pip:

git clone https://github.com/facebookresearch/bouquet-eval
cd bouquet-eval

python3 -m venv .venv
source .venv/bin/activate

python3 -m ensurepip --upgrade
python3 -m pip install -e .          # core deps (CPU-friendly, supports OpenAI backend + scoring)
python3 -m pip install -e .[vllm]    # add vLLM for local inference (requires a GPU)

With uv (faster, recommended):

git clone https://github.com/facebookresearch/bouquet-eval
cd bouquet-eval

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

make install          # core + dev deps (CPU-friendly, supports OpenAI backend + scoring)
make install-all      # add vLLM for local inference (requires a GPU)

A CUDA GPU is required for vLLM inference and for MetricX scoring; ChrF++ and GlotLID run fine on CPU.


Quickstart

Translate one direction with a local vLLM model

bouquet-eval translate \
    --backend vllm \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --src-lang eng_Latn --tgt-lang fra_Latn \
    --output-dir runs/llama-3.1-8b/

Translate via an OpenAI-compatible endpoint

export OPENAI_API_KEY=sk-...
bouquet-eval translate \
    --backend openai \
    --model gpt-4o-mini \
    --src-lang eng_Latn --tgt-lang fra_Latn \
    --output-dir runs/gpt-4o-mini/

--base-url lets you point at any compatible server (a local vllm serve, OpenRouter, Together, etc.).

Score the translations

bouquet-eval score --output-dir runs/llama-3.1-8b/

This writes runs/llama-3.1-8b/scores.csv (one row per direction) and runs/llama-3.1-8b/scored_rows.parquet (per-sentence detail).

Run translate + score in one go

bouquet-eval run \
    --backend openai --model gpt-4o-mini \
    --src-lang eng_Latn --tgt-lang fra_Latn \
    --output-dir runs/gpt-4o-mini/

Run all directions on an 8-GPU node

bouquet-eval itself doesn't fork or pin GPUs — that's better handled by the shell, so the same recipe works on one node or many. The trick is --shard I/N: each process runs only directions [I::N], and atomic per-direction outputs guarantee that concurrent processes never collide. For a model that fits on one GPU (≲ 30B at bf16), launch one process per GPU and aggregate at the end:

OUT=runs/llama-3.1-8b-full
mkdir -p $OUT
for i in 0 1 2 3 4 5 6 7; do
  CUDA_VISIBLE_DEVICES=$i bouquet-eval run \
    --backend vllm --model meta-llama/Llama-3.1-8B-Instruct \
    --all --shard $i/8 \
    --output-dir $OUT \
    > $OUT/shard_$i.log 2>&1 &
done
wait
bouquet-eval aggregate --output-dir $OUT

The same pattern works on Slurm (srun --ntasks-per-node=8 + --shard $SLURM_PROCID/$SLURM_NTASKS) and across multiple nodes (have each node take a disjoint range of shard ranks; aggregate after srun finishes). Each shard process does both translation and scoring for its 1/N of the directions — MetricX loads once per shard and amortizes across all of that shard's rows.

Choosing TP vs sharding

Each shard process loads one full copy of the translation model. Pick by what fits on a single GPU at your inference dtype:

Model size (bf16) Recommended config (8-GPU node)
≤ ~30B 8 shards, --tensor-parallel-size 1
~30–70B 4 shards, --tensor-parallel-size 2
70B+ or very long context 2 shards, --tensor-parallel-size 4
>150B / 405B 1 shard, --tensor-parallel-size 8

Translation, scoring, and aggregate are all independently resumable at the direction level: if a shard dies midway you re-run it (or even change N) and only the missing directions get reprocessed. The dataset has ~1062 directions × 854 sentences; each direction is written as its own JSONL (translation) and as its own <src>-<tgt>.scored.parquet (scoring).

CLI reference

bouquet-eval translate --output-dir DIR
                       [--all | --src-lang ... | --tgt-lang ... | --directions ...] [--shard I/N]
                       --backend {vllm,openai} --model NAME [--system-prompt ...] [--limit N]
                       [--tensor-parallel-size N] ...
bouquet-eval score     --output-dir DIR [--shard I/N]
                       [--all | --src-lang ... | --tgt-lang ... | --directions ...]
                       [--metrics chrf,metricx,glotlid] [--metricx-model ...]
bouquet-eval aggregate --output-dir DIR
bouquet-eval run       (translate args) + (score args) [--shard I/N]

Direction selection (any combination):

  • --all: every direction in the dataset.
  • --src-lang eng_Latn --tgt-lang fra_Latn: filter by source/target (each repeatable).
  • --directions eng_Latn-fra_Latn,eng_Latn-deu_Latn: explicit list.

Useful flags:

  • --limit N: take only the first N sentences per direction (handy for smoke tests).
  • --system-prompt "...": prepend a system message; default is empty.
  • --temperature 0 --seed 0: greedy by default, for reproducibility.
  • Sharding: --shard I/N makes a single process handle only directions [I::N]. Multi-GPU / multi-node parallelism is done by spawning N shard processes from the shell (see Run all directions on an 8-GPU node above) and calling bouquet-eval aggregate once they all finish.
  • vLLM: --tensor-parallel-size, --dtype, --max-model-len, --gpu-memory-utilization, --deterministic (turns off some vLLM optimisations for tighter run-to-run reproducibility, at the cost of throughput).
  • OpenAI: --base-url, --api-key-env, --concurrency (parallel HTTP requests).
  • MetricX: --metricx-model, --metricx-device, --metricx-qe.

Output layout

runs/<your-model>/
├── eng_Latn-fra_Latn.jsonl                 # translation, one row per sentence
├── eng_Latn-fra_Latn.scored.parquet        # scoring,    one row per sentence
├── eng_Latn-deu_Latn.jsonl
├── eng_Latn-deu_Latn.scored.parquet
├── ...
├── scores.csv                              # produced by `aggregate`: per-direction means
└── scored_rows.parquet                     # produced by `aggregate`: concatenated per-row

Each translation JSONL row contains: src_lang, tgt_lang, src_text, ref_text, mt_text. Each <src>-<tgt>.scored.parquet produced by score carries those columns plus whichever of chrfpp, metricx, glotlid_score, glotlid_pred were requested. aggregate concatenates them into scored_rows.parquet and emits per-direction means as scores.csv. The score command auto-runs aggregate for you when --shard is absent.


Adding a new model to the leaderboard

  1. Generate translations + scores. For a single GPU this is one command:
    bouquet-eval run --backend vllm --model <repo> --all --output-dir runs/<name>/
    For an 8-GPU node, use the shell-loop + shard pattern from Run all directions on an 8-GPU node above (one process per GPU, then bouquet-eval aggregate --output-dir runs/<name>/).
  2. Open a PR against the leaderboard repo with runs/<name>/scores.csv and runs/<name>/manifest.json attached and the exact CLI commands in the description.
  3. We might decide to rerun the published baseline on identical hardware to spot-check before merging.

The manifest.json written to every output dir records: bouquet_eval version, full argv, backend / model / translator spec / sampling params / system prompt, the verbatim prompt template, the dataset config, and the number of directions. Resumed runs (and individual shards) append a new entry rather than overwrite, so the manifest preserves the full history of what produced the outputs.


Custom translators (programmatic)

bouquet-eval ships with two translator backends but the runner is happy with any object that implements:

class Translator(Protocol):
    def translate(self, prompts: list[str]) -> list[str]: ...

(bouquet_eval.Translator is a runtime-checkable Protocol — no subclassing required.)

Option A: drive the runner from your own script

from bouquet_eval import run_translation
from bouquet_eval.data import select_directions

class MyTranslator:
    def translate(self, prompts):
        ...  # call your model however you like

run_translation(
    translator=MyTranslator(),
    directions=select_directions(src_filter=["eng_Latn"]),
    output_dir="runs/mine/",
)

This gives you the same atomic-write / resume / per-direction JSONL output as the CLI. You can then bouquet-eval score --output-dir runs/mine/ exactly as you would for a built-in backend.

Option B: plug your translator into the CLI

bouquet-eval translate \
    --translator mypackage.translators:MyTranslator \
    --translator-kwargs '{"model_path": "/path", "thinking": true}' \
    --all --output-dir runs/mine/

--translator pkg.module:Name imports the named callable (class or factory) and calls it with the JSON dict from --translator-kwargs. The returned object must have a .translate(prompts) method. When --translator is given, --backend / --model are ignored.

Option C: just bring your translations

Skip translation entirely and feed pre-existing translations to the scorer. Write one file per direction at runs/mine/<src>-<tgt>.jsonl with rows of the form

{"src_lang": "eng_Latn", "tgt_lang": "fra_Latn", "src_text": "...", "ref_text": "...", "mt_text": "..."}

then run bouquet-eval score --output-dir runs/mine/.

Reproducibility note. If you customize the prompt or sampling, your scores are no longer directly comparable to leaderboard entries. Custom translators / prompt fns are responsible for documenting their differences in their manifest.json (the CLI records --translator, --translator-kwargs, --prompt-fn, --prompt-kwargs automatically).

Option D: built-in adapters for HF encoder-decoder MT models

For convenience, bouquet_eval.translators.hf_seq2seq ships three opt-in adapters that wrap Hugging Face transformers seq2seq models behind the Translator Protocol. They are not registered as --backend values - load them via --translator like any other custom translator:

# NLLB-200 (sets tokenizer.src_lang + forced_bos_token_id)
bouquet-eval run \
    --translator bouquet_eval.translators.hf_seq2seq:NLLBTranslator \
    --translator-kwargs '{"model":"facebook/nllb-200-3.3B","num_beams":5}' \
    --prompt-fn bouquet_eval.translators.hf_seq2seq:nllb_prompt \
    --all --output-dir runs/nllb-200-3.3B/

# MADLAD-400 (prepends "<2xx>" target-language token)
bouquet-eval run \
    --translator bouquet_eval.translators.hf_seq2seq:MadladTranslator \
    --translator-kwargs '{"model":"google/madlad400-10b-mt"}' \
    --prompt-fn bouquet_eval.translators.hf_seq2seq:madlad_prompt \
    --all --output-dir runs/madlad400-10b-mt/

# Aya-101 (instruction-tuned mT5-XXL; uses the default prompt fn)
bouquet-eval run \
    --translator bouquet_eval.translators.hf_seq2seq:AyaTranslator \
    --translator-kwargs '{"model":"CohereForAI/aya-101"}' \
    --all --output-dir runs/aya-101/

See the module docstring for details.


Custom prompt functions (RAG, few-shot, custom templates)

Anywhere the CLI builds a prompt it calls a single PromptFn:

PromptFn = Callable[[str, str, str], str]   # (src_lang, tgt_lang, src_text) -> rendered_user_message

The default implementation (bouquet_eval.prompt.build_prompt) renders our zero-shot template. Any callable matching the signature can replace it.

CLI

bouquet-eval translate \
    --backend vllm --model meta-llama/Llama-3.1-8B-Instruct \
    --prompt-fn mypackage.prompts:make_rag_prompt \
    --prompt-kwargs '{"index_path": "/data/my-index", "k": 4}' \
    --src-lang eng_Latn --tgt-lang fra_Latn \
    --output-dir runs/mine/

--prompt-fn pkg.module:Name resolves the named attribute. If --prompt-kwargs is given (a JSON object), the attribute is treated as a factory and called with those kwargs; the factory must return the actual PromptFn. If --prompt-kwargs is omitted, the attribute is used directly — convenient for stateless prompt functions.

Stateful example (RAG)

# mypackage/prompts.py
from bouquet_eval import PromptFn

def make_rag_prompt(index_path: str, k: int = 4) -> PromptFn:
    index = load_index(index_path)             # heavy, runs once at startup
    def prompt_fn(src_lang, tgt_lang, src_text):
        examples = index.retrieve(src_text, k=k)
        return render_with_examples(src_lang, tgt_lang, src_text, examples)
    return prompt_fn

Stateless example

# mypackage/prompts.py
def few_shot_prompt(src_lang: str, tgt_lang: str, src_text: str) -> str:
    return f"""Examples:
EN: Hello -> FR: Bonjour
EN: Goodbye -> FR: Au revoir

{src_lang} -> {tgt_lang}:
{src_text}"""
bouquet-eval translate ... --prompt-fn mypackage.prompts:few_shot_prompt ...

Programmatic

from bouquet_eval import run_translation

run_translation(
    translator=MyTranslator(),
    directions=[("eng_Latn", "fra_Latn")],
    output_dir="runs/mine/",
    prompt_fn=make_rag_prompt("/data/my-index"),
)

The default LANG_NAMES table in src/bouquet_eval/prompt.py is intentionally minimal and is only consulted by the default build_prompt. A custom prompt fn is responsible for whatever language-name lookup it needs.


Metrics

  • ChrF++ (sacrebleu, word_order=2): per-sentence character n-gram F-score. Range [0, 100].
  • MetricX-24-hybrid-xl (default google/metricx-24-hybrid-xl-v2p6): a learned regression metric. Lower is better; range [0, 25]. By default uses both source and reference; pass --metricx-qe for reference-free.
  • GlotLID (cis-lmu/glotlid): probability that the FastText LID model assigns the expected target language to the hypothesis. A small remap table aligns BOUQuET codes (e.g. cmn_Hans) to GlotLID's label space (e.g. cmn_Hani).
  • GlotLID × MetricX (glotlid_x_metricx): a composite that aggregate computes from the two metrics as glotlid_score * (1 - sqrt(metricx / 25)). The rescaling maps a perfect MetricX of 0 to 1 and the worst of 25 to 0 (while compressing the high-error tail), so the product lives in [0, 1] and rewards both correct target-language ID and high translation quality. Only emitted when both glotlid_score and metricx are present.

Project layout

src/bouquet_eval/
├── cli.py            # argparse entry point: translate / score / run
├── data.py           # HF dataset loader + direction enumeration
├── prompt.py         # zero-shot translation template + lang-name table
├── translate.py      # VllmChatTranslator + OpenAIChatTranslator
├── runner.py         # translate / score / aggregate orchestration
├── io.py             # atomic per-direction writes + resume
├── utils.py          # shared helpers (token-budget batching)
├── metrics/
│   ├── chrf.py
│   ├── metricx.py
│   └── glotlid.py
└── translators/
    └── hf_seq2seq.py # opt-in HF seq2seq adapters (NLLB / MADLAD / Aya)
tests/

Heavy imports (vllm, torch, transformers) are lazy so bouquet-eval --help works on a CPU-only login node.


License

MIT. See LICENSE.

The MetricXScorer implementation is informed by google-research/metricx (Apache-2.0). The GlotLID model is from cis-lmu/glotlid (Apache-2.0). The BOUQuET dataset is described at facebook/bouquet.

About

A companion repo for the BOUQuET dataset

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors