Section A – Evaluation Setup
- Which exact benchmarks are used for the quality gate? Could you confirm the complete list of benchmarks (and question counts) used to compute the quality score for Text-to-Text? Is the list fixed between Round 1 and Round 2? We will not disclose benchmarks used for evaluation but it will be the same for Round 1 and Round 2. We only release tasks categories. Note that we will not use benchmarks that require tool-use or agents. There will be evaluation on indic benchmarks.
- How is energy measured and aggregated? Energy consumption will be computed over the full benchmark suite. We do not normalize by token, because energy consumption is directly linked to the number of tokens produced by the model. If a model is more verbose than another it will have a negative impact on its overall consumption which we want to take into account in the final ranking.
- Will you publish baseline energy numbers before Round 1? We have not set an energy baseline. If you meet the 80% quality threshold, you will then be ranked according to your model’s energy consumption (lesser is better)
Section B – Submission Mechanics
- Can we deliver our solutions in container? For evaluation, the organizing team will use the commands ‘vllm serve –config vllm_config.yaml’ or ‘llama-server -hf model_hf’ to perform the evaluation. As long as the submission work with these commands and the base model is the primary model in the inference of the submitted model, the submission will be allowed.
- What is the best solution to deliver my solution on Hugging Face? You will need to upload (i) model weights (ii) README with information on the process you have applied to the model (iii) a config file for vllm or llama.
- Is our max_model_len setting in vllm_config.yaml respected as-is by the evaluation harness? max_model_len parameter will not be override by the evaluation harness. Note that decreasing this parameter too low can affect your performance on benchmarks.
- Will the submission form accept both a vllm_config.yaml or a llama_config.yaml in the same repo, or must we pick one engine per submission ? You must pick one engine.
Section C – Permitted Techniques
- Is quantization allowed? Yes!
- Is distillation allowed? The base model shall be the primary model in the inference of the submitted model. As such, participants could use distillation only if the base model is the student model.
- Is finetuning on the evaluation allowed? We don’t allow further finetuning after compression
- Is reducing num_experts_per_tok (top-k routing) permitted? Any optimization technique is allowed. Any modification to the model architecture shall be justified.
The organisation HuggingFace account to be shared on the submission day.
Should you have any further technical questions, please contact the evaluation team: resilientchallenge2026@peren.gouv.fr








