Newmind AI LLM Evaluation Leaderboard
Evaluate your model's performance in the following categories:
โ๏ธ Auto Arena - Tournament-style evaluation where models are directly compared and ranked using an ELO rating system.
๐ฅ Human Arena - Comparative evaluation based on human preferences, assessed by a reviewer group.
๐ Retrieval - Evaluation focused on information retrieval and text generation quality for real-world applications.
โก Light Eval - Fast and efficient model evaluation framework for quick testing.
๐ EvalMix - Multi-dimensional evaluation including lexical accuracy and semantic coherence.
๐ Snake Bench - Specialized evaluation measuring step-by-step problem solving and complex reasoning abilities.
๐งฉ Structured Outputs - Coming soon!
Evaluate your model in any or all of these categories to discover its capabilities and areas of excellence.
For any questions, please contact us at info@newmind.ai
Model Evaluation Results
This screen shows model performance across different evaluation categories.
Model Performance Comparison
meta-llama/Meta-Llama-3.1-70B-Instruct | 1848.77 | 1559.97 | 0.84 | 0.33 | 0.85 | 0.58 | 0.25 | bfloat16 | Proprietary |
Arena Detailed Results
meta-llama/Meta-Llama-3.1-70B-Instruct | 1848.77 | 99.25 | +0.22/-0.31 | 1020.0 | bfloat16 | Proprietary |
grok-3 | 1848.77 | 99.25 | +0.22/-0.31 | 886.0 | Unknown | Proprietary |
google/gemma-3-27b-it | 1636.89 | 97.51 | +0.47/-0.42 | 896.0 | bfloat16 | Gemma |
newmindai/Qwen2.5-72b-Instruct | 1310.17 | 85.64 | +1.41/-1.21 | 953.0 | bfloat16 | Qwen |
Qwen/Qwen2.5-72B-Instruct | 1263.87 | 82.04 | +1.22/-1.52 | 926.0 | bfloat16 | Qwen |
deepseek-ai/DeepSeek-R1 | 1158.08 | 71.3 | +1.63/-1.36 | 1020.0 | bfloat16 | MIT |
microsoft/phi-4 | 1141.07 | 69.26 | +1.29/-2.12 | 824.0 | bfloat16 | MIT |
Qwen/Qwen3-32B | 1118.19 | 66.38 | +2.00/-2.26 | 1021.0 | bfloat16 | Qwen |
newmindai/Llama-3.3-70b-Instruct | 1049.29 | 57.05 | +2.13/-1.99 | 465.0 | bfloat16 | Llama-3.3 |
meta-llama/Llama-3.3-70b-Instruct | 1000 | 50 | +0.00/-0.00 | 480.0 | bfloat16 | Llama 3.3 |
Qwen/QwQ-32B | 953.18 | 43.3 | +2.04/-2.01 | 1020.0 | bfloat16 | Apache 2.0 |
grok-3-mini-fast-beta | 949.88 | 42.84 | +2.66/-1.65 | 395.0 | Unknown | Proprietary |
mistralai/Magistral-Small-2506 | 904.37 | 36.58 | +1.88/-1.77 | 633.0 | bfloat16 | Apache 2.0 |
Qwen/Qwen3-14B | 888.09 | 34.43 | +1.39/-2.11 | 1022.0 | bfloat16 | Apache 2.0 |
meta-llama/Meta-Llama-3.1-70B-Instruct | 744.81 | 18.71 | +1.46/-1.70 | 486.0 | bfloat16 | Llama 3.1 |
newmindai/QwQ-32B-r1 | 634.84 | 10.89 | +1.07/-1.20 | 257.0 | bfloat16 | Apache 2.0 |
Human Arena Results
meta-llama/Meta-Llama-3.1-70B-Instruct | 1749.65 | 1000 | 215 | 159 | 1374 | 72.78 | bfloat16 | Proprietary |
google/gemma-3-27b-it | 1749.65 | 1000 | 215 | 159 | 1374 | 72.78 | bfloat16 | Gemma |
newmindai/Qwen2.5-72b-Instruct | 1674.4 | 726 | 513 | 163 | 1402 | 51.78 | bfloat16 | Qwen |
microsoft/phi-4 | 1664.84 | 531 | 223 | 373 | 1127 | 47.12 | bfloat16 | MIT |
meta-llama/Llama-3.3-70b-Instruct | 1606.71 | 588 | 403 | 378 | 1369 | 42.95 | bfloat16 | Llama 3.3 |
newmindai/Llama-3.3-70b-Instruct | 1584.04 | 567 | 442 | 287 | 1296 | 43.75 | bfloat16 | Llama-3.3 |
Qwen/Qwen2.5-72B-Instruct | 1561.2 | 369 | 515 | 287 | 1171 | 31.51 | bfloat16 | Qwen |
grok-3 | 1559.97 | 429 | 171 | 436 | 1036 | 41.41 | Unknown | Proprietary |
mistralai/Magistral-Small-2506 | 1515.6 | 413 | 723 | 282 | 1418 | 29.13 | bfloat16 | Apache 2.0 |
grok-3-mini-fast-beta | 1456.57 | 386 | 419 | 218 | 1023 | 37.73 | Unknown | Proprietary |
Qwen/Qwen3-32B | 1453.11 | 344 | 429 | 181 | 954 | 36.06 | bfloat16 | Qwen |
Qwen/Qwen3-14B | 1450.2 | 658 | 471 | 177 | 1306 | 50.38 | bfloat16 | Apache 2.0 |
meta-llama/Meta-Llama-3.1-70B-Instruct | 1394.55 | 378 | 537 | 262 | 1177 | 32.12 | bfloat16 | Llama 3.1 |
Qwen/QwQ-32B | 1384.24 | 366 | 766 | 152 | 1284 | 28.5 | bfloat16 | Apache 2.0 |
deepseek-ai/DeepSeek-R1 | 1345.62 | 340 | 469 | 247 | 1056 | 32.2 | bfloat16 | MIT |
newmindai/QwQ-32B-r1 | 1225.64 | 278 | 917 | 168 | 1363 | 20.4 | bfloat16 | Apache 2.0 |
Retrieval Detailed Results
meta-llama/Meta-Llama-3.1-70B-Instruct | 0.88 | 0.89 | 9 | 112 | 4199 | 90.37 | bfloat16 | Proprietary |
Qwen/Qwen3-32B | 0.88 | 0.89 | 9 | 19 | 4199 | 90.37 | bfloat16 | Qwen |
newmindai/Qwen2.5-72b-Instruct | 0.85 | 0.9 | 8 | 24 | 5294 | 89.83 | bfloat16 | Qwen |
deepseek-ai/DeepSeek-R1 | 0.84 | 0.81 | 6 | 1 | 5339 | 90.46 | bfloat16 | MIT |
grok-3 | 0.84 | 0.93 | 8 | 36 | 4845 | 91.66 | Unknown | Proprietary |
meta-llama/Meta-Llama-3.1-70B-Instruct | 0.83 | 0.89 | 9 | 40 | 4380 | 87.8 | bfloat16 | Llama 3.1 |
mistralai/Magistral-Small-2506 | 0.78 | 0.9 | 9 | 61 | 4638 | 89.77 | bfloat16 | Apache 2.0 |
Qwen/Qwen2.5-72B-Instruct | 0.74 | 0.85 | 8 | 62 | 5430 | 90.21 | bfloat16 | Qwen |
newmindai/Llama-3.3-70b-Instruct | 0.74 | 0.91 | 8 | 75 | 4076 | 88.82 | bfloat16 | Llama-3.3 |
newmindai/QwQ-32B-r1 | 0.72 | 0.87 | 8 | 76 | 4129 | 90.89 | bfloat16 | Apache 2.0 |
Qwen/QwQ-32B | 0.68 | 0.7 | 8 | 58 | 5141 | 90.12 | bfloat16 | Apache 2.0 |
meta-llama/Llama-3.3-70b-Instruct | 0.64 | 0.88 | 8 | 112 | 4354 | 89.01 | bfloat16 | Llama 3.3 |
Qwen/Qwen3-14B | 0.59 | 0.49 | 5 | 37 | 5996 | 86.31 | bfloat16 | Apache 2.0 |
grok-3-mini-fast-beta | 0.49 | 0.65 | 6 | 115 | 5631 | 87.99 | Unknown | Proprietary |
microsoft/phi-4 | 0.46 | 0.76 | 8 | 164 | 4997 | 90.72 | bfloat16 | MIT |
google/gemma-3-27b-it | 0.42 | 0.86 | 6 | 190 | 4799 | 91.68 | bfloat16 | Gemma |
Light Eval Detailed Results
meta-llama/Meta-Llama-3.1-70B-Instruct | 0.4097 | 0.5122 | 0.4444 | 0.5039 | 0.2584 | 0.2948 | 0.4445 | bfloat16 | Proprietary |
meta-llama/Meta-Llama-3.1-70B-Instruct | 0.4097 | 0.5122 | 0.4444 | 0.5039 | 0.2584 | 0.2948 | 0.4445 | bfloat16 | Llama 3.1 |
microsoft/phi-4 | 0.3879 | 0.3948 | 0.4656 | 0.4787 | 0.2478 | 0.3538 | 0.3865 | bfloat16 | MIT |
Qwen/Qwen3-32B | 0.3614 | 0.2731 | 0.4948 | 0.515 | 0.2509 | 0.3489 | 0.2858 | bfloat16 | Qwen |
deepseek-ai/DeepSeek-R1 | 0.3575 | 0.2601 | 0.4114 | 0.4961 | 0.2421 | 0.41 | 0.3251 | bfloat16 | MIT |
Qwen/Qwen3-14B | 0.3527 | 0.2341 | 0.4924 | 0.5237 | 0.2494 | 0.3088 | 0.308 | bfloat16 | Apache 2.0 |
Qwen/Qwen2.5-72B-Instruct | 0.3461 | 0.242 | 0.4484 | 0.5 | 0.2832 | 0.36 | 0.2431 | bfloat16 | Qwen |
newmindai/Qwen2.5-72b-Instruct | 0.3432 | 0.2421 | 0.4463 | 0.5039 | 0.2772 | 0.35 | 0.24 | bfloat16 | Qwen |
meta-llama/Llama-3.3-70b-Instruct | 0.3368 | 0.242 | 0.448 | 0.506 | 0.247 | 0.28 | 0.298 | bfloat16 | Llama 3.3 |
grok-3 | 0.3329 | 0.2514 | 0.4513 | 0.4984 | 0.2518 | 0.2903 | 0.2543 | Unknown | Proprietary |
mistralai/Magistral-Small-2506 | 0.3328 | 0.2562 | 0.4377 | 0.4984 | 0.2472 | 0.2732 | 0.284 | bfloat16 | Apache 2.0 |
google/gemma-3-27b-it | 0.3285 | 0.2421 | 0.4421 | 0.5103 | 0.2404 | 0.2734 | 0.2628 | bfloat16 | Gemma |
grok-3-mini-fast-beta | 0.3229 | 0.2421 | 0.4483 | 0.4961 | 0.2505 | 0.2732 | 0.227 | Unknown | Proprietary |
newmindai/Llama-3.3-70b-Instruct | 0.3136 | 0.234 | 0.4483 | 0.476 | 0.248 | 0.18 | 0.292 | bfloat16 | Llama-3.3 |
Qwen/QwQ-32B | 0.3123 | 0.234 | 0.45 | 0.5 | 0.249 | 0.19 | 0.251 | bfloat16 | Apache 2.0 |
newmindai/QwQ-32B-r1 | 0.312 | 0.234 | 0.4428 | 0.484 | 0.282 | 0.19 | 0.238 | bfloat16 | Apache 2.0 |
EvalMix Detailed Results
meta-llama/Meta-Llama-3.1-70B-Instruct | 0.97 | 0.87 | 0.81 | 0.15 | 0.55 | 0.34 | 0.34 | 0.65 | 0.72 | bfloat16 | Proprietary |
newmindai/Llama-3.3-70b-Instruct | 0.97 | 0.87 | 0.81 | 0.15 | 0.55 | 0.34 | 0.34 | 0.7 | 0.72 | bfloat16 | Llama-3.3 |
meta-llama/Llama-3.3-70b-Instruct | 0.93 | 0.77 | 0.69 | 0.05 | 0.35 | 0.16 | 0.23 | 0.6 | 0.58 | bfloat16 | Llama 3.3 |
newmindai/Qwen2.5-72b-Instruct | 0.87 | 0.83 | 0.75 | 0.06 | 0.47 | 0.2 | 0.24 | 0.65 | 0.64 | bfloat16 | Qwen |
google/gemma-3-27b-it | 0.87 | 0.83 | 0.8 | 0.06 | 0.48 | 0.22 | 0.24 | 0.65 | 0.64 | bfloat16 | Gemma |
microsoft/phi-4 | 0.86 | 0.83 | 0.76 | 0.06 | 0.48 | 0.21 | 0.25 | 0.66 | 0.65 | bfloat16 | MIT |
grok-3 | 0.85 | 0.58 | 0.6 | 0.21 | 0.37 | 0.14 | 0.2 | 0.85 | 0.85 | Unknown | Proprietary |
Qwen/Qwen2.5-72B-Instruct | 0.84 | 0.81 | 0.73 | 0.05 | 0.45 | 0.18 | 0.22 | 0.63 | 0.62 | bfloat16 | Qwen |
mistralai/Magistral-Small-2506 | 0.84 | 0.82 | 0.67 | 0.04 | 0.37 | 0.17 | 0.23 | 0.63 | 0.66 | bfloat16 | Apache 2.0 |
meta-llama/Meta-Llama-3.1-70B-Instruct | 0.84 | 0.81 | 0.71 | 0.08 | 0.42 | 0.23 | 0.26 | 0.64 | 0.7 | bfloat16 | Llama 3.1 |
newmindai/QwQ-32B-r1 | 0.84 | 0.55 | 0.6 | 0.02 | 0.32 | 0.11 | 0.15 | 0.48 | 0.46 | bfloat16 | Apache 2.0 |
Qwen/QwQ-32B | 0.83 | 0.53 | 0.61 | 0.02 | 0.31 | 0.11 | 0.15 | 0.47 | 0.45 | bfloat16 | Apache 2.0 |
grok-3-mini-fast-beta | 0.78 | 0.79 | 0.68 | 0.03 | 0.34 | 0.14 | 0.18 | 0.57 | 0.58 | Unknown | Proprietary |
deepseek-ai/DeepSeek-R1 | 0.68 | 0.7 | 0.54 | 0.03 | 0.36 | 0.14 | 0.17 | 0.53 | 0.52 | bfloat16 | MIT |
Qwen/Qwen3-14B | 0.29 | 0.21 | 0.73 | 0.02 | 0.26 | 0.11 | 0.13 | 0.37 | 0.36 | bfloat16 | Apache 2.0 |
Qwen/Qwen3-32B | 0.29 | 0.22 | 0.7 | 0.01 | 0.21 | 0.08 | 0.1 | 0.37 | 0.36 | bfloat16 | Qwen |
Snake Benchmark Detailed Results
mistralai/Magistral-Small-2506 | 1606.27 | 47.62 | 7.14 | 26 | 13 | 3 | 30.95 | bfloat16 | Apache 2.0 |
newmindai/QwQ-32B-r1 | 1606.27 | 61.9 | 7.14 | 26 | 13 | 3 | 30.95 | bfloat16 | Apache 2.0 |
Qwen/Qwen3-32B | 1543.63 | 47.62 | 11.9 | 20 | 17 | 5 | 40.48 | bfloat16 | Qwen |
Qwen/QwQ-32B | 1526.66 | 52.38 | 7.14 | 22 | 17 | 3 | 40.48 | bfloat16 | Apache 2.0 |
deepseek-r1-distill-llama-70b | 1445.85 | 35.71 | 9.52 | 15 | 23 | 4 | 54.76 | bfloat16 | MIT |
qwen-qwq-32b | 1376.17 | 28.57 | 2.38 | 12 | 29 | 1 | 69.05 | bfloat16 | Apache 2.0 |
mistralai/Magistral-Small-2506 | 1326.15 | 19.05 | 7.14 | 8 | 31 | 3 | 73.81 | bfloat16 | Apache 2.0 |
Evaluation Categories
1. โ๏ธ Arena-Hard-Auto: Competitive Benchmarking at Scale
Arena-Hard-Auto is a cutting-edge automatic evaluation framework tailored for instruction-tuned Large Language Models (LLMs). Leveraging a tournament-style evaluation methodology, it pits models against each other in head-to-head matchups, with performance rankings determined via the Elo rating systemโa method proven to align closely with human judgment, as evidenced in Chatbot Arena benchmarks. This evaluation suite is grounded in real-world use cases, benchmarking models across 11 diverse legal tasks, encompassing an extensive Turkish legal question-answer pairs.
Key Evaluation Pillars
- Automated Judging Relies on specialized judge models to assess and determine the winner in each model-versus-model comparison. Includes dynamic system prompt adaptation to ensure context-aware evaluation based on the specific domain of the query.
- Win Probability Estimation Computes the probability of victory for each model in a matchup using a logistic regression model, offering a probabilistic understanding of comparative strength.
- Skill Rating (Elo Score) Utilizes the Elo rating system to provide a robust measurement of each modelโs skill level relative to its competitors, ensuring a competitive and evolving leaderboard.
- Win Rate Measures a modelโs dominance by calculating the proportion of head-to-head victories, serving as a direct indicator of real-world performance across the benchmark suite. Arena-Hard-Auto offers a fast, scalable, and reliable method to benchmark LLMs, combining quantitative rigor with realistic interaction scenariosโmaking it an indispensable tool for understanding model capabilities in high-stakes legal and instructional domains.
To reproduce the results, you can use this repository: https://github.com/lmarena/arena-hard-auto
2. ๐ EvalMix
EvalMix is a comprehensive evaluation pipeline designed to assess the performance of language models across multiple dimensions. This hybrid evaluation tool automatically analyzes model outputs, computes various semantic, LLM-based, and lexical metrics, and visualizes the results. EvalMix offers the following features:
Comprehensive Evaluation Metrics
- LLM-as-a-Judge: Uses large language modelsโprimarily GPT variantsโto evaluate the accuracy, coherence, and relevance of generated responses.
- Lexical Metrics: Calculates traditional NLP metrics such as BLEU, ROUGE-1, ROUGE-2, and ROUGE-L, along with modern metrics like BERTScore (precision, recall, F1) and cosine similarity.
- Comparative Analysis: Enables performance comparison between multiple models on the same dataset.
- Cosine Similarity (Turkish): Assesses performance in the Turkish language using Turkish-specific embedding models.
- Cosine Similarity (Multilingual): Measures multilingual performance using language-agnostic embeddings.
Generation Configuration for Evaluation
The following configuration parameters are used for model generation during evaluation:
{
"num_samples": 1100,
"random_seed": 42,
"temperature": 0.0,
"max_completion_tokens": 1024
}
3. โก Light-Eval
LightEval is a fast and modular framework designed to evaluate Large Language Models (LLMs) across a diverse range of tasks. It provides a comprehensive performance analysis by benchmarking models on academic, logical, scientific, and mathematical reasoning challenges.
Evaluation Tasks and Objectives LightEval assesses model capabilities using the following six core tasks:
- MMLU (5-shot): Evaluates general knowledge and reasoning skills across academic and professional disciplines.(Proffesional-law task only)
- TruthfulQA (0-shot): Measures the model's ability to generate accurate and truthful responses.
- Winogrande (5-shot): Tests commonsense reasoning and logical inference abilities.
- Hellaswag (10-shot): Assesses the coherence and logical consistency of model predictions based on contextual cues.
- GSM8k (5-shot): Evaluates step-by-step mathematical reasoning and problem-solving capabilities.
- ARC (25-shot): Tests scientific reasoning and the ability to solve science-based problems.
Overall Score Calculation The overall performance score of a model is computed using the average of the six evaluation tasks: LightEval Overall Score = (MMLU_proffesional_law + TruthfulQA + Winogrande + Hellaswag + GSM8k + ARC) / 6
To reproduce the results, you can use this repository: https://github.com/huggingface/lighteval
4. ๐ Snake-Eval
An evaluation framework where models play the classic Snake game, competing to collect apples while avoiding collisions. Starting from random positions, models must guide their snakes using step-by-step reasoning, with performance measured as an Elo rating. This tests problem-solving ability, spatial awareness, and logical thinking in a challenging environment.
Sample Prompt:
You are controlling a snake in a multi-apple Snake game. The board size is 10x10.
Normal X,Y coordinates are used. Coordinates range from (0,0) at bottom left to (9,9) at top right.
Apples at: (9, 6), (0, 2), (5, 9), (1, 7), (9, 7)
Your snake ID: 1 which is currently positioned at (5, 1)
Enemy snakes positions:
* Snake #2 is at position (7, 1) with body at []
Board state:
9 . . . . . A . . . .
8 . . . . . . . . . .
7 . A . . . . . . . A
6 . . . . . . . . . A
5 . . . . . . . . . .
4 . . . . . . . . . .
3 . . . . . . . . . .
2 A . . . . . . . . .
1 . . . . . 1 . 2 . .
0 . . . . . . . . . .
0 1 2 3 4 5 6 7 8 9
--Your last move information:--
Direction: LEFT
Rationale: I'm noticing that (0,2) is the closest apple from our head at (6,1).
Moving LEFT starts guiding us toward this apple while avoiding the enemy snake.
Strategy: Continue left and then maneuver upward to reach the apple at (0,2).
--End of your last move information.--
Rules:<
1) If you move onto an apple, you grow and gain 1 point.
2) If you hit a wall, another snake, or yourself, you die.
3) The goal is to have the most points by the end.
Decreasing x coordinate: LEFT, increasing x coordinate: RIGHT
Decreasing y coordinate: DOWN, increasing y coordinate: UP
Provide your reasoning and end with: UP, DOWN, LEFT, or RIGHT.
To reproduce the results, you can use this repository: https://github.com/gkamradt/SnakeBench
5. ๐ Retrieval
An evaluation system designed to assess Retrieval-Augmented Generation (RAG) capabilities. It measures how well models can:
- Retrieve relevant information from a knowledge base
- Generate accurate and contextually appropriate responses
- Maintain coherence between retrieved information and generated text
- Handle real-world information retrieval scenarios
Retrieval Metrics
- RAG Success Rate: Percentage of successful retrievals
- Maximum Correct References: Upper limit for correct retrievals per query
- Hallucinated References: Number of irrelevant documents retrieved
- Missed References: Number of relevant documents not retrieved
LLM Judge Evaluation Metrics
- Legal Reasoning: Assesses the model's ability to understand and apply legal concepts
- Factual Legal Accuracy: Measures accuracy of legal facts and references
- Clarity & Precision: Evaluates clarity and precision of responses
- Factual Reliability: Checks for biases and factual accuracy
- Fluency: Assesses language fluency and coherence
- Relevance: Measures response relevance to the query
- Content Safety: Evaluates content safety and appropriateness
Judge Model: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
RAG Score Calculation The RAG Score is a comprehensive metric that combines multiple performance indicators using dynamic normalization across all models. The formula weights different aspects of retrieval performance:
Formula Components:
- RAG Success Rate (0.9 weight): Direct percentage of successful retrievals (higher is better)
- Normalized False Positives (0.9 weight): Hallucinated references, min-max normalized (lower is better)
- Normalized Max Correct References (0.1 weight): Maximum correct retrievals, min-max normalized (higher is better)
- Normalized Missed References (0.1 weight): Relevant documents not retrieved, min-max normalized (lower is better)
Final Score Formula:
RAG Score = (0.9 ร RAG_success_rate + 0.9 ร norm_false_positives +
0.1 ร norm_max_correct + 0.1 ร norm_missed_refs) รท 2.0
6. ๐ฅ Human Arena
Human Arena is a community-driven evaluation platform where language models are compared through human preferences and voting. This evaluation method captures real-world user preferences and provides insights into model performance from a human perspective.
Evaluation Methodology
- Head-to-Head Comparisons: Models are presented with the same prompts and their responses are compared by human evaluators
- ELO Rating System: Similar to chess rankings, models gain or lose rating points based on wins, losses, and ties against other models
- Community Voting: Real users vote on which model responses they prefer, ensuring diverse evaluation perspectives
- Blind Evaluation: Evaluators see responses without knowing which model generated them, reducing bias
Key Metrics
- ELO Rating: Overall skill level based on tournament-style matchups (higher is better)
- Win Rate: Percentage of head-to-head victories against other models
- Wins/Losses/Ties: Direct comparison statistics showing model performance
- Total Games: Number of evaluation rounds completed
- Votes: Community engagement and evaluation volume
- Provider & Technical Details: Infrastructure and model configuration information
Evaluation Criteria Human evaluators consider multiple factors when comparing model responses:
- Response Quality: Accuracy, completeness, and relevance of answers
- Communication Style: Clarity, coherence, and appropriateness of language
- Helpfulness: How well the response addresses the user's needs
- Safety & Ethics: Adherence to safety guidelines and ethical considerations
- Creativity & Originality: For tasks requiring creative or innovative thinking
Human Arena provides a complementary perspective to automated benchmarks, capturing nuanced human preferences that traditional metrics might miss. This evaluation is particularly valuable for understanding how models perform in real-world conversational scenarios.
Benchmark Datasets
This section provides detailed information about the datasets used in our evaluation benchmarks. Each dataset has been carefully selected and adapted to provide comprehensive model evaluation across different domains and capabilities.
Available Datasets for Evaluation
Dataset | Evaluation Task | Language | Description |
---|---|---|---|
malhajar/mmlu_tr-v0.2 | Lighteval MMLU | Turkish | Turkish adaptation of MMLU (Massive Multitask Language Understanding) v0.2 covering 57 academic subjects including mathematics, physics, chemistry, biology, history, law, and computer science. Tests knowledge and reasoning capabilities across multiple domains with multiple-choice questions. |
malhajar/truthful_qa-tr-v0.2 | Lighteval TruthfulQA | Turkish | Turkish version of TruthfulQA (v0.2) designed to measure model truthfulness and resistance to generating false information. Contains questions where humans often answer incorrectly due to misconceptions or false beliefs, testing the model's ability to provide accurate information. |
malhajar/winogrande-tr-v0.2 | Lighteval WinoGrande | Turkish | Turkish adaptation of WinoGrande (v0.2) focusing on commonsense reasoning through pronoun resolution tasks. Tests the model's ability to understand context, make logical inferences, and resolve ambiguous pronouns in everyday scenarios. |
malhajar/hellaswag_tr-v0.2 | Lighteval HellaSwag | Turkish | Turkish version of HellaSwag (v0.2) for commonsense reasoning evaluation. Tests the model's ability to predict plausible continuations of everyday scenarios and activities, requiring understanding of common sense and typical human behavior patterns. |
malhajar/arc-tr-v0.2 | Lighteval ARC | Turkish | Turkish adaptation of ARC (AI2 Reasoning Challenge) v0.2 focusing on science reasoning and question answering. Contains grade school level science questions that require reasoning beyond simple factual recall, covering topics in physics, chemistry, biology, and earth science. |
malhajar/gsm8k_tr-v0.2 | Lighteval GSM8K | Turkish | Turkish version of GSM8K (Grade School Math 8K) v0.2 for mathematical reasoning evaluation. Contains grade school level math word problems that require multi-step reasoning, arithmetic operations, and logical problem-solving skills to arrive at the correct numerical answer. |
newmindai/mezura-eval-data | Auto-Arena | Turkish | mezura-eval dataset is a Turkish-language legal text dataset designed for evaluation tasks with RAG context support. The subsets include domains like Environmental Law, Tax Law, Data Protection Law and Health Law each containing annotated samples. Every row includes structured fields such as the category, concept, input and contextual information drawn from sources like official decisions. |
newmindai/mezura-eval-data | EvalMix | Turkish | mezura-eval dataset is a Turkish-language legal text dataset designed for evaluation tasks with RAG context support. The subsets include domains like Environmental Law, Tax Law, Data Protection Law and Health Law each containing annotated samples. Every row includes structured fields such as the category, concept, input and contextual information drawn from sources like official decisions. |
newmindai/mezura-eval-data | Retrieval | Turkish | mezura-eval dataset is a Turkish-language legal text dataset designed for evaluation tasks with RAG context support. The subsets include domains like Environmental Law, Tax Law, Data Protection Law and Health Law each containing annotated samples. Every row includes structured fields such as the category, concept, input and contextual information drawn from sources like official decisions. |
Model Evaluation
Evaluation Process:
Login to Your Hugging Face Account
- You must be logged in to submit models for evaluation
Enter Model Name
- Input the HuggingFace model name or path you want to evaluate
- Example: meta-llama/Meta-Llama-3.1-70B-Instruct
Select Base Model
- Choose the base model from the dropdown list
- The system will verify if your repository is a valid HuggingFace repository
- It will check if the model is trained from the selected base model
Start Evaluation
- Click the "Start All Benchmarks" button to begin the evaluation
- If validation passes, your request will be processed
- If validation fails, you'll see an error message
Important Limitations:
- The model repository must be a maximum of 750 MB in size.
- For trained adapters, the maximum LoRA rank must be 32.
Model Submission
Note: Currently, only adapter models are supported. Merged models are not yet supported.
Enable reasoning capability during evaluation