Newmind AI LLM Evaluation Leaderboard

Evaluate your model's performance in the following categories:

⚔️ Auto Arena - Tournament-style evaluation where models are directly compared and ranked using an ELO rating system.
👥 Human Arena - Comparative evaluation based on human preferences, assessed by a reviewer group.
📚 Retrieval - Evaluation focused on information retrieval and text generation quality for real-world applications.
⚡ Light Eval - Fast and efficient model evaluation framework for quick testing.
🔄 EvalMix - Multi-dimensional evaluation including lexical accuracy and semantic coherence.
🐍 Snake Bench - Specialized evaluation measuring step-by-step problem solving and complex reasoning abilities.
🧩 Structured Outputs - Evaluation of models' ability to generate properly formatted, structured responses with accurate field extraction and semantic understanding.

Evaluate your model in any or all of these categories to discover its capabilities and areas of excellence.

For any questions, please contact us at info@newmind.ai

Model Evaluation Results

This screen shows model performance across different evaluation categories.

Human Arena Results Updated At 06.02.2026

Human Arena Results Updated At 06.02.2026

meta-llama/Llama-3.3-70B-Instruct	1628	111	103	20	191	58	bfloat16	Proprietary


grok-4	1628	111	60	20	191	58	Unknown	Proprietary
gemini-2.5-flash-preview-04-17	1607	120	40	28	188	64	Unknown	Proprietary
qwen-max	1589	72	67	37	176	41	Unknown	Qwen
gemini-2.0-flash	1569	98	66	24	188	52	Unknown	Proprietary
grok-3	1539	105	63	21	189	56	Unknown	Proprietary
deepseek-ai/DeepSeek-V3	1528	81	71	38	190	43	bfloat16	MIT
qwen-plus	1518	73	75	35	183	40	Unknown	Qwen
gemma-3-27b-it	1512	10	10	0	20	50	bfloat16	Proprietary
Qwen/Qwen2.5-72B-Instruct	1506	24	21	10	55	44	bfloat16	Qwen
llama-3.3-70b-versatile	1482	66	103	27	196	34	bfloat16	Llama-3.3
Qwen/QwQ-32B	1472	21	22	13	56	38	bfloat16	Apache 2.0
newmindai/Llama-3.3-70B-Instruct	1472	17	31	7	55	31	bfloat16	Proprietary
qwen-turbo	1451	42	95	42	179	24	Unknown	Qwen
grok-4-0709	1449	71	89	31	191	37	Unknown	Proprietary
meta-llama/Llama-3.3-70B-Instruct	1414	58	100	39	197	29	bfloat16	Llama-3.3
gpt-4o-mini	1391	66	92	34	192	34	Unknown	Proprietary
gpt-4o	1373	67	97	30	194	35	Unknown	Proprietary

Model Performance Comparison

Model Performance Comparison

meta-llama/Meta-Llama-3.1-70B-Instruct	1848.77	0.84	0.7628	0.33	0.85	0.58	0.25	bfloat16	Proprietary


grok-3	1848.77	0.84	0.7628	0.33	0.85	0.58	0.5	Unknown	Proprietary
google/gemma-3-27b-it	1636.89	0.42	0.7478	0.33	0.87	0.83	0.25	bfloat16	Gemma
newmindai/Qwen2.5-72b-Instruct	1310.17	0.85	0.761	0.34	0.87	0.83	0.24	bfloat16	Qwen
Qwen/Qwen2.5-72B-Instruct	1263.87	0.74	0.7309	0.35	0.84	0.81	0.28	bfloat16	Qwen
deepseek-ai/DeepSeek-R1	1158.08	0.84	0.76	0.36	0.68	0.7	0.33	bfloat16	MIT
microsoft/phi-4	1141.07	0.46	0.6906	0.39	0.86	0.83	0.42	bfloat16	MIT
Qwen/Qwen3-32B	1118.19	0.88	0.735	0.36	0.29	0.22	0.09	bfloat16	Qwen
newmindai/Llama-3.3-70b-Instruct	1049.29	0.74	0.7622	0.31	0.97	0.87	0.34	bfloat16	Llama-3.3
meta-llama/Llama-3.3-70b-Instruct	1000	0.64	0.7635	0.34	0.93	0.77	0.2	bfloat16	Llama 3.3
Qwen/QwQ-32B	953.18	0.68	0.7205	0.31	0.83	0.53	0.16	bfloat16	Apache 2.0
grok-3-mini-fast-beta	949.88	0.49	0.7471	0.32	0.78	0.79	0.34	Unknown	Proprietary
Qwen/Qwen3-14B	888.09	0.59	0.6153	0.35	0.29	0.21	0.23	bfloat16	Apache 2.0
meta-llama/Meta-Llama-3.1-70B-Instruct	744.81	0.83	0.7424	0.41	0.84	0.81	0.42	bfloat16	Llama 3.1
newmindai/QwQ-32B-r1	634.84	0.72	0.7252	0.31	0.84	0.55	0.15	bfloat16	Apache 2.0

Arena Detailed Results

meta-llama/Meta-Llama-3.1-70B-Instruct	1848.77	99.25	+0.22/-0.31	1020.0	bfloat16	Proprietary


grok-3	1848.77	99.25	+0.22/-0.31	886.0	Unknown	Proprietary
google/gemma-3-27b-it	1636.89	97.51	+0.47/-0.42	896.0	bfloat16	Gemma
newmindai/Qwen2.5-72b-Instruct	1310.17	85.64	+1.41/-1.21	953.0	bfloat16	Qwen
Qwen/Qwen2.5-72B-Instruct	1263.87	82.04	+1.22/-1.52	926.0	bfloat16	Qwen
deepseek-ai/DeepSeek-R1	1158.08	71.3	+1.63/-1.36	1020.0	bfloat16	MIT
microsoft/phi-4	1141.07	69.26	+1.29/-2.12	824.0	bfloat16	MIT
Qwen/Qwen3-32B	1118.19	66.38	+2.00/-2.26	1021.0	bfloat16	Qwen
newmindai/Llama-3.3-70b-Instruct	1049.29	57.05	+2.13/-1.99	465.0	bfloat16	Llama-3.3
meta-llama/Llama-3.3-70b-Instruct	1000	50	+0.00/-0.00	480.0	bfloat16	Llama 3.3
Qwen/QwQ-32B	953.18	43.3	+2.04/-2.01	1020.0	bfloat16	Apache 2.0
grok-3-mini-fast-beta	949.88	42.84	+2.66/-1.65	395.0	Unknown	Proprietary
Qwen/Qwen3-14B	888.09	34.43	+1.39/-2.11	1022.0	bfloat16	Apache 2.0
meta-llama/Meta-Llama-3.1-70B-Instruct	744.81	18.71	+1.46/-1.70	486.0	bfloat16	Llama 3.1
newmindai/QwQ-32B-r1	634.84	10.89	+1.07/-1.20	257.0	bfloat16	Apache 2.0

Retrieval Detailed Results

meta-llama/Meta-Llama-3.1-70B-Instruct	0.88	0.89	9	112	4199	90.37	bfloat16	Proprietary


Qwen/Qwen3-32B	0.88	0.89	9	19	4199	90.37	bfloat16	Qwen
newmindai/Qwen2.5-72b-Instruct	0.85	0.9	8	24	5294	89.83	bfloat16	Qwen
deepseek-ai/DeepSeek-R1	0.84	0.81	6	1	5339	90.46	bfloat16	MIT
grok-3	0.84	0.93	8	36	4845	91.66	Unknown	Proprietary
meta-llama/Meta-Llama-3.1-70B-Instruct	0.83	0.89	9	40	4380	87.8	bfloat16	Llama 3.1
Qwen/Qwen2.5-72B-Instruct	0.74	0.85	8	62	5430	90.21	bfloat16	Qwen
newmindai/Llama-3.3-70b-Instruct	0.74	0.91	8	75	4076	88.82	bfloat16	Llama-3.3
newmindai/QwQ-32B-r1	0.72	0.87	8	76	4129	90.89	bfloat16	Apache 2.0
Qwen/QwQ-32B	0.68	0.7	8	58	5141	90.12	bfloat16	Apache 2.0
meta-llama/Llama-3.3-70b-Instruct	0.64	0.88	8	112	4354	89.01	bfloat16	Llama 3.3
Qwen/Qwen3-14B	0.59	0.49	5	37	5996	86.31	bfloat16	Apache 2.0
grok-3-mini-fast-beta	0.49	0.65	6	115	5631	87.99	Unknown	Proprietary
microsoft/phi-4	0.46	0.76	8	164	4997	90.72	bfloat16	MIT
google/gemma-3-27b-it	0.42	0.86	6	190	4799	91.68	bfloat16	Gemma

Structured Outputs Detailed Results

Structured Outputs Detailed Results

meta-llama/Meta-Llama-3.1-70B-Instruct	0.7635	0.5271	506/506	0.6364	0.2194	0.6561	0.6319	0.4919	bfloat16	Proprietary


meta-llama/Llama-3.3-70b-Instruct	0.7635	0.5271	506/506	0.6364	0.2194	0.6561	0.6319	0.4919	bfloat16	Llama-3.3
grok-3	0.7628	0.5256	506/506	0.6344	0.166	0.6482	0.6493	0.5299	Unknown	Proprietary
newmindai/Llama-3.3-70b-Instruct	0.7622	0.5245	506/506	0.6423	0.2016	0.6561	0.6259	0.4966	bfloat16	Llama-3.3
newmindai/Qwen2.5-72b-Instruct	0.761	0.5219	506/506	0.5632	0.2905	0.6403	0.6136	0.5018	bfloat16	Qwen
deepseek-ai/DeepSeek-R1	0.76	0.5199	506/506	0.6601	0.1917	0.6542	0.6223	0.4713	bfloat16	MIT
google/gemma-3-27b-it	0.7478	0.4955	506/506	0.5909	0.2055	0.6502	0.6044	0.4264	bfloat16	Gemma
grok-3-mini-fast-beta	0.7471	0.4943	506/506	0.6403	0.1957	0.6324	0.567	0.4363	Unknown	Proprietary
meta-llama/Meta-Llama-3.1-70B-Instruct	0.7424	0.4847	506/506	0.581	0.2134	0.6561	0.5248	0.4482	bfloat16	Llama 3.1
Qwen/Qwen3-32B	0.735	0.482	500/506	0.566	0.21	0.636	0.5614	0.4367	bfloat16	Qwen
Qwen/Qwen2.5-72B-Instruct	0.7309	0.4618	506/506	0.502	0.1957	0.6383	0.5927	0.3801	bfloat16	Qwen
newmindai/QwQ-32B-r1	0.7252	0.4564	503/506	0.507	0.1272	0.6243	0.5816	0.4419	bfloat16	Apache 2.0
Qwen/QwQ-32B	0.7205	0.4468	503/506	0.4791	0.1352	0.6243	0.573	0.4224	bfloat16	Apache 2.0
microsoft/phi-4	0.6906	0.3912	503/506	0.3752	0.2275	0.5768	0.4542	0.3222	bfloat16	MIT
Qwen/Qwen3-14B	0.6153	0.2426	501/506	0.31	0.156	0.538	0.1095	0.0998	bfloat16	Apache 2.0

Light Eval Detailed Results

Light Eval Detailed Results

meta-llama/Meta-Llama-3.1-70B-Instruct	0.4097	0.5122	0.4444	0.5039	0.2584	0.2948	0.4445	bfloat16	Proprietary


meta-llama/Meta-Llama-3.1-70B-Instruct	0.4097	0.5122	0.4444	0.5039	0.2584	0.2948	0.4445	bfloat16	Llama 3.1
microsoft/phi-4	0.3879	0.3948	0.4656	0.4787	0.2478	0.3538	0.3865	bfloat16	MIT
Qwen/Qwen3-32B	0.3614	0.2731	0.4948	0.515	0.2509	0.3489	0.2858	bfloat16	Qwen
deepseek-ai/DeepSeek-R1	0.3575	0.2601	0.4114	0.4961	0.2421	0.41	0.3251	bfloat16	MIT
Qwen/Qwen3-14B	0.3527	0.2341	0.4924	0.5237	0.2494	0.3088	0.308	bfloat16	Apache 2.0
Qwen/Qwen2.5-72B-Instruct	0.3461	0.242	0.4484	0.5	0.2832	0.36	0.2431	bfloat16	Qwen
newmindai/Qwen2.5-72b-Instruct	0.3432	0.2421	0.4463	0.5039	0.2772	0.35	0.24	bfloat16	Qwen
meta-llama/Llama-3.3-70b-Instruct	0.3368	0.242	0.448	0.506	0.247	0.28	0.298	bfloat16	Llama 3.3
grok-3	0.3329	0.2514	0.4513	0.4984	0.2518	0.2903	0.2543	Unknown	Proprietary
google/gemma-3-27b-it	0.3285	0.2421	0.4421	0.5103	0.2404	0.2734	0.2628	bfloat16	Gemma
grok-3-mini-fast-beta	0.3229	0.2421	0.4483	0.4961	0.2505	0.2732	0.227	Unknown	Proprietary
newmindai/Llama-3.3-70b-Instruct	0.3136	0.234	0.4483	0.476	0.248	0.18	0.292	bfloat16	Llama-3.3
Qwen/QwQ-32B	0.3123	0.234	0.45	0.5	0.249	0.19	0.251	bfloat16	Apache 2.0
newmindai/QwQ-32B-r1	0.312	0.234	0.4428	0.484	0.282	0.19	0.238	bfloat16	Apache 2.0

EvalMix Detailed Results

EvalMix Detailed Results

meta-llama/Meta-Llama-3.1-70B-Instruct	0.97	0.87	0.81	0.15	0.55	0.34	0.34	0.65	0.72	bfloat16	Proprietary


newmindai/Llama-3.3-70b-Instruct	0.97	0.87	0.81	0.15	0.55	0.34	0.34	0.7	0.72	bfloat16	Llama-3.3
meta-llama/Llama-3.3-70b-Instruct	0.93	0.77	0.69	0.05	0.35	0.16	0.23	0.6	0.58	bfloat16	Llama 3.3
google/gemma-3-27b-it	0.87	0.83	0.8	0.06	0.48	0.22	0.24	0.65	0.64	bfloat16	Gemma
newmindai/Qwen2.5-72b-Instruct	0.87	0.83	0.75	0.06	0.47	0.2	0.24	0.65	0.64	bfloat16	Qwen
microsoft/phi-4	0.86	0.83	0.76	0.06	0.48	0.21	0.25	0.66	0.65	bfloat16	MIT
grok-3	0.85	0.58	0.6	0.21	0.37	0.14	0.2	0.85	0.85	Unknown	Proprietary
meta-llama/Meta-Llama-3.1-70B-Instruct	0.84	0.81	0.71	0.08	0.42	0.23	0.26	0.64	0.7	bfloat16	Llama 3.1
newmindai/QwQ-32B-r1	0.84	0.55	0.6	0.02	0.32	0.11	0.15	0.48	0.46	bfloat16	Apache 2.0
Qwen/Qwen2.5-72B-Instruct	0.84	0.81	0.73	0.05	0.45	0.18	0.22	0.63	0.62	bfloat16	Qwen
Qwen/QwQ-32B	0.83	0.53	0.61	0.02	0.31	0.11	0.15	0.47	0.45	bfloat16	Apache 2.0
grok-3-mini-fast-beta	0.78	0.79	0.68	0.03	0.34	0.14	0.18	0.57	0.58	Unknown	Proprietary
deepseek-ai/DeepSeek-R1	0.68	0.7	0.54	0.03	0.36	0.14	0.17	0.53	0.52	bfloat16	MIT
Qwen/Qwen3-32B	0.29	0.22	0.7	0.01	0.21	0.08	0.1	0.37	0.36	bfloat16	Qwen
Qwen/Qwen3-14B	0.29	0.21	0.73	0.02	0.26	0.11	0.13	0.37	0.36	bfloat16	Apache 2.0

Snake Benchmark Detailed Results

deepseek-r1-distill-llama-70b	1606.27	47.62	7.14	26	13	3	30.95	bfloat16	Apache 2.0


newmindai/QwQ-32B-r1	1606.27	61.9	7.14	26	13	3	30.95	bfloat16	Apache 2.0
Qwen/Qwen3-32B	1543.63	47.62	11.9	20	17	5	40.48	bfloat16	Qwen
Qwen/QwQ-32B	1526.66	52.38	7.14	22	17	3	40.48	bfloat16	Apache 2.0
deepseek-r1-distill-llama-70b	1445.85	35.71	9.52	15	23	4	54.76	bfloat16	MIT
qwen-qwq-32b	1376.17	28.57	2.38	12	29	1	69.05	bfloat16	Apache 2.0

Evaluation Categories

1. ⚔️ Arena-Hard-Auto: Competitive Benchmarking at Scale

Arena-Hard-Auto is a cutting-edge automatic evaluation framework tailored for instruction-tuned Large Language Models (LLMs). Leveraging a tournament-style evaluation methodology, it pits models against each other in head-to-head matchups, with performance rankings determined via the Elo rating system—a method proven to align closely with human judgment, as evidenced in Chatbot Arena benchmarks. This evaluation suite is grounded in real-world use cases, benchmarking models across 11 diverse legal tasks, encompassing an extensive Turkish legal question-answer pairs.

Key Evaluation Pillars

Automated Judging Relies on specialized judge models to assess and determine the winner in each model-versus-model comparison. Includes dynamic system prompt adaptation to ensure context-aware evaluation based on the specific domain of the query.
Win Probability Estimation Computes the probability of victory for each model in a matchup using a logistic regression model, offering a probabilistic understanding of comparative strength.
Skill Rating (Elo Score) Utilizes the Elo rating system to provide a robust measurement of each model’s skill level relative to its competitors, ensuring a competitive and evolving leaderboard.
Win Rate Measures a model’s dominance by calculating the proportion of head-to-head victories, serving as a direct indicator of real-world performance across the benchmark suite. Arena-Hard-Auto offers a fast, scalable, and reliable method to benchmark LLMs, combining quantitative rigor with realistic interaction scenarios—making it an indispensable tool for understanding model capabilities in high-stakes legal and instructional domains.

To reproduce the results, you can use this repository: https://github.com/lmarena/arena-hard-auto

2. 🔄 EvalMix

EvalMix is a comprehensive evaluation pipeline designed to assess the performance of language models across multiple dimensions. This hybrid evaluation tool automatically analyzes model outputs, computes various semantic, LLM-based, and lexical metrics, and visualizes the results. EvalMix offers the following features:

Comprehensive Evaluation Metrics

LLM-as-a-Judge: Uses large language models—primarily GPT variants—to evaluate the accuracy, coherence, and relevance of generated responses.
Lexical Metrics: Calculates traditional NLP metrics such as BLEU, ROUGE-1, ROUGE-2, and ROUGE-L, along with modern metrics like BERTScore (precision, recall, F1) and cosine similarity.
Comparative Analysis: Enables performance comparison between multiple models on the same dataset.
Cosine Similarity (Turkish): Assesses performance in the Turkish language using Turkish-specific embedding models.
Cosine Similarity (Multilingual): Measures multilingual performance using language-agnostic embeddings.

Generation Configuration for Evaluation

The following configuration parameters are used for model generation during evaluation:

{
  "num_samples": 1100,
  "random_seed": 42,
  "temperature": 0.0,
  "max_completion_tokens": 1024
}

3. ⚡ Light-Eval

LightEval is a fast and modular framework designed to evaluate Large Language Models (LLMs) across a diverse range of tasks. It provides a comprehensive performance analysis by benchmarking models on academic, logical, scientific, and mathematical reasoning challenges.

Evaluation Tasks and Objectives LightEval assesses model capabilities using the following six core tasks:

MMLU (5-shot): Evaluates general knowledge and reasoning skills across academic and professional disciplines.(Proffesional-law task only)
TruthfulQA (0-shot): Measures the model's ability to generate accurate and truthful responses.
Winogrande (5-shot): Tests commonsense reasoning and logical inference abilities.
Hellaswag (10-shot): Assesses the coherence and logical consistency of model predictions based on contextual cues.
GSM8k (5-shot): Evaluates step-by-step mathematical reasoning and problem-solving capabilities.
ARC (25-shot): Tests scientific reasoning and the ability to solve science-based problems.

Overall Score Calculation The overall performance score of a model is computed using the average of the six evaluation tasks: LightEval Overall Score = (MMLU_proffesional_law + TruthfulQA + Winogrande + Hellaswag + GSM8k + ARC) / 6

To reproduce the results, you can use this repository: https://github.com/huggingface/lighteval

4. 🐍 Snake-Eval

An evaluation framework where models play the classic Snake game, competing to collect apples while avoiding collisions. Starting from random positions, models must guide their snakes using step-by-step reasoning, with performance measured as an Elo rating. This tests problem-solving ability, spatial awareness, and logical thinking in a challenging environment.

Sample Prompt:

You are controlling a snake in a multi-apple Snake game. The board size is 10x10. 
Normal X,Y coordinates are used. Coordinates range from (0,0) at bottom left to (9,9) at top right.

Apples at: (9, 6), (0, 2), (5, 9), (1, 7), (9, 7)

Your snake ID: 1 which is currently positioned at (5, 1)

Enemy snakes positions:
* Snake #2 is at position (7, 1) with body at []

Board state:
9 . . . . . A . . . .
8 . . . . . . . . . .
7 . A . . . . . . . A
6 . . . . . . . . . A
5 . . . . . . . . . .
4 . . . . . . . . . .
3 . . . . . . . . . .
2 A . . . . . . . . .
1 . . . . . 1 . 2 . .
0 . . . . . . . . . .
  0 1 2 3 4 5 6 7 8 9

--Your last move information:--
Direction: LEFT
Rationale: I'm noticing that (0,2) is the closest apple from our head at (6,1). 
Moving LEFT starts guiding us toward this apple while avoiding the enemy snake.
Strategy: Continue left and then maneuver upward to reach the apple at (0,2).
--End of your last move information.--

Rules:<
1) If you move onto an apple, you grow and gain 1 point.
2) If you hit a wall, another snake, or yourself, you die.
3) The goal is to have the most points by the end.

Decreasing x coordinate: LEFT, increasing x coordinate: RIGHT
Decreasing y coordinate: DOWN, increasing y coordinate: UP

Provide your reasoning and end with: UP, DOWN, LEFT, or RIGHT.

To reproduce the results, you can use this repository: https://github.com/gkamradt/SnakeBench

5. 📚 Retrieval

An evaluation system designed to assess Retrieval-Augmented Generation (RAG) capabilities. It measures how well models can:

Retrieve relevant information from a knowledge base
Generate accurate and contextually appropriate responses
Maintain coherence between retrieved information and generated text
Handle real-world information retrieval scenarios

Retrieval Metrics

RAG Success Rate: Percentage of successful retrievals
Maximum Correct References: Upper limit for correct retrievals per query
Hallucinated References: Number of irrelevant documents retrieved
Missed References: Number of relevant documents not retrieved

LLM Judge Evaluation Metrics

Legal Reasoning: Assesses the model's ability to understand and apply legal concepts
Factual Legal Accuracy: Measures accuracy of legal facts and references
Clarity & Precision: Evaluates clarity and precision of responses
Factual Reliability: Checks for biases and factual accuracy
Fluency: Assesses language fluency and coherence
Relevance: Measures response relevance to the query
Content Safety: Evaluates content safety and appropriateness

Judge Model: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1

RAG Score Calculation The RAG Score is a comprehensive metric that combines multiple performance indicators using dynamic normalization across all models. The formula weights different aspects of retrieval performance:

Formula Components:

RAG Success Rate (0.9 weight): Direct percentage of successful retrievals (higher is better)
Normalized False Positives (0.9 weight): Hallucinated references, min-max normalized (lower is better)
Normalized Max Correct References (0.1 weight): Maximum correct retrievals, min-max normalized (higher is better)
Normalized Missed References (0.1 weight): Relevant documents not retrieved, min-max normalized (lower is better)

Final Score Formula:

RAG Score = (0.9 × RAG_success_rate + 0.9 × norm_false_positives + 
             0.1 × norm_max_correct + 0.1 × norm_missed_refs) ÷ 2.0

6. 👥 Human Arena

Human Arena is a community-driven evaluation platform where language models are compared through human preferences and voting. This evaluation method captures real-world user preferences and provides insights into model performance from a human perspective.

Evaluation Methodology

Head-to-Head Comparisons: Models are presented with the same prompts and their responses are compared by human evaluators
ELO Rating System: Similar to chess rankings, models gain or lose rating points based on wins, losses, and ties against other models
Community Voting: Real users vote on which model responses they prefer, ensuring diverse evaluation perspectives
Blind Evaluation: Evaluators see responses without knowing which model generated them, reducing bias

Key Metrics

ELO Rating: Overall skill level based on tournament-style matchups (higher is better)
Win Rate: Percentage of head-to-head victories against other models
Wins/Losses/Ties: Direct comparison statistics showing model performance
Total Games: Number of evaluation rounds completed
Votes: Community engagement and evaluation volume
Provider & Technical Details: Infrastructure and model configuration information

Evaluation Criteria Human evaluators consider multiple factors when comparing model responses:

Response Quality: Accuracy, completeness, and relevance of answers
Communication Style: Clarity, coherence, and appropriateness of language
Helpfulness: How well the response addresses the user's needs
Safety & Ethics: Adherence to safety guidelines and ethical considerations
Creativity & Originality: For tasks requiring creative or innovative thinking

Human Arena provides a complementary perspective to automated benchmarks, capturing nuanced human preferences that traditional metrics might miss. This evaluation is particularly valuable for understanding how models perform in real-world conversational scenarios.

7. 🧩 Structured Outputs

Structured Outputs evaluation assesses models' ability to generate properly formatted, structured responses with accurate field extraction and semantic understanding. This benchmark tests how well language models can parse, understand, and extract specific information from documents while maintaining semantic coherence.

Evaluation Methodology: Models are evaluated on their ability to extract structured information from Turkish legal documents. The evaluation uses advanced semantic similarity scoring with Turkish-specific embedding models for accurate assessment.

Technical Implementation:

Embedding Model: Primary evaluation uses newmindai/TurkEmbed4Retrieval for Turkish-specific semantic understanding
Similarity Threshold: 0.75 cosine similarity threshold for field matching
Ground Truth Comparison: MongoDB-stored ground truth data with pre-computed embeddings

Evaluation Metrics:

Overall: Combined overall performance metric that averages Semantic understanding and Response Format success ratio
Semantic: Measures semantic understanding and coherence of extracted information using cosine similarity (corresponds to overall_score in scoring)
Response Format: Success ratio showing successful JSON extractions vs total attempts (success_response/sample_count)
Name: Accuracy in extracting and identifying name fields from legal documents (20% weight)
Document Date: Accuracy in date field extraction with multiple format support (20% weight)
Document Note: Performance in extracting document annotation information using semantic similarity (20% weight)
From: Performance in extracting source/sender information as lists with semantic matching (20% weight)
To: Accuracy in extracting destination/recipient information as lists with semantic matching (20% weight)

Scoring Algorithm: The evaluation uses a sophisticated multi-level scoring system:

String Fields (name, document_note): Turkish embedding similarity with 0.75 threshold using newmindai/TurkEmbed4Retrieval
Date Fields (document_date): Exact date matching with multiple format parsing support
List Fields (from, to): One-way similarity from ground truth to predictions using semantic matching
Overall Score Calculation: Overall = (Semantic + Response Format) / 2
Field Weights: Each extraction field (name, document_date, document_note, from, to) contributes equally with 20% weight to the semantic score

Benchmark Datasets

This section provides detailed information about the datasets used in our evaluation benchmarks. Each dataset has been carefully selected and adapted to provide comprehensive model evaluation across different domains and capabilities.

Available Datasets for Evaluation

Dataset	Evaluation Task	Language	Description
malhajar/mmlu_tr-v0.2	Lighteval MMLU	Turkish	Turkish adaptation of MMLU (Massive Multitask Language Understanding) v0.2 covering 57 academic subjects including mathematics, physics, chemistry, biology, history, law, and computer science. Tests knowledge and reasoning capabilities across multiple domains with multiple-choice questions.
malhajar/truthful_qa-tr-v0.2	Lighteval TruthfulQA	Turkish	Turkish version of TruthfulQA (v0.2) designed to measure model truthfulness and resistance to generating false information. Contains questions where humans often answer incorrectly due to misconceptions or false beliefs, testing the model's ability to provide accurate information.
malhajar/winogrande-tr-v0.2	Lighteval WinoGrande	Turkish	Turkish adaptation of WinoGrande (v0.2) focusing on commonsense reasoning through pronoun resolution tasks. Tests the model's ability to understand context, make logical inferences, and resolve ambiguous pronouns in everyday scenarios.
malhajar/hellaswag_tr-v0.2	Lighteval HellaSwag	Turkish	Turkish version of HellaSwag (v0.2) for commonsense reasoning evaluation. Tests the model's ability to predict plausible continuations of everyday scenarios and activities, requiring understanding of common sense and typical human behavior patterns.
malhajar/arc-tr-v0.2	Lighteval ARC	Turkish	Turkish adaptation of ARC (AI2 Reasoning Challenge) v0.2 focusing on science reasoning and question answering. Contains grade school level science questions that require reasoning beyond simple factual recall, covering topics in physics, chemistry, biology, and earth science.
malhajar/gsm8k_tr-v0.2	Lighteval GSM8K	Turkish	Turkish version of GSM8K (Grade School Math 8K) v0.2 for mathematical reasoning evaluation. Contains grade school level math word problems that require multi-step reasoning, arithmetic operations, and logical problem-solving skills to arrive at the correct numerical answer.
newmindai/mezura-eval-data	Auto-Arena	Turkish	mezura-eval dataset is a Turkish-language legal text dataset designed for evaluation tasks with RAG context support. The subsets include domains like Environmental Law, Tax Law, Data Protection Law and Health Law each containing annotated samples. Every row includes structured fields such as the category, concept, input and contextual information drawn from sources like official decisions.
newmindai/mezura-eval-data	EvalMix	Turkish	mezura-eval dataset is a Turkish-language legal text dataset designed for evaluation tasks with RAG context support. The subsets include domains like Environmental Law, Tax Law, Data Protection Law and Health Law each containing annotated samples. Every row includes structured fields such as the category, concept, input and contextual information drawn from sources like official decisions.
newmindai/mezura-eval-data	Retrieval	Turkish	mezura-eval dataset is a Turkish-language legal text dataset designed for evaluation tasks with RAG context support. The subsets include domains like Environmental Law, Tax Law, Data Protection Law and Health Law each containing annotated samples. Every row includes structured fields such as the category, concept, input and contextual information drawn from sources like official decisions.