Newmind AI LLM Evaluation Leaderboard

Evaluate your model's performance in the following categories:

⚔️ Auto Arena - Tournament-style evaluation where models are directly compared and ranked using an ELO rating system.
👥 Human Arena - Comparative evaluation based on human preferences, assessed by a reviewer group.
📚 Retrieval - Evaluation focused on information retrieval and text generation quality for real-world applications.
⚡ Light Eval - Fast and efficient model evaluation framework for quick testing.
🔄 EvalMix - Multi-dimensional evaluation including lexical accuracy and semantic coherence.
🐍 Snake Bench - Specialized evaluation measuring step-by-step problem solving and complex reasoning abilities.
🧩 Structured Outputs - Coming soon!

Evaluate your model in any or all of these categories to discover its capabilities and areas of excellence.

For any questions, please contact us at info@newmind.ai

Model Evaluation Results

This screen shows model performance across different evaluation categories.

Model Performance Comparison

Model Performance Comparison

meta-llama/Meta-Llama-3.1-70B-Instruct	1848.77	1559.97	0.84	0.33	0.85	0.58	0.25	bfloat16	Proprietary


grok-3	1848.77	1559.97	0.84	0.33	0.85	0.58	0.5	Unknown	Proprietary
google/gemma-3-27b-it	1636.89	1749.65	0.42	0.33	0.87	0.83	0.25	bfloat16	Gemma
newmindai/Qwen2.5-72b-Instruct	1310.17	1674.4	0.85	0.34	0.87	0.83	0.24	bfloat16	Qwen
Qwen/Qwen2.5-72B-Instruct	1263.87	1561.2	0.74	0.35	0.84	0.81	0.28	bfloat16	Qwen
deepseek-ai/DeepSeek-R1	1158.08	1345.62	0.84	0.36	0.68	0.7	0.33	bfloat16	MIT
microsoft/phi-4	1141.07	1664.84	0.46	0.39	0.86	0.83	0.42	bfloat16	MIT
Qwen/Qwen3-32B	1118.19	1453.11	0.88	0.36	0.29	0.22	0.09	bfloat16	Qwen
newmindai/Llama-3.3-70b-Instruct	1049.29	1584.04	0.74	0.31	0.97	0.87	0.34	bfloat16	Llama-3.3
meta-llama/Llama-3.3-70b-Instruct	1000	1606.71	0.64	0.34	0.93	0.77	0.2	bfloat16	Llama 3.3
Qwen/QwQ-32B	953.18	1384.24	0.68	0.31	0.83	0.53	0.16	bfloat16	Apache 2.0
grok-3-mini-fast-beta	949.88	1456.57	0.49	0.32	0.78	0.79	0.34	Unknown	Proprietary
mistralai/Magistral-Small-2506	904.37	1515.6	0.78	0.33	0.84	0.82	0.2	bfloat16	Apache 2.0
Qwen/Qwen3-14B	888.09	1450.2	0.59	0.35	0.29	0.21	0.23	bfloat16	Apache 2.0
meta-llama/Meta-Llama-3.1-70B-Instruct	744.81	1394.55	0.83	0.41	0.84	0.81	0.42	bfloat16	Llama 3.1
newmindai/QwQ-32B-r1	634.84	1225.64	0.72	0.31	0.84	0.55	0.15	bfloat16	Apache 2.0

Arena Detailed Results

meta-llama/Meta-Llama-3.1-70B-Instruct	1848.77	99.25	+0.22/-0.31	1020.0	bfloat16	Proprietary


grok-3	1848.77	99.25	+0.22/-0.31	886.0	Unknown	Proprietary
google/gemma-3-27b-it	1636.89	97.51	+0.47/-0.42	896.0	bfloat16	Gemma
newmindai/Qwen2.5-72b-Instruct	1310.17	85.64	+1.41/-1.21	953.0	bfloat16	Qwen
Qwen/Qwen2.5-72B-Instruct	1263.87	82.04	+1.22/-1.52	926.0	bfloat16	Qwen
deepseek-ai/DeepSeek-R1	1158.08	71.3	+1.63/-1.36	1020.0	bfloat16	MIT
microsoft/phi-4	1141.07	69.26	+1.29/-2.12	824.0	bfloat16	MIT
Qwen/Qwen3-32B	1118.19	66.38	+2.00/-2.26	1021.0	bfloat16	Qwen
newmindai/Llama-3.3-70b-Instruct	1049.29	57.05	+2.13/-1.99	465.0	bfloat16	Llama-3.3
meta-llama/Llama-3.3-70b-Instruct	1000	50	+0.00/-0.00	480.0	bfloat16	Llama 3.3
Qwen/QwQ-32B	953.18	43.3	+2.04/-2.01	1020.0	bfloat16	Apache 2.0
grok-3-mini-fast-beta	949.88	42.84	+2.66/-1.65	395.0	Unknown	Proprietary
mistralai/Magistral-Small-2506	904.37	36.58	+1.88/-1.77	633.0	bfloat16	Apache 2.0
Qwen/Qwen3-14B	888.09	34.43	+1.39/-2.11	1022.0	bfloat16	Apache 2.0
meta-llama/Meta-Llama-3.1-70B-Instruct	744.81	18.71	+1.46/-1.70	486.0	bfloat16	Llama 3.1
newmindai/QwQ-32B-r1	634.84	10.89	+1.07/-1.20	257.0	bfloat16	Apache 2.0

Human Arena Results

meta-llama/Meta-Llama-3.1-70B-Instruct	1749.65	1000	215	159	1374	72.78	bfloat16	Proprietary


google/gemma-3-27b-it	1749.65	1000	215	159	1374	72.78	bfloat16	Gemma
newmindai/Qwen2.5-72b-Instruct	1674.4	726	513	163	1402	51.78	bfloat16	Qwen
microsoft/phi-4	1664.84	531	223	373	1127	47.12	bfloat16	MIT
meta-llama/Llama-3.3-70b-Instruct	1606.71	588	403	378	1369	42.95	bfloat16	Llama 3.3
newmindai/Llama-3.3-70b-Instruct	1584.04	567	442	287	1296	43.75	bfloat16	Llama-3.3
Qwen/Qwen2.5-72B-Instruct	1561.2	369	515	287	1171	31.51	bfloat16	Qwen
grok-3	1559.97	429	171	436	1036	41.41	Unknown	Proprietary
mistralai/Magistral-Small-2506	1515.6	413	723	282	1418	29.13	bfloat16	Apache 2.0
grok-3-mini-fast-beta	1456.57	386	419	218	1023	37.73	Unknown	Proprietary
Qwen/Qwen3-32B	1453.11	344	429	181	954	36.06	bfloat16	Qwen
Qwen/Qwen3-14B	1450.2	658	471	177	1306	50.38	bfloat16	Apache 2.0
meta-llama/Meta-Llama-3.1-70B-Instruct	1394.55	378	537	262	1177	32.12	bfloat16	Llama 3.1
Qwen/QwQ-32B	1384.24	366	766	152	1284	28.5	bfloat16	Apache 2.0
deepseek-ai/DeepSeek-R1	1345.62	340	469	247	1056	32.2	bfloat16	MIT
newmindai/QwQ-32B-r1	1225.64	278	917	168	1363	20.4	bfloat16	Apache 2.0

Retrieval Detailed Results

Retrieval Detailed Results

meta-llama/Meta-Llama-3.1-70B-Instruct	0.88	0.89	9	112	4199	90.37	bfloat16	Proprietary


Qwen/Qwen3-32B	0.88	0.89	9	19	4199	90.37	bfloat16	Qwen
newmindai/Qwen2.5-72b-Instruct	0.85	0.9	8	24	5294	89.83	bfloat16	Qwen
deepseek-ai/DeepSeek-R1	0.84	0.81	6	1	5339	90.46	bfloat16	MIT
grok-3	0.84	0.93	8	36	4845	91.66	Unknown	Proprietary
meta-llama/Meta-Llama-3.1-70B-Instruct	0.83	0.89	9	40	4380	87.8	bfloat16	Llama 3.1
mistralai/Magistral-Small-2506	0.78	0.9	9	61	4638	89.77	bfloat16	Apache 2.0
Qwen/Qwen2.5-72B-Instruct	0.74	0.85	8	62	5430	90.21	bfloat16	Qwen
newmindai/Llama-3.3-70b-Instruct	0.74	0.91	8	75	4076	88.82	bfloat16	Llama-3.3
newmindai/QwQ-32B-r1	0.72	0.87	8	76	4129	90.89	bfloat16	Apache 2.0
Qwen/QwQ-32B	0.68	0.7	8	58	5141	90.12	bfloat16	Apache 2.0
meta-llama/Llama-3.3-70b-Instruct	0.64	0.88	8	112	4354	89.01	bfloat16	Llama 3.3
Qwen/Qwen3-14B	0.59	0.49	5	37	5996	86.31	bfloat16	Apache 2.0
grok-3-mini-fast-beta	0.49	0.65	6	115	5631	87.99	Unknown	Proprietary
microsoft/phi-4	0.46	0.76	8	164	4997	90.72	bfloat16	MIT
google/gemma-3-27b-it	0.42	0.86	6	190	4799	91.68	bfloat16	Gemma

Light Eval Detailed Results

Light Eval Detailed Results

meta-llama/Meta-Llama-3.1-70B-Instruct	0.4097	0.5122	0.4444	0.5039	0.2584	0.2948	0.4445	bfloat16	Proprietary


meta-llama/Meta-Llama-3.1-70B-Instruct	0.4097	0.5122	0.4444	0.5039	0.2584	0.2948	0.4445	bfloat16	Llama 3.1
microsoft/phi-4	0.3879	0.3948	0.4656	0.4787	0.2478	0.3538	0.3865	bfloat16	MIT
Qwen/Qwen3-32B	0.3614	0.2731	0.4948	0.515	0.2509	0.3489	0.2858	bfloat16	Qwen
deepseek-ai/DeepSeek-R1	0.3575	0.2601	0.4114	0.4961	0.2421	0.41	0.3251	bfloat16	MIT
Qwen/Qwen3-14B	0.3527	0.2341	0.4924	0.5237	0.2494	0.3088	0.308	bfloat16	Apache 2.0
Qwen/Qwen2.5-72B-Instruct	0.3461	0.242	0.4484	0.5	0.2832	0.36	0.2431	bfloat16	Qwen
newmindai/Qwen2.5-72b-Instruct	0.3432	0.2421	0.4463	0.5039	0.2772	0.35	0.24	bfloat16	Qwen
meta-llama/Llama-3.3-70b-Instruct	0.3368	0.242	0.448	0.506	0.247	0.28	0.298	bfloat16	Llama 3.3
grok-3	0.3329	0.2514	0.4513	0.4984	0.2518	0.2903	0.2543	Unknown	Proprietary
mistralai/Magistral-Small-2506	0.3328	0.2562	0.4377	0.4984	0.2472	0.2732	0.284	bfloat16	Apache 2.0
google/gemma-3-27b-it	0.3285	0.2421	0.4421	0.5103	0.2404	0.2734	0.2628	bfloat16	Gemma
grok-3-mini-fast-beta	0.3229	0.2421	0.4483	0.4961	0.2505	0.2732	0.227	Unknown	Proprietary
newmindai/Llama-3.3-70b-Instruct	0.3136	0.234	0.4483	0.476	0.248	0.18	0.292	bfloat16	Llama-3.3
Qwen/QwQ-32B	0.3123	0.234	0.45	0.5	0.249	0.19	0.251	bfloat16	Apache 2.0
newmindai/QwQ-32B-r1	0.312	0.234	0.4428	0.484	0.282	0.19	0.238	bfloat16	Apache 2.0

EvalMix Detailed Results

EvalMix Detailed Results

meta-llama/Meta-Llama-3.1-70B-Instruct	0.97	0.87	0.81	0.15	0.55	0.34	0.34	0.65	0.72	bfloat16	Proprietary


newmindai/Llama-3.3-70b-Instruct	0.97	0.87	0.81	0.15	0.55	0.34	0.34	0.7	0.72	bfloat16	Llama-3.3
meta-llama/Llama-3.3-70b-Instruct	0.93	0.77	0.69	0.05	0.35	0.16	0.23	0.6	0.58	bfloat16	Llama 3.3
newmindai/Qwen2.5-72b-Instruct	0.87	0.83	0.75	0.06	0.47	0.2	0.24	0.65	0.64	bfloat16	Qwen
google/gemma-3-27b-it	0.87	0.83	0.8	0.06	0.48	0.22	0.24	0.65	0.64	bfloat16	Gemma
microsoft/phi-4	0.86	0.83	0.76	0.06	0.48	0.21	0.25	0.66	0.65	bfloat16	MIT
grok-3	0.85	0.58	0.6	0.21	0.37	0.14	0.2	0.85	0.85	Unknown	Proprietary
Qwen/Qwen2.5-72B-Instruct	0.84	0.81	0.73	0.05	0.45	0.18	0.22	0.63	0.62	bfloat16	Qwen
mistralai/Magistral-Small-2506	0.84	0.82	0.67	0.04	0.37	0.17	0.23	0.63	0.66	bfloat16	Apache 2.0
meta-llama/Meta-Llama-3.1-70B-Instruct	0.84	0.81	0.71	0.08	0.42	0.23	0.26	0.64	0.7	bfloat16	Llama 3.1
newmindai/QwQ-32B-r1	0.84	0.55	0.6	0.02	0.32	0.11	0.15	0.48	0.46	bfloat16	Apache 2.0
Qwen/QwQ-32B	0.83	0.53	0.61	0.02	0.31	0.11	0.15	0.47	0.45	bfloat16	Apache 2.0
grok-3-mini-fast-beta	0.78	0.79	0.68	0.03	0.34	0.14	0.18	0.57	0.58	Unknown	Proprietary
deepseek-ai/DeepSeek-R1	0.68	0.7	0.54	0.03	0.36	0.14	0.17	0.53	0.52	bfloat16	MIT
Qwen/Qwen3-14B	0.29	0.21	0.73	0.02	0.26	0.11	0.13	0.37	0.36	bfloat16	Apache 2.0
Qwen/Qwen3-32B	0.29	0.22	0.7	0.01	0.21	0.08	0.1	0.37	0.36	bfloat16	Qwen

Snake Benchmark Detailed Results

mistralai/Magistral-Small-2506	1606.27	47.62	7.14	26	13	3	30.95	bfloat16	Apache 2.0


newmindai/QwQ-32B-r1	1606.27	61.9	7.14	26	13	3	30.95	bfloat16	Apache 2.0
Qwen/Qwen3-32B	1543.63	47.62	11.9	20	17	5	40.48	bfloat16	Qwen
Qwen/QwQ-32B	1526.66	52.38	7.14	22	17	3	40.48	bfloat16	Apache 2.0
deepseek-r1-distill-llama-70b	1445.85	35.71	9.52	15	23	4	54.76	bfloat16	MIT
qwen-qwq-32b	1376.17	28.57	2.38	12	29	1	69.05	bfloat16	Apache 2.0
mistralai/Magistral-Small-2506	1326.15	19.05	7.14	8	31	3	73.81	bfloat16	Apache 2.0

Evaluation Categories

1. ⚔️ Arena-Hard-Auto: Competitive Benchmarking at Scale

Arena-Hard-Auto is a cutting-edge automatic evaluation framework tailored for instruction-tuned Large Language Models (LLMs). Leveraging a tournament-style evaluation methodology, it pits models against each other in head-to-head matchups, with performance rankings determined via the Elo rating system—a method proven to align closely with human judgment, as evidenced in Chatbot Arena benchmarks. This evaluation suite is grounded in real-world use cases, benchmarking models across 11 diverse legal tasks, encompassing an extensive Turkish legal question-answer pairs.

Key Evaluation Pillars

Automated Judging Relies on specialized judge models to assess and determine the winner in each model-versus-model comparison. Includes dynamic system prompt adaptation to ensure context-aware evaluation based on the specific domain of the query.
Win Probability Estimation Computes the probability of victory for each model in a matchup using a logistic regression model, offering a probabilistic understanding of comparative strength.
Skill Rating (Elo Score) Utilizes the Elo rating system to provide a robust measurement of each model’s skill level relative to its competitors, ensuring a competitive and evolving leaderboard.
Win Rate Measures a model’s dominance by calculating the proportion of head-to-head victories, serving as a direct indicator of real-world performance across the benchmark suite. Arena-Hard-Auto offers a fast, scalable, and reliable method to benchmark LLMs, combining quantitative rigor with realistic interaction scenarios—making it an indispensable tool for understanding model capabilities in high-stakes legal and instructional domains.

To reproduce the results, you can use this repository: https://github.com/lmarena/arena-hard-auto

2. 🔄 EvalMix

EvalMix is a comprehensive evaluation pipeline designed to assess the performance of language models across multiple dimensions. This hybrid evaluation tool automatically analyzes model outputs, computes various semantic, LLM-based, and lexical metrics, and visualizes the results. EvalMix offers the following features:

Comprehensive Evaluation Metrics

LLM-as-a-Judge: Uses large language models—primarily GPT variants—to evaluate the accuracy, coherence, and relevance of generated responses.
Lexical Metrics: Calculates traditional NLP metrics such as BLEU, ROUGE-1, ROUGE-2, and ROUGE-L, along with modern metrics like BERTScore (precision, recall, F1) and cosine similarity.
Comparative Analysis: Enables performance comparison between multiple models on the same dataset.
Cosine Similarity (Turkish): Assesses performance in the Turkish language using Turkish-specific embedding models.
Cosine Similarity (Multilingual): Measures multilingual performance using language-agnostic embeddings.

Generation Configuration for Evaluation

The following configuration parameters are used for model generation during evaluation:

{
  "num_samples": 1100,
  "random_seed": 42,
  "temperature": 0.0,
  "max_completion_tokens": 1024
}

3. ⚡ Light-Eval

LightEval is a fast and modular framework designed to evaluate Large Language Models (LLMs) across a diverse range of tasks. It provides a comprehensive performance analysis by benchmarking models on academic, logical, scientific, and mathematical reasoning challenges.

Evaluation Tasks and Objectives LightEval assesses model capabilities using the following six core tasks:

MMLU (5-shot): Evaluates general knowledge and reasoning skills across academic and professional disciplines.(Proffesional-law task only)
TruthfulQA (0-shot): Measures the model's ability to generate accurate and truthful responses.
Winogrande (5-shot): Tests commonsense reasoning and logical inference abilities.
Hellaswag (10-shot): Assesses the coherence and logical consistency of model predictions based on contextual cues.
GSM8k (5-shot): Evaluates step-by-step mathematical reasoning and problem-solving capabilities.
ARC (25-shot): Tests scientific reasoning and the ability to solve science-based problems.

Overall Score Calculation The overall performance score of a model is computed using the average of the six evaluation tasks: LightEval Overall Score = (MMLU_proffesional_law + TruthfulQA + Winogrande + Hellaswag + GSM8k + ARC) / 6

To reproduce the results, you can use this repository: https://github.com/huggingface/lighteval

4. 🐍 Snake-Eval

An evaluation framework where models play the classic Snake game, competing to collect apples while avoiding collisions. Starting from random positions, models must guide their snakes using step-by-step reasoning, with performance measured as an Elo rating. This tests problem-solving ability, spatial awareness, and logical thinking in a challenging environment.

Sample Prompt:

You are controlling a snake in a multi-apple Snake game. The board size is 10x10. 
Normal X,Y coordinates are used. Coordinates range from (0,0) at bottom left to (9,9) at top right.

Apples at: (9, 6), (0, 2), (5, 9), (1, 7), (9, 7)

Your snake ID: 1 which is currently positioned at (5, 1)

Enemy snakes positions:
* Snake #2 is at position (7, 1) with body at []

Board state:
9 . . . . . A . . . .
8 . . . . . . . . . .
7 . A . . . . . . . A
6 . . . . . . . . . A
5 . . . . . . . . . .
4 . . . . . . . . . .
3 . . . . . . . . . .
2 A . . . . . . . . .
1 . . . . . 1 . 2 . .
0 . . . . . . . . . .
  0 1 2 3 4 5 6 7 8 9

--Your last move information:--
Direction: LEFT
Rationale: I'm noticing that (0,2) is the closest apple from our head at (6,1). 
Moving LEFT starts guiding us toward this apple while avoiding the enemy snake.
Strategy: Continue left and then maneuver upward to reach the apple at (0,2).
--End of your last move information.--

Rules:<
1) If you move onto an apple, you grow and gain 1 point.
2) If you hit a wall, another snake, or yourself, you die.
3) The goal is to have the most points by the end.

Decreasing x coordinate: LEFT, increasing x coordinate: RIGHT
Decreasing y coordinate: DOWN, increasing y coordinate: UP

Provide your reasoning and end with: UP, DOWN, LEFT, or RIGHT.

To reproduce the results, you can use this repository: https://github.com/gkamradt/SnakeBench

5. 📚 Retrieval

An evaluation system designed to assess Retrieval-Augmented Generation (RAG) capabilities. It measures how well models can:

Retrieve relevant information from a knowledge base
Generate accurate and contextually appropriate responses
Maintain coherence between retrieved information and generated text
Handle real-world information retrieval scenarios

Retrieval Metrics

RAG Success Rate: Percentage of successful retrievals
Maximum Correct References: Upper limit for correct retrievals per query
Hallucinated References: Number of irrelevant documents retrieved
Missed References: Number of relevant documents not retrieved

LLM Judge Evaluation Metrics

Legal Reasoning: Assesses the model's ability to understand and apply legal concepts
Factual Legal Accuracy: Measures accuracy of legal facts and references
Clarity & Precision: Evaluates clarity and precision of responses
Factual Reliability: Checks for biases and factual accuracy
Fluency: Assesses language fluency and coherence
Relevance: Measures response relevance to the query
Content Safety: Evaluates content safety and appropriateness

Judge Model: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1

RAG Score Calculation The RAG Score is a comprehensive metric that combines multiple performance indicators using dynamic normalization across all models. The formula weights different aspects of retrieval performance:

Formula Components:

RAG Success Rate (0.9 weight): Direct percentage of successful retrievals (higher is better)
Normalized False Positives (0.9 weight): Hallucinated references, min-max normalized (lower is better)
Normalized Max Correct References (0.1 weight): Maximum correct retrievals, min-max normalized (higher is better)
Normalized Missed References (0.1 weight): Relevant documents not retrieved, min-max normalized (lower is better)

Final Score Formula:

RAG Score = (0.9 × RAG_success_rate + 0.9 × norm_false_positives + 
             0.1 × norm_max_correct + 0.1 × norm_missed_refs) ÷ 2.0

6. 👥 Human Arena

Human Arena is a community-driven evaluation platform where language models are compared through human preferences and voting. This evaluation method captures real-world user preferences and provides insights into model performance from a human perspective.

Evaluation Methodology

Head-to-Head Comparisons: Models are presented with the same prompts and their responses are compared by human evaluators
ELO Rating System: Similar to chess rankings, models gain or lose rating points based on wins, losses, and ties against other models
Community Voting: Real users vote on which model responses they prefer, ensuring diverse evaluation perspectives
Blind Evaluation: Evaluators see responses without knowing which model generated them, reducing bias

Key Metrics

ELO Rating: Overall skill level based on tournament-style matchups (higher is better)
Win Rate: Percentage of head-to-head victories against other models
Wins/Losses/Ties: Direct comparison statistics showing model performance
Total Games: Number of evaluation rounds completed
Votes: Community engagement and evaluation volume
Provider & Technical Details: Infrastructure and model configuration information

Evaluation Criteria Human evaluators consider multiple factors when comparing model responses:

Response Quality: Accuracy, completeness, and relevance of answers
Communication Style: Clarity, coherence, and appropriateness of language
Helpfulness: How well the response addresses the user's needs
Safety & Ethics: Adherence to safety guidelines and ethical considerations
Creativity & Originality: For tasks requiring creative or innovative thinking

Human Arena provides a complementary perspective to automated benchmarks, capturing nuanced human preferences that traditional metrics might miss. This evaluation is particularly valuable for understanding how models perform in real-world conversational scenarios.

Benchmark Datasets

This section provides detailed information about the datasets used in our evaluation benchmarks. Each dataset has been carefully selected and adapted to provide comprehensive model evaluation across different domains and capabilities.

Available Datasets for Evaluation

Dataset	Evaluation Task	Language	Description
malhajar/mmlu_tr-v0.2	Lighteval MMLU	Turkish	Turkish adaptation of MMLU (Massive Multitask Language Understanding) v0.2 covering 57 academic subjects including mathematics, physics, chemistry, biology, history, law, and computer science. Tests knowledge and reasoning capabilities across multiple domains with multiple-choice questions.
malhajar/truthful_qa-tr-v0.2	Lighteval TruthfulQA	Turkish	Turkish version of TruthfulQA (v0.2) designed to measure model truthfulness and resistance to generating false information. Contains questions where humans often answer incorrectly due to misconceptions or false beliefs, testing the model's ability to provide accurate information.
malhajar/winogrande-tr-v0.2	Lighteval WinoGrande	Turkish	Turkish adaptation of WinoGrande (v0.2) focusing on commonsense reasoning through pronoun resolution tasks. Tests the model's ability to understand context, make logical inferences, and resolve ambiguous pronouns in everyday scenarios.
malhajar/hellaswag_tr-v0.2	Lighteval HellaSwag	Turkish	Turkish version of HellaSwag (v0.2) for commonsense reasoning evaluation. Tests the model's ability to predict plausible continuations of everyday scenarios and activities, requiring understanding of common sense and typical human behavior patterns.
malhajar/arc-tr-v0.2	Lighteval ARC	Turkish	Turkish adaptation of ARC (AI2 Reasoning Challenge) v0.2 focusing on science reasoning and question answering. Contains grade school level science questions that require reasoning beyond simple factual recall, covering topics in physics, chemistry, biology, and earth science.
malhajar/gsm8k_tr-v0.2	Lighteval GSM8K	Turkish	Turkish version of GSM8K (Grade School Math 8K) v0.2 for mathematical reasoning evaluation. Contains grade school level math word problems that require multi-step reasoning, arithmetic operations, and logical problem-solving skills to arrive at the correct numerical answer.
newmindai/mezura-eval-data	Auto-Arena	Turkish	mezura-eval dataset is a Turkish-language legal text dataset designed for evaluation tasks with RAG context support. The subsets include domains like Environmental Law, Tax Law, Data Protection Law and Health Law each containing annotated samples. Every row includes structured fields such as the category, concept, input and contextual information drawn from sources like official decisions.
newmindai/mezura-eval-data	EvalMix	Turkish	mezura-eval dataset is a Turkish-language legal text dataset designed for evaluation tasks with RAG context support. The subsets include domains like Environmental Law, Tax Law, Data Protection Law and Health Law each containing annotated samples. Every row includes structured fields such as the category, concept, input and contextual information drawn from sources like official decisions.
newmindai/mezura-eval-data	Retrieval	Turkish	mezura-eval dataset is a Turkish-language legal text dataset designed for evaluation tasks with RAG context support. The subsets include domains like Environmental Law, Tax Law, Data Protection Law and Health Law each containing annotated samples. Every row includes structured fields such as the category, concept, input and contextual information drawn from sources like official decisions.