Newmind AI LLM Evaluation Leaderboard

Evaluate your model's performance in the following categories:

  1. โš”๏ธ Auto Arena - Tournament-style evaluation where models are directly compared and ranked using an ELO rating system.

  2. ๐Ÿ‘ฅ Human Arena - Comparative evaluation based on human preferences, assessed by a reviewer group.

  3. ๐Ÿ“š Retrieval - Evaluation focused on information retrieval and text generation quality for real-world applications.

  4. โšก Light Eval - Fast and efficient model evaluation framework for quick testing.

  5. ๐Ÿ”„ EvalMix - Multi-dimensional evaluation including lexical accuracy and semantic coherence.

  6. ๐Ÿ Snake Bench - Specialized evaluation measuring step-by-step problem solving and complex reasoning abilities.

  7. ๐Ÿงฉ Structured Outputs - Evaluation of models' ability to generate properly formatted, structured responses with accurate field extraction and semantic understanding.

Evaluate your model in any or all of these categories to discover its capabilities and areas of excellence.

For any questions, please contact us at info@newmind.ai

Model Evaluation Results

This screen shows model performance across different evaluation categories.

Human Arena Results Updated At 26.01.2026

Human Arena Results Updated At 26.01.2026
meta-llama/Llama-3.3-70B-Instruct
1628
111
103
20
191
58
bfloat16
Proprietary