Newmind AI LLM Evaluation Leaderboard

Evaluate your model's performance in the following categories:

  1. โš”๏ธ Auto Arena - Tournament-style evaluation where models are directly compared and ranked using an ELO rating system.

  2. ๐Ÿ‘ฅ Human Arena - Comparative evaluation based on human preferences, assessed by a reviewer group.

  3. ๐Ÿ“š Retrieval - Evaluation focused on information retrieval and text generation quality for real-world applications.

  4. โšก Light Eval - Fast and efficient model evaluation framework for quick testing.

  5. ๐Ÿ”„ EvalMix - Multi-dimensional evaluation including lexical accuracy and semantic coherence.

  6. ๐Ÿ Snake Bench - Specialized evaluation measuring step-by-step problem solving and complex reasoning abilities.

  7. ๐Ÿงฉ Structured Outputs - Coming soon!

Evaluate your model in any or all of these categories to discover its capabilities and areas of excellence.

For any questions, please contact us at info@newmind.ai

Model Evaluation Results

This screen shows model performance across different evaluation categories.

Model Performance Comparison

Model Performance Comparison
meta-llama/Meta-Llama-3.1-70B-Instruct
1848.77
1559.97
0.84
0.33
0.85
0.58
0.25
bfloat16
Proprietary