LLM Leaderboard

Find the best LLMs for your use case

Compare LLMs against standard benchmarks. Choose the best AI model for your GenAI-powered apps.

LLM Leaderboard

Find the best LLMs for your use case

Compare LLMs against standard benchmarks. Choose the best AI model for your GenAI-powered apps.

LLM Leaderboard

Find the best LLMs for your use case

Compare LLMs against standard benchmarks. Choose the best AI model for your GenAI-powered apps.

Best LLMs Per Task

Discover the best LLMs for common tasks like multilingual Q&A, multi-task reasoning, and math problem-solving.

Best LLMs Per Task

Discover the best LLMs for common tasks like multilingual Q&A, multi-task reasoning, and math problem-solving.

Best LLMs Per Task

Discover the best LLMs for common tasks like multilingual Q&A, multi-task reasoning, and math problem-solving.

Best LLMs Per Task

Discover the best LLMs for common tasks like multilingual Q&A, multi-task reasoning, and math problem-solving.

Reasoning (GPQA Diamond³)

84.8%

Claude 3.7 Sonnet (64K extended thinking)

84.6%

Grok 3 Beta (Extended thinking)

79.7%

OpenAI o3-mini¹

78%

OpenAI o1¹

71.5%

DeepSeek R1 (32K extended thinking)

Multilingual Q&A (MMMLU)

87.7%

OpenAI o1¹

86.1%

Claude 3.7 Sonnet (64K extended thinking)

82.1%

Claude 3.5 Sonnet

83.2%

Claude 3.7 Sonnet (No extended thinking)

79.5%

OpenAI o3-mini¹

Math Problem-Solving (MATH 500)

97.9%

OpenAI o3-mini¹

97.3%

DeepSeek R1 (32K extended thinking)

96.4%

OpenAI o1¹

96.2%

Claude 3.7 Sonnet (64K extended thinking)

82.2%

Claude 3.7 Sonnet (No extended thinking)

Fast & Cheapest LLMs

Find the fastest and most cost-efficient LLMs. Get the best performance without breaking the bank.

Fast & Cheapest LLMs

Find the fastest and most cost-efficient LLMs. Get the best performance without breaking the bank.

Fast & Cheapest LLMs

Find the fastest and most cost-efficient LLMs. Get the best performance without breaking the bank.

Fast & Cheapest LLMs

Find the fastest and most cost-efficient LLMs. Get the best performance without breaking the bank.

Fastest Models

2100

Llama 70b

723

Llama 8b

237

O-1 mini

150

Nova lite

11

GPT-4.0 mini

tokens/second

Lowest Latency

(TTFT)

0.3s

Nova Micro

0.3s

1.5 Flash

0.3s

Llama 8b

0.4s

Nova Pro

0.4s

Nova Lite

Cheapest Models

Input Cost

Output Cost

$0.6

$0.45

$0.3

$0.15

$0

Nova Micro

1.5 Flash

LIama 70b

GPT-4o mini

Model Comparison

Model Comparison

Model Comparison

Model

Release Date

Context Window

Input Cost / 1M tokens

Output Cost / 1M tokens

Average

MMLU (General)

GPQA (Reasoning)

HumanEval (Coding)

Math

BFCL (Tool Use)

AWS Nova Lite

03/12/2024

300000

$0

$0

N/A

80.50%

42%

85.40%

73.30%

66.60%

AWS Nova Micro

03/12/2024

300000

$0

$0

N/A

77.60%

40%

81.10%

69.30%

56.20%

AWS Nova Pro

03/12/2024

300000

$0

$0

N/A

85.90%

46.90%

89%

76.60%

68.40%

Claude 3 Haiku

13/03/2024

200000

$0.25

$1.25

62.90%

75.20%

35.70%

75.90%

38.90%

74.65%

Claude 3 Opus

14/3/2024

200000

$15

$75

76.70%

85.70%

50.40%

84.90%

60.10%

88.40%

Claude 3.5 Haiku

22/10/2024

200000

$0.80

$4

68.30%

65%

41.60%

88.10%

69.40%

60%

Claude 3.7
Sonnet

24/02/2025

200000

$3

$15

N/A

83.20%

68%

N/A

82.20%

N/A

Claude 3 Sonnet
(Reasoner)

20/06/2024

200000

$3

$15

N/A

N/A

N/A

N/A

N/A

N/A

DeepSeek R1

20/01/2025

128000

$0.55

$2.19

N/A

90.8%

71.5%

N/A

97.3%

N/A

DeepSeek V3

26/12/2024

128000

$0.27

$1.10

76.24%

88.50%

59.10%

82.60%

90.20%

57.23%

GPT-4.5

27/02/2025

128000

$25

$150

N/A

89.60%

71.4%

76%

36.7%

N/A

GPT-3.5 Turbo

30/11/2022

16000

$0.50

$1.50

59.20%

69.80%

30.80%

68%

34.10%

64.41%

GPT-4

14/03/2023

8000

$30

$60

75.50%

86.40%

41.40%

86.60%

64.50%

88.30%

GPT-4o

13/05/2024

128000

$5

$15

80.50%

88.70%

53.60%

90.20%

76.60%

83.59%

GPT-4o mini

18/07/2024

128000

$0.15

$0.60

N/A

82%

40.20%

87.20%

70.20%

N/A

Gemini 1.5 Flash

14/05/2024

1000000

$0.35

$0.70

66.70%

78.90%

39.50%

71.50%

54.90%

79.88%

Gemini 1.5 Pro

24/09/2024

128000

$7

$21

74.10%

85.90%

46.20%

71.90%

67..70%

84.35%

Gemini 2.0 Flash

30/01/2025

1000000

$0.15

$0.60

N/A

76.40%

62.10%

N/A

89.70%

N/A

Gemini Ultra

24/09/2024

32000

N/A

N/A

No

83.70%

35.70%

N/A

53.20%

N/A

Grok-2

13/08/2024

128000

$5

$15

N/A

87.50%

56%

88.40%

76.10%

N/A

Grok-2 mini

14/08/2024

128000

$2

$10

N/A

86.20%

51%

85.70%

73%

N/A

Llama 3.1 405b

23/07/2024

128000

$1.79

$1.79

80.40%

88.60%

51.10%

89%

73.80%

88.50%

Llama 3.1 70b

23/07/2024

128000

$0.23

$0.40

75.50%

86%

46.70%

80.50%

68%

84.80%

Llama 3.1 8b

23/07/2024

128000

$0.09

$0.09

62.50%

73%

32.80%

72.60%

51.90%

76.10%

Llama 3.3 70b

23/07/2024

128000

$0.23

$0.40

74.50%

86%

48%

88.40%

77%

77.50%

Mistral Large

26/02/2024

32000

$8

$24

N/A

81.20%

N/A

N/A

N/A

N/A

Mistral Medium

09/12/2023

32000

$2.70

$8.10

N/A

75.30%

N/A

N/A

N/A

N/A

Mistral Small

17/09/2024

16000

$2

$6

N/A

70.6%

N/A

N/A

N/A

N/A

OpenAI o1

05/12/2024

128000

$15

$60

85.39%

91.80%

75.70%

92.40%

96.40%

66.73%

OpenAI o1-mini

12/09/2024

64000

$1.10

$4.40

80.07%

85.20%

60%

92.40%

90%

62.89%

OpenAI o3-mini

31/01/2025

128000

$1.10

$4.40

N/A

86.90%

79.70%

N/A

97.90%

N.A

Qwen2.5-70b

19/09/2024

128000

$0.90

$1.20

N/A

N/A

N/A

88%

N/A

N/A

Qwen2.5-72b

19/09/2024

131000

$0.40

$0.75

No

86.1%

45.9%

59.1%

62.1%

61.31%

Model

Release Date

Context Window

Input Cost / 1M tokens

Output Cost / 1M tokens

Average

MMLU (General)

GPQA (Reasoning)

HumanEval (Coding)

Math

BFCL (Tool Use)

AWS Nova Lite

03/12/2024

300000

$0

$0

N/A

80.50%

42%

85.40%

73.30%

66.60%

AWS Nova Micro

03/12/2024

300000

$0

$0

N/A

77.60%

40%

81.10%

69.30%

56.20%

AWS Nova Pro

03/12/2024

300000

$0

$0

N/A

85.90%

46.90%

89%

76.60%

68.40%

Claude 3 Haiku

13/03/2024

200000

$0.25

$1.25

62.90%

75.20%

35.70%

75.90%

38.90%

74.65%

Claude 3 Opus

14/3/2024

200000

$15

$75

76.70%

85.70%

50.40%

84.90%

60.10%

88.40%

Claude 3.5 Haiku

22/10/2024

200000

$0.80

$4

68.30%

65%

41.60%

88.10%

69.40%

60%

Claude 3.7
Sonnet

24/02/2025

200000

$3

$15

N/A

83.20%

68%

N/A

82.20%

N/A

Claude 3 Sonnet
(Reasoner)

20/06/2024

200000

$3

$15

N/A

N/A

N/A

N/A

N/A

N/A

DeepSeek R1

20/01/2025

128000

$0.55

$2.19

N/A

90.8%

71.5%

N/A

97.3%

N/A

DeepSeek V3

26/12/2024

128000

$0.27

$1.10

76.24%

88.50%

59.10%

82.60%

90.20%

57.23%

GPT-4.5

27/02/2025

128000

$25

$150

N/A

89.60%

71.4%

76%

36.7%

N/A

GPT-3.5 Turbo

30/11/2022

16000

$0.50

$1.50

59.20%

69.80%

30.80%

68%

34.10%

64.41%

GPT-4

14/03/2023

8000

$30

$60

75.50%

86.40%

41.40%

86.60%

64.50%

88.30%

GPT-4o

13/05/2024

128000

$5

$15

80.50%

88.70%

53.60%

90.20%

76.60%

83.59%

GPT-4o mini

18/07/2024

128000

$0.15

$0.60

N/A

82%

40.20%

87.20%

70.20%

N/A

Gemini 1.5 Flash

14/05/2024

1000000

$0.35

$0.70

66.70%

78.90%

39.50%

71.50%

54.90%

79.88%

Gemini 1.5 Pro

24/09/2024

128000

$7

$21

74.10%

85.90%

46.20%

71.90%

67..70%

84.35%

Gemini 2.0 Flash

30/01/2025

1000000

$0.15

$0.60

N/A

76.40%

62.10%

N/A

89.70%

N/A

Gemini Ultra

24/09/2024

32000

N/A

N/A

No

83.70%

35.70%

N/A

53.20%

N/A

Grok-2

13/08/2024

128000

$5

$15

N/A

87.50%

56%

88.40%

76.10%

N/A

Grok-2 mini

14/08/2024

128000

$2

$10

N/A

86.20%

51%

85.70%

73%

N/A

Llama 3.1 405b

23/07/2024

128000

$1.79

$1.79

80.40%

88.60%

51.10%

89%

73.80%

88.50%

Llama 3.1 70b

23/07/2024

128000

$0.23

$0.40

75.50%

86%

46.70%

80.50%

68%

84.80%

Llama 3.1 8b

23/07/2024

128000

$0.09

$0.09

62.50%

73%

32.80%

72.60%

51.90%

76.10%

Llama 3.3 70b

23/07/2024

128000

$0.23

$0.40

74.50%

86%

48%

88.40%

77%

77.50%

Mistral Large

26/02/2024

32000

$8

$24

N/A

81.20%

N/A

N/A

N/A

N/A

Mistral Medium

09/12/2023

32000

$2.70

$8.10

N/A

75.30%

N/A

N/A

N/A

N/A

Mistral Small

17/09/2024

16000

$2

$6

N/A

70.6%

N/A

N/A

N/A

N/A

OpenAI o1

05/12/2024

128000

$15

$60

85.39%

91.80%

75.70%

92.40%

96.40%

66.73%

OpenAI o1-mini

12/09/2024

64000

$1.10

$4.40

80.07%

85.20%

60%

92.40%

90%

62.89%

OpenAI o3-mini

31/01/2025

128000

$1.10

$4.40

N/A

86.90%

79.70%

N/A

97.90%

N.A

Qwen2.5-70b

19/09/2024

128000

$0.90

$1.20

N/A

N/A

N/A

88%

N/A

N/A

Qwen2.5-72b

19/09/2024

131000

$0.40

$0.75

No

86.1%

45.9%

59.1%

62.1%

61.31%

Model

Release Date

Context Window

Input Cost / 1M tokens

Output Cost / 1M tokens

Average

MMLU (General)

GPQA (Reasoning)

HumanEval (Coding)

Math

BFCL (Tool Use)

AWS Nova Lite

03/12/2024

300000

$0

$0

N/A

80.50%

42%

85.40%

73.30%

66.60%

AWS Nova Micro

03/12/2024

300000

$0

$0

N/A

77.60%

40%

81.10%

69.30%

56.20%

AWS Nova Pro

03/12/2024

300000

$0

$0

N/A

85.90%

46.90%

89%

76.60%

68.40%

Claude 3 Haiku

13/03/2024

200000

$0.25

$1.25

62.90%

75.20%

35.70%

75.90%

38.90%

74.65%

Claude 3 Opus

14/3/2024

200000

$15

$75

76.70%

85.70%

50.40%

84.90%

60.10%

88.40%

Claude 3.5 Haiku

22/10/2024

200000

$0.80

$4

68.30%

65%

41.60%

88.10%

69.40%

60%

Claude 3.7
Sonnet

24/02/2025

200000

$3

$15

N/A

83.20%

68%

N/A

82.20%

N/A

Claude 3 Sonnet
(Reasoner)

20/06/2024

200000

$3

$15

N/A

N/A

N/A

N/A

N/A

N/A

DeepSeek R1

20/01/2025

128000

$0.55

$2.19

N/A

90.8%

71.5%

N/A

97.3%

N/A

DeepSeek V3

26/12/2024

128000

$0.27

$1.10

76.24%

88.50%

59.10%

82.60%

90.20%

57.23%

GPT-4.5

27/02/2025

128000

$25

$150

N/A

89.60%

71.4%

76%

36.7%

N/A

GPT-3.5 Turbo

30/11/2022

16000

$0.50

$1.50

59.20%

69.80%

30.80%

68%

34.10%

64.41%

GPT-4

14/03/2023

8000

$30

$60

75.50%

86.40%

41.40%

86.60%

64.50%

88.30%

GPT-4o

13/05/2024

128000

$5

$15

80.50%

88.70%

53.60%

90.20%

76.60%

83.59%

GPT-4o mini

18/07/2024

128000

$0.15

$0.60

N/A

82%

40.20%

87.20%

70.20%

N/A

Gemini 1.5 Flash

14/05/2024

1000000

$0.35

$0.70

66.70%

78.90%

39.50%

71.50%

54.90%

79.88%

Gemini 1.5 Pro

24/09/2024

128000

$7

$21

74.10%

85.90%

46.20%

71.90%

67..70%

84.35%

Gemini 2.0 Flash

30/01/2025

1000000

$0.15

$0.60

N/A

76.40%

62.10%

N/A

89.70%

N/A

Gemini Ultra

24/09/2024

32000

N/A

N/A

No

83.70%

35.70%

N/A

53.20%

N/A

Grok-2

13/08/2024

128000

$5

$15

N/A

87.50%

56%

88.40%

76.10%

N/A

Grok-2 mini

14/08/2024

128000

$2

$10

N/A

86.20%

51%

85.70%

73%

N/A

Llama 3.1 405b

23/07/2024

128000

$1.79

$1.79

80.40%

88.60%

51.10%

89%

73.80%

88.50%

Llama 3.1 70b

23/07/2024

128000

$0.23

$0.40

75.50%

86%

46.70%

80.50%

68%

84.80%

Llama 3.1 8b

23/07/2024

128000

$0.09

$0.09

62.50%

73%

32.80%

72.60%

51.90%

76.10%

Llama 3.3 70b

23/07/2024

128000

$0.23

$0.40

74.50%

86%

48%

88.40%

77%

77.50%

Mistral Large

26/02/2024

32000

$8

$24

N/A

81.20%

N/A

N/A

N/A

N/A

Mistral Medium

09/12/2023

32000

$2.70

$8.10

N/A

75.30%

N/A

N/A

N/A

N/A

Mistral Small

17/09/2024

16000

$2

$6

N/A

70.6%

N/A

N/A

N/A

N/A

OpenAI o1

05/12/2024

128000

$15

$60

85.39%

91.80%

75.70%

92.40%

96.40%

66.73%

OpenAI o1-mini

12/09/2024

64000

$1.10

$4.40

80.07%

85.20%

60%

92.40%

90%

62.89%

OpenAI o3-mini

31/01/2025

128000

$1.10

$4.40

N/A

86.90%

79.70%

N/A

97.90%

N.A

Qwen2.5-70b

19/09/2024

128000

$0.90

$1.20

N/A

N/A

N/A

88%

N/A

N/A

Qwen2.5-72b

19/09/2024

131000

$0.40

$0.75

No

86.1%

45.9%

59.1%

62.1%

61.31%

FAQ

Frequently asked questions

What is the LLM Leaderboard, and how does it work?

The LLM Leaderboard is a performance-driven ranking of the top large language models (LLMs), focusing on key evaluation benchmarks such as reasoning, multilingual Q&A, and math problem-solving. The leaderboard provides a comparative analysis of LLMs based on speed (tokens per second), cost efficiency (input/output costs), and latency. It also includes a comprehensive model comparison table covering aspects like context window size, accuracy on various benchmarks (GPQA Diamond³, MMMLU, MATH 500, and more), and pricing metrics.

What benchmarks are used to evaluate models on the LLM Leaderboard?

The LLM Leaderboard evaluates models based on industry-standard benchmarks, including:

  • GPQA Diamond³: Measures graduate-level reasoning capabilities.

  • MMMLU: Assesses performance in multilingual Q&A across diverse knowledge areas.

  • MATH 500: Tests advanced mathematical problem-solving skills.

  • HumanEval: Evaluates code generation accuracy.

  • BFCL & MGSM: Additional benchmarks providing insights into business-friendliness and grade-school math problem-solving.

Each model’s ranking reflects its proficiency in these tasks, helping users select the most suitable LLM for their needs.

How does the LLM Leaderboard rank models based on speed, cost, and latency?

Beyond accuracy, the leaderboard also highlights the fastest, cheapest, and lowest-latency models:

  • Fastest Models: Ranked by tokens generated per second.

  • Cheapest Models: Evaluated based on input and output costs per 1 million tokens.

  • Lowest Latency: Measures response time efficiency, which is crucial for real-time applications.

This data helps businesses and developers choose models that balance performance, cost, and speed.

What details are included in the model comparison table?

The model comparison table provides a quick view of each LLM’s capabilities, including:

  • Context Window Size: Determines how much text a model can process at once.

  • Input & Output Costs (per 1M tokens): Helps estimate operational expenses.

  • Performance on Key Benchmarks: Includes MMLU, GPQA, HumanEval, Math, BFCL, and MGSM.

This structured comparison makes it easier to evaluate models based on specific business or technical requirements.

How can I use the LLM Leaderboard to choose the best model for my use case?

The LLM Leaderboard is designed to help users select the best LLM based on their priorities:

  • For reasoning-heavy tasks: Look for high GPQA Diamond³ and MMMLU scores.

  • For math-related applications: Focus on MATH 500 and MGSM benchmarks.

  • For cost-sensitive projects: Check the cheapest models based on input/output token pricing.

  • For real-time applications: Prioritize the fastest models and those with the lowest latency.

By considering both accuracy benchmarks and efficiency metrics, users can make informed decisions about which LLM best fits their needs.

What is the LLM Leaderboard, and how does it work?

The LLM Leaderboard is a performance-driven ranking of the top large language models (LLMs), focusing on key evaluation benchmarks such as reasoning, multilingual Q&A, and math problem-solving. The leaderboard provides a comparative analysis of LLMs based on speed (tokens per second), cost efficiency (input/output costs), and latency. It also includes a comprehensive model comparison table covering aspects like context window size, accuracy on various benchmarks (GPQA Diamond³, MMMLU, MATH 500, and more), and pricing metrics.

What benchmarks are used to evaluate models on the LLM Leaderboard?

The LLM Leaderboard evaluates models based on industry-standard benchmarks, including:

  • GPQA Diamond³: Measures graduate-level reasoning capabilities.

  • MMMLU: Assesses performance in multilingual Q&A across diverse knowledge areas.

  • MATH 500: Tests advanced mathematical problem-solving skills.

  • HumanEval: Evaluates code generation accuracy.

  • BFCL & MGSM: Additional benchmarks providing insights into business-friendliness and grade-school math problem-solving.

Each model’s ranking reflects its proficiency in these tasks, helping users select the most suitable LLM for their needs.

How does the LLM Leaderboard rank models based on speed, cost, and latency?

Beyond accuracy, the leaderboard also highlights the fastest, cheapest, and lowest-latency models:

  • Fastest Models: Ranked by tokens generated per second.

  • Cheapest Models: Evaluated based on input and output costs per 1 million tokens.

  • Lowest Latency: Measures response time efficiency, which is crucial for real-time applications.

This data helps businesses and developers choose models that balance performance, cost, and speed.

What details are included in the model comparison table?

The model comparison table provides a quick view of each LLM’s capabilities, including:

  • Context Window Size: Determines how much text a model can process at once.

  • Input & Output Costs (per 1M tokens): Helps estimate operational expenses.

  • Performance on Key Benchmarks: Includes MMLU, GPQA, HumanEval, Math, BFCL, and MGSM.

This structured comparison makes it easier to evaluate models based on specific business or technical requirements.

How can I use the LLM Leaderboard to choose the best model for my use case?

The LLM Leaderboard is designed to help users select the best LLM based on their priorities:

  • For reasoning-heavy tasks: Look for high GPQA Diamond³ and MMMLU scores.

  • For math-related applications: Focus on MATH 500 and MGSM benchmarks.

  • For cost-sensitive projects: Check the cheapest models based on input/output token pricing.

  • For real-time applications: Prioritize the fastest models and those with the lowest latency.

By considering both accuracy benchmarks and efficiency metrics, users can make informed decisions about which LLM best fits their needs.

What is the LLM Leaderboard, and how does it work?

The LLM Leaderboard is a performance-driven ranking of the top large language models (LLMs), focusing on key evaluation benchmarks such as reasoning, multilingual Q&A, and math problem-solving. The leaderboard provides a comparative analysis of LLMs based on speed (tokens per second), cost efficiency (input/output costs), and latency. It also includes a comprehensive model comparison table covering aspects like context window size, accuracy on various benchmarks (GPQA Diamond³, MMMLU, MATH 500, and more), and pricing metrics.

What benchmarks are used to evaluate models on the LLM Leaderboard?

The LLM Leaderboard evaluates models based on industry-standard benchmarks, including:

  • GPQA Diamond³: Measures graduate-level reasoning capabilities.

  • MMMLU: Assesses performance in multilingual Q&A across diverse knowledge areas.

  • MATH 500: Tests advanced mathematical problem-solving skills.

  • HumanEval: Evaluates code generation accuracy.

  • BFCL & MGSM: Additional benchmarks providing insights into business-friendliness and grade-school math problem-solving.

Each model’s ranking reflects its proficiency in these tasks, helping users select the most suitable LLM for their needs.

How does the LLM Leaderboard rank models based on speed, cost, and latency?

Beyond accuracy, the leaderboard also highlights the fastest, cheapest, and lowest-latency models:

  • Fastest Models: Ranked by tokens generated per second.

  • Cheapest Models: Evaluated based on input and output costs per 1 million tokens.

  • Lowest Latency: Measures response time efficiency, which is crucial for real-time applications.

This data helps businesses and developers choose models that balance performance, cost, and speed.

What details are included in the model comparison table?

The model comparison table provides a quick view of each LLM’s capabilities, including:

  • Context Window Size: Determines how much text a model can process at once.

  • Input & Output Costs (per 1M tokens): Helps estimate operational expenses.

  • Performance on Key Benchmarks: Includes MMLU, GPQA, HumanEval, Math, BFCL, and MGSM.

This structured comparison makes it easier to evaluate models based on specific business or technical requirements.

How can I use the LLM Leaderboard to choose the best model for my use case?

The LLM Leaderboard is designed to help users select the best LLM based on their priorities:

  • For reasoning-heavy tasks: Look for high GPQA Diamond³ and MMMLU scores.

  • For math-related applications: Focus on MATH 500 and MGSM benchmarks.

  • For cost-sensitive projects: Check the cheapest models based on input/output token pricing.

  • For real-time applications: Prioritize the fastest models and those with the lowest latency.

By considering both accuracy benchmarks and efficiency metrics, users can make informed decisions about which LLM best fits their needs.

Create an account and start building today.

Create an account and start building today.

Create an account and start building today.

Create an account and start building today.