Cracking the Code of AI Report Cards! The Secrets Behind Evaluation Rankings

Article Summary 📝

“This AI is the smartest in the world!” “A new AI has topped the rankings!”… You’ve probably seen headlines like these, right? But who exactly decides how “smart” they are, and how do they do it?

This article gently pulls back the curtain on the various “evaluation methods” used to measure an AI’s performance—just like school tests or sports competitions. We’ll break it down so that even if you’re not an AI expert, you’ll get it instantly. By the end, you’ll find AI news much more interesting and understand it on a deeper level!

Chapter 1: How Are AI Grades Determined? The Basics of Evaluation

There isn’t just one way to measure an AI’s abilities. Broadly speaking, there are two approaches: having people check its work directly, and making it solve a set of standardized tests. Let’s start by looking at the basic ideas behind them!

🧑‍⚖️

Checked Directly by Humans

(Tap the card to see details)

Human Evaluation

Human experts directly review an AI’s output to see if its writing is natural or if it understands the intent of a question. Because it can assess creativity and contextual nuances that are difficult for AI, it’s often called the “gold standard” of evaluation. The only catch is that it’s extremely time-consuming and expensive.

📝

A “Standardized Test” for AIs

(Tap the card to see details)

Benchmark Evaluation

This involves scoring an AI by having it solve pre-prepared, standard tests (called benchmarks), much like math or science quizzes. It allows for fair comparisons because different AIs can be measured on a level playing field. Lately, we developers are in a daily race to see who can achieve the highest benchmark scores.

Test Scoring Methods: A Simple Guide to Common “Evaluation Metrics”

If there are “tests,” there must also be “scoring rules,” right? In AI evaluation, some rules with rather interesting names are used. Here are a few of the most common ones.

In a nutshell: It’s the “simple percentage of correct answers.”

Explanation: This is the most basic and straightforward metric. It’s just like true/false or multiple-choice questions at school. For example, if an AI takes a 100-question multiple-choice quiz on history and law (a format used in the famous MMLU benchmark) and answers 90 correctly, its “Accuracy is 90%.” It’s often used in tests that measure breadth of knowledge.

In a nutshell: The rule is, “Give it a few chances, and if it succeeds even once, it’s a pass!”

Explanation: This is mainly used for tests that involve solving programming problems (like HumanEval). For example, with “Pass@3,” you let the AI generate program code three times. If at least one of them works correctly, it’s counted as a “success.” This method evaluates the ability to reach a correct answer through trial and error, rather than producing a perfect answer on the first try.

In a nutshell: A way to compare, at the word level, “how similar an AI’s text is to a model answer.”

Explanation: These have traditionally been used to evaluate machine translation (BLEU) and text summarization (ROUGE). They mechanically count how many words or short phrases in the AI-generated text match a human-written model answer to produce a score. However, their weakness is that they can’t grasp nuances, like when two sentences have different wording but the same meaning.

Chapter 2: The Two Major Styles of Evaluation

There are two main schools of thought in AI evaluation. One is a “battle format” where AIs compete against each other and a human picks the winner based on preference. The other is an “exam format” where they solve a fixed set of problems. Let’s compare their features side-by-side!

🥊 Battle Style

Chatbot Arena Style

A real-world evaluation style where the winner is decided by user “preference.”

📝Method: Users freely chat with two anonymous AIs and vote for the one they think is better.
👍Strength: Reveals how well an AI handles a wide variety of real-world questions. It’s considered fair because it’s hard to “cram for the test.”
🤔Weakness: There’s a tendency for users to prefer confident and “eloquent” AIs, even if their answers aren’t as correct.

🏫 Exam Style

Hugging Face Leaderboard Style

A style that measures objective skill with a “standardized test” where everyone solves the same problems.

📝Method: All AIs solve several fixed benchmarks, such as science and math, under the same conditions, and their scores are compared.
👍Strength: It’s objective, and its high “reproducibility” (getting the same result no matter who runs the test) is a major advantage. It also makes it easy to track research progress.
🤔Weakness: The risk of “data contamination” and the problem of “saturation,” where tests become too easy to differentiate models, are major concerns.

Chapter 3: The Latest Method? The ‘AI Teacher’ That Grades Other AIs

Human evaluation is accurate, but it’s a ton of work! The solution that has emerged to solve this problem is something straight out of sci-fi: “an AI that evaluates other AIs.” This is called “LLM-as-a-Judge“.

🧑‍🏫

Hello! I’m the “AI Teacher,” an evaluator AI. It’s my job to grade the answers of other AIs.

🧑‍🏫

The reason we’re used is because we can evaluate faster, cheaper, and in larger quantities than humans. It’s also based on the idea that it’s easier to evaluate an existing answer than to create one from scratch.

🧑‍🏫

Research shows that our evaluations agree with human evaluators over 80% of the time. Sometimes, that’s even higher than the agreement rate between two humans!

Weaknesses of the “AI Teacher”: Does it Play Favorites or Make Mistakes?

But this “AI Teacher” has its weaknesses, too. In fact, it can be quite opinionated…

🚨 AI Teacher’s Bias Quiz 🚨

What kind of “favorites” do you think an AI teacher tends to play? Click to check the answers!

Q. Between an answer from an AI in its own family and one from a completely unknown AI, which one does it score higher?

▼ See Answer

A. An AI from its own family.
This is called “self-preference bias,” where it tends to favor writing styles similar to its own. It’s like playing favorites with family!

Q. Between a concise, accurate answer and a long, detailed-looking answer, which one does it tend to rate higher?

▼ See Answer

A. The long, detailed-looking answer.
This is “verbosity bias.” It might give a higher score to a longer answer just because it’s long, even if the content isn’t as accurate.

Chapter 4: The ‘Underside’ of AI Evaluation: Important Caveats

We’ve looked at various evaluation methods, but there’s a huge challenge facing the entire world of AI evaluation. Knowing this will help you look at AI news with a more critical and informed eye.

Are the Test Questions Leaked? “Data Contamination”

This is one of the biggest problems in AI evaluation. AIs learn from vast amounts of data from the internet, right? Sometimes, the test questions from the benchmarks used for evaluation are unintentionally included in that training data.

An analogy…
It’s like cramming for an entrance exam by memorizing a study guide that contains the exact questions and answers. You can’t measure true ability that way, can you? The AI might not be “solving” the problems but simply “recalling” the answers it has already memorized.

Why Do the Results Change Every Time? “The Difficulty of Reproducibility”

In science, “reproducibility”—getting the same result no matter who conducts the experiment—is crucial. However, in the world of AI, especially with AI provided by companies, this is extremely difficult.

That’s because closed models are constantly updated without notice. The GPT-4 I tested yesterday and the GPT-4 you use today might already be different models internally. This makes it difficult to compare evaluations.

Conclusion: How Should We Read AI Report Cards?

We’ve now seen the various methods of AI evaluation and what goes on behind the scenes. The most important conclusion I want to share is this:

“A single leaderboard is not the absolute truth.”

Every evaluation method has its strengths and, as we’ve seen, significant drawbacks.

A high rank in the Exam Style might indicate it’s good at academic tests.
A high rank in the Battle Style might indicate it’s good at talking in a way humans like.

But neither guarantees that it will be “truly easy to use” for your specific purpose.
That’s why it’s important to take a step back when you look at rankings and ask, “What method was used to measure this score?” Having that critical perspective is key. AI evaluation will continue to evolve, and following that evolution might just be another fascinating part of the world of AI!