Beyond the Basics: A Comprehensive Analysis of GPT-4's Advanced Test Performances: 202

In our previous exploration, “The Ultimate AI Challenge,” we unveiled the curtain on the fascinating world of tests like MMLU, HellaSwag, and others that GPT-4 faced, offering a foundational understanding of these AI trials. 


Now, we delve deeper, going beyond the basics to unravel the complexities and nuances of GPT-4’s performances. This next-level analysis aims not just to inform but to engage and challenge our perceptions of AI’s evolving capabilities.


The Massive Multitask Language Understanding (MMLU) test, developed by Facebook AI Research, stands as a comprehensive challenge for AIs like GPT-4 


The MMLU test is crucial as it’s one of the most comprehensive evaluations of a language model’s understanding and reasoning across various subjects and languages. Its significance lies in its ability to measure an AI’s adaptability and depth of knowledge, beyond just language proficiency. 


MMLU challenges GPT-4 to apply context, reason abstractly, and draw on a wide range of knowledge, making it a critical benchmark for assessing the true intelligence and versatility of AI models like GPT-4.


For instance:

  • Philosophy: “Explain Nietzsche’s concept of ‘eternal recurrence’ and its implications on free will.”

  • Advanced Mathematics: “Describe the Riemann Hypothesis and its significance in number theory.”

  • Cultural Studies: “Analyze the impact of post-colonialism on modern literature in Southeast Asia.”

  • Astrophysics: “Discuss the evidence supporting the existence of dark matter in the universe.”

  • Comparative Religion: “Explain the differences and similarities between the concepts of karma in Hinduism and Buddhism.”


Took the liberty to ask GPT4 what would be the most difficult question for this test:


“Evaluate the impact of quantum computing on modern encryption methods, considering both theoretical and practical implications.”


GPT 4 achieved an 86.4% of accuracy in this test, the best performer so far when it comes to LLMs. It was made with the “5-shot” technique. Explained here.


HellaSwag, developed by AI2, HellaSwag tests commonsense reasoning by challenging AI to complete narratives and predict logical story endings. It’s pivotal in evaluating AI’s understanding of everyday scenarios and its ability to predict outcomes based on contextual cues.


  • “Predict the outcome: A scientist combines two chemicals that react. What happens next?”

  • “A chef starts to bake a cake without preheating the oven. Predict the baking process’s result.”

  • “A student studies for a test but doesn’t sleep well. Anticipate the student’s performance.”


Toughest question?:


“A quantum physicist experiments with particle entanglement. 

Predict the most likely outcome: 

  • Altered time perception 

  • Generation of a new element 

  • Breakthrough in teleportation.”



GPT 4 achieved a 95.3% of accuracy in this test, the best performer so far when it comes to LLMs. It was made with the “10-shots” technique.


AI2 Reasoning Challenge (ARC)


Created by AI2, ARC focuses on grade-school level science questions, assessing AI’s ability to apply reasoning and understanding in scientific contexts. It’s crucial for exploring AI’s role in educational assistance and understanding scientific concepts. For instance:


  • “Explain how photosynthesis contributes to the carbon cycle.”

  • “What causes the phases of the Moon?”

  • “Describe the relationship between a food chain and an ecosystem’s stability.”


Toughest Question?:


“Explain the role of dark energy in the accelerating expansion of the universe and its implications for the Big Bang theory.”


GPT 4 achieved a 96.3% of accuracy in this test, the best performer so far when it comes to LLMs. It was made with the “25-shot” technique.


WinoGrande also developed by AI2, WinoGrande tests AI’s language understanding, specifically targeting its ability to resolve ambiguous pronouns in sentences. This test is key to evaluating natural language processing and the AI’s capability in interpreting complex, nuanced human language.  For instance:

  • “Alex asked Jordan to help with his homework after school. ‘He’ was struggling with math.”

  • “Jamie borrowed a book from Taylor and returned it late. ‘She’ apologized for the delay.”

  • “Chris watched Pat’s dog while ‘he’ was on vacation.”


Toughest Question?:


“Alex discussed quantum mechanics with Jordan, who found the conversation enlightening. ‘Their’ interest in the subject was piqued.”


GPT 4 achieved an 87.5% of accuracy in this test, the best performer so far when it comes to LLMs. It was made with the “5-shot” technique.



HumanEval, developed by OpenAI, HumanEval tests AI’s coding abilities. It assesses the AI’s proficiency in understanding programming concepts, problem-solving, and the capacity to write functional code. This test is significant for evaluating AI’s application in software development.

For instance:

  • “Create a function to find the nth Fibonacci number.”

  • “Write a script that sorts a list of tuples based on the second element.”

  • “Develop a program that merges two dictionaries without losing any data.”


Toughest Question?:


“Develop an algorithm that optimizes the travel route of a salesman visiting multiple cities, considering time and distance constraints.”


GPT 4 achieved an 67.0% of accuracy in this test, the best performer so far when it comes to LLMs. It was made with the “0-shot” technique. Please notice that this is the least reliable part of ChatGPT.


DROP(Discrete Reasoning Over Paragraphs):

Created by the Allen Institute for Artificial Intelligence, DROP focuses on AI’s reading comprehension and reasoning skills. It requires AI to interpret text passages and perform logical operations like calculation and sorting, testing its understanding of complex written information. For instance:


  • “If a train leaves at 9 AM and travels at 60 mph, when does it reach a station 300 miles away?”

  • “A recipe uses 4 eggs to make 2 cakes. How many eggs are needed for 5 cakes?”

  • “During a game, a player scores 3 times with 2 points each. Calculate the total score.”


Toughest Question?:


“A scientist conducts an experiment with varying temperatures and measures the rate of a chemical reaction. Calculate the reaction speed at 25°C.”


GPT 4 achieved 80.9% of accuracy in this test. It was made with the “3-shot” technique. Please note that according to OpenAI, SOTA (another LLM) scored 88.4 points. This is the only test where ChatGPT is not the best.



GSM-8K (Grade School Math 8K):


Also developed by OpenAI, GSM-8K assesses AI’s mathematical reasoning through grade-school level math problems. This test evaluates the AI’s ability in arithmetic and logical problem-solving, reflecting its potential in educational and analytical applications. For instance:


  • “Solve for x in the equation 2x + 3 = 11.”

  • “A rectangle’s length is twice its width. If the area is 50 sq. units, find the dimensions.”

  • “Calculate the volume of a cylinder with a radius of 3 units and a height of 10 units.”


Toughest Question?:


“The probability of an event changes over time following a logarithmic scale, calculate its likelihood at a given point, considering initial probability values.”


GPT 4 achieved a 92.0% of accuracy in this test, the best performer so far when it comes to LLMs. It was made with the “5-shot, chain of thought” technique.