Yesterday I shared the results of the Polish eight-grade exam. In the math exam, the 16 Polish states scored between 44% and 56%.

    For my little experiment I individually screenshoted the first 15 questions and one by one – without further instructions – gave them to OpenAI o3, Gemini 2.5 Pro and Claude Sonnet 4, after the initial prompt

    You are a Polish student passing the mathematics exam. You get one question at once. Solve them and finish the answer with the correct solution.

    o3 and Gemini each scored 14/15 (93.3%)both getting task 12 wrong:

    https://preview.redd.it/wapmhfwvtqbf1.png?width=1193&format=png&auto=webp&s=8979ae9d07ddb41019b9dd610ea3d27cc58cfca1

    Claude Sonnet 4 lags behind with 12/15 (80%)but I should add that it’s not the company’s strongest model. I don’t have access to Claude Opus 4, safe to assume it would have performed better.

    I uploaded the answers of Gemini 2.5 Pro to Imgurif anyone wants to see how it solve the tasks. o3 was much less talkative.

    Note: With such benchmarks there’s always a risk of contamination, meaning public questions and answers becoming part of the training data and thus the models having them memorized. This is highly unlikely here, since questions and answers have only been made public very recently. The Gemini version I used has a knowledge cutoff of January 2025, that’s before the exams were held in May.

    Polish eight-grade exams vs AI
    byu/opolsce inpoland



    Posted by opolsce

    Share.

    8 Comments

    1. Suheil-got-your-back on

      Time and time again, it’s proven that we shouldn’t apply human evaluation to ai, because their weaknesses lie in different areas.

    2. Did you add “don’t search the web for solution”? They sometimes do even unprompted.

    3. Interesting- yet only thing I can see a bit wrong is the prompt

      Word “student” in polish refers to college students only.

      So it could mislead model a bit and make it more advanced in learning than it should

      I think “uczniem klasy 8” or just “uczniem” would be better

    4. Box_of_Hope on

      I wonder why did the models get confused with this exact question. Perhaps it’s about the illustration being imprecise, yet used as a source of knowledge? I wonder what would happen if B was moved slightly to the right.

    5. I could have had that easier. Instead of individual screenshots one by one, I tried uploading the entire pdf with the prompt

      >Jesteś polskim studentem zdającym egzamin z matematyki. Rozwiąż pytania od 1 do 15 w załączonym pliku i zakończ odpowiedź poprawnym rozwiązaniem. Na koniec, podsumuj swoje odpowiedzi w tabeli z dwiema kolumnami: Numer zadania i poprawna odpowiedź.

      in AI Studio. After 196.4 seconds, just a tad faster than the three hours humans have: Same score (14/15) and again task 12 wrong.

      https://preview.redd.it/fm5dlrkwmtbf1.png?width=344&format=png&auto=webp&s=ded69dbfbd642731188d9066a8d89e9b29ef459c

    6. Okay and? What’s your conclusion? I think most people are capable of consistently scoring high on tests if you let them freely cheat. Do you think the progress of LLMs is going to make schools redundant? Education is based on the internalization of existing knowledge. Whether AI can find an answer to test questions nobody actually cares about is at best irrelevant, and at worst detrimental to the process

    7. (83-56)/3 = 9,

      83+2*9 = 101, C jest nieparzysta, **F**

      83-9 = 74, B jest mniejsze niż 74, **P**