Polish eight-grade exams vs AI

Yesterday I shared the results of the Polish eight-grade exam. In the math exam, the 16 Polish states scored between 44% and 56%.

For my little experiment I individually screenshoted the first 15 questions and one by one – without further instructions – gave them to OpenAI o3, Gemini 2.5 Pro and Claude Sonnet 4, after the initial prompt

You are a Polish student passing the mathematics exam. You get one question at once. Solve them and finish the answer with the correct solution.

o3 and Gemini each scored 14/15 (93.3%)both getting task 12 wrong:

https://preview.redd.it/wapmhfwvtqbf1.png?width=1193&format=png&auto=webp&s=8979ae9d07ddb41019b9dd610ea3d27cc58cfca1

Claude Sonnet 4 lags behind with 12/15 (80%)but I should add that it’s not the company’s strongest model. I don’t have access to Claude Opus 4, safe to assume it would have performed better.

I uploaded the answers of Gemini 2.5 Pro to Imgurif anyone wants to see how it solve the tasks. o3 was much less talkative.

Note: With such benchmarks there’s always a risk of contamination, meaning public questions and answers becoming part of the training data and thus the models having them memorized. This is highly unlikely here, since questions and answers have only been made public very recently. The Gemini version I used has a knowledge cutoff of January 2025, that’s before the exams were held in May.

Polish eight-grade exams vs AI
byu/opolsce inpoland

Posted by opolsce

View 8 Comments

8 Comments

Suheil-got-your-back on July 9, 2025 9:12 am

Time and time again, it’s proven that we shouldn’t apply human evaluation to ai, because their weaknesses lie in different areas.
tibmb on July 9, 2025 9:13 am

Did you add “don’t search the web for solution”? They sometimes do even unprompted.
wizarddos on July 9, 2025 9:15 am

Interesting- yet only thing I can see a bit wrong is the prompt

Word “student” in polish refers to college students only.

So it could mislead model a bit and make it more advanced in learning than it should

I think “uczniem klasy 8” or just “uczniem” would be better
Box_of_Hope on July 9, 2025 9:52 am

I wonder why did the models get confused with this exact question. Perhaps it’s about the illustration being imprecise, yet used as a source of knowledge? I wonder what would happen if B was moved slightly to the right.
opolsce on July 9, 2025 9:57 am

I could have had that easier. Instead of individual screenshots one by one, I tried uploading the entire pdf with the prompt

>Jesteś polskim studentem zdającym egzamin z matematyki. Rozwiąż pytania od 1 do 15 w załączonym pliku i zakończ odpowiedź poprawnym rozwiązaniem. Na koniec, podsumuj swoje odpowiedzi w tabeli z dwiema kolumnami: Numer zadania i poprawna odpowiedź.

in AI Studio. After 196.4 seconds, just a tad faster than the three hours humans have: Same score (14/15) and again task 12 wrong.

https://preview.redd.it/fm5dlrkwmtbf1.png?width=344&format=png&auto=webp&s=ded69dbfbd642731188d9066a8d89e9b29ef459c
The_InHuman on July 9, 2025 11:26 am

Okay and? What’s your conclusion? I think most people are capable of consistently scoring high on tests if you let them freely cheat. Do you think the progress of LLMs is going to make schools redundant? Education is based on the internalization of existing knowledge. Whether AI can find an answer to test questions nobody actually cares about is at best irrelevant, and at worst detrimental to the process
Jesper537 on July 9, 2025 12:39 pm

(83-56)/3 = 9,

83+2*9 = 101, C jest nieparzysta, **F**

83-9 = 74, B jest mniejsze niż 74, **P**
13579konrad on July 9, 2025 10:31 pm

Why not all the questions?