Hey everyone,

This should be even bigger announcement. I wanted to share a project I worked on this summer with two interns. We tackled a problem that tech giants like Google, OpenAI (Whisper), and Meta have effectively ignored.

🚫 The Problem

In Cyprus, as you know, the official language is Standard Greek, but in daily life, most people speak the Cypriot dialect.

If you try to use AI speech-to-text apps or AI voice assistants here, they fail hard. Big models treat the dialect as "noise" or "bad Greek." Even Meta’s massive 1,600-language model doesn’t support it.

💡 The Plan vs. The Reality

We thought this would be a standard fine-tuning job:

  1. Grab a Greek wav2vec model.
  2. Download a dataset.
  3. Burn some GPU credits.
  4. Profit.

The Reality Check: There were NO datasets. Existing research data was either lost, broken, or locked behind a €35,000 paywall.

🛠 The Solution (The Hard Way)

Since data = AI, we had to build the entire pipeline from scratch. Here is what we did:

  • Getting data: We downloaded hours of Cypriot TV, radio, and podcasts.
  • Crowdsourcing: We built a platform (voiceofcyprus.org) to gather real translations from locals.
  • Messy Data: We used every NLP trick to teach the model context using very imperfect audio data.

We’ve open-sourced everything:

https://huggingface.co/datasets/Elormiden/RIK_Cypriot_Collection_Dataset
https://huggingface.co/datasets/Elormiden/RIK_Cypriot_News_Dataset

🚀 The Result

In just 6 weeks and with only $150 in GPU creditswe trained a small model that actually understands the Cypriot dialect!
https://huggingface.co/Elormiden/bert-base-cypriot-greek

It isn’t production-perfect yet, but it is the first working pipeline. We proved that the task is solvable without a Big Tech budget. This provides a foundation for researchers and local devs to finally build voice AI for the island that can understand local people.

The Takeaway: You can have the best architecture in the world, but No Data = No AI

Check out the full breakdown of the project here: https://youtu.be/zN_FMIWRSLA

And for the Greek speakers, here is an AI-translated version of the explanation: https://youtu.be/hcoXFNVP6L4

https://i.redd.it/ewoikegz3e2g1.png

Posted by AkimovIgor

3 Comments

  1. ForsakenMarzipan3133 on

    Great. Now scammers from Indian call-centers will be able to sound just like Pampos the taxi-driver from down the road.

    Ρε πελλέ, εν θα το πιστέψεις! έχω σου μιαν απίστευτην ευκαιρίαν να επενδύσεις σε κάτι κρυπτονομίσματα να σιέζεις τις λίρες! έν εσιη λάθος άκου μου που σου λαλώ.