Introduction: Why this project matters?

Training instruction following LLMs is no longer just about scaling models. It is about scaling data quality.

In high resource languages like English, datasets such as Alpaca and OpenAssistant already exist. However, in low resource languages like Persian, high quality instruction datasets are extremely limited.

Most available Persian corpora suffer from:

• lack of instruction structure