To Data & Beyond

To Data & Beyond

Share this post

To Data & Beyond
To Data & Beyond
Best Resources On Building Datasets to Trian LLMs

Best Resources On Building Datasets to Trian LLMs

Youssef Hosni's avatar
Youssef Hosni
Apr 18, 2024
∙ Paid
2

Share this post

To Data & Beyond
To Data & Beyond
Best Resources On Building Datasets to Trian LLMs
1
Share

To Data & Beyond is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Large language models (LLMs), such as OpenAI’s GPT series, Google’s Bard, and are driving profound technological changes. Recently, with the emergence of open-source large model frameworks like LlaMa and ChatGLM, training an LLM is no longer the exclusive domain of resource-rich companies. 

Training LLMs by small organizations or individuals has become an important interest in the open-source community, with some notable works including Alpaca, Vicuna, and Luotuo. 

In addition to large model frameworks, large-scale and high-quality training corpora are also essential for training large language models. Currently, relevant open-source corpora in the community are still scattered. 

Therefore, the goal of this article is to introduce and collect high-quality resources to learn how to build training datasets for LLM applications.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Youssef Hosni
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share