Large language models (LLMs), such as OpenAI’s GPT series, Google’s Bard, and are driving profound technological changes. Recently, with the emergence of open-source large model frameworks like LlaMa and ChatGLM, training an LLM is no longer the exclusive domain of resource-rich companies.
Training LLMs by small organizations or individuals has become an important interest in the open-source community, with some notable works including Alpaca, Vicuna, and Luotuo.
In addition to large model frameworks, large-scale and high-quality training corpora are also essential for training large language models. Currently, relevant open-source corpora in the community are still scattered.
Therefore, the goal of this article is to introduce and collect high-quality resources to learn how to build training datasets for LLM applications.