Best Resources On Building Datasets to Trian LLMs
Large language models (LLMs), such as OpenAI’s GPT series, Google’s Bard, and are driving profound technological changes. Recently, with the emergence of open-source large model frameworks like LlaMa and ChatGLM, training an LLM is no longer the exclusive domain of resource-rich companies.
Training LLMs by small organizations or individuals has become an important interest in the open-source community, with some notable works including Alpaca, Vicuna, and Luotuo.
In addition to large model frameworks, large-scale and high-quality training corpora are also essential for training large language models. Currently, relevant open-source corpora in the community are still scattered.
Therefore, the goal of this article is to introduce and collect high-quality resources to learn how to build training datasets for LLM applications.



