Large language AI models are currently used by many companies. After all, with their help, neural networks learn to generate coherent text or even program code. These models are trained on data from Wikipedia, scientific papers, books, and so on. The trend in recent years has been to train models on more and more data in the hope that this will make them more accurate. However, a problem arose.
Reportedly, the data types commonly used for training language models could end around 2026. After all, researchers are creating more powerful models with more and more capabilities, and they need more and more texts.
Part of the problem stems from the fact that researchers prepare data for training language models and pre-filter it for quality. High-quality texts can be good articles, while low-quality texts can be posts on social networks or comments on websites. Although the boundary between these concepts is quite blurred.
Researchers usually train models on high quality texts, which has yielded results in the GPT-3 system and similar ones. However, if the problem of lack of material arises in the coming years, then neural networks can be “fed” with lower quality texts.
However, not all experts agree with this. Percy Liang, a professor of computer science at Stanford University, said there was evidence that small models trained on high quality texts performed better than large models trained on low quality texts.
It is also possible to train models on the same texts several times. Now large language models are trained on the same data only once. At the same time, users themselves often take part in training neural networks. An example would be the recent Galaxy.