Are LLMs trained on Internet crap? / Slow Learner's Quest

airspresso:

I keep hearing that LLMs are trained on “Internet crap” but is it true? Karpathy repeated this in a recent interview, that if you’d look at random samples in the pretraining set you’d mostly see a lot of garbage text. And that it’s very surprising it works at all.

The labs have focused a lot more on finetuning (posttraining) and RL lately, and from my understanding that’s where all the desirable properties of an LLM are trained into it. Pretraining just teaches the LLM the semantic relations it needs as the foundation for finetuning to work.