Re Bob's claim that LLM training is doing the work of evolution plus learning:
While it may well be true that LLM training is doing some of the work that evolution did, I don't think it's technically correct to say that "the system for representing words as vectors" is itself learned when training transformers. Or at least it's ambiguous …
Re Bob's claim that LLM training is doing the work of evolution plus learning:
While it may well be true that LLM training is doing some of the work that evolution did, I don't think it's technically correct to say that "the system for representing words as vectors" is itself learned when training transformers. Or at least it's ambiguous what that means.
Before any training at all, with a randomly initialized network, transformers still represent words (well tokens) as vectors - so "the system for representing words as vectors" is there before the transformer has seen a single bit of training data. There's not really any other way for a transformer (or a neural network in general) to represent things. It's just that during learning, these vectors evolve to take on something you could call semantic meaning - where geometric relationships between token-vectors start to correspond to semantic relationships.
I agree though it is totally possible that the model is kind of "re-learning" some generic linguistic algorithm that was discovered by evolution - I just wouldn't say it's accurate to describe it as "the system for representing words as vectors". There should be plenty of space for it to fit that - humans and chimps have about 60MB of DNA difference and GPT4 has about 100,000x that much storage available in its weights.
Re Bob's claim that LLM training is doing the work of evolution plus learning:
While it may well be true that LLM training is doing some of the work that evolution did, I don't think it's technically correct to say that "the system for representing words as vectors" is itself learned when training transformers. Or at least it's ambiguous what that means.
Before any training at all, with a randomly initialized network, transformers still represent words (well tokens) as vectors - so "the system for representing words as vectors" is there before the transformer has seen a single bit of training data. There's not really any other way for a transformer (or a neural network in general) to represent things. It's just that during learning, these vectors evolve to take on something you could call semantic meaning - where geometric relationships between token-vectors start to correspond to semantic relationships.
I agree though it is totally possible that the model is kind of "re-learning" some generic linguistic algorithm that was discovered by evolution - I just wouldn't say it's accurate to describe it as "the system for representing words as vectors". There should be plenty of space for it to fit that - humans and chimps have about 60MB of DNA difference and GPT4 has about 100,000x that much storage available in its weights.