Argyro (Iro) Tasitsiomi, Ph.D., Head of Investment Data Science at T. Rowe Price
Argyro (Iro) Tasitsiomi, Ph.D., Head of Investment Data Science at T. Rowe Price
Since the introduction of ChatGPT, there has been a huge amount of excitement and excitement surrounding generative artificial intelligence and large language models (or “LLMs”, with “large” referring to the number of model parameters), and more specifically their ability to create stories and poetry. Take an interest, answer questions, and engage in conversation.
But how smart are these AIs? By intelligence I mean the ability to successfully cope with new challenges. Does this still matter? From a practical perspective, their potential is undeniable: even if they are not intelligent themselves, users can become more productive in content consumption and creation through LL.M.
However, the answer to this question is important. To respond effectively to potential risks, we must understand the capabilities and limitations of the LL.M., both of which are critical to mitigating the risks of overreliance on information generated by artificial intelligence or unfounded fears of automation replacing humans, which, while distinct, However, it may adversely affect the results.
Next, I will provide some considerations for readers to think about when thinking about this and related topics.
Why the “perfect” model isn’t the best.
When we effectively fit a model to the data, we are trying to find a data compression mechanism. For example, suppose we fit a line to 1000 points; assuming a good fit, this means that we manage to store most of the information in the data in only two parameters: the slope and intercept of the line. So if we want to convey the message carried by 1000 data points, we can now do it using only two values instead of 1000 values.
Good models can produce data compression with high efficiency and low information loss. Efficient means that the model captures the information content of the data with only a few model parameters—much smaller than the data size. Small information loss means that the values it produces are close to real data. That’s why we find the best model parameters by minimizing a metric that represents the loss – the distance between the fitted predictions and the real data (think least squares).
The perfect minimal information loss scenario is when the model values equal the real data. This happens if the model has the same number of parameters as there are data points: each parameter will “store” information for exactly one data point. However, this “perfect” fit achieves nothing in terms of compression: it uses as many parameters as there are data points to capture the data…
To respond effectively to potential risks, we must understand the capabilities and limitations of the LL.M., both of which are critical to mitigating the risks of overreliance on information generated by artificial intelligence or unfounded fears of automation replacing humans, which, while distinct, However, it may adversely affect the results.
Furthermore, all real-world data sets contain useful information and useless noise. By forcing the model to use fewer parameters to represent the data, we force it to learn information, not noise. Allowing more parameters beyond a certain point leads to overfitting: the model learns every tiny twist and kink in the data, signal, or noise, and therefore lacks the flexibility to fit data it has never seen before.As we approach the “perfect” model, we can fit Totally all The data the model sees. When a model encounters data it has not seen, it becomes useless. This is why a perfect model is not the best; like a student who remembers everything he has learned but cannot solve any unfamiliar problems.
What does all this have to do with LLM and intelligence?
Although a little harder to understand when language is involved, all of the above applies equally to the LL.M., at least intuitively.
It turns out that LLMs are relatively poor at compression: they “fit” a lot of data points (~the entire internet), but also a lot of parameters (~trillions+!). The larger the LLM we develop, the higher the proportion of the network that is “stored” in the LLM parameters, and therefore the closer we are to “overfitting” and memorization.
Furthermore, in a hypothetical scenario where LL.M.s are given the entire internet to learn from, we encounter the paradox where overfitting is not a problem because… there is no data that the model has not seen! Since the model is large enough to retain most of the data, we will have one giant “perfect” model that contains everything: a more detailed copy of the network…?
in conclusion
LL.M.s are marvels of human ingenuity and can bring us immense value, not because they are efficient, but because they are huge. This makes them “brute force”, extremely complex models with an equally brutal ability to “remember” information in a way that can mimic intelligence.
But then again, life itself may be detached from large and complex systems due to unexpected behavior. Couldn’t these giant LL.M.s produce real wisdom in a similar way? Well, that’s a conversation for another time!