Uncovering the Learning Process of LLMs: Current Knowledge and Future Trends #LLMlearning

Summarise this content to 300 words

These models also cost a pretty penny to train. GPT-4 for example cost USD 41M in compute alone. And this is just the compute-per-minute cost, which excludes the costs in personnel, research, engineering and dataset preparation that are alse needed to train these beasts. Some internet sources estimate that developing Llama 3 set Meta back somewhere in the 1 to 2 billion USD range.

All these parameters and all this data are needed so that LLMs can learn the basics of human language***. When training LLMs, researchers have found that one approach that works well is to show the model the same sentence as both input and output while hiding one or more of the words in the output.

By learning to correctly “guess” the hidden words in the output sentence, LLMs are able predict the next word in a sequence to a very high degree of accuracy. This little trick is at the foundation of all recent advances in AI!

Let’s look at an example. I asked Claude-3.5 Sonnet to generate the code for a toy LLM for me, along with code to train it on Shakespeare’s corpus.

This is the code for a simple “transformer” neural network, which is a lower-parameter and simpler version of the same model architecture used in most of the state-of-the-art LLMs:

class SimpleTransformer(nn.Module):
def __init__(self, vocab_size, d_model=64, nhead=2, num_layers=2):
super(SimpleTransformer, self).__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoder = nn.Embedding(1000, d_model)
encoder_layers = nn.TransformerEncoderLayer(d_model, nhead, dim_feedforward=128, batch_first=True)
self.transformer_encoder = nn.TransformerEncoder(encoder_layers, num_layers)
self.fc_out = nn.Linear(d_model, vocab_size)def forward(self, src, return_activations=False):
positions = torch.arange(0, src.size(1), device=src.device).unsqueeze(0).expand(src.size(0), -1)
embedded = self.embedding(src) + self.pos_encoder(positions)
encoder_output = self.transformer_encoder(embedded)
output = self.fc_out(encoder_output)
if return_activations:
return output, embedded, encoder_output
return output

Before training the model, I asked it to generate text based on the following input:

To be, or not to be, that is the question:

This is what our SimpleTransformer model generated:

Qow,lLxRPJ'wQOImwAYOa-avDeI,a?x,xC
laBQU-,P,vFKWiH:KfJBqSgFQ&o
FhvOJKEsjBQPlDd&;nnn!twyjb!YMVjkzHJMnkcBOcmF$W'&jXacilcMFFTCk&Xwg;jHB'sw:aYYUjih'iJPiFUbBacs-FvyDv;$haMP!ZMx-HAzjdpfgK''Ak!bObmoj,3!xvLcw

Not very interesting, right? The fact that the model even generates anything at all is because its model parameters are initialised with random values before training. This is another trick that researchers have stumbled on that just works.

If we look at the neurons**** that were activated in the untrained model we see the same degree of randomness we saw in the output. They are all over the place:

Mapping the token inputs to the corresponding model parameters before training the model.

In this case, we’re training the model to complete sentences, so for an input like “To be, or not to be, that is the question:”, we’d show the model “Whether ’tis nobler in the mind to suffer” as output.

Now let’s look at the output after training the model for 10 iterations — after asking it to look at each sentence pair in Shakespeare’s corpus 10 times.

uqubtt ub, u ob,
nnbttobnnottinottototttin ntiaiatiiaiia unst ie ttonty,
osoiatoeoobibttiu,iril utttolybnottettttehootimt
intitoebuieiuiiioteiouiiinatiantoieisuianubeienltctirb'iiniitiuiuilt,ltiilbbii

Admittedly, Shakespeare probably said it better — but you can already see the model is starting to learn patterns from the English-language corpus it has been training on. There are no more random or uppercase characters, and it’s introduced spacing and commas in the generated text.

Let’s see what this looks like at the level of individual neurons — the parameters of the model that determine the output it will generate:

The token inputs now activate different neurons in the neural network.

It shows the emergence of the first patterns in the neurons, and a higher level of contrast in the activations than before-with the same input.

The neat thing about the transformer architecture researchers at OpenAI found which basically kickstarted this whole LLM craze was that if you increase the number of parameters and the size of the input data enough, your LLM will actually start to generate high quality syntactically and semantically correct text!

But just having a working model of human languages isn’t enough. If you’ve ever interacted with an LLM that has only been “pre-trained”, you’ll know that its generations will often miss the point completely, the model won’t know when to stop generating text, and generations will very likely to devolve into complete gibberish at some point.

This is where instruction tuning and supervised fine-tuning (SFT) come in.

These is a set of techniques to teach LLMs how to respond to human input by showing them examples of text inputs and outputs in a conversational context.

Whereas during pre-training LLMs are shown raw text, instruction tuning data is often conversational in nature, since it needs to teach the LLM how to respond to human inputs. Think of data like question-answer pairs or movie scripts.

Similarly, SFT data is domain-or task-specific, since it needs to teach the LLM how to complete tasks in a certain context or domain (for example in a medical setting).

Training the model on this kind of data provides it with a baseline of human expectations — of the kind of responses humans expect, of how much text it should generate, as well as other domain- or context-specific information humans expect it to have access to for its generations.

A great example of an LLM that in my opinion has been fine-tuned very well is Claude-3.5 Sonnet. My guess is the Anthropic team spent a lot of time curating a high-quality instruction-tuning dataset. This has resulted in a model that produces much more useful generations than GPT-4o.

Since the type of data needed for instruction-tuning is much more rare and harder to come by than data for pre-training, the volumes of data used in this stage are also much smaller — in the tens of millions of examples, rather than the billions or trillions of examples of the internet-scale pre-training data.

Creating instruction and SFT datasets is also where a lot of the budget of LLM providers like Google, OpenAI, Anthropic and Meta is allocated. They often rely on people in low-income countries to manually curate these datasets for them.

A last step that has become a common practice is to teach LLMs our preferences for certain responses by using the feedback users provide. This can only be done after the model has been made available for public use, so data volumes here are often even lower than in the SFT or instruction tuning datasets. The one exception to this rule is OpenAI, because ChatGPT has hundreds of millions of active users (I’m ignoring Google since they have a bit of work to do getting their genAI teams sorted out).

The feedback you submit will be used for preference optimisation (image: ChatGPT)

The techniques used by LLMs to learn from human preferences rely on the fact that due to their stochastic nature, LLMs can generate multiple distinct outputs from the same human input.

However, in order to take advantage of this fact and teach the LLM which output users prefer, researchers have to first learn a model of these human preferences. As we have seen, LLMs by themselves are trained to predict words in sentences. They have no idea what individual humans might be interested in.

In fact, all the “knowledge” on what humans find interesting stored in their model parameters is a byproduct of them learning patterns in human language.

So in order to teach LLMs user preferences (“optimise” them, in jargon), we first need to be able to model user preferences. This is usually done with a technique called reinforcement learning, which learns what LLM generations among all the possible generations are preferred by users.

All their “knowledge” of us humans is a byproduct of LLMs learning patterns in human language.

Once a good model of human preferences has been learnt, it can be used to directly improve the LLM output by tweaking (“fine-tuning”) the layers of the LLM that determine the final output of the LLM.

The reward model learns to predict LLM outputs preferred by humans. It is then used to further improve (“fine-tune”) selected parameter layers of the LLM (image: HuggingFace).

Most LLMs used today are trained with one or more combinations of these three techniques. AI researchers are working on novel approaches such as self-play (where LLMs are learning by talking to each other or themselves), but the current generation of LLMs is trained using pre-training, supervised and / or instruction tuning, and preference optimisation methods.

These techniques map naturally to the datasets available — internet-scale raw text data for learning human languages, curated data for learning how to respond, and data generated from human interactions to learn which responses humans prefer.

The strange thing is that researchers today don’t really know how LLMs generate their outputs. There are two main issues. One is the size and complexity of these LLMs. That makes figuring out which of the tens of billions of parameters are reacting to inputs and shaping the outputs of LLMs a very hard task. Researchers at Anthropic have been making some interesting inroads using a technique called dictionary learning, which we’ll discuss in the next section.

LLM model training techniques map naturally to the datasets available-internet-scale raw text data for learning human languages, curated data for learning how to respond, and data generated by human interactions to learn which responses humans prefer.

The second issue is the empirical nature of AI research. A lot of the canonical techniques and tricks used to train LLMs have been discovered by researchers in AI labs around the world trying a bunch of different things and seeing which one would stick. In this sense, AI research is a lot closer to an engineering discipline than a lot of researchers and professors would have you believe. We’ll dive into the implications of this approach for the “AI revolution” in part three.

One of the main questions AI researchers have been struggling with is how the neurons of LLMs — the learnt mathematical representations — map to semantic units in human language. In other words, how neurons in an artificial neural network map to concepts like “trees”, “birds”, and “polynomial equations” — concepts that neuroscientists have shown to have a biological basis in our neural substrates.

The main issue is that the same neuron in a neural network can activate for many different inputs — e.g. you’d see the same neuron fire whether the input is Burmese characters, math equations, or abstract Chinese nouns*****. This makes it pretty much impossible for us humans to interpret what is going on inside an LLM.

At Anthropic, they’ve tried to tackle this problem using a method called dictionary learning. The key idea driving this line of research is the hypothesis that the neural networks we end up with after training an LLM are actually compressed versions of higher-dimensional neural networks-that somewhere during training, neurons become “ superimposed” onto each other.

A key feature of the “superposition hypothesis” is that neurons of LLMs will take on different semantic meaning depending on the input vector (image source: Anthropic, 2023).

This would mean that the neurons of LLMs are polysemantic-exactly the problem we were trying to solve! For the details of dictionary learning and the method they used to disentangle the semantic units in neural networks — its “features” — I highly recommend reading their well-written blogpost on this.

Just because it works doesn’t mean it’s understood (image: Anthropic, 2023)

I’m not a computer science major, so when I think of compression I think of something like g-zip. Ignoring for a moment that this (compression, not g-zip) is the foundation of all modern information theory, it’s very hard to see how a simple step like compressing a neural network can lead to the reasoning abilities we see in top-of-the-line LLMs.

The thing that is most astounding to me — which is mentioned in a side-note in the Anthropic write-up — is that this type of compression is known to occur only when neural networks are trained with a specific function to reduce prediction errors called “cross-entropy loss”:

where:

N is the number of sequences in the batch.
T is the length of the target sequence.
C is the number of classes (vocabulary size).
y_{i,t,c} is a binary indicator (0 or 1) if the target token at position t in sequence i is class c.
\hat{y}_{i,t,c} is the predicted probability that the token at position t in sequence i is class c.

This formula is used to quantify the prediction error rate of LLMs by providing a numerical value for the generated sequence to sequence mappings (the input and output examples used when training the LLM).

And somehow along the way we end up with technological artefacts that are able to reason through and solve problems at practically the same level as humans!

Nothing in the way humans use language suggests this has happened before. I’ve done a lot of research on the evolution of languages over time, and on how languages relate to knowledge systems, and can’t think of any historical process that would generate the same kind of cultural compression that training a neural network does. Even time itself doesn’t result in anything like this.

A projection of the scaling laws for transformer models(image: Leopold Aschenbrenner, Situational Awareness)

Part of my bewilderment stems from the fact that language as a means of communication has many flaws. It’s not a pure, exact or even particularly successful representation of human thoughts-of our internal states. States that also happen to be embodied in a central nervous system and biomolecular process that have taken 4 billion years to refine.

Somehow along the way we end up with technological artefacts that are able to reason through and solve problems at practically the same level as humans!

But somehow, LLMs trained on text — on technological artefacts produced in the technology that is language — are able to pick up on enough patterns to mimic human reasoning and problem-solving skills.

It’s still quite astounding to see LLMs reason through problems when I am building AI applications that leverage their reasoning capabilities.

One possible explanation I’ve read for the massive jumps in reasoning capabilities from GPT-2 to GPT-3.5 and beyond is that researchers started including source code data in the training datasets of LLMs. While this seems plausible, I haven’t come across any clear evidence that this is really what is happening.

I guess you could look at evolution as a form of compression, of iterating over traits in the same way an LLM iterates over the “features” found by Anthropic researchers. The main difference — and where the analogy breaks down — is that the traits that have been most successful in natural selection combine effectiveness to cope with a specific environment with adaptability to new environments.******

It is unclear at this point how well LLMs will work in agentic systems that need to do a lot of context switching, since this is an ongoing area of research in both industry and academia. My personal experience is that LLMs require a lot of guardrails to ensure they perform even reasonably well in any given context.

In this sense, compression is definitely not producing the same results as natural selection — LLMs miss the kind of information-seeking drive all living beings have.

What does all this mean for the future of AI? For one, to me at least it is very clear we haven’t yet “solved” AI, AGI, superintelligence, or whatever else you want to call it with our current set of machine learning methods.

Even though people like Leopold Aschenbrenner make a very convincing case the path towards superintelligence is scaling compute, I don’t think the only thing holding back vLLMs (very Large Language Models) from taking over the world is the sandbox in which they are deployed.

People using vLLMs the right way are a different thing altogether, obviously.

I think we need some major innovations in algorithms and representation learning before we will have truly autonomous agents — ”AI” in the sci-fi sense of the word.

In LLMs, as I hope has become clear from reading this blogpost, the information-seeking behaviour is an after-thought, bolted on by humans during preference optimisation like the guardrails that make GPT-4o refrain from generating racist, sexist and other reputationally damaging outputs.

In fact, most of the successful neural network solutions in the domain of computer games — where neural networks are allowed to act on their environments — have been combinations of neural networks and reinforcement learning. Large neural networks (like LLMs) learn to process and compress environmental data, and the reinforcement learning model then learns how to act on the environment using this compressed representation of the environment.

In all of these applications, it is the reinforcement learning agent that is driving the exploration, information seeking, and acting-and they are horribly inefficient.

So how should we look at the rise of LLMs? Is this a moonshot like the Apollo program, as Leopold Aschenbrenner and many others in Silicon Valley would have us believe? Or is it something closer to the dot.com bubble — where there are real use- and business cases for the technology, but they will take a lot longer to realise and be a lot less transformative than AI marketing gurus would have us believe?

I think — but I could well be wrong — that a more fruitful way to look at LLMs is to view them through the lens of the technological breakthrough of a different era — that of the industrial revolution.

The main driving force of social, technological, and economic change in that period was the steam engine. The switch from biological to carbon-based energy sources enabled us to concentrate much more kinetic energy into much smaller containers, culminating in the automobiles, airplanes and spaceships breaking down physical distances for humanity today.

In the same way, LLMs could be the steam engines of the information age, allowing us to switch our cultural evolution from one technology — language — to another — computing. The issue that we then run into is one voiced by numerous smart people around the world, namely what problem does this solve?

This image has been making rounds on social media recently. Seems like a valid point to me. I’ve also written about this in a previous blogpost.

There is a case to be made for LLMs to automate or augment a lot of the knowledge work that is currently driving the information economy, allowing us to spend more time away from our devices — working on things that have more direct impact on our and our groups social, cultural and economic wellbeing. This would, in my mind at least, be a very positive outcome given that I believe none of us were brought into this world to stare at a computer screen 8+ hours a day.

Such a change would of course also result in a massive period of disruption — the biggest humanity has ever seen given the number of people currently roaming the earth (England had around 6 million inhabitants at the start of the industrial revolution in 1750, 16.7 in 1851, and 56 million today).

Either way, we’re not there yet from how I’ve seen LLMs perform in the day-to-day. I think we need further innovations in AI before computers can be trusted to act correctly and competently on on your input.

Maybe the distant past is not that far away? (photo taken at TNW Amsterdam 2024)

Source link

Source link: https://medium.com/@denominations/how-llms-learn-what-we-know-what-we-dont-yet-know-and-what-comes-next-eecf3f04f78b?source=rss——ai-5