Web Excursions 2023-02-18

Platy Hsu

Feb 19, 2023

My purpose here is to give a rough outline of what’s going on inside ChatGPT—
- and then to explore why it is that it can do so well in producing what we might consider to be meaningful text.
What ChatGPT is always fundamentally trying to do is to produce a “reasonable continuation” of whatever text it’s got so far
when ChatGPT does something like write an essay what it’s essentially doing is
- just asking over and over again “given the text so far, what should the next word be?”—and each time adding a word.
- (More precisely it’s adding a “token”, which could be just a part of a word, which is why it can sometimes “make up new words”.)
which one should it actually pick to add to the essay (or whatever) that it’s writing?
- if we always pick the highest-ranked word, we’ll typically get a very “flat” essay, that never seems to “show any creativity” (and even sometimes repeats word for word).
- But if sometimes (at random) we pick lower-ranked words, we get a “more interesting” essay.
- there’s a particular so-called “temperature” parameter that determines how often lower-ranked words will be used,
  - and for essay generation, it turns out that a “temperature” of 0.8 seems best.
Where Do the Probabilities Come From?
- if we take a large enough sample of English text we can expect to eventually get at least fairly consistent results
- With sufficiently much English text we can get pretty good estimates not just for probabilities of single letters or pairs of letters (2-grams), but also for longer runs of letters.
- Just like with letters, we can start taking into account not just probabilities for single words but probabilities for pairs or longer n-grams of words.
But here’s the problem: there just isn’t even close to enough English text that’s ever been written to be able to deduce those probabilities
- with 40,000 common words, even the number of possible 2-grams is already 1.6 billion—and the number of possible 3-grams is 60 trillion.
- So there’s no way we can estimate the probabilities even for all of these from text that’s out there.
- And by the time we get to “essay fragments” of 20 words, the number of possibilities is larger than the number of particles in the universe, so in a sense they could never all be written down.

So what can we do?
- The big idea is to make a model that lets us estimate the probabilities with which sequences should occur—even though we’ve never explicitly seen those sequences in the corpus of text we’ve looked at.
- And at the core of ChatGPT is precisely a so-called “large language model” (LLM) that’s been built to do a good job of estimating those probabilities.
How do we figure out how long it’s going to take to fall from a floor we don’t explicitly have data about?
- In this particular case, we can use known laws of physics to work it out.
- But say all we’ve got is the data, and we don’t know what underlying laws govern it.
Then we might make a mathematical guess,
- use a straight line as a model
  - How did we know to try using a straight line here?
  - At some level we didn’t.
- It’s just something that’s mathematically simple, and we’re used to the fact that lots of data we measure turns out to be well fit by mathematically simple things.
another human-like task: recognizing images.
- if we treat the gray-level value of each pixel here as some variable xi is there some function of all those variables that—when evaluated—tells us what digit the image is of?
- It turns out that it’s possible to construct such a function.
- Not surprisingly, it’s not particularly simple, though.
- And a typical example might involve perhaps half a million mathematical operations.
- we have a “good model” if the results we get from our function typically agree with what a human would say

Invented—in a form remarkably close to their use today—in the 1940s, neural nets can be thought of as simple idealizations of how brains seem to work.
but how does a neural net like this “recognize things”?
- The key is the notion of attractors.
- We somehow want all the 1’s to “be attracted to one place”, and all the 2’s to “be attracted to another place”.
- Or, put a different way, if an image is somehow “closer to being a 1” than to being a 2, we want it to end up in the “1 place” and vice versa.
let’s say we have certain positions in the plane, indicated by dots (in a real-life setting they might be positions of coffee shops).
- Then we might imagine that starting from any point on the plane we’d always want to end up at the closest dot
- We can represent this by dividing the plane into regions (“attractor basins”) separated by idealized “watersheds”
We can think of this as implementing a kind of “recognition task”
- in which we’re not doing something like identifying what digit a given image “looks most like”—
- but rather we’re just, quite directly, seeing what dot a given point is closest to.
- (The “Voronoi diagram” setup we’re showing here separates points in 2D Euclidean space;
  - the digit recognition task can be thought of as doing something very similar—
  - but in a 784-dimensional space formed from the gray levels of all the pixels in each image.)
Our goal is to take an “input” corresponding to a position {x,y}—
- and then to “recognize” it as whichever of the three points it’s closest to.
- Or, in other words, we want the neural net to compute a function of {x,y} like
Each “neuron” is effectively set up to evaluate a simple numerical function.
- And to “use” the network, we simply feed numbers (like our coordinates x and y) in at the top,
- then have neurons on each layer “evaluate their functions” and feed the results forward through the network—
- eventually producing the final result at the bottom
In the traditional (biologically inspired) setup each neuron effectively has a certain set of “incoming connections” from the neurons on the previous layer,
- with each connection being assigned a certain “weight” (which can be a positive or negative number).
- The value of a given neuron is determined by multiplying the values of “previous neurons” by their corresponding weight, then adding these up and multiplying by a constant—
  - and finally applying a “thresholding” (or “activation”) function.
- In mathematical terms, if a neuron has inputs x = {x1, x2 …} then we compute f(w*x + b),
  - where the weights w and constant b are generally chosen differently for each neuron in the network;
  - the function f is usually the same.
The neural net of ChatGPT also just corresponds to a mathematical function like this—but effectively with billions of terms.
Can we say “mathematically” how the network makes its distinctions?
- Not really. It’s just “doing what the neural net does”.
- But it turns out that that normally seems to agree fairly well with the distinctions we humans make.
Whatever input it’s given, the neural net is generating an answer.
- And, it turns out, to do it a way that’s reasonably consistent with what humans might do.
- that’s not a fact we can “derive from first principles”.
  - It’s just something that’s empirically been found to be true, at least in certain domains.
- But it’s a key reason why neural nets are useful: that they somehow capture a “human-like” way of doing things.
What about for an (artificial) neural net?
- it’s straightforward to see what each “neuron” does when you show a picture of a cat.
- But even to get a basic visualization is usually very difficult
we can represent the states of neurons at the first layer by a collection of derived images—
- many of which we can readily interpret as being things like “the cat without its background”, or “the outline of the cat”
- By the 10th layer it’s harder to interpret what’s going on
- But in general we might say that the neural net is “picking out certain features” (maybe pointy ears are among them), and using these to determine what the image is of.
- But are those features ones for which we have names—like “pointy ears”? Mostly not.
But at least as of now we don’t have a way to “give a narrative description” of what the network is doing.
- And maybe that’s because it truly is computationally irreducible, and there’s no general way to find what it does except by explicitly tracing each step.
- Or maybe it’s just that we haven’t “figured out the science”, and identified the “natural laws” that allow us to summarize what’s going on.
We’ll encounter the same kinds of issues when we talk about generating language with ChatGPT.
- And again it’s not clear whether there are ways to “summarize what it’s doing”.
- But the richness and detail of language (and our experience with it) may allow us to get further than with images.

Machine Learning, and the Training of Neural Nets

what makes neural nets so useful (presumably also in brains) is that
- not only can they in principle do all sorts of tasks,
- but they can be incrementally “trained from examples” to do those tasks.
Essentially what we’re always trying to do is to find weights that make the neural net successfully reproduce the examples we’ve given.
- And then we’re relying on the neural net to “interpolate” (or “generalize”) “between” these examples in a “reasonable” way.
To find out “how far away we are” we compute what’s usually called a “loss function” (or sometimes “cost function”).
- Here we’re using a simple (L2) loss function that’s just the sum of the squares of the differences between the values we get, and the true values.
- And what we see is that as our training process progresses, the loss function progressively decreases (following a certain “learning curve” that’s different for different tasks)—
- until we reach a point where the network (at least to a good approximation) successfully reproduces the function we want
now imagine that the weights are variables—say w_i.
- We want to find out how to adjust the values of these variables to minimize the loss that depends on them.
Numerical analysis provides a variety of techniques for finding the minimum in cases like this.
- But a typical approach is just to progressively follow the path of steepest descent from whatever previous w1, w2 we had
Like water flowing down a mountain, all that’s guaranteed is that this procedure will end up at some local minimum of the surface (“a mountain lake”);
- it might well not reach the ultimate global minimum.
- It’s not obvious that it would be feasible to find the path of the steepest descent on the “weight landscape”.
But calculus comes to the rescue.
- one can always think of a neural net as computing a mathematical function—that depends on its inputs, and its weights.
- But now consider differentiating with respect to these weights.
- It turns out that the chain rule of calculus in effect lets us “unravel” the operations done by successive layers in the neural net.
- And the result is that we can—at least in some local approximation—“invert” the operation of the neural net,
  - and progressively find weights that minimize the loss associated with the output
it turns out that even with many more weights (ChatGPT uses 175 billion) it’s still possible to do the minimization, at least to some level of approximation.
- And in fact the big breakthrough in “deep learning” that occurred around 2011 was associated with the discovery that
- in some sense it can be easier to do (at least approximate) minimization when there are lots of weights involved than when there are fairly few.
In other words—somewhat counterintuitively—it can be easier to solve more complicated problems with neural nets than simpler ones.
- And the rough reason for this seems to be that when one has a lot of “weight variables” one has a high-dimensional space with “lots of different directions” that can lead one to the minimum—
- whereas with fewer variables it’s easier to end up getting stuck in a local minimum (“mountain lake”) from which there’s no “direction to get out”.

The Practice and Lore of Neural Net Training

key parts
1. what architecture of neural net one should use for a particular task.
2. how one’s going to get the data on which to train the neural net.
increasingly one isn’t dealing with training a net from scratch:
- instead a new net can either directly incorporate another already-trained net, or
- at least can use that net to generate more training examples for itself.
the same architecture often seems to work even for apparently quite different tasks
at least for “human-like tasks”—it’s usually better just to try to train the neural net on the “end-to-end problem”,
- letting it “discover” the necessary intermediate features, encodings, etc. for itself.
it’s better just to deal with very simple components and let them “organize themselves” (albeit usually in ways we can’t understand) to achieve (presumably) the equivalent of those algorithmic ideas
That’s not to say that there are no “structuring ideas” that are relevant for neural nets.
- having 2D arrays of neurons with local connections seems at least very useful in the early stages of processing images.
- having patterns of connectivity that concentrate on “looking back in sequences” seems useful in dealing with things like human language, for example in ChatGPT.
one can often get away with a smaller network if there’s a “squeeze” in the middle that forces everything to go through a smaller intermediate number of neurons
it’s a standard strategy to just show a neural net all the examples one has, over and over again.
- In each of these “training rounds” (or “epochs”) the neural net will be in at least a slightly different state, and somehow “reminding it” of a particular example is useful in getting it to “remember that example”.
- (And, yes, perhaps this is analogous to the usefulness of repetition in human memorization.)
But often just repeating the same example over and over again isn’t enough.
- It’s also necessary to show the neural net variations of the example.
- And it’s a feature of neural net lore that those “data augmentation” variations don’t have to be sophisticated to be useful.
- Just slightly modifying images with basic image processing can make them essentially “as good as new” for neural net training.
- And, similarly, when one’s run out of actual video, etc.
- for training self-driving cars, one can go on and just get data from running simulations in a model videogame-like environment without all the detail of actual real-world scenes.
ChatGPT has the nice feature that it can do “unsupervised learning”, making it much easier to get it examples to train from.
typically the loss decreases for a while, but eventually flattens out at some constant value.
- If that value is sufficiently small, then the training can be considered successful;
- otherwise it’s probably a sign one should try changing the network architecture.
there are things that can be figured out by formal processes, but aren’t readily accessible to immediate human thinking.
- The kinds of things that we normally do with our brains are presumably specifically chosen to avoid computational irreducibility.
- It takes special effort to do math in one’s brain.
- And it’s in practice largely impossible to “think through” the steps in the operation of any nontrivial program just in one’s brain.
computational irreducibility means that we can never guarantee that the unexpected won’t happen—and it’s only by explicitly doing the computation that you can tell what actually happens in any particular case.
And in the end there’s just a fundamental tension between learnability and computational irreducibility.
- Learning involves in effect compressing data by leveraging regularities.
- But computational irreducibility implies that ultimately there’s a limit to what regularities there may be.
there’s an ultimate tradeoff between capability and trainability: the more you want a system to make “true use” of its computational capabilities, the more it’s going to show computational irreducibility, and the less it’s going to be trainable.
- And the more it’s fundamentally trainable, the less it’s going to be able to do sophisticated computation.
what we should conclude is that tasks—like writing essays—that we humans could do, but we didn’t think computers could do, are actually in some sense computationally easier than we thought.
the reason a neural net can be successful in writing an essay is because writing an essay turns out to be a “computationally shallower” problem than we thought.
- And in a sense this takes us closer to “having a theory” of how we humans manage to do things like writing essays, or in general deal with language.

The Concept of Embeddings

One can think of an embedding as a way to try to represent the “essence” of something by an array of numbers—with the property that “nearby things” are represented by nearby numbers.
we can think of a word embedding as trying to lay out words in a kind of “meaning space” in which words that are somehow “nearby in meaning” appear nearby in the embedding.
But how can we construct such an embedding?
- Roughly the idea is to look at large amounts of text (here 5 billion words from the web) and then see “how similar” the “environments” are in which different words appear.
- “alligator” and “crocodile” will often appear almost interchangeably in otherwise similar sentences, and that means they’ll be placed nearby in the embedding.
- But “turnip” and “eagle” won’t tend to appear in otherwise similar sentences, so they’ll be placed far apart in the embedding.
Rather than directly trying to characterize “what image is near what other image”, we instead consider a well-defined task (in this case digit recognition)
- for which we can get explicit training data—then use the fact that in doing this task the neural net implicitly has to make what amount to “nearness decisions”.
- So instead of us ever explicitly having to talk about “nearness of images” we’re just talking about the concrete question of what digit an image represents,
- and then we’re “leaving it to the neural net” to implicitly determine what that implies about “nearness of images”.
so how do we follow the same kind of approach to find embeddings for words?
- The key is to start from a task about words for which we can readily do training.
- And the standard such task is “word prediction”.
- Imagine we’re given “the ___ cat”.
- Based on a large corpus of text (say, the text content of the web), what are the probabilities for different words that might “fill in the blank”?
- Or, alternatively, given “___ black ___” what are the probabilities for different “flanking words”?
How do we set this problem up for a neural net?
- Ultimately we have to formulate everything in terms of numbers.
- And one way to do this is just to assign a unique number to each of the 50,000 or so common words in English.
But actually we can go further than just characterizing words by collections of numbers;
- we can also do this for sequences of words, or indeed whole blocks of text.
- And inside ChatGPT that’s how it’s dealing with things.
- It takes the text it’s got so far, and generates an embedding vector to represent it.
- Then its goal is to find the probabilities for different words that might occur next.
- And it represents its answer for this as a list of numbers that essentially give the probabilities for each of the 50,000 or so possible words.

Inside ChatGPT

In many ways this is a neural net very much like the other ones we’ve discussed.
- But it’s a neural net that’s particularly set up for dealing with language.
- And its most notable feature is a piece of neural net architecture called a “transformer”.
The idea of transformers is to do something at least somewhat similar for sequences of tokens that make up a piece of text.
- But instead of just defining a fixed region in the sequence over which there can be connections, transformers instead introduce the notion of “attention”—
- and the idea of “paying attention” more to some parts of the sequence than others.
A critical point is that every part of this pipeline is implemented by a neural network, whose weights are determined by end-to-end training of the network.
- In other words, in effect nothing except the overall architecture is “explicitly engineered”;
- everything is just “learned” from training data.
it’s part of the lore of neural nets that—in some sense—so long as the setup one has is “roughly right”
- it’s usually possible to home in on details just by doing sufficient training,
- without ever really needing to “understand at an engineering level” quite how the neural net has ended up configuring itself.
The original input to ChatGPT is an array of numbers (the embedding vectors for the tokens so far),
- and what happens when ChatGPT “runs” to produce a new token is just that these numbers “ripple through” the layers of the neural net,
- with each neuron “doing its thing” and passing the result to neurons on the next layer.
- There’s no looping or “going back”.
- Everything just “feeds forward” through the network.
It’s a very different setup from a typical computational system—like a Turing machine—in which results are repeatedly “reprocessed” by the same computational elements.
- Here—at least in generating a given token of output—each computational element (i.e.
- neuron) is used only once.
But there is in a sense still an “outer loop” that reuses computational elements even in ChatGPT.
- Because when ChatGPT is going to generate a new token, it always “reads” (i.e. takes as input) the whole sequence of tokens that come before it, including tokens that ChatGPT itself has “written” previously.
- And we can think of this setup as meaning that ChatGPT does—at least at its outermost level—involve a “feedback loop”,
  - albeit one in which every iteration is explicitly visible as a token that appears in the text that it generates.

The Training of ChatGPT

With modern GPU hardware, it’s straightforward to compute the results from batches of thousands of examples in parallel.
- But when it comes to actually updating the weights in the neural net, current methods require one to do this basically batch by batch.
- (this is probably where actual brains—with their combined computation and memory elements—have, for now, at least an architectural advantage.)
When we run ChatGPT to generate text, we’re basically having to use each weight once.
- So if there are n weights, we’ve got of order n computational steps to do—though in practice many of them can typically be done in parallel in GPUs.
- But if we need about n words of training data to set up those weights, then from what we’ve said above we can conclude that we’ll need about n^2 computational steps to do the training of the network—
  - which is why, with current methods, one ends up needing to talk about billion-dollar training efforts.

Beyond Basic Training

And a key idea in the construction of ChatGPT was to have another step after “passively reading” things like the web: to have actual humans actively interact with ChatGPT, see what it produces, and in effect give it feedback on “how to be a good chatbot”
now with ChatGPT we’ve got an important new piece of information: we know that a pure, artificial neural network with about as many connections as brains have neurons
- is capable of doing a surprisingly good job of generating human language.
language is at a fundamental level somehow simpler than it seems.
- And this means that ChatGPT—even with its ultimately straightforward neural net structure—is successfully able to “capture the essence” of human language and the thinking behind it.
The success of ChatGPT is suggesting that we can expect there to be major new “laws of language”—and effectively “laws of thought”—out there to discover.
- In ChatGPT—built as it is as a neural net—those laws are at best implicit.
- But if we could somehow make the laws explicit, there’s the potential to do the kinds of things ChatGPT does in vastly more direct, efficient—and transparent—ways.
a key “natural-science-like” observation is that the transformer architecture of neural nets like the one in ChatGPT seems to successfully be able to learn the kind of nested-tree-like syntactic structure that seems to exist (at least in some approximation) in all human languages.

Meaning Space and Semantic Laws of Motion

when ChatGPT continues a piece of text this corresponds to tracing out a trajectory in linguistic feature space.
- But now we can ask what makes this trajectory correspond to text we consider meaningful.
- And might there perhaps be some kind of “semantic laws of motion” that define—or at least constrain—how points in linguistic feature space can move around while preserving “meaningfulness”?
syntactic grammar gives rules for how words corresponding to things like different parts of speech can be put together in human language.
- to deal with meaning, we need to go further.
- And one version of how to do this is to think about not just a syntactic grammar for language, but also a semantic one.
From its training ChatGPT has effectively “pieced together” a certain (rather impressive) quantity of what amounts to semantic grammar.
- it’s going to be feasible to construct something more complete in computational language form.
- And, unlike what we’ve so far figured out about the innards of ChatGPT, we can expect to design the computational language so that it’s readily understandable to humans.
When we talk about semantic grammar, we can draw an analogy to syllogistic logic.

Platy’s Web Excursions

Discussion about this post

Platy’s Web Excursions

Web Excursions 2023-02-18

What Is ChatGPT Doing … and Why Does It Work?

Discussion about this post