At this point, there are several guides out there that explain how LLMs work. Andrej Karpathy’s YouTube channel is probably the canonical source in this regard, and worth watching.

But if you’re looking for a quicker read, this ngrok blog post shines. The article aims to discuss prompt caching, and along the way it explains LLMs in a clear way, with interactive diagrams and code samples.

Like many (most?) individuals in the software industry, I didn’t really have a clue what LLMs were as Copilot and ChatGPT became commercialized in 2022. Along the way, I’ve saught videos, blog posts, and other resources to try and make sense of this increasingly fundamental technology.

As a continual learner, I’ve watched some of Karpathy’s videos, but not in a studious way. I have an understanding of how LLMs operate, but I don’t claim to be an expert on how they’re composed at a deeply technical level. This ngrok blog post helped solidify the following concepts for me:

Tokens

The same prompt always results in the same tokens. Tokens are also case-sensitive, and this is because capitalisation tells you something about the word. “Will” with a capital W is more likely to be a name than “will” with a lowercase W, for example.

This makes sense. If a user YELLS at an LLM, then that would be interpreted differently than if they typed in a casual tone. We also often refer to LLMs as “non-deterministic,” which is true, but the tokenization is fully deterministic.

Last thing on tokenizers: there are lots of them! The tokenizer that ChatGPT uses is different to the one Claude uses. Even different models made by OpenAI use different tokenizers.

While there are open source tokenizers out there, tokenization is a key ingredient that contributes to LLM performance. It’s easy to forget that each AI lab has their own way of doing this!

Embeddings

Tokens do not have dimensions […]. An embedding is an array of length n representing a position in n-dimensional space.

This captures the relationship between tokens and embeddings succiently. But how do embeddings know when a a concept is close to a similar conept. Let’s find out.

So we take tokens, an array of integers, and convert it into an array of embeddings. An array of arrays, or a “matrix.”

As we increase the dimensionality of the embeddings, the more a model can accurately compare relationships between embeddings. The visuals in the article are fantastic, and I’d recommend clicking through them. The article uses examples of a 3 dimensional embeddings, but notes “the biggest ones have more than 10,000.” And that’s probably increased as I write this. Hard to fathom!

Transformer

Remember when I asked how does the LLM “know” when embeddings are related. This is how: attention. You may have heard of the famous Google paper “Attention is All You Need.”

The input is an entire prompt’s embeddings, and the output is a single new embedding that is a weighted combination of all of the input embeddings.

Remember, an embedding is a series of tokens in high dimensionality. The attention mechansism then 1) takes the tokens in the sequences 2) decides how its attention is allocated amongst the tokens and 3) uses that allocation to determine which of the tokens to pay close attention to in order to generate the next token in the sequence.

The author points out that given a phrase like “Mary had a little _” and attention mechanism would direct 63% of its attention to “Mary”. That makes intuitive sense; to get the next word in a phrase, this token is the most likely one to direct the sentance’s meaning. But the rest of the tokens are also import to yield the final token, just less so.

The article also discusses prompt caching, but I admittedly bounced off that section. I feel like the above sections underpin the core of how LLMs operate on a fundamental level and that’s what I was hoping to learn from the blog post.

Closing thoughts

There’s so much to learn in the LLM space. They feel like a black box, like magic. But ultimately they’re just super powerful prediction computers, driving instructions through electrified silicon.

I find these kinds of articles and visusualization helpful to better understand the space, as well as using my own writing to articulate concepts. The author also points to these site Transformer Explainer which is even more detailed.

A straightforward summary of how LLMs work

Tokens

Embeddings

Transformer

Closing thoughts

Wisdom on scaling AI internally, via Will Larson

Finding peace in the AI whirlpool