Let’s build GPT: from scratch, in code, spelled out.

We build a Generatively Pretrained Transformer (GPT), following the paper “Attention is All You Need” and OpenAI’s GPT-2 …

20 thoughts on “Let’s build GPT: from scratch, in code, spelled out.”

@dasein1115 February 29, 2024 at 10:58 pm

thank you sir~!!!
@dravya...7471 March 1, 2024 at 8:27 am

hey i am rookie to the world of programming and i am really curious to learn new things , what are the prerequisite if i have to understand what Andrej will teach in this video ?
@nestorhenriquez1321 March 1, 2024 at 8:31 pm

Thnk you for all the information. Truly! I hate to be that guy but, wow, you make tons of unnecessary clicks.. its so annoying. lol
Thanks again for the video!
@user-hv9hm8uz3i March 1, 2024 at 11:56 pm

It is difficult to comprehend how lucky we are to have you teaching us. Thank you, Andrej.
@georgiostsoukas7270 March 2, 2024 at 12:28 pm

Thank you so much Andrej! This was a big joy to watch. No need for an A100, the 10M parameter model trains fine on a macbook (M1 Max):
2024-03-01T11:04:20.855197: 10.788929 M parameters
2024-03-01T11:05:05.557170: step 0: train loss 4.2849, val loss 4.2823
2024-03-01T11:09:37.926982: step 500: train loss 2.0008, val loss 2.0863
2024-03-01T11:14:12.468949: step 1000: train loss 1.5974, val loss 1.7752
2024-03-01T11:18:58.406528: step 1500: train loss 1.4382, val loss 1.6382
2024-03-01T11:23:38.201690: step 2000: train loss 1.3417, val loss 1.5709
2024-03-01T11:28:22.162644: step 2500: train loss 1.2798, val loss 1.5286
2024-03-01T11:33:10.709514: step 3000: train loss 1.2263, val loss 1.5048
2024-03-01T11:38:02.321640: step 3500: train loss 1.1827, val loss 1.4917
2024-03-01T11:42:50.745175: step 4000: train loss 1.1447, val loss 1.4825
2024-03-01T11:47:36.918182: step 4500: train loss 1.1075, val loss 1.4720
2024-03-01T11:52:23.090766: step 4999: train loss 1.0760, val loss 1.4867
2024-03-01T11:52:23.612078

ESCALUS:
Enone, changle there I stay it so: it is off alls
then revenge that I say for the lambs
full deep: their counsel duty head, stopes,
babes.
@anvayjain4100 March 2, 2024 at 3:21 pm

Thank You Andrej
@sung77777 March 3, 2024 at 2:22 am

Has anyone progressed with the suggested exercises? I'm stuck on all..
@random-user-1337 March 3, 2024 at 6:19 am

Thanks a lot.
@tror2285 March 3, 2024 at 7:26 am

can anybody tell me what prerequisite knowledge i should have to get hold of this ?
@bon_sense March 3, 2024 at 2:01 pm

thank you very much. it was so helpful
@Jack-vv7zb March 3, 2024 at 2:04 pm

love you xx
@wolpumba4099 March 3, 2024 at 5:26 pm

ELI5 Abstract

Imagine you want to teach a computer to write like a famous author.

* We start small: The computer first learns how to predict the next
letter in a word. It's like a baby learning the alphabet!
* Making friends: Then, we teach it to pay attention to the letters
around it, so it understands words in a sentence better. It's like
friends helping each other learn.
* Bigger and better: We make our computer brain much bigger, and
give it lots of stories to practice with. Now it can write things
that sound like the famous author, even if they don't make perfect
sense yet.
* Secret superpower: Big computers like the one that makes ChatGPT
have extra lessons. They learn to answer questions, follow
instructions, and be more helpful, just like how kids keep learning
new things at school.

This video shows you the building blocks of how it works. With some
code and practice, maybe you can teach a computer to write something
awesome too!

Abstract

This video transcript presents a series of discussions and code
implementations centered on building a simplified Transformer-based
language model. Drawing inspiration from the concepts behind ChatGPT,
the focus is on understanding the core principles of Transformers for
natural language processing.

The exploration begins with a simple bigram language model, used as a
baseline for comparison as the concepts evolve toward the Transformer
architecture. Initial emphasis is placed on how tokens can communicate
with each other for contextualized predictions, evolving from basic
averaging to the more powerful self-attention mechanism. Key concepts
like positional embeddings, masking, and scaling of attention scores
are introduced.

To improve the model's optimization, residual connections and
LayerNorm techniques are incorporated. The model is then scaled up,
demonstrating the ability of the Transformer architecture to achieve
more nuanced language generation with larger datasets and
hyperparameter tuning.

Finally, the discussions touch upon the distinctions between
encoder-decoder Transformers and the multi-stage training process
behind a model like ChatGPT, including pre-training on vast amounts of
data and subsequent fine-tuning with reward models. The focus on
practical code implementation with NanoGPT provides a strong
foundation for further exploration of these advanced concepts in
natural language processing.
@bentos117 March 4, 2024 at 12:45 am

"infinite Shakespeare" 😀
@magicpython8473 March 4, 2024 at 4:45 am

Now I get why OpenAI engineers get paid so much
@richeek10 March 4, 2024 at 6:04 am

Its by far the best video on transformer. Thank you so much! I have a question on position encoding. When we do x = tok_emb + pos_emb, didn't we loose the position encoding as that data get merged with tok_emb? Shouldn't we save this as a separate dimension (like a (B, T, 2C) matrix?)
@user-go6ee7tf1q March 5, 2024 at 11:36 am

go forward and transform, Ill tatoo that
@BurcAKBAS March 5, 2024 at 10:22 pm

Thank you Andrej! You’re so passionate about your job. It was 11:00 am when you started coding. Now it’s dark in here and you still trying to teach! 🙏
@abdallahbashir8738 March 6, 2024 at 9:11 am

The community is blessed to have people like you.
@arjunprasaath5538 March 6, 2024 at 3:51 pm

Can you guys explain in detail on "head_size" hyper-parameter mentioned at timestamp 1:05:10, what is the difference between this head_size and the Multi head diagram present in Attention is all you need
@m.rr.c.1570 March 7, 2024 at 8:26 am

can you please tell me which video of yours should i watch to understand the code in deep? like you said i have already explained this in my previous video