Skip to content

Let’s build GPT: from scratch, in code, spelled out.



We build a Generatively Pretrained Transformer (GPT), following the paper “Attention is All You Need” and OpenAI’s GPT-2 …

20 thoughts on “Let’s build GPT: from scratch, in code, spelled out.”

  1. hey i am rookie to the world of programming and i am really curious to learn new things , what are the prerequisite if i have to understand what Andrej will teach in this video ?

  2. Thank you so much Andrej! This was a big joy to watch. No need for an A100, the 10M parameter model trains fine on a macbook (M1 Max):
    2024-03-01T11:04:20.855197: 10.788929 M parameters
    2024-03-01T11:05:05.557170: step 0: train loss 4.2849, val loss 4.2823
    2024-03-01T11:09:37.926982: step 500: train loss 2.0008, val loss 2.0863
    2024-03-01T11:14:12.468949: step 1000: train loss 1.5974, val loss 1.7752
    2024-03-01T11:18:58.406528: step 1500: train loss 1.4382, val loss 1.6382
    2024-03-01T11:23:38.201690: step 2000: train loss 1.3417, val loss 1.5709
    2024-03-01T11:28:22.162644: step 2500: train loss 1.2798, val loss 1.5286
    2024-03-01T11:33:10.709514: step 3000: train loss 1.2263, val loss 1.5048
    2024-03-01T11:38:02.321640: step 3500: train loss 1.1827, val loss 1.4917
    2024-03-01T11:42:50.745175: step 4000: train loss 1.1447, val loss 1.4825
    2024-03-01T11:47:36.918182: step 4500: train loss 1.1075, val loss 1.4720
    2024-03-01T11:52:23.090766: step 4999: train loss 1.0760, val loss 1.4867
    2024-03-01T11:52:23.612078

    ESCALUS:
    Enone, changle there I stay it so: it is off alls
    then revenge that I say for the lambs
    full deep: their counsel duty head, stopes,
    babes.

  3. ELI5 Abstract

    Imagine you want to teach a computer to write like a famous author.

    * We start small: The computer first learns how to predict the next
    letter in a word. It's like a baby learning the alphabet!
    * Making friends: Then, we teach it to pay attention to the letters
    around it, so it understands words in a sentence better. It's like
    friends helping each other learn.
    * Bigger and better: We make our computer brain much bigger, and
    give it lots of stories to practice with. Now it can write things
    that sound like the famous author, even if they don't make perfect
    sense yet.
    * Secret superpower: Big computers like the one that makes ChatGPT
    have extra lessons. They learn to answer questions, follow
    instructions, and be more helpful, just like how kids keep learning
    new things at school.

    This video shows you the building blocks of how it works. With some
    code and practice, maybe you can teach a computer to write something
    awesome too!

    Abstract

    This video transcript presents a series of discussions and code
    implementations centered on building a simplified Transformer-based
    language model. Drawing inspiration from the concepts behind ChatGPT,
    the focus is on understanding the core principles of Transformers for
    natural language processing.

    The exploration begins with a simple bigram language model, used as a
    baseline for comparison as the concepts evolve toward the Transformer
    architecture. Initial emphasis is placed on how tokens can communicate
    with each other for contextualized predictions, evolving from basic
    averaging to the more powerful self-attention mechanism. Key concepts
    like positional embeddings, masking, and scaling of attention scores
    are introduced.

    To improve the model's optimization, residual connections and
    LayerNorm techniques are incorporated. The model is then scaled up,
    demonstrating the ability of the Transformer architecture to achieve
    more nuanced language generation with larger datasets and
    hyperparameter tuning.

    Finally, the discussions touch upon the distinctions between
    encoder-decoder Transformers and the multi-stage training process
    behind a model like ChatGPT, including pre-training on vast amounts of
    data and subsequent fine-tuning with reward models. The focus on
    practical code implementation with NanoGPT provides a strong
    foundation for further exploration of these advanced concepts in
    natural language processing.

  4. Its by far the best video on transformer. Thank you so much! I have a question on position encoding. When we do x = tok_emb + pos_emb, didn't we loose the position encoding as that data get merged with tok_emb? Shouldn't we save this as a separate dimension (like a (B, T, 2C) matrix?)

  5. Thank you Andrej! You’re so passionate about your job. It was 11:00 am when you started coding. Now it’s dark in here and you still trying to teach! 🙏

  6. can you please tell me which video of yours should i watch to understand the code in deep? like you said i have already explained this in my previous video

Comments are closed.