hey i am rookie to the world of programming and i am really curious to learn new things , what are the prerequisite if i have to understand what Andrej will teach in this video ?
Thnk you for all the information. Truly! I hate to be that guy but, wow, you make tons of unnecessary clicks.. its so annoying. lol Thanks again for the video!
Thank you so much Andrej! This was a big joy to watch. No need for an A100, the 10M parameter model trains fine on a macbook (M1 Max): 2024-03-01T11:04:20.855197: 10.788929 M parameters 2024-03-01T11:05:05.557170: step 0: train loss 4.2849, val loss 4.2823 2024-03-01T11:09:37.926982: step 500: train loss 2.0008, val loss 2.0863 2024-03-01T11:14:12.468949: step 1000: train loss 1.5974, val loss 1.7752 2024-03-01T11:18:58.406528: step 1500: train loss 1.4382, val loss 1.6382 2024-03-01T11:23:38.201690: step 2000: train loss 1.3417, val loss 1.5709 2024-03-01T11:28:22.162644: step 2500: train loss 1.2798, val loss 1.5286 2024-03-01T11:33:10.709514: step 3000: train loss 1.2263, val loss 1.5048 2024-03-01T11:38:02.321640: step 3500: train loss 1.1827, val loss 1.4917 2024-03-01T11:42:50.745175: step 4000: train loss 1.1447, val loss 1.4825 2024-03-01T11:47:36.918182: step 4500: train loss 1.1075, val loss 1.4720 2024-03-01T11:52:23.090766: step 4999: train loss 1.0760, val loss 1.4867 2024-03-01T11:52:23.612078
ESCALUS: Enone, changle there I stay it so: it is off alls then revenge that I say for the lambs full deep: their counsel duty head, stopes, babes.
Imagine you want to teach a computer to write like a famous author.
* We start small: The computer first learns how to predict the next letter in a word. It's like a baby learning the alphabet! * Making friends: Then, we teach it to pay attention to the letters around it, so it understands words in a sentence better. It's like friends helping each other learn. * Bigger and better: We make our computer brain much bigger, and give it lots of stories to practice with. Now it can write things that sound like the famous author, even if they don't make perfect sense yet. * Secret superpower: Big computers like the one that makes ChatGPT have extra lessons. They learn to answer questions, follow instructions, and be more helpful, just like how kids keep learning new things at school.
This video shows you the building blocks of how it works. With some code and practice, maybe you can teach a computer to write something awesome too!
Abstract
This video transcript presents a series of discussions and code implementations centered on building a simplified Transformer-based language model. Drawing inspiration from the concepts behind ChatGPT, the focus is on understanding the core principles of Transformers for natural language processing.
The exploration begins with a simple bigram language model, used as a baseline for comparison as the concepts evolve toward the Transformer architecture. Initial emphasis is placed on how tokens can communicate with each other for contextualized predictions, evolving from basic averaging to the more powerful self-attention mechanism. Key concepts like positional embeddings, masking, and scaling of attention scores are introduced.
To improve the model's optimization, residual connections and LayerNorm techniques are incorporated. The model is then scaled up, demonstrating the ability of the Transformer architecture to achieve more nuanced language generation with larger datasets and hyperparameter tuning.
Finally, the discussions touch upon the distinctions between encoder-decoder Transformers and the multi-stage training process behind a model like ChatGPT, including pre-training on vast amounts of data and subsequent fine-tuning with reward models. The focus on practical code implementation with NanoGPT provides a strong foundation for further exploration of these advanced concepts in natural language processing.
Its by far the best video on transformer. Thank you so much! I have a question on position encoding. When we do x = tok_emb + pos_emb, didn't we loose the position encoding as that data get merged with tok_emb? Shouldn't we save this as a separate dimension (like a (B, T, 2C) matrix?)
Can you guys explain in detail on "head_size" hyper-parameter mentioned at timestamp 1:05:10, what is the difference between this head_size and the Multi head diagram present in Attention is all you need
can you please tell me which video of yours should i watch to understand the code in deep? like you said i have already explained this in my previous video
thank you sir~!!!
hey i am rookie to the world of programming and i am really curious to learn new things , what are the prerequisite if i have to understand what Andrej will teach in this video ?
Thnk you for all the information. Truly! I hate to be that guy but, wow, you make tons of unnecessary clicks.. its so annoying. lol
Thanks again for the video!
It is difficult to comprehend how lucky we are to have you teaching us. Thank you, Andrej.
Thank you so much Andrej! This was a big joy to watch. No need for an A100, the 10M parameter model trains fine on a macbook (M1 Max):
2024-03-01T11:04:20.855197: 10.788929 M parameters
2024-03-01T11:05:05.557170: step 0: train loss 4.2849, val loss 4.2823
2024-03-01T11:09:37.926982: step 500: train loss 2.0008, val loss 2.0863
2024-03-01T11:14:12.468949: step 1000: train loss 1.5974, val loss 1.7752
2024-03-01T11:18:58.406528: step 1500: train loss 1.4382, val loss 1.6382
2024-03-01T11:23:38.201690: step 2000: train loss 1.3417, val loss 1.5709
2024-03-01T11:28:22.162644: step 2500: train loss 1.2798, val loss 1.5286
2024-03-01T11:33:10.709514: step 3000: train loss 1.2263, val loss 1.5048
2024-03-01T11:38:02.321640: step 3500: train loss 1.1827, val loss 1.4917
2024-03-01T11:42:50.745175: step 4000: train loss 1.1447, val loss 1.4825
2024-03-01T11:47:36.918182: step 4500: train loss 1.1075, val loss 1.4720
2024-03-01T11:52:23.090766: step 4999: train loss 1.0760, val loss 1.4867
2024-03-01T11:52:23.612078
ESCALUS:
Enone, changle there I stay it so: it is off alls
then revenge that I say for the lambs
full deep: their counsel duty head, stopes,
babes.
Thank You Andrej
Has anyone progressed with the suggested exercises? I'm stuck on all..
Thanks a lot.
can anybody tell me what prerequisite knowledge i should have to get hold of this ?
thank you very much. it was so helpful
love you xx
ELI5 Abstract
Imagine you want to teach a computer to write like a famous author.
* We start small: The computer first learns how to predict the next
letter in a word. It's like a baby learning the alphabet!
* Making friends: Then, we teach it to pay attention to the letters
around it, so it understands words in a sentence better. It's like
friends helping each other learn.
* Bigger and better: We make our computer brain much bigger, and
give it lots of stories to practice with. Now it can write things
that sound like the famous author, even if they don't make perfect
sense yet.
* Secret superpower: Big computers like the one that makes ChatGPT
have extra lessons. They learn to answer questions, follow
instructions, and be more helpful, just like how kids keep learning
new things at school.
This video shows you the building blocks of how it works. With some
code and practice, maybe you can teach a computer to write something
awesome too!
Abstract
This video transcript presents a series of discussions and code
implementations centered on building a simplified Transformer-based
language model. Drawing inspiration from the concepts behind ChatGPT,
the focus is on understanding the core principles of Transformers for
natural language processing.
The exploration begins with a simple bigram language model, used as a
baseline for comparison as the concepts evolve toward the Transformer
architecture. Initial emphasis is placed on how tokens can communicate
with each other for contextualized predictions, evolving from basic
averaging to the more powerful self-attention mechanism. Key concepts
like positional embeddings, masking, and scaling of attention scores
are introduced.
To improve the model's optimization, residual connections and
LayerNorm techniques are incorporated. The model is then scaled up,
demonstrating the ability of the Transformer architecture to achieve
more nuanced language generation with larger datasets and
hyperparameter tuning.
Finally, the discussions touch upon the distinctions between
encoder-decoder Transformers and the multi-stage training process
behind a model like ChatGPT, including pre-training on vast amounts of
data and subsequent fine-tuning with reward models. The focus on
practical code implementation with NanoGPT provides a strong
foundation for further exploration of these advanced concepts in
natural language processing.
"infinite Shakespeare" 😀
Now I get why OpenAI engineers get paid so much
Its by far the best video on transformer. Thank you so much! I have a question on position encoding. When we do x = tok_emb + pos_emb, didn't we loose the position encoding as that data get merged with tok_emb? Shouldn't we save this as a separate dimension (like a (B, T, 2C) matrix?)
go forward and transform, Ill tatoo that
Thank you Andrej! You’re so passionate about your job. It was 11:00 am when you started coding. Now it’s dark in here and you still trying to teach! 🙏
The community is blessed to have people like you.
Can you guys explain in detail on "head_size" hyper-parameter mentioned at timestamp 1:05:10, what is the difference between this head_size and the Multi head diagram present in Attention is all you need
can you please tell me which video of yours should i watch to understand the code in deep? like you said i have already explained this in my previous video
Comments are closed.