What's inside ChatGPT?

And what does it mean for robotics/robotists?

Marek Šuppa
Fablab 2023

1 / 81

$ whomai"Principal Data Scientist/Engineer" at Slido (now part of Cisco)
2 / 81

`$ whomai`

"Principal Data Scientist/Engineer" at Slido (now part of Cisco)
Lecturer at Matfyz (ML, NLP)

3 / 81

`$ whomai`

"Principal Data Scientist/Engineer" at Slido (now part of Cisco)
Lecturer at Matfyz (ML, NLP)
RoboCupJunior Exec

4 / 81

Short history of Neural Network approaches to sequence processing2001: Neural Language Models
Word Embeddings
Sequence-to-Sequence models
Attention
Neural Machine Translation boom
Transformers
Pretrained Contextualized Word Embeddings (ELMo)
Massive Transformer Models (BERT, GPT-2, ...)
GPT-3
Large Language Models trained on Code (Codex)
ChatGPT?
2023+: Current Frontiers
/ 81

Short history of Neural Network approaches to sequence processing

2001: Neural Language Models
2013: Word Embeddings
2014: Sequence-to-Sequence models
2015: Attention *
2016: Neural Machine Translation boom
2017: Transformers *
2018: Pretrained Contextualized Word Embeddings (ELMo)
2019: Massive Transformer Models (BERT, GPT-2, ...)
2020: GPT-3 *
2021: Large Language Models trained on Code (Codex)
2022: ChatGPT? *
2023+: Current Frontiers

* Our today's (loose) agenda

6 / 81

Attention and Transformers7 / 81

History of Deep Learning Milestones

From Deep Learning State of the Art (2020) by Lex Fridman at MIT

8 / 81

The perils of seq2seq modeling
  
  Your browser does not support the video tag.
9 / 81

The perils of seq2seq modeling

Aren't we throwing out a bit too much?

Videos from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

10 / 81

The fix

Let's use the full encoder output!

11 / 81

The fix

Let's use the full encoder output!

But how do we combine all the hidden states together?

12 / 81

The mechanics of Attention
  
  Your browser does not support the video tag.
13 / 81

The mechanics of Attention II
  
  Your browser does not support the video tag.
14 / 81

The mechanics of Attention III
  
  Your browser does not support the video tag.
15 / 81

Getting alignment with attention
  
  Your browser does not support the video tag.
16 / 81

Attention visualized

See nice demo at https://distill.pub/2016/augmented-rnns/

17 / 81

What if we only used attention?18 / 81

Attention is All You Need (2017)

19 / 81

The Transformer architecture

Images from https://jalammar.github.io/illustrated-transformer/

20 / 81

The Transformer's Encoder

21 / 81

What's Self Attention?

The animal didn't cross the street because it was too tired.

What does "it" refer to?

22 / 81

What's Self Attention?

The animal didn't cross the street because it was too tired.

What does "it" refer to?

23 / 81

Self Attention mechanics

24 / 81

Multi-headed Self Attention

25 / 81

The full Transformer seq2seq process I

26 / 81

The full Transformer seq2seq process II

27 / 81

Intermezzo: implement it yourself

To actually understand what's going on, there is no better approach.

28 / 81

Intermezzo: implement it yourself

To actually understand what's going on, there is no better approach.

A walkthorugh of the Transformer architecture

29 / 81

Intermezzo: implement it yourself

To actually understand what's going on, there is no better approach.

30 / 81

Intermezzo: implement it yourself

To actually understand what's going on, there is no better approach.

31 / 81

Big Transformers Wins: GPT-2

Try it yourself at https://transformer.huggingface.co/doc/gpt2-large

32 / 81

Big Transformer Wins: Huggingface `transformers`

A very nicely done library that allows anyone with some Python knowledge to play with pretrained state-of-the-art models (more in the docs).

33 / 81

Big Transformer Wins: Huggingface `transformers` II

A small example: English to Slovak translator in about 3 lines of Python code: *

from transformers import pipeline
en_sk_translator = pipeline("translation_en_to_sk")
print(en_sk_translator("When will this presentation end ?"))

Works with many other languages as well -- the full list is here

34 / 81

Attention and Transformers: Recap35 / 81

Attention and Transformers: RecapAttention was a fix for sequence models that did not really work too wel
36 / 81

Attention and Transformers: Recap

Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing

37 / 81

Attention and Transformers: Recap

Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Transformer is an encoder-decoder architecture that is "all the rage" now

38 / 81

Attention and Transformers: Recap

Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Transformer is an encoder-decoder architecture that is "all the rage" now
It has no time-depencency due to self-attention and is therefore easy to paralelize

39 / 81

Attention and Transformers: Recap

Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Transformer is an encoder-decoder architecture that is "all the rage" now
It has no time-depencency due to self-attention and is therefore easy to paralelize
Well known models like BERT and GPT-* took the world of NLP by storm

40 / 81

Attention and Transformers: Recap

Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Transformer is an encoder-decoder architecture that is "all the rage" now
It has no time-depencency due to self-attention and is therefore easy to paralelize
Well known models like BERT and GPT-* took the world of NLP by storm

Very helpful in many tasks, easy to play with thanks to the Huggingface transformers library

41 / 81

GPT3 and ChatGPT42 / 81

GPT-2 vs GPT-3

43 / 81

GPT-2 vs GPT-3

44 / 81

GPT3Basically the same architecture as GPT2
45 / 81

GPT3

Basically the same architecture as GPT2
The sheer size is astounding (power-law of model/dataset/computation size)

46 / 81

GPT3

Basically the same architecture as GPT2
The sheer size is astounding (power-law of model/dataset/computation size)
It would take 355 years of Tesla V100 GPU time to train
The training would cost about $4.6M at retail prices to train

47 / 81

GPT3

Basically the same architecture as GPT2
The sheer size is astounding (power-law of model/dataset/computation size)
It would take 355 years of Tesla V100 GPU time to train
The training would cost about $4.6M at retail prices to train
It was so expensive to train they didn't even fix the bugs they themselves found:

48 / 81

GPT3

Basically the same architecture as GPT2
The sheer size is astounding (power-law of model/dataset/computation size)
It would take 355 years of Tesla V100 GPU time to train
The training would cost about $4.6M at retail prices to train
It was so expensive to train they didn't even fix the bugs they themselves found:

49 / 81

Language Models Are Few-Shot Learners

50 / 81

Q: Ok, so is ChatGPT simply a wrapper around GPT-3?Very short A: It depends
51 / 81

Q: Ok, so is ChatGPT simply a wrapper around GPT-3?

Very short A: It depends
Short A: It depends on who you ask

52 / 81

Q: Ok, so is ChatGPT simply a wrapper around GPT-3?

Very short A: It depends
Short A: It depends on who you ask
A: It depends on who you ask. OpenAI's Docs probably wouldn't agree.

53 / 81

Q: Ok, so is ChatGPT simply a wrapper around GPT-3?

Very short A: It depends
Short A: It depends on who you ask
A: It depends on who you ask. OpenAI's Docs probably wouldn't agree.
Actual A: We don't really know. It's behind an API, we don't really have ways of proving this one way or the other.

54 / 81

55 / 81

The "potential" ChatGPT training procedure

InstructGPT: Training language models to follow instructions with human feedback (2022)

56 / 81

Supervised FineTuning (SFT) Model

The compilation of prompts from the OpenAI API and hand-written by labelers resulted in 13,000 input / output samples to leverage for the supervised model.

57 / 81

Reward Model Training

58 / 81

Reward Model Training II

Illustrating Reinforcement Learning from Human Feedback (RLHF)

59 / 81

Fine-tuning with RL

Illustrating Reinforcement Learning from Human Feedback (RLHF)

60 / 81

The "potential" ChatGPT training procedure

InstructGPT: Training language models to follow instructions with human feedback (2022)

61 / 81

InstructGPT: SummaryThe outputs generated by a small (1.3B) InstructGPT model were prefered to those of GPT3
62 / 81

InstructGPT: Summary

The outputs generated by a small (1.3B) InstructGPT model were prefered to those of GPT3
The rewards model was also "rather small" (6B)

63 / 81

InstructGPT: Summary

The outputs generated by a small (1.3B) InstructGPT model were prefered to those of GPT3
The rewards model was also "rather small" (6B)
We don't know how large the model behind ChatGPT is, but chances are it's this "small"