+ - 0:00:00
Notes for current slide
Notes for next slide

What's inside ChatGPT?

And what does it mean for robotics/robotists?

Marek Šuppa
Fablab 2023

1 / 81

$ whomai

  • "Principal Data Scientist/Engineer" at Slido (now part of Cisco)
2 / 81

$ whomai

  • "Principal Data Scientist/Engineer" at Slido (now part of Cisco)

  • Lecturer at Matfyz (ML, NLP)

3 / 81

$ whomai

  • "Principal Data Scientist/Engineer" at Slido (now part of Cisco)

  • Lecturer at Matfyz (ML, NLP)

  • RoboCupJunior Exec

4 / 81

Short history of Neural Network approaches to sequence processing

  • 2001: Neural Language Models
  • 2013: Word Embeddings
  • 2014: Sequence-to-Sequence models
  • 2015: Attention
  • 2016: Neural Machine Translation boom
  • 2017: Transformers
  • 2018: Pretrained Contextualized Word Embeddings (ELMo)
  • 2019: Massive Transformer Models (BERT, GPT-2, ...)
  • 2020: GPT-3
  • 2021: Large Language Models trained on Code (Codex)
  • 2022: ChatGPT?
  • 2023+: Current Frontiers
5 / 81

Short history of Neural Network approaches to sequence processing

  • 2001: Neural Language Models
  • 2013: Word Embeddings
  • 2014: Sequence-to-Sequence models
  • 2015: Attention *
  • 2016: Neural Machine Translation boom
  • 2017: Transformers *
  • 2018: Pretrained Contextualized Word Embeddings (ELMo)
  • 2019: Massive Transformer Models (BERT, GPT-2, ...)
  • 2020: GPT-3 *
  • 2021: Large Language Models trained on Code (Codex)
  • 2022: ChatGPT? *
  • 2023+: Current Frontiers

* Our today's (loose) agenda

6 / 81

Attention and Transformers

7 / 81

History of Deep Learning Milestones

From Deep Learning State of the Art (2020) by Lex Fridman at MIT

8 / 81

The perils of seq2seq modeling

9 / 81

The perils of seq2seq modeling

Aren't we throwing out a bit too much?

Videos from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

10 / 81

The fix

Let's use the full encoder output!

11 / 81

The fix

Let's use the full encoder output!

But how do we combine all the hidden states together?

12 / 81

The mechanics of Attention

13 / 81

The mechanics of Attention II

14 / 81

The mechanics of Attention III

15 / 81

Getting alignment with attention

16 / 81

Attention visualized

See nice demo at https://distill.pub/2016/augmented-rnns/

17 / 81

What if we only used attention?

18 / 81

The Transformer architecture

Images from https://jalammar.github.io/illustrated-transformer/

20 / 81

The Transformer's Encoder

21 / 81

What's Self Attention?

The animal didn't cross the street because it was too tired.

What does "it" refer to?

22 / 81

What's Self Attention?

The animal didn't cross the street because it was too tired.

What does "it" refer to?

23 / 81

Self Attention mechanics

24 / 81

Multi-headed Self Attention

25 / 81

The full Transformer seq2seq process I

26 / 81

The full Transformer seq2seq process II

27 / 81

Intermezzo: implement it yourself

To actually understand what's going on, there is no better approach.

28 / 81

Intermezzo: implement it yourself

To actually understand what's going on, there is no better approach.

29 / 81

Intermezzo: implement it yourself

To actually understand what's going on, there is no better approach.

30 / 81

Intermezzo: implement it yourself

To actually understand what's going on, there is no better approach.

31 / 81

Big Transformers Wins: GPT-2

Try it yourself at https://transformer.huggingface.co/doc/gpt2-large

32 / 81

Big Transformer Wins: Huggingface transformers

  • A very nicely done library that allows anyone with some Python knowledge to play with pretrained state-of-the-art models (more in the docs).

33 / 81

Big Transformer Wins: Huggingface transformers II

  • A small example: English to Slovak translator in about 3 lines of Python code: *
from transformers import pipeline
en_sk_translator = pipeline("translation_en_to_sk")
print(en_sk_translator("When will this presentation end ?"))

Works with many other languages as well -- the full list is here

34 / 81

Attention and Transformers: Recap

35 / 81

Attention and Transformers: Recap

  • Attention was a fix for sequence models that did not really work too wel
36 / 81

Attention and Transformers: Recap

  • Attention was a fix for sequence models that did not really work too wel

  • It turned out it was all that was needed for (bounded) sequence processing

37 / 81

Attention and Transformers: Recap

  • Attention was a fix for sequence models that did not really work too wel

  • It turned out it was all that was needed for (bounded) sequence processing

  • Transformer is an encoder-decoder architecture that is "all the rage" now

38 / 81

Attention and Transformers: Recap

  • Attention was a fix for sequence models that did not really work too wel

  • It turned out it was all that was needed for (bounded) sequence processing

  • Transformer is an encoder-decoder architecture that is "all the rage" now

  • It has no time-depencency due to self-attention and is therefore easy to paralelize

39 / 81

Attention and Transformers: Recap

  • Attention was a fix for sequence models that did not really work too wel

  • It turned out it was all that was needed for (bounded) sequence processing

  • Transformer is an encoder-decoder architecture that is "all the rage" now

  • It has no time-depencency due to self-attention and is therefore easy to paralelize

  • Well known models like BERT and GPT-* took the world of NLP by storm

40 / 81

Attention and Transformers: Recap

  • Attention was a fix for sequence models that did not really work too wel

  • It turned out it was all that was needed for (bounded) sequence processing

  • Transformer is an encoder-decoder architecture that is "all the rage" now

  • It has no time-depencency due to self-attention and is therefore easy to paralelize

  • Well known models like BERT and GPT-* took the world of NLP by storm

  • Very helpful in many tasks, easy to play with thanks to the Huggingface transformers library
41 / 81

GPT3 and ChatGPT

42 / 81

GPT-2 vs GPT-3

43 / 81

GPT-2 vs GPT-3

44 / 81

GPT3

  • Basically the same architecture as GPT2
45 / 81

GPT3

  • Basically the same architecture as GPT2

  • The sheer size is astounding (power-law of model/dataset/computation size)

46 / 81

GPT3

47 / 81

GPT3

  • Basically the same architecture as GPT2

  • The sheer size is astounding (power-law of model/dataset/computation size)

  • It would take 355 years of Tesla V100 GPU time to train

  • The training would cost about $4.6M at retail prices to train

  • It was so expensive to train they didn't even fix the bugs they themselves found:

48 / 81

GPT3

  • Basically the same architecture as GPT2

  • The sheer size is astounding (power-law of model/dataset/computation size)

  • It would take 355 years of Tesla V100 GPU time to train

  • The training would cost about $4.6M at retail prices to train

  • It was so expensive to train they didn't even fix the bugs they themselves found:

49 / 81

Q: Ok, so is ChatGPT simply a wrapper around GPT-3?

  • Very short A: It depends
51 / 81

Q: Ok, so is ChatGPT simply a wrapper around GPT-3?

  • Very short A: It depends

  • Short A: It depends on who you ask

52 / 81

Q: Ok, so is ChatGPT simply a wrapper around GPT-3?

  • Very short A: It depends

  • Short A: It depends on who you ask

  • A: It depends on who you ask. OpenAI's Docs probably wouldn't agree.

53 / 81

Q: Ok, so is ChatGPT simply a wrapper around GPT-3?

  • Very short A: It depends

  • Short A: It depends on who you ask

  • A: It depends on who you ask. OpenAI's Docs probably wouldn't agree.

  • Actual A: We don't really know. It's behind an API, we don't really have ways of proving this one way or the other.

54 / 81

55 / 81

The "potential" ChatGPT training procedure

InstructGPT: Training language models to follow instructions with human feedback (2022)

56 / 81

Supervised FineTuning (SFT) Model

  • The compilation of prompts from the OpenAI API and hand-written by labelers resulted in 13,000 input / output samples to leverage for the supervised model.
57 / 81

Reward Model Training

58 / 81

The "potential" ChatGPT training procedure

InstructGPT: Training language models to follow instructions with human feedback (2022)

61 / 81

InstructGPT: Summary

  • The outputs generated by a small (1.3B) InstructGPT model were prefered to those of GPT3
62 / 81

InstructGPT: Summary

  • The outputs generated by a small (1.3B) InstructGPT model were prefered to those of GPT3

  • The rewards model was also "rather small" (6B)

63 / 81

InstructGPT: Summary

  • The outputs generated by a small (1.3B) InstructGPT model were prefered to those of GPT3

  • The rewards model was also "rather small" (6B)

  • We don't know how large the model behind ChatGPT is, but chances are it's this "small"

64 / 81

Implications

65 / 81

66 / 81

72 / 81

73 / 81

77 / 81

78 / 81

Three rules for using (things like) ChatGPT

  1. Let it do things you will do a manual check on anyway

  2. Have it draft things you'll rewrite anyway

  3. Assume the first response will be far from final

Inspired by https://vickiboykis.com/2023/02/26/what-should-you-use-chatgpt-for/

79 / 81

marek@mareksuppa.com

80 / 81

$ whomai

  • "Principal Data Scientist/Engineer" at Slido (now part of Cisco)
2 / 81
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow