+ - 0:00:00
Notes for current slide
Notes for next slide

Neural Network techniques for sequence processing (in NLP)

From word vectors to transformers and back

Marek Šuppa
MLSS 2020, Žilina

1 / 126

Short history of Neural Network approaches to sequence processing

  • 2001: Neural Language Models
  • 2013: Word Embeddings
  • 2014: Sequence-to-Sequence models
  • 2015: Attention
  • 2016: Neural Machine Translation boom
  • 2017: Transformers
  • 2018: Pretrained Contextualized Word Embeddings (ELMo)
  • 2019+: Massive Transformer Models (BERT, GPT-2, ...)
  • 2020: Current Frontiers
3 / 126

Short history of Neural Network approaches to sequence processing

  • 2001: Neural Language Models
  • 2013: Word Embeddings *
  • 2014: Sequence-to-Sequence models
  • 2015: Attention *
  • 2016: Neural Machine Translation boom
  • 2017: Transformers *
  • 2018: Pretrained Contextualized Word Embeddings (ELMo)
  • 2019+: Massive Transformer Models (BERT, GPT-2, ...) *
  • 2020: Current Frontiers

* Our today's (loose) agenda

4 / 126

Word Embeddings

5 / 126

Word Embeddings

"How do we convert free text into a format understandable by neural networks?"

One-hot encoding.

6 / 126

Word Embeddings

"How do we convert free text into a format understandable by neural networks?"

One-hot encoding.

"That results in extra large vectors. Can we do better?"

Well, there is this thing called dense representations.

7 / 126

Word Embeddings

"How do we convert free text into a format understandable by neural networks?"

One-hot encoding.

"That results in extra large vectors. Can we do better?"

Well, there is this thing called dense representations.

"Cool! While we are at it, can we make the format semantically meaningful?"

We don't even know how to train them quickly, but we can try...

8 / 126

Word Embeddings: A real-world example

A GloVe word embedding for king (trained on Wikipedia):

[
0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 ,
0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 ,
0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961l ,
-0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 ,
-0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 ,
0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 ,
-1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 ,
-0.51042
]

Example and subsequent images taken from the great Illustrated Word2Vec

9 / 126

Word Embeddings: Visualized

By converting the numbers to colored bars we get a totally different picture:

10 / 126

Word Embeddings: Visualized

By converting the numbers to colored bars we get a totally different picture:

Especially in context:

11 / 126

Word Embeddings: Visualized II

12 / 126

Word Embeddings: Visualized III

It seems that some semantic meaning does indeed get encoded!*

13 / 126

Word Embeddings: Visualized III

It seems that some semantic meaning does indeed get encoded!*

* Note that king-man+woman does not exactly equal queen -- it's just that queen was the closest word in the space to the result of king-man+woman

14 / 126

Word Embeddings: Training

So how would one come up with these?

15 / 126

Word Embeddings: Training

So how would one come up with these?

Let a neural net learn the proper values on its own!

16 / 126

Word Embeddings: Training

So how would one come up with these?

Let a neural net learn the proper values on its own!

But how?

17 / 126

Word Embeddings: Training

So how would one come up with these?

Let a neural net learn the proper values on its own!

But how?

Give it some context and let it predict the center word.

18 / 126

Word Embeddings: Training

So how would one come up with these?

Let a neural net learn the proper values on its own!

But how?

Give it some context and let it predict the center word.

Image from https://amitness.com/2020/06/fasttext-embeddings/

19 / 126

Word Embeddings: Limitations

Out Of Vocabulary Words

Morphology

21 / 126

Zipf's Law: the fundamental limitation

The general home-take point seems to be:

You never see most things.

Image from https://shadycharacters.co.uk/2015/10/zipfs-law/

22 / 126

Word Embeddings: FastText

The idea is relatively simple: split words into their own character n-grams

23 / 126

Word Embeddings: FastText

The idea is relatively simple: split words into their own character n-grams

Sum the n-grams together to create the embedding.

24 / 126

Word Embeddings: FastText Classification

This concept can be very easily extended to classification, which you can try out in Google Colab:

Text Classification with FastText in Google Colab

25 / 126

Word Embeddings: Recap

  • Alternative to one-hot encoding
26 / 126

Word Embeddings: Recap

  • Alternative to one-hot encoding

  • Created as a byproduct of training a neural network

27 / 126

Word Embeddings: Recap

  • Alternative to one-hot encoding

  • Created as a byproduct of training a neural network

  • Seems to encode some semantic meaning and allows for quick comparison

28 / 126

Word Embeddings: Recap

  • Alternative to one-hot encoding

  • Created as a byproduct of training a neural network

  • Seems to encode some semantic meaning and allows for quick comparison

  • Suffers from the implications of the Zipf's law

29 / 126

Word Embeddings: Recap

  • Alternative to one-hot encoding

  • Created as a byproduct of training a neural network

  • Seems to encode some semantic meaning and allows for quick comparison

  • Suffers from the implications of the Zipf's law

  • FastText tries to alleviate this, can be directly used for classification as well.
30 / 126

Recurrent Neural Networks

31 / 126

Types of Neural Networks

Image from https://karpathy.github.io/2015/05/21/rnn-effectiveness/

32 / 126

Unfolded RNN

33 / 126

Unfolded RNN

34 / 126

Training unfolded RNN

This concept is called Back Propagation Through Time (BPTT)

35 / 126

Training unfolded RNN

Note how various parts of the unfolded RNN impact h2

36 / 126

Problems with long-term dependencies

37 / 126

LSTM: what to forget and what to remember

38 / 126

LSTM: Conveyor belt

39 / 126

LSTM: Conveyor belt

  • The cell state is sort of a "conveyor belt"
40 / 126

LSTM: Conveyor belt

  • The cell state is sort of a "conveyor belt"

  • Allows information to stay unchanged or get slightly updated

All the following nice images are from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ which I highly recommend

41 / 126

LSTM: Conveyor belt

Step 1: Decide what to forget

42 / 126

LSTM: Conveyor belt

Step 2: Decide

  • which values to update (it)
  • what should the new values be (ˆCt)
43 / 126

LSTM: Conveyor belt

Step 2.5: perform forgetting and update

44 / 126

LSTM: Conveyor belt

Step 3: produce output (ht)

45 / 126

LSTM: Conveyor belt

A coveyor belt that can pick

  • what to remember
  • what to forget
  • what to output
46 / 126

GRU: Simplified conveyor belt

47 / 126

GRU: Simplified conveyor belt

  • Forget and input combined into a single "update gate" (zt)
48 / 126

GRU: Simplified conveyor belt

  • Forget and input combined into a single "update gate" (zt)

  • Cell state (Ct in LSTM) merged with the hidden state (ht)

49 / 126

GRU vs LSTM

  • GRU is smaller and hence requires less compute
  • But it turns out it cannot count (especially longer sequences)
50 / 126

GRU vs LSTM

  • GRU is smaller and hence requires less compute
  • But it turns out it cannot count (especially longer sequences)

On the Practical Computational Power of Finite Precision RNNs for Language Recognition (2018)

51 / 126

Application: Handwriting from Text

Input: "He dismissed the idea"

52 / 126

Application: Handwriting from Text

Input: "He dismissed the idea"

Output:

53 / 126

Application: Handwriting from Text

Input: "He dismissed the idea"

Output:

Generating Sequences With Recurrent Neural Networks, Alex Graves, 2013

Demo at https://www.cs.toronto.edu/~graves/handwriting.html

54 / 126

Application: Character-Level Text Generation

"The Unreasonable Effectiveness of Recurrent Neural Networks", Andrej Karpathy, 2015

55 / 126

Application: Image Question Answering

56 / 126

Application: Image Question Answering

Exploring Models and Data for Image Question Answering, 2015

57 / 126

Application: Image Caption Generation

58 / 126

Application: Video Caption Generation

59 / 126

Application: Video Caption Generation

Sequence to Sequence - Video to Text, Venugopalan et al., 2015

60 / 126

Application: Adding Audion to Silent Film

61 / 126

Application: Adding Audion to Silent Film

Visually Indicated Sounds, Owens et al., 2015

62 / 126

Application: Medical Diagnosis

63 / 126

Application: End-to-End Driving *



64 / 126

Application: End-to-End Driving *



Input: features extracted from CNN Output: predicted steering angle

* On relatively straight roads

65 / 126

Application: Sentiment Analysis

Try it yourself at https://demo.allennlp.org/sentiment-analysis/

66 / 126

Application: Named Entity Recognition (NER)

Try it yourself at https://demo.allennlp.org/named-entity-recognition/

67 / 126

Application: Trump2Cash

68 / 126

Application: Trump2Cash

  • A combination of Sentiment Analysis and Named Entity Recognition
69 / 126

Application: Trump2Cash

  • A combination of Sentiment Analysis and Named Entity Recognition

How it works:

  1. Monitor tweets of Donald Trump
  2. Use NER to see if some of them mention a publicly traded company
  3. Apply sentiment analysis and use its result to decide whether to buy or sell
  4. Profit?
70 / 126

Application: Trump2Cash

See predictions live at https://twitter.com/Trump2Cash

71 / 126

Application: Machine Translation

72 / 126

Application: Machine Translation

73 / 126

Recurrent Neural Networks

  • A "neural" way of handling sequences
74 / 126

Recurrent Neural Networks

  • A "neural" way of handling sequences

  • Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- BPTT)

75 / 126

Recurrent Neural Networks

  • A "neural" way of handling sequences

  • Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- BPTT)

  • In theory they can handle sequences of any length. In practice it is difficult due to exploding and vanishing gradient.

76 / 126

Recurrent Neural Networks

  • A "neural" way of handling sequences

  • Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- BPTT)

  • In theory they can handle sequences of any length. In practice it is difficult due to exploding and vanishing gradient.

  • One solution is to use a different the network architecture, such as LSTM or GRU

77 / 126

Recurrent Neural Networks

  • A "neural" way of handling sequences

  • Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- BPTT)

  • In theory they can handle sequences of any length. In practice it is difficult due to exploding and vanishing gradient.

  • One solution is to use a different the network architecture, such as LSTM or GRU

  • Both of these are used with great success in many practical applications, especially in the sequence-to-sequence setting

78 / 126

Attention and Transformers

79 / 126

History of Deep Learning Milestones

From Deep Learning State of the Art (2020) by Lex Fridman at MIT

80 / 126

The perils of seq2seq modeling

81 / 126

The perils of seq2seq modeling

Aren't we throwing out a bit too much?

Videos from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

82 / 126

The fix

Let's use the full encoder output!

83 / 126

The fix

Let's use the full encoder output!

But how do we combine all the hidden states together?

84 / 126

The mechanics of Attention

85 / 126

Getting alignment with attention

86 / 126

Attention visualized

See nice demo at https://distill.pub/2016/augmented-rnns/

87 / 126

What if we only used attention?

88 / 126

89 / 126

The Transformer architecture

Images from https://jalammar.github.io/illustrated-transformer/

90 / 126

The Transformer's Encoder

91 / 126

What's Self Attention?

The animal didn't cross the street because it was too tired.

What does "it" refer to?

92 / 126

What's Self Attention?

The animal didn't cross the street because it was too tired.

What does "it" refer to?

93 / 126

Self Attention mechanics

94 / 126

The full Transformer seq2seq process

95 / 126

Big Transformers Wins: GPT-2

Try it yourself at https://transformer.huggingface.co/doc/gpt2-large

96 / 126

Big Transformers Wins: BERT

97 / 126

Big Transformers Wins: BERT

98 / 126

BERT Applications: any classification task

  • Thanks to contextualized word vectors that BERT provides, the performance on many tasks has increased significantly
99 / 126

BERT Applications: any classification task

  • Thanks to contextualized word vectors that BERT provides, the performance on many tasks has increased significantly

  • (This is what it means to come from word vectors to transformers and back)

100 / 126

BERT Applications: any classification task

  • Thanks to contextualized word vectors that BERT provides, the performance on many tasks has increased significantly

  • (This is what it means to come from word vectors to transformers and back)

  • For a real-life example of what it means to work with it, I recommend the PyTorch Sentiment Analysis tutorial

101 / 126

BERT Applications: ViLBERT

  • A single model that can perform various vision and language tasks

vilbert.cloudcv.org

102 / 126

BERT Applications: Better Ctrl+F

103 / 126

Big Transformer Wins: Huggingface transformers

  • A very nicely done library that allows anyone with some Python knowledge to play with pretrained state-of-the-art models (more in the docs).
104 / 126

Big Transformer Wins: Huggingface transformers

  • A very nicely done library that allows anyone with some Python knowledge to play with pretrained state-of-the-art models (more in the docs).

  • A small example: English to Slovak translator in about 10 lines of Python code: *

from transformers import MarianTokenizer, MarianMTModel
src = 'en' # source language
trg = 'sk' # target language
sample_text = "When will this presentation end ?"
mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'
model = MarianMTModel.from_pretrained(mname)
tok = MarianTokenizer.from_pretrained(mname)
batch = tok.prepare_seq2seq_batch(src_texts=[sample_text])
gen = model.generate(**batch)
words = tok.batch_decode(gen, skip_special_tokens=True)
print(words)

Works with many other languages as well -- the full list is here

105 / 126

Attention and Transformers: Recap

106 / 126

Attention and Transformers: Recap

  • Attention was a fix for sequence models that did not really work too wel
107 / 126

Attention and Transformers: Recap

  • Attention was a fix for sequence models that did not really work too wel

  • It turned out it was all that was needed for (bounded) sequence processing

108 / 126

Attention and Transformers: Recap

  • Attention was a fix for sequence models that did not really work too wel

  • It turned out it was all that was needed for (bounded) sequence processing

  • Transformer is an encoder-decoder architecture that is "all the rage" now

109 / 126

Attention and Transformers: Recap

  • Attention was a fix for sequence models that did not really work too wel

  • It turned out it was all that was needed for (bounded) sequence processing

  • Transformer is an encoder-decoder architecture that is "all the rage" now

  • It has no time-depencency due to self-attention and is therefore easy to paralelize

110 / 126

Attention and Transformers: Recap

  • Attention was a fix for sequence models that did not really work too wel

  • It turned out it was all that was needed for (bounded) sequence processing

  • Transformer is an encoder-decoder architecture that is "all the rage" now

  • It has no time-depencency due to self-attention and is therefore easy to paralelize

  • Well known models like BERT and GPT-* took the world of NLP by storm

111 / 126

Attention and Transformers: Recap

  • Attention was a fix for sequence models that did not really work too wel

  • It turned out it was all that was needed for (bounded) sequence processing

  • Transformer is an encoder-decoder architecture that is "all the rage" now

  • It has no time-depencency due to self-attention and is therefore easy to paralelize

  • Well known models like BERT and GPT-* took the world of NLP by storm

  • Very helpful in many tasks, easy to play with thanks to the Huggingface transformers library
112 / 126

Our path today

  • From text to word vectors
113 / 126

Our path today

  • From text to word vectors

  • From word vectors to sequence processing

114 / 126

Our path today

  • From text to word vectors

  • From word vectors to sequence processing

  • From sequence processing to attention

115 / 126

Our path today

  • From text to word vectors

  • From word vectors to sequence processing

  • From sequence processing to attention

  • From attention to transformers

116 / 126

Our path today

  • From text to word vectors

  • From word vectors to sequence processing

  • From sequence processing to attention

  • From attention to transformers

  • Via transformers back to (better, contextualized) word vectors

117 / 126

Current Frontiers

118 / 126

Smaller models

  • BERT (Large) is about 430MB when serialized on disk. Even for Google it's at best impractical to put it to production
  • The trend is to make models smaller while keeping their performance roughly the same
    • DistilBERT
    • ALBERT
    • TinyBERT
    • MobileBERT
  • Various avenues of research:
    • Knowledge Distillation
    • Quantization
    • ...
119 / 126

Non-English NLP

  • Most of the cutting-edge research in NLP happens in English
120 / 126

Non-English NLP

121 / 126

Non-English NLP

  • Most of the cutting-edge research in NLP happens in English

  • Most of the world does not speak English

  • Many problems "solved" in English are still open in other (smaller) languages like Slovak or Ukrainian

122 / 126

Doing more with less data

  • Data is almost always the bottleneck for NLP projects
123 / 126

Doing more with less data

  • Data is almost always the bottleneck for NLP projects

  • Being able to do data augmentation can push a project from "no-go" to "doable"

124 / 126

Doing more with less data

  • Data is almost always the bottleneck for NLP projects

  • Being able to do data augmentation can push a project from "no-go" to "doable"

  • Open are a of research, some examples include backtranslation or MixUp for text

125 / 126

marek@mareksuppa.com

126 / 126
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow