Neural Network techniques for sequence processing (in NLP)

From word vectors to transformers and back

Marek Šuppa
MLSS 2020, Žilina

1 / 126

https://is.gd/MLSS2020NLP

2 / 126

Short history of Neural Network approaches to sequence processing2001: Neural Language Models
Word Embeddings
Sequence-to-Sequence models
Attention
Neural Machine Translation boom
Transformers
Pretrained Contextualized Word Embeddings (ELMo)
2019+: Massive Transformer Models (BERT, GPT-2, ...)
Current Frontiers
/ 126

Short history of Neural Network approaches to sequence processing

2001: Neural Language Models
2013: Word Embeddings *
2014: Sequence-to-Sequence models
2015: Attention *
2016: Neural Machine Translation boom
2017: Transformers *
2018: Pretrained Contextualized Word Embeddings (ELMo)
2019+: Massive Transformer Models (BERT, GPT-2, ...) *
2020: Current Frontiers

* Our today's (loose) agenda

4 / 126

Word Embeddings5 / 126

Word Embeddings

"How do we convert free text into a format understandable by neural networks?"

One-hot encoding.

6 / 126

Word Embeddings

"How do we convert free text into a format understandable by neural networks?"

One-hot encoding.

"That results in extra large vectors. Can we do better?"

Well, there is this thing called dense representations.

7 / 126

Word Embeddings

"How do we convert free text into a format understandable by neural networks?"

One-hot encoding.

"That results in extra large vectors. Can we do better?"

Well, there is this thing called dense representations.

"Cool! While we are at it, can we make the format semantically meaningful?"

We don't even know how to train them quickly, but we can try...

8 / 126

Word Embeddings: A real-world example

A GloVe word embedding for king (trained on Wikipedia):

[ 
  0.50451  , 0.68607  , -0.59517 , -0.022801, 0.60046  , -0.13498 , -0.08813 ,
  0.47377  , -0.61798 , -0.31012 , -0.076666, 1.493    , -0.034189, -0.98173 ,
  0.68229  , 0.81722  , -0.51874 , -0.31503 , -0.55809 , 0.66421  , 0.1961l  ,
  -0.13495 , -0.11476 , -0.30344 , 0.41177  , -2.223   , -1.0756  , -1.0783  ,
  -0.34354 , 0.33505  , 1.9927   , -0.04234 , -0.64319 , 0.71125  , 0.49159  , 
  0.16754  , 0.34344  , -0.25663 , -0.8523  , 0.1661   , 0.40102  , 1.1685   , 
  -1.0137  , -0.21585 , -0.15155 , 0.78321  , -0.91241 , -1.6106  , -0.64426 , 
  -0.51042
]

Example and subsequent images taken from the great Illustrated Word2Vec

9 / 126

Word Embeddings: Visualized

By converting the numbers to colored bars we get a totally different picture:

10 / 126

Word Embeddings: Visualized

By converting the numbers to colored bars we get a totally different picture:

Especially in context:

11 / 126

Word Embeddings: Visualized II

12 / 126

Word Embeddings: Visualized III

It seems that some semantic meaning does indeed get encoded!*

13 / 126

Word Embeddings: Visualized III

It seems that some semantic meaning does indeed get encoded!*

* Note that king-man+woman does not exactly equal queen -- it's just that queen was the closest word in the space to the result of king-man+woman

14 / 126

Word Embeddings: Training

So how would one come up with these?

15 / 126

Word Embeddings: Training

So how would one come up with these?

Let a neural net learn the proper values on its own!

16 / 126

Word Embeddings: Training

So how would one come up with these?

Let a neural net learn the proper values on its own!

But how?

17 / 126

Word Embeddings: Training

So how would one come up with these?

Let a neural net learn the proper values on its own!

But how?

Give it some context and let it predict the center word.

18 / 126

Word Embeddings: Training

So how would one come up with these?

Let a neural net learn the proper values on its own!

But how?

Give it some context and let it predict the center word.

Image from https://amitness.com/2020/06/fasttext-embeddings/

19 / 126

Word Embeddings: CBOW Training

Image from https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html

20 / 126

Word Embeddings: Limitations

Out Of Vocabulary Words

Morphology

21 / 126

Zipf's Law: the fundamental limitation

The general home-take point seems to be:

You never see most things.

Image from https://shadycharacters.co.uk/2015/10/zipfs-law/

22 / 126

Word Embeddings: FastText

The idea is relatively simple: split words into their own character n-grams

23 / 126

Word Embeddings: FastText

The idea is relatively simple: split words into their own character n-grams

Sum the n-grams together to create the embedding.

24 / 126

Word Embeddings: FastText Classification

This concept can be very easily extended to classification, which you can try out in Google Colab:

Text Classification with FastText in Google Colab

25 / 126

Image from https://www.semanticscholar.org/paper/Analysis-and-Optimization-of-fastText-Linear-Text-Zolotov-Kung/9d6993f60539d30ee325138b3465aa020fa3bcb4/figure/0

Word Embeddings: RecapAlternative to one-hot encoding
26 / 126

Word Embeddings: Recap

Alternative to one-hot encoding
Created as a byproduct of training a neural network

27 / 126

Word Embeddings: Recap

Alternative to one-hot encoding
Created as a byproduct of training a neural network
Seems to encode some semantic meaning and allows for quick comparison

28 / 126

Word Embeddings: Recap

Alternative to one-hot encoding
Created as a byproduct of training a neural network
Seems to encode some semantic meaning and allows for quick comparison
Suffers from the implications of the Zipf's law

29 / 126

Word Embeddings: Recap

Alternative to one-hot encoding
Created as a byproduct of training a neural network
Seems to encode some semantic meaning and allows for quick comparison
Suffers from the implications of the Zipf's law

FastText tries to alleviate this, can be directly used for classification as well.

30 / 126

Recurrent Neural Networks31 / 126

Types of Neural Networks

Image from https://karpathy.github.io/2015/05/21/rnn-effectiveness/

32 / 126

Unfolded RNN

33 / 126

Unfolded RNN

34 / 126

Training unfolded RNN

This concept is called Back Propagation Through Time (BPTT)

35 / 126

Training unfolded RNN

Note how various parts of the unfolded RNN impact

36 / 126

Problems with long-term dependencies

37 / 126

LSTM: what to forget and what to remember

38 / 126

LSTM: Conveyor belt

39 / 126

LSTM: Conveyor belt

The cell state is sort of a "conveyor belt"

40 / 126

LSTM: Conveyor belt

The cell state is sort of a "conveyor belt"
Allows information to stay unchanged or get slightly updated

All the following nice images are from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ which I highly recommend

41 / 126

LSTM: Conveyor belt

Step 1: Decide what to forget

42 / 126

LSTM: Conveyor belt

Step 2: Decide

which values to update ()
what should the new values be ()

43 / 126

LSTM: Conveyor belt

Step 2.5: perform forgetting and update

44 / 126

LSTM: Conveyor belt

Step 3: produce output ()

45 / 126

LSTM: Conveyor belt

A coveyor belt that can pick

what to remember
what to forget
what to output

46 / 126

GRU: Simplified conveyor belt

47 / 126

GRU: Simplified conveyor belt

Forget and input combined into a single "update gate" ()

48 / 126

GRU: Simplified conveyor belt

Forget and input combined into a single "update gate" ()
Cell state ( in LSTM) merged with the hidden state ()

49 / 126

GRU vs LSTMGRU is smaller and hence requires less compute
But it turns out it cannot count (especially longer sequences)
50 / 126

GRU vs LSTM

GRU is smaller and hence requires less compute
But it turns out it cannot count (especially longer sequences)

On the Practical Computational Power of Finite Precision RNNs for Language Recognition (2018)

51 / 126

Application: Handwriting from Text

Input: "He dismissed the idea"

52 / 126

Application: Handwriting from Text

Input: "He dismissed the idea"

Output:

53 / 126

Application: Handwriting from Text

Input: "He dismissed the idea"

Output:

Generating Sequences With Recurrent Neural Networks, Alex Graves, 2013

Demo at https://www.cs.toronto.edu/~graves/handwriting.html

54 / 126

Application: Character-Level Text Generation

"The Unreasonable Effectiveness of Recurrent Neural Networks", Andrej Karpathy, 2015

55 / 126

Application: Image Question Answering

56 / 126

Application: Image Question Answering

Exploring Models and Data for Image Question Answering, 2015

Live Demo at https://vqa.cloudcv.org/

57 / 126

Application: Image Caption Generation

58 / 126

Application: Video Caption Generation

59 / 126

Application: Video Caption Generation

Sequence to Sequence - Video to Text, Venugopalan et al., 2015

More at https://vsubhashini.github.io/s2vt.html

60 / 126

Application: Adding Audion to Silent Film

61 / 126

Application: Adding Audion to Silent Film

Visually Indicated Sounds, Owens et al., 2015

More at http://andrewowens.com/vis/

62 / 126

Application: Medical Diagnosis

63 / 126

Application: End-to-End Driving *

64 / 126

Application: End-to-End Driving *

Input: features extracted from CNN Output: predicted steering angle

* On relatively straight roads

65 / 126

Application: Sentiment Analysis

Try it yourself at https://demo.allennlp.org/sentiment-analysis/

66 / 126

Application: Named Entity Recognition (NER)

Try it yourself at https://demo.allennlp.org/named-entity-recognition/

67 / 126

Application: Trump2Cash

68 / 126

Application: Trump2CashA combination of Sentiment Analysis and Named Entity Recognition
69 / 126

Application: Trump2Cash

A combination of Sentiment Analysis and Named Entity Recognition

How it works:

Monitor tweets of Donald Trump
Use NER to see if some of them mention a publicly traded company
Apply sentiment analysis and use its result to decide whether to buy or sell
Profit?

70 / 126

Application: Trump2Cash

See predictions live at https://twitter.com/Trump2Cash

71 / 126

Application: Machine Translation

72 / 126

Application: Machine Translation

73 / 126

Recurrent Neural NetworksA "neural" way of handling sequences
74 / 126

Recurrent Neural Networks

A "neural" way of handling sequences
Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- BPTT)

75 / 126

Recurrent Neural Networks

A "neural" way of handling sequences
Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- BPTT)
In theory they can handle sequences of any length. In practice it is difficult due to exploding and vanishing gradient.

76 / 126

Recurrent Neural Networks

A "neural" way of handling sequences
Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- BPTT)
In theory they can handle sequences of any length. In practice it is difficult due to exploding and vanishing gradient.
One solution is to use a different the network architecture, such as LSTM or GRU

77 / 126

Recurrent Neural Networks

A "neural" way of handling sequences
Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- BPTT)
In theory they can handle sequences of any length. In practice it is difficult due to exploding and vanishing gradient.
One solution is to use a different the network architecture, such as LSTM or GRU
Both of these are used with great success in many practical applications, especially in the sequence-to-sequence setting

78 / 126

Attention and Transformers79 / 126

History of Deep Learning Milestones

From Deep Learning State of the Art (2020) by Lex Fridman at MIT

80 / 126

The perils of seq2seq modeling
  
  Your browser does not support the video tag.
81 / 126

The perils of seq2seq modeling

Aren't we throwing out a bit too much?

Videos from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

82 / 126

The fix

Let's use the full encoder output!

83 / 126

The fix

Let's use the full encoder output!

But how do we combine all the hidden states together?

84 / 126

The mechanics of Attention
  
  Your browser does not support the video tag.
85 / 126

Getting alignment with attention
  
  Your browser does not support the video tag.
86 / 126

Attention visualized

See nice demo at https://distill.pub/2016/augmented-rnns/

87 / 126

What if we only used attention?88 / 126

89 / 126

The Transformer architecture

Images from https://jalammar.github.io/illustrated-transformer/

90 / 126

The Transformer's Encoder

91 / 126

What's Self Attention?

The animal didn't cross the street because it was too tired.

What does "it" refer to?

92 / 126

What's Self Attention?

The animal didn't cross the street because it was too tired.

What does "it" refer to?

93 / 126

Self Attention mechanics

94 / 126

The full Transformer seq2seq process

95 / 126

Big Transformers Wins: GPT-2

Try it yourself at https://transformer.huggingface.co/doc/gpt2-large

96 / 126

Big Transformers Wins: BERT

97 / 126

Big Transformers Wins: BERT

98 / 126

BERT Applications: any classification taskThanks to contextualized word vectors that BERT provides, the performance on many tasks has increased significantly
99 / 126

BERT Applications: any classification task

Thanks to contextualized word vectors that BERT provides, the performance on many tasks has increased significantly
(This is what it means to come from word vectors to transformers and back)

100 / 126

BERT Applications: any classification task

Thanks to contextualized word vectors that BERT provides, the performance on many tasks has increased significantly
(This is what it means to come from word vectors to transformers and back)
For a real-life example of what it means to work with it, I recommend the PyTorch Sentiment Analysis tutorial

101 / 126

BERT Applications: ViLBERT

A single model that can perform various vision and language tasks

vilbert.cloudcv.org

102 / 126

BERT Applications: Better Ctrl+F

Repurposing question answering model for asking questions about the webpage the user is currently on.
Check it out at github.com/model-zoo/shift-ctrl-f

103 / 126

Big Transformer Wins: Huggingface `transformers`

A very nicely done library that allows anyone with some Python knowledge to play with pretrained state-of-the-art models (more in the docs).

104 / 126

Big Transformer Wins: Huggingface `transformers`

A very nicely done library that allows anyone with some Python knowledge to play with pretrained state-of-the-art models (more in the docs).
A small example: English to Slovak translator in about 10 lines of Python code: *

from transformers import MarianTokenizer, MarianMTModel
src = 'en'  # source language
trg = 'sk'  # target language
sample_text = "When will this presentation end ?"
mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'
model = MarianMTModel.from_pretrained(mname)
tok = MarianTokenizer.from_pretrained(mname)
batch = tok.prepare_seq2seq_batch(src_texts=[sample_text]) 
gen = model.generate(**batch)
words = tok.batch_decode(gen, skip_special_tokens=True)
print(words)

Works with many other languages as well -- the full list is here

105 / 126

Attention and Transformers: Recap106 / 126

Attention and Transformers: RecapAttention was a fix for sequence models that did not really work too wel
107 / 126

Attention and Transformers: Recap

Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing

108 / 126

Attention and Transformers: Recap

Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Transformer is an encoder-decoder architecture that is "all the rage" now

109 / 126

Attention and Transformers: Recap

Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Transformer is an encoder-decoder architecture that is "all the rage" now
It has no time-depencency due to self-attention and is therefore easy to paralelize

110 / 126

Attention and Transformers: Recap

Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Transformer is an encoder-decoder architecture that is "all the rage" now
It has no time-depencency due to self-attention and is therefore easy to paralelize
Well known models like BERT and GPT-* took the world of NLP by storm

111 / 126

Attention and Transformers: Recap

Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Transformer is an encoder-decoder architecture that is "all the rage" now
It has no time-depencency due to self-attention and is therefore easy to paralelize
Well known models like BERT and GPT-* took the world of NLP by storm

Very helpful in many tasks, easy to play with thanks to the Huggingface transformers library

112 / 126

Our path todayFrom text to word vectors
113 / 126

Our path today

From text to word vectors
From word vectors to sequence processing

114 / 126

Our path today

From text to word vectors
From word vectors to sequence processing
From sequence processing to attention

115 / 126

Our path today

From text to word vectors
From word vectors to sequence processing
From sequence processing to attention
From attention to transformers

116 / 126

Our path today

From text to word vectors
From word vectors to sequence processing
From sequence processing to attention
From attention to transformers
Via transformers back to (better, contextualized) word vectors

117 / 126

Current Frontiers118 / 126

Smaller models

BERT (Large) is about 430MB when serialized on disk. Even for Google it's at best impractical to put it to production
The trend is to make models smaller while keeping their performance roughly the same
- DistilBERT
- ALBERT
- TinyBERT
- MobileBERT
Various avenues of research:
- Knowledge Distillation
- Quantization
- ...

119 / 126

Non-English NLPMost of the cutting-edge research in NLP happens in English
120 / 126

Non-English NLP

Most of the cutting-edge research in NLP happens in English
Most of the world does not speak English

121 / 126

Non-English NLP

Most of the cutting-edge research in NLP happens in English
Most of the world does not speak English
Many problems "solved" in English are still open in other (smaller) languages like Slovak or Ukrainian

122 / 126

Doing more with less dataData is almost always the bottleneck for NLP projects
123 / 126

Doing more with less data

Data is almost always the bottleneck for NLP projects
Being able to do data augmentation can push a project from "no-go" to "doable"

124 / 126

Doing more with less data

Data is almost always the bottleneck for NLP projects
Being able to do data augmentation can push a project from "no-go" to "doable"
Open are a of research, some examples include backtranslation or MixUp for text

125 / 126

marek@mareksuppa.com126 / 126

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help