From word vectors to transformers and back
Marek Šuppa
MLSS 2020, Žilina
* Our today's (loose) agenda
"How do we convert free text into a format understandable by neural networks?"
One-hot encoding.
"How do we convert free text into a format understandable by neural networks?"
One-hot encoding.
"That results in extra large vectors. Can we do better?"
Well, there is this thing called dense representations.
"How do we convert free text into a format understandable by neural networks?"
One-hot encoding.
"That results in extra large vectors. Can we do better?"
Well, there is this thing called dense representations.
"Cool! While we are at it, can we make the format semantically meaningful?"
We don't even know how to train them quickly, but we can try...
A GloVe word embedding for king
(trained on Wikipedia):
[ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961l , -0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , -0.51042]
Example and subsequent images taken from the great Illustrated Word2Vec
By converting the numbers to colored bars we get a totally different picture:
By converting the numbers to colored bars we get a totally different picture:
Especially in context:
It seems that some semantic meaning does indeed get encoded!*
It seems that some semantic meaning does indeed get encoded!*
* Note that king-man+woman
does not exactly equal queen
-- it's just that queen
was the closest word in the space to the result of king-man+woman
So how would one come up with these?
So how would one come up with these?
Let a neural net learn the proper values on its own!
So how would one come up with these?
Let a neural net learn the proper values on its own!
But how?
So how would one come up with these?
Let a neural net learn the proper values on its own!
But how?
Give it some context and let it predict the center word.
So how would one come up with these?
Let a neural net learn the proper values on its own!
But how?
Give it some context and let it predict the center word.
Image from https://amitness.com/2020/06/fasttext-embeddings/
Image from https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html
Out Of Vocabulary Words
Morphology
The general home-take point seems to be:
You never see most things.
Image from https://shadycharacters.co.uk/2015/10/zipfs-law/
The idea is relatively simple: split words into their own character n-grams
The idea is relatively simple: split words into their own character n-grams
Sum the n-grams together to create the embedding.
This concept can be very easily extended to classification, which you can try out in Google Colab:
Alternative to one-hot encoding
Created as a byproduct of training a neural network
Alternative to one-hot encoding
Created as a byproduct of training a neural network
Seems to encode some semantic meaning and allows for quick comparison
Alternative to one-hot encoding
Created as a byproduct of training a neural network
Seems to encode some semantic meaning and allows for quick comparison
Suffers from the implications of the Zipf's law
Alternative to one-hot encoding
Created as a byproduct of training a neural network
Seems to encode some semantic meaning and allows for quick comparison
Suffers from the implications of the Zipf's law
Image from https://karpathy.github.io/2015/05/21/rnn-effectiveness/
This concept is called Back Propagation Through Time (BPTT)
Note how various parts of the unfolded RNN impact h2
The cell state is sort of a "conveyor belt"
Allows information to stay unchanged or get slightly updated
All the following nice images are from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ which I highly recommend
Step 1: Decide what to forget
Step 2: Decide
Step 2.5: perform forgetting and update
Step 3: produce output (ht)
A coveyor belt that can pick
Forget and input combined into a single "update gate" (zt)
Cell state (Ct in LSTM) merged with the hidden state (ht)
On the Practical Computational Power of Finite Precision RNNs for Language Recognition (2018)
Input: "He dismissed the idea"
Input: "He dismissed the idea"
Output:
Input: "He dismissed the idea"
Output:
Generating Sequences With Recurrent Neural Networks, Alex Graves, 2013
Demo at https://www.cs.toronto.edu/~graves/handwriting.html
"The Unreasonable Effectiveness of Recurrent Neural Networks", Andrej Karpathy, 2015
Exploring Models and Data for Image Question Answering, 2015
Live Demo at https://vqa.cloudcv.org/
Sequence to Sequence - Video to Text, Venugopalan et al., 2015
Visually Indicated Sounds, Owens et al., 2015
More at http://andrewowens.com/vis/
Input: features extracted from CNN Output: predicted steering angle
* On relatively straight roads
Try it yourself at https://demo.allennlp.org/sentiment-analysis/
Try it yourself at https://demo.allennlp.org/named-entity-recognition/
How it works:
A "neural" way of handling sequences
Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- BPTT)
A "neural" way of handling sequences
Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- BPTT)
In theory they can handle sequences of any length. In practice it is difficult due to exploding and vanishing gradient.
A "neural" way of handling sequences
Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- BPTT)
In theory they can handle sequences of any length. In practice it is difficult due to exploding and vanishing gradient.
One solution is to use a different the network architecture, such as LSTM or GRU
A "neural" way of handling sequences
Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- BPTT)
In theory they can handle sequences of any length. In practice it is difficult due to exploding and vanishing gradient.
One solution is to use a different the network architecture, such as LSTM or GRU
Both of these are used with great success in many practical applications, especially in the sequence-to-sequence setting
From Deep Learning State of the Art (2020) by Lex Fridman at MIT
Aren't we throwing out a bit too much?
Videos from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
Let's use the full encoder output!
Let's use the full encoder output!
But how do we combine all the hidden states together?
Images from https://jalammar.github.io/illustrated-transformer/
The animal didn't cross the street because it was too tired.
What does "it" refer to?
The animal didn't cross the street because it was too tired.
What does "it" refer to?
Try it yourself at https://transformer.huggingface.co/doc/gpt2-large
Thanks to contextualized word vectors that BERT provides, the performance on many tasks has increased significantly
(This is what it means to come from word vectors to transformers and back)
Thanks to contextualized word vectors that BERT provides, the performance on many tasks has increased significantly
(This is what it means to come from word vectors to transformers and back)
For a real-life example of what it means to work with it, I recommend the PyTorch Sentiment Analysis tutorial
transformers
transformers
A very nicely done library that allows anyone with some Python knowledge to play with pretrained state-of-the-art models (more in the docs).
A small example: English to Slovak translator in about 10 lines of Python code: *
from transformers import MarianTokenizer, MarianMTModelsrc = 'en' # source languagetrg = 'sk' # target languagesample_text = "When will this presentation end ?"mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'model = MarianMTModel.from_pretrained(mname)tok = MarianTokenizer.from_pretrained(mname)batch = tok.prepare_seq2seq_batch(src_texts=[sample_text]) gen = model.generate(**batch)words = tok.batch_decode(gen, skip_special_tokens=True)print(words)
Works with many other languages as well -- the full list is here
Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Transformer is an encoder-decoder architecture that is "all the rage" now
Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Transformer is an encoder-decoder architecture that is "all the rage" now
It has no time-depencency due to self-attention and is therefore easy to paralelize
Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Transformer is an encoder-decoder architecture that is "all the rage" now
It has no time-depencency due to self-attention and is therefore easy to paralelize
Well known models like BERT and GPT-* took the world of NLP by storm
Attention was a fix for sequence models that did not really work too wel
It turned out it was all that was needed for (bounded) sequence processing
Transformer is an encoder-decoder architecture that is "all the rage" now
It has no time-depencency due to self-attention and is therefore easy to paralelize
Well known models like BERT and GPT-* took the world of NLP by storm
transformers
libraryFrom text to word vectors
From word vectors to sequence processing
From text to word vectors
From word vectors to sequence processing
From sequence processing to attention
From text to word vectors
From word vectors to sequence processing
From sequence processing to attention
From attention to transformers
From text to word vectors
From word vectors to sequence processing
From sequence processing to attention
From attention to transformers
Via transformers back to (better, contextualized) word vectors
Most of the cutting-edge research in NLP happens in English
Most of the world does not speak English
Most of the cutting-edge research in NLP happens in English
Most of the world does not speak English
Many problems "solved" in English are still open in other (smaller) languages like Slovak or Ukrainian
Data is almost always the bottleneck for NLP projects
Being able to do data augmentation can push a project from "no-go" to "doable"
Data is almost always the bottleneck for NLP projects
Being able to do data augmentation can push a project from "no-go" to "doable"
Open are a of research, some examples include backtranslation or MixUp for text
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |