name: inverse layout: true class: center, middle, inverse --- # Neural Network techniques for sequence processing (in NLP) From word vectors to transformers and back .footnote[Marek Šuppa
MLSS 2020, Žilina] --- # https://is.gd/MLSS2020NLP --- layout: false # Short history of Neural Network approaches to sequence processing - *2001*: Neural Language Models - *2013*: Word Embeddings - *2014*: Sequence-to-Sequence models - *2015*: Attention - *2016*: Neural Machine Translation boom - *2017*: Transformers - *2018*: Pretrained Contextualized Word Embeddings (ELMo) - *2019+*: Massive Transformer Models (BERT, GPT-2, ...) - *2020*: Current Frontiers --- layout: false # Short history of Neural Network approaches to sequence processing - *2001*: Neural Language Models - *2013*: Word Embeddings .red[*] - *2014*: Sequence-to-Sequence models - *2015*: Attention .red[*] - *2016*: Neural Machine Translation boom - *2017*: Transformers .red[*] - *2018*: Pretrained Contextualized Word Embeddings (ELMo) - *2019+*: Massive Transformer Models (BERT, GPT-2, ...) .red[*] - *2020*: Current Frontiers .footnote[.red[*] Our today's (loose) agenda] --- template: inverse ## Word Embeddings --- layout: false # Word Embeddings "How do we convert free text into a format understandable by neural networks?" > One-hot encoding. -- "That results in extra large vectors. Can we do better?" > Well, there is this thing called _dense representations_. -- "Cool! While we are at it, can we make the format semantically meaningful?" > We don't even know how to train them quickly, but we can try... --- # Word Embeddings: A real-world example A GloVe word embedding for `king` (trained on Wikipedia): ``` [ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961l , -0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , -0.51042 ] ``` .footnote[.font-small[Example and subsequent images taken from the great [Illustrated Word2Vec](https://jalammar.github.io/illustrated-word2vec/)]] --- # Word Embeddings: Visualized By converting the numbers to colored bars we get a totally different picture: ![:scale 100%](images/king-colored-embedding.png) -- Especially in context: ![:scale 100%](images/king-man-woman-embedding.png) --- # Word Embeddings: Visualized II .center[![:scale 100%](images/queen-woman-girl-embeddings.png)] --- # Word Embeddings: Visualized III It seems that some semantic meaning does indeed get encoded!.red[*] .center[![:scale 80%](images/king-analogy-viz.png)] -- .footnote[.font-small[.red[*] Note that `king-man+woman` does not exactly equal `queen` -- it's just that `queen` was the closest word in the space to the result of `king-man+woman`]] --- # Word Embeddings: Training > So how would one come up with these? -- Let a neural net learn the proper values on its own! -- > But how? -- Give it some context and let it predict the center word. -- .center[![:scale 80%](images/nlp-ssl-center-word-prediction.gif)] .footnote[.font-small[Image from https://amitness.com/2020/06/fasttext-embeddings/]] --- # Word Embeddings: CBOW Training .center[![:scale 80%](images/word2vec-cbow.png)] .footnote[.quote_author[.font-small[Image from https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html]]] --- # Word Embeddings: Limitations **Out Of Vocabulary Words** .center[![:scale 50%](images/word2vec-oov-tensorflow.png)] **Morphology** .center[![:scale 50%](images/word2vec-radicals.png)] --- # Zipf's Law: the fundamental limitation The general home-take point seems to be: > You never see most things. ![:scale 100%](images/brown-linear-1.png) .footnote[.font-small[Image from https://shadycharacters.co.uk/2015/10/zipfs-law/]] --- # Word Embeddings: FastText The idea is relatively simple: split words into their own character n-grams ![:scale 70%](images/fasttext-3-grams-list.png) -- Sum the n-grams together to create the embedding. ![:scale 70%](images/fasttext-center-word-embedding.png) --- # Word Embeddings: FastText Classification This concept can be very easily extended to classification, which you can try out in Google Colab: ## [Text Classification with FastText in Google Colab](https://colab.research.google.com/github/NaiveNeuron/nlp-exercises/blob/master/tutorial2-fasttext/Text_Classification_fastText.ipynb) .center[![:scale 40%](images/3-Figure1-1.png)] ??? Image from https://www.semanticscholar.org/paper/Analysis-and-Optimization-of-fastText-Linear-Text-Zolotov-Kung/9d6993f60539d30ee325138b3465aa020fa3bcb4/figure/0 --- # Word Embeddings: Recap - Alternative to one-hot encoding -- - Created as a byproduct of training a neural network -- - Seems to encode some semantic meaning and allows for quick comparison -- - Suffers from the implications of the Zipf's law -- - FastText tries to alleviate this, can be directly used for classification as well. --- template: inverse ## Recurrent Neural Networks --- layout: false class: center # Types of Neural Networks ![:scale 100%](images/rnns.png) .footnote[.font-small[Image from https://karpathy.github.io/2015/05/21/rnn-effectiveness/]] --- layout: false class: center # Unfolded RNN ![:scale 100%](images/unfolded-rnn.png) --- layout: false class: center # Unfolded RNN ![:scale 100%](images/unfolded-rnn-2.png) --- layout: false class: center # Training unfolded RNN ![:scale 90%](images/rnn-bptt.png) This concept is called Back Propagation Through Time (**BPTT**) --- layout: false class: center # Training unfolded RNN ![:scale 95%](images/rnn-bptt-2.png) Note how various parts of the unfolded RNN impact $h_2$ --- layout: false class: center # Problems with long-term dependencies ![:scale 95%](images/long-term-dep.png) --- layout: false class: center ## LSTM: what to forget and what to remember ![:scale 100%](images/lstm-intro.png) --- layout: false ## LSTM: Conveyor belt ![:scale 100%](images/LSTM3-C-line.png) -- - The cell state is sort of a "conveyor belt" -- - Allows information to stay unchanged or get slightly updated .footnote[.font-small[All the following nice images are from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ which I highly recommend]] --- layout: false calls: center, middle ## LSTM: Conveyor belt ![:scale 100%](images/LSTM3-focus-f.png) **Step 1**: Decide what to forget --- layout: false calls: center, middle ## LSTM: Conveyor belt ![:scale 100%](images/LSTM3-focus-i.png) **Step 2**: Decide - which values to update ($i_t$) - what should the new values be ($\hat{C}_t$) --- layout: false calls: center, middle ## LSTM: Conveyor belt ![:scale 100%](images/LSTM3-focus-C.png) **Step 2.5**: perform forgetting and update --- layout: false calls: center, middle ## LSTM: Conveyor belt ![:scale 100%](images/LSTM3-focus-o.png) **Step 3**: produce output ($h_t$) --- layout: false calls: center, middle ## LSTM: Conveyor belt ![:scale 100%](images/LSTM3-chain.png) A coveyor belt that can pick - what to remember - what to forget - what to output --- layout: false calls: center, middle ## GRU: Simplified conveyor belt ![:scale 100%](images/LSTM3-var-GRU.png) -- - Forget and input combined into a single "update gate" ($z_t$) -- - Cell state ($C_t$ in LSTM) merged with the hidden state ($h_t$) --- layout: false ## GRU vs LSTM - GRU is smaller and hence requires less compute - But it turns out it cannot count (especially longer sequences) -- .center[![:scale 80%](images/lstm-gru-counting.png)] [On the Practical Computational Power of Finite Precision RNNs for Language Recognition (2018)](https://arxiv.org/abs/1805.04908) --- ## Application: Handwriting from Text **Input:** "He dismissed the idea" -- **Output:** .center[![:scale 50%](images/handwriting.png)] -- [Generating Sequences With Recurrent Neural Networks, Alex Graves, 2013](https://arxiv.org/abs/1308.0850) Demo at https://www.cs.toronto.edu/~graves/handwriting.html --- ## Application: Character-Level Text Generation .center[![:scale 100%](images/char-rnn.png)] .footnote[.font-small[["The Unreasonable Effectiveness of Recurrent Neural Networks"](https://karpathy.github.io/2015/05/21/rnn-effectiveness/), Andrej Karpathy, 2015]] --- ## Application: Image Question Answering .center[![:scale 100%](images/vqa.png)] -- .left-eq-column[ .center[![:scale 100%](images/vqa-arch.png)] .font-small[Exploring Models and Data for Image Question Answering, 2015] ] .right-eq-column[ Live Demo at https://vqa.cloudcv.org/ ] --- ## Application: Image Caption Generation .center[![:scale 100%](images/captioning.png)] --- ## Application: Video Caption Generation .center[![:scale 100%](images/video-caption-generation.png)] -- .left-eq-column[ .center[![:scale 100%](images/S2VTarchitecture.png)] .font-small[Sequence to Sequence - Video to Text, Venugopalan et al., 2015] ] .right-eq-column[ More at https://vsubhashini.github.io/s2vt.html ] --- ## Application: Adding Audion to Silent Film .center[![:scale 100%](images/silent-audio.png)] -- .left-eq-column[ .center[![:scale 60%](images/pipeline.jpg)] .font-small[Visually Indicated Sounds, Owens et al., 2015] ] .right-eq-column[ More at http://andrewowens.com/vis/ ] --- ## Application: Medical Diagnosis .center[![:scale 100%](images/medical-diagnosis.png)] --- ## Application: End-to-End Driving .red[*]
.left-eq-column[![:scale 100%](images/rnn-steering.gif)] .right-eq-column[![:scale 100%](images/LSTM3-chain.png)] --
**Input**: features extracted from CNN **Output**: predicted steering angle .footnoe[.red[*] On relatively straight roads] --- ## Application: Sentiment Analysis .center[![:scale 80%](images/sentiment.analysis.png)] Try it yourself at https://demo.allennlp.org/sentiment-analysis/ --- ## Application: Named Entity Recognition (NER) .center[![:scale 100%](images/NER.png)] Try it yourself at https://demo.allennlp.org/named-entity-recognition/ --- ## Application: Trump2Cash .center[![:scale 60%](images/trump2cash.png)] --- ## Application: Trump2Cash - A combination of Sentiment Analysis and Named Entity Recognition -- How it works: 1. Monitor tweets of Donald Trump 2. Use NER to see if some of them mention a publicly traded company 3. Apply sentiment analysis and use its result to decide whether to buy or sell 4. Profit? --- ## Application: Trump2Cash .center[![:scale 50%](images/simulated-twitter-fund.png)] See predictions live at https://twitter.com/Trump2Cash --- class: middle ## Application: Machine Translation .center[![:scale 80%](images/machine-translation.png)] --- class: middle ## Application: Machine Translation .center[![:scale 80%](images/cover_ELMo_web.jpg)] --- ## Recurrent Neural Networks - A "neural" way of handling sequences -- - Their training usually happens by "unfolding" the network in time (Back Propagation Through Time -- **BPTT**) -- - In theory they can handle sequences of any length. In practice it is difficult due to exploding and vanishing gradient. -- - One solution is to use a different the network architecture, such as LSTM or GRU -- - Both of these are used with great success in many practical applications, especially in the sequence-to-sequence setting --- class: center, middle, inverse ## Attention and Transformers --- ## History of Deep Learning Milestones ![:scale 70%](images/timeline.png) .footnote[ From [Deep Learning State of the Art (2020)](https://www.youtube.com/watch?v=0VH1Lim8gL8) by Lex Fridman at MIT] --- class: middle ## The perils of seq2seq modeling
Your browser does not support the video tag.
-- Aren't we throwing out a bit too much? .footnote[.font-small[Videos from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/]] --- class: middle ## The fix Let's use the full encoder output!
Your browser does not support the video tag.
-- But how do we combine all the hidden states together? --- class: middle ## The mechanics of Attention
Your browser does not support the video tag.
--- class: middle ## Getting alignment with attention
Your browser does not support the video tag.
--- ## Attention visualized .center[![:scale 60%](images/attention_sentence.png)] See nice demo at https://distill.pub/2016/augmented-rnns/ --- class: middle # What if we only used attention? --- class: middle .center[![:scale 100%](images/attention_is_all_you_need.png)] --- class: middle ## The Transformer architecture .center[![:scale 90%](images/The_transformer_encoder_decoder_stack.png)] .footnote[.font-small[Images from https://jalammar.github.io/illustrated-transformer/]] --- class: middle ## The Transformer's Encoder .center[![:scale 100%](images/encoder_with_tensors_2.png)] --- ## What's Self Attention? .center[ *The animal didn't cross the street because it was too tired.* ] What does "it" refer to? -- .center[![:scale 50%](images/transformer_self-attention_visualization.png)] --- ## Self Attention mechanics .center[![:scale 70%](images/self-attention-output.png)] --- ## The full Transformer seq2seq process .center[![:scale 100%](images/transformer_decoding_2.gif)] --- ## Big Transformers Wins: GPT-2 .center[![:scale 100%](images/gpt2-sizes.png)] Try it yourself at https://transformer.huggingface.co/doc/gpt2-large --- ## Big Transformers Wins: BERT .center[![:scale 100%](images/bert.png)] --- ## Big Transformers Wins: BERT .center[![:scale 100%](images/nlp-ssl-masked-lm.png)] --- ### BERT Applications: any classification task - Thanks to contextualized word vectors that BERT provides, the performance on many tasks has increased significantly -- - (This is what it means to come from word vectors to transformers and back) -- - For a real-life example of what it means to work with it, I recommend the [PyTorch Sentiment Analysis tutorial](https://github.com/bentrevett/pytorch-sentiment-analysis) --- ## BERT Applications: ViLBERT - A single model that can perform various vision and language tasks .center[[vilbert.cloudcv.org](https://vilbert.cloudcv.org/)] .center[![:scale 100%](images/vilbert_architecture.png)] --- ## BERT Applications: Better Ctrl+F - Repurposing question answering model for asking questions about the webpage the user is currently on. - Check it out at [github.com/model-zoo/shift-ctrl-f](https://github.com/model-zoo/shift-ctrl-f) .center[![:scale 90%](images/googledemo.gif)] --- ## Big Transformer Wins: Huggingface `transformers` - A very nicely done library that allows anyone with some Python knowledge to play with pretrained state-of-the-art models (more in the [docs](https://huggingface.co/transformers/)). -- - A small example: English to Slovak translator in about 10 lines of Python code: .red[*] ```python from transformers import MarianTokenizer, MarianMTModel src = 'en' # source language trg = 'sk' # target language sample_text = "When will this presentation end ?" mname = f'Helsinki-NLP/opus-mt-{src}-{trg}' model = MarianMTModel.from_pretrained(mname) tok = MarianTokenizer.from_pretrained(mname) batch = tok.prepare_seq2seq_batch(src_texts=[sample_text]) gen = model.generate(**batch) words = tok.batch_decode(gen, skip_special_tokens=True) print(words) ``` .footnote[.font-small[Works with many other languages as well -- the full list is [here](https://huggingface.co/Helsinki-NLP)]] --- ## Attention and Transformers: Recap -- - Attention was a fix for sequence models that did not really work too wel -- - It turned out it was all that was needed for (bounded) sequence processing -- - Transformer is an encoder-decoder architecture that is "all the rage" now -- - It has no time-depencency due to self-attention and is therefore easy to paralelize -- - Well known models like BERT and GPT-* took the world of NLP by storm -- - Very helpful in many tasks, easy to play with thanks to the Huggingface `transformers` library --- ## Our path today - From text to word vectors -- - From word vectors to sequence processing -- - From sequence processing to attention -- - From attention to transformers -- - Via transformers back to (better, contextualized) word vectors --- class: center, middle, inverse ## Current Frontiers --- ## Smaller models - BERT (Large) is about 430MB when serialized on disk. Even for Google it's at best [impractical to put it to production](https://medium.com/@neal_lathia/when-is-a-neural-net-too-big-for-production-4315452193ef) - The trend is to make models smaller while keeping their performance roughly the same - DistilBERT - ALBERT - TinyBERT - MobileBERT - Various avenues of research: - Knowledge Distillation - Quantization - ... --- ## Non-English NLP - Most of the cutting-edge research in NLP happens in English -- - Most of the world [does not speak English](https://ruder.io/nlp-beyond-english/) -- - Many problems "solved" in English are still open in other (smaller) languages like Slovak or Ukrainian .center[![:scale 80%](images/language_data_distribution.png)] --- ## Doing more with less data - Data is almost always the bottleneck for NLP projects -- - Being able to do data augmentation can push a project from "no-go" to "doable" -- - Open are a of research, some examples include [**backtranslation**](https://amitness.com/back-translation/) or **MixUp for text** .center[![:scale 80%](images/back-translation-marianmt.png)] --- class: center, middle, inverse ## marek@mareksuppa.com