name: inverse layout: true class: center, middle, inverse --- # Introduction to Supervised Learning From kNN to CNN in (less then) 123 slides .footnote[Marek Šuppa
ESS 2020, Bratislava] --- layout: false # Ball snake vs Carpet Python .center[ ![:scale 100%](images/ball_snake_vs_python.png) ] ??? - The ball python is a ground snake -- spend a lot of time hiding - The carpet Python will spend more time out and on display --- class: center ## Supervised learning in the context of ML/AI ![:scale 100%](images/AI_ML.jpg) .footnote[.font-small[Image from https://read.hyperight.com/a-beginners-guide-to-machine-learning-for-hr-practitioners/]] --- class: center ## Supervised vs. Unsupervised learning ![:scale 100%](images/supervised_machine_learning.png) .footnote[.font-small[Image from https://julienbeaulieu.gitbook.io/wiki/sciences/machine-learning/machine-learning-overview]] --- layout: false class: center, middle .left-column[ ## Supervised Learning ### - Definition ] .right-column[ .larger[ $$ (x_i, y_i) \propto p(x, y) \text{ i.i.d.}$$ $$ x_i \in \mathbb{R}^n, y_i \in \mathbb{R}$$ $$f(x_i) \approx y_i$$ $$f(x) \approx y$$ ] ] --- layout: false class: center, middle .left-column[ ## Supervised Learning ### - Definition ### - kNN Example ] .right-column[ ![:scale 80%](images/knn_boundary_test_points.png) $$f(x) = y_i, i = \text{argmin}_j || x_j - x||$$ ] ??? - Two class (red/blue) problem (ball snake vs carpet python) - Two features x_1 and x_2 (snake weight and length) on the x and y axes - We are trying to predict the values of the points with stars - What will k=1 closest neighbor prediction look like? --- layout: false class: center, middle .left-column[ ## Supervised Learning ### - Definition ### - kNN Example ] .right-column[ ![:scale 80%](images/knn_boundary_k1.png) $$f(x) = y_i, i = \text{argmin}_j || x_j - x||$$ ] ??? - So in what way was this problem "supervised"? - How do we find out whether this model is any good? --- layout: false class: center, middle .left-column[ ## Supervised Learning ### - Definition ### - kNN Example ] .right-column[ ![:scale 90%](images/train_test_set_2d_classification.png) ] ??? - Well, we put together a dataset, split it into train/test in say 75:25 ratio - Train on 75, test on the remaining 25 - Applying the model on test set will give us an unbiased estimate ("how would this work in real life on data it did not see") --- # kNN in scikit-learn ```python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train, y_train) print("accuracy: {:.2f}".format(knn.score(X_test, y_test))) y_pred = knn.predict(X_test) ``` accuracy: 0.77 ??? - train_test_split does 75:25 by default - note the 1 with n_neighbors - `fit` function trains the model - `predict` runs it on input data - `score` function runs `predict` internally and compares the input agains the ground truth it received in `y_test` - for classification `score` by default computes accuracy --- layout: false class: center, middle .left-column[ ## Supervised Learning ### - Definition ### - kNN Example ### - Influence of k ] .right-column[ ![:scale 90%](images/knn_boundary_k1.png) ] ??? --- layout: false class: center, middle .left-column[ ## Supervised Learning ### - Definition ### - kNN Example ### - Influence of k ] .right-column[ ![:scale 90%](images/knn_boundary_k3.png) ] ??? - The situation did change, as we can see - Clearly, the choice of k matters - How do we know what k is better? - This is called hyperparameter tuning and a ton of time is spent on it in practice --- layout: false class: center, middle .left-column[ ## Supervised Learning ### - Definition ### - kNN Example ### - Influence of k ] .right-column[ #### `n_neighbors`'s influence visualized ![:scale 90%](images/knn_boundary_varying_k.png) ] ??? - Perfect classification for one neighbor, but we are covering a lot of noise - We can compare it not just in this graphic way but also by evaluating these settings (different `k`) --- layout: false class: center, middle .left-column[ ## Supervised Learning ### - Definition ### - kNN Example ### - Influence of k ### - Evaluation ] .right-column[ ### Model complexity vs. accuracy ![:scale 90%](images/knn_model_complexity.png) ] ??? --- layout: false class: center, middle .left-column[ ## Supervised Learning ### - Definition ### - kNN Example ### - Influence of k ### - Evaluation ] .right-column[ ### Overfitting and Underfitting ![:scale 95%](images/overfitting_underfitting_cartoon_train.png) ] ??? --- layout: false class: center, middle .left-column[ ## Supervised Learning ### - Definition ### - kNN Example ### - Influence of k ### - Evaluation ] .right-column[ ### Overfitting and Underfitting ![:scale 95%](images/overfitting_underfitting_cartoon_generalization.png) ] ??? --- layout: false class: center, middle .left-column[ ## Supervised Learning ### - Definition ### - kNN Example ### - Influence of k ### - Evaluation ] .right-column[ ### Overfitting and Underfitting ![:scale 95%](images/overfitting_underfitting_cartoon_full.png) ] ??? --- class: middle # Supervised Learning: Recap - Supervised learning requires "supervision" -- some way of knowing what the truth is ($y$) - One of the simplest models that utilize supervised learning is kNN - ML models have hyperparameters that need to be tuned - By looking at the relationship between complexity and model performance we can notice when they overfit or underfit. - Our aim is to find the hyperparameters when the model no longer unferfits but does not yet overfit. --- layout: false class: center, middle .left-column[ ## Model Selection ] .right-column[ ### So far: Train-test-split ![:scale 100%](images/train_test_split_new.png) ] ??? --- layout: false class: center, middle .left-column[ ## Model Selection ### - 3-fold split ] .right-column[ ### Threefold split ![:scale 100%](images/train_test_validation_split.png) ] ??? --- layout: false class: middle # Threefold split example in scikit-learn .smaller[ ```python X_trainval, X_test, y_trainval, y_test = train_test_split(X, y) X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval) val_scores = [] neighbors = np.arange(1, 15, 2) for i in neighbors: knn = KNeighborsClassifier(n_neighbors=i) knn.fit(X_train, y_train) val_scores.append(knn.score(X_val, y_val)) print(f"best validation score: {np.max(val_scores):.3}") best_n_neighbors = neighbors[np.argmax(val_scores)] print("best n_neighbors:", best_n_neighbors) knn = KNeighborsClassifier(n_neighbors=best_n_neighbors) knn.fit(X_trainval, y_trainval) print(f"test-set score: {knn.score(X_test, y_test):.3f}") ``` ``` best validation score: 0.991 best n_neighbors: 11 test-set score: 0.951 ``` ] --- layout: false class: center, middle .left-column[ ## Model Selection ### - 3-fold split ### - Cross-validation ] .right-column[ .center[ ### Cross-validation ![:scale 85%](images/cross_validation_new.png) ] ] ??? --- layout: false class: center, middle .left-column[ ## Model Selection ### - 3-fold split ### - Cross-validation ] .right-column[ .center[ ### Cross-validation + test set ![:scale 105%](images/grid_search_cross_validation_new.png) ] ] ??? --- layout: false class: middle ### Grid-Search with Cross-Validation .smaller[ ```python from sklearn.model_selection import cross_val_score X_train, X_test, y_train, y_test = train_test_split(X, y) cross_val_scores = [] for i in neighbors: knn = KNeighborsClassifier(n_neighbors=i) scores = cross_val_score(knn, X_train, y_train, cv=10) cross_val_scores.append(np.mean(scores)) print(f"best cross-validation score: {np.max(cross_val_scores):.3}") best_n_neighbors = neighbors[np.argmax(cross_val_scores)] print(f"best n_neighbors: {best_n_neighbors}") knn = KNeighborsClassifier(n_neighbors=best_n_neighbors) knn.fit(X_train, y_train) print(f"test-set score: {knn.score(X_test, y_test):.3f}") ``` ``` best cross-validation score: 0.967 best n_neighbors: 9 test-set score: 0.965 ``` ] --- layout: false class: center, middle .left-column[ ## Model Selection ### - 3-fold split ### - Cross-validation ### - Grid search ] .right-column[ ### Parameter tuning via Grid Search ![:scale 80%](images/gridsearch_workflow.png) ] --- ### GridSearchCV in scikit-learn .smaller[ ```python from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y) param_grid = {'n_neighbors': np.arange(1, 30, 2)} grid = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, cv=10, return_train_score=True) grid.fit(X_train, y_train) print(f"best mean cross-validation score: {grid.best_score_}") print(f"best parameters: {grid.best_params_}") print(f"test-set score: {grid.score(X_test, y_test):.3f}") ``` ``` best mean cross-validation score: 0.967 best parameters: {'n_neighbors': 9} test-set score: 0.993 ``` ] --- layout: false class: center, middle .left-column[ ## Model Selection ### - 3-fold split ### - Cross-validation ### - Grid search ### - CV strategies ] .right-column[ .center[ ![:scale 100%](images/kfold_cv.png) ] ] --- layout: false class: center, middle .left-column[ ## Model Selection ### - 3-fold split ### - Cross-validation ### - Grid search ### - CV strategies ] .right-column[ .center[ ![:scale 100%](images/stratified_cv.png) ] ] --- class: middle # Model Selection: Recap - The `test` part of your dataset should be sacred -- make sure you only use it when you are willing to "bet" on a model. - Wherever possible, use cross-validation to get an unbiased estimate of your model's performance - In the absolute worst case, have at least a threefold split (train/valid/test) - Cross Validation and Grid Search are a nice combo for parameter tuning - When working with imbalanced dataset, consider using Startified Cross Validation --- class: center .left-column[ ## Linear Models ### - Linear Regression ] .right-column[ ### Linear Regression ![:scale 90%](images/linear_regression_1d.png) $$\hat{y} = w^T \mathbf{x} + b = \sum_{i=1}^p w_i x_i +b$$ ] ??? Predictions in all linear models for regression are of the form shown here: It's an inner product of the features with some coefficient or weight vector w, and some bias or intercept b. In other words, the output is a weighted sum of the inputs, possibly with a shift. here i runs over the features and x_i is one feature of the data point x. These models are called linear models because they are linear in the parameters w. The way I wrote it down here they are also linear in the features x_i. However, you can replace the features by any non-linear function of the inputs, and it'll still be a linear model. There are many differnt linear models for regression, and they all share this formula for making predictions. The difference between them is in how they find w and b based on the training data. --- .left-column[ ## Linear Models ### - Linear Regression ] .right-column[ ### Ordinary Least Squares $$\hat{y} = w^T \mathbf{x} + b = \sum_{i=1}^p w_i x_i +b $$ `$$\min_{w \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^n (w^T\mathbf{x}_i + b - y_i)^2$$` ] ??? - Unique solution if $\mathbf{X} = (\mathbf{x}_1, ... \mathbf{x}_n)^T$ has full column rank. The most straight-forward solution that goes back to Gauss is ordinary least squares. In ordinary least squares, find w and b such that the predictions on the training set are as accurate as possible according the the squared error. That intuitively makes sense: we want the predictions to be good on the training set. If there is more samples than features (and the samples span the whole feature space), then there is a unique solution. The problem is what's called a least squares problem, which is particularly easy to optimize and get the unique solution to. However, if there are more features than samples, there are usually many perfect solutions that lead to 0 error on the training set. Then it's not clear which solution to pick. Even if there are more samples than features, if there are strong correlations among features the results might be unstable, and we'll see some examples of that soon. Before we look at examples, I want to introduce a popular alternative. --- .left-column[ ## Linear Models ### - Linear Regression ### - Ridge Regression ] .right-column[ ### Ridge Regression `$$ \min_{w \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^n (w^T\mathbf{x}_i + b - y_i)^2 + \alpha ||w||^2 $$` ] ??? - Always has a unique solution. - Tuning parameter alpha. In Ridge regression we add another term to the optimization problem. Not only do we want to fit the training data well, we also want w to have a small squared l2 norm or squared euclidean norm. The idea here is that we're decreasing the "slope" along each of the feature by pushing the coefficients towards zero. This constraings the model to be more simple. So there are two terms in this optimization problem, which is also called the objective function of the model: the data fitting term here that wants to be close to the training data according to the squared norm, and the prenalty or regularization term here that wants w to have small norm, and that doesn't depend on the data. Usually these two goals are somewhat opposing. If we made w zero, the second term would be zero, but the predictions would be bad. So we need to trade off between these two. The trade off is problem specific and is specified by the user. If we set alpha to zero, we get linear regression, if we set alpha to infinity we get a constant model. Obviously usually we want something in between. This is a very typical example of a general principle in machine learning, called regularized empirical risk minimization. --- ### Linear Regression vs Ridge ```python log_regressor = TransformedTargetRegressor( LinearRegression(), func=np.log, inverse_func=np.exp) cross_val_score(make_pipeline(preprocess, log_regressor), X_train, y_train, cv=5) ``` ``` array([0.95 , 0.943, 0.941, 0.913, 0.922]) ``` ```python log_ridge = TransformedTargetRegressor( Ridge(), func=np.log, inverse_func=np.exp) cross_val_score(make_pipeline(preprocess, log_ridge), X_train, y_train, cv=5) ``` ``` array([0.948, 0.95 , 0.941, 0.915, 0.931]) ``` ??? Let’s look at two simple models. Linear regression and Ridge regression. What I've done is I’ve split the data into training and test set and used 10 fold cross-validation to evaluate them. Here I use cross_val_score together with the model, the training data, training labels, and 10 fold cross-validation. This will return 10 scores and I'm going to compute the mean of them. I'm doing this for both linear regression and Ridge regression. Here is ridge regression uses a default value of alpha of 1. Here these two scores are quite similar. --- .left-column[ ## Linear Models ### - Linear Regression ### - Ridge Regression ### - Lasso Regression ] .right-column[ ### Lasso Regression `$$ \min_{w \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^n (w^T\mathbf{x}_i + b - y_i)^2 + \alpha ||w||_1 $$` ] ??? - Shrinks w towards zero like Ridge - Sets some w exactly to zero - automatic feature selection! Lasso Regression looks very similar to Ridge Regression. The only thing that is changed is we use the L1 norm instead of the L2 norm. L2 norm is the sum of squares, the L1 norm is the sum of the absolute values. So again, we are shrinking w towards 0, but we're shrinking it in a different way. The L2 norm penalizes very large coefficients more, the L1 norm penalizes all coefficients equally. What this does in practice is its sets some entries of W to exactly 0. It does automatic feature selection if the coefficient of zero means it doesn't influence the prediction and so you can just drop it out of the model. This model does features selection together with prediction. Ideally what you would want is, let's say you want a model that does features selections. The goal is to make our model automatically select the features that are good. What you would want to penalize the number of features that it uses, that would be L0 norm. --- .left-column[ ## Linear Models ### - Linear Regression ### - Ridge Regression ### - Lasso Regression ### - Logistic Regression ] .right-column[ .center[ ![:scale 90%](images/linear_boundary_vector.png) ] $$\hat{y} = \text{sign}(w^T \textbf{x} + b) = \text{sign}\left(\sum\limits_{i}w_ix_i + b\right)$$ ] ??? --- .left-column[ ## Linear Models ### - Linear Regression ### - Ridge Regression ### - Lasso Regression ### - Logistic Regression ] .right-column[ ### What would be an appropriate loss? $$\hat{y} = \text{sign}(w^T \textbf{x} + b)$$ `$$\min_{w \in \mathbb{R}^{p}, b \in \mathbb{R}} \sum_{i=1}^n 1_{y_i \neq \text{sign}(w^T \textbf{x} + b)}$$` .center[ ![:scale 90%](images/binary_loss.png) ] ] ??? So we need to define a loss function for given w and b that tell us how well they fit the training set. Obvious Idea: Minimize number of misclassifications aka 0-1 loss but this loss is non-convex, not continuous and minimizing it is NP-hard. So we need to relax it, which basically means we want to find a convex upper bound for this loss. This is not done on the actual prediction, but on the inner product $w^T x$, which is also called the decision function. So this graph here has the inner product on the x axis, and shows what the loss would be for class 1. The 0-1 loss is zero if the decision function is positive, and one if it's negative. Because a positive decision function means a positive predition, means correct classification in the case of y=1. A negative prediction means a wrong classification, which is penalized by the 0-1 loss with a loss of 1, i.e. one mistake. The other losses we'll talk about are mostly the hinge loss and the log loss. You can see they are both upper bounds on the 0-1 loss but they are convex and continuous. Both of these losses care not only that you make a correct prediction, but also "how correct" your prediction is, i.e. how positive or negative your decision function is. We'll talk a bit more about the motivation of these two losses, starting with the logistic loss. --- # Logistic Regression .left-eq-column[ $$\log\left(\frac{p(y=1|x)}{p(y=-1|x)}\right) = w^T\textbf{x} + b$$ $$p(y=1|\textbf{x}) = \frac{1}{1+e^{-w^T\textbf{x} -b }}$$ `$$\min_{w \in ℝ^{p}, b \in \mathbb{R}} \sum_{i=1}^n \log(\exp(-y_i(w^T \textbf{x}_i + b)) + 1)$$` $$\hat{y} = \text{sign}(w^T\textbf{x} + b)$$ ] .right-eq-column[ ![:scale 70%](images/logit.png)] ??? Logistic regression is probably the most commonly used linear classifier, maybe the most commonly used classifier overall. The idea is to model the log-odds, which is log p(y=1|x) - log p(y=0|x) as a linear function, as shown here. Rearranging the formula, you get a model of p(y=1|x) as 1 over 1 + ... This function is called the logistic sigmoid, and is drawn to the right here. Basically it squashed the linear function $w^Tx$ between 0 and 1, so that it can model a probability. Given this equation for p(y|x), what we want to do is maximize the probability of the training set under this model. This approach is known as maximum likelihood. Basically you want to find w and b such that they assign maximum probability to the labels observed in the training data. You can rearrange that a bit and end up with this equation here, which contains the log-loss as seen on the last slide. The prediction is the class with the higher probability. In the binary case, that's the same as asking whether the probability of class 1 is bigger or smaller than .5. And as you can see from the plot of the logistic sigmoid, the probability of the class +1 is greater than .5 exactly if the decision function $w^T x$ is greater than 0. So predicting the class with maximum probability is the same as predicting which side of the hyperplane given by w we are on. Ok so this is logistic regression. We minimize this loss and get a w which defines a hyper plane. But if you think back to last time, this is only part of what we want. This formulation tries to fit the training data, but it doesn't care about finding a simple solution. --- .left-column[ ## Linear Models ### - Linear Regression ### - Ridge Regression ### - Lasso Regression ### - Logistic Regression ] .right-column[ # Penalized Logistic Regression `$$\min_{w \in ℝ^{p}, b \in \mathbb{R}}C \sum_{i=1}^n\log(\exp(-y_i(w^T \textbf{x}_i + b )) + 1) + ||w||_2^2$$` `$$\min_{w \in ℝ^{p}, b \in \mathbb{R}}C \sum_{i=1}^n\log(\exp(-y_i (w^T \textbf{x}_i + b)) + 1) + ||w||_1$$` ] ??? - C is inverse to alpha (or alpha / n_samples) - Both versions strongly convex, l2 version smooth (differentiable). - All points contribute to $w$ (dense solution to dual). So we can do the same we did for regression: we can add regularization terms using the L1 and L2 norm. The effects are the same as for regression: both push the coefficients towards zero, but the l1 norm encourages coefficients to be exactly zero, for the same reasons we discussed last time. You could also use a mixed penalty to get something like the elasticnet. That's not implemented in the logisticregression class in scikit-learn right now, but it's certainly a sensible thing to do. Here I used a slightly different notation as last time, though. I'm not using alpha to multiply the regularizer, instead I'm using C to multiply the loss. That's mostly because that's how it's done in scikit-learn and it has only historic reasons. The idea is exactly the same, only now C is 1 over alpha. So large C means heavy weight to the loss, means little regularization, while small C means less weight on the loss, means strong regularization. Depending on the model, there might be a factor of n_samples in there somewhere. Usually we try to make the objective as independent of the number of samples as possible in scikit-learn, but that might lead to surprises if you're not aware of it. Some side-notes on the optimization problem: here, as in regression, having more regularization makes the optimization problem easier. You might have seen this in your homework already, if you decrease C, meaning you add more regularization, your model fits more quickly. One particular property of the logistic loss, compared to the hinge loss we'll discuss next is that each data point contributes to the loss, so each data point has an effect on the solution. That's also true for all the regression models we saw last time. --- .left-column[ ## Linear Models ### - Linear Regression ### - Ridge Regression ### - Lasso Regression ### - Logistic Regression ] .right-column[ ### Effect of regularization .center[ ![:scale 90%](images/logreg_regularization.png) ] ] ??? - Small C (a lot of regularization) limits the influence of individual points! So I spared you with coefficient plots, because they looks the same as for regression. All the things I said about model complexity and dependency on the number of features and samples is as true for classification as it is for regression. There is another interesting way to thing about regularization that I found helpful, though. I'm not going to walk through the math for this, but you can reformulate the optimization problem and find that what the C parameter does is actually limit the influence of individual data points. With very large C, we said we have no regularization. It also means individual data points can have basically unlimited influence, as you can see here. There are two outliers here, which basically completely tilt the decision boundary. But if we decrease C, and therefore increase the regularization, what happens is that the influence of these outlier points becomes limited, and the other points get more influence. --- class: middle # Linear Models: Recap - Linear models are the most often used baselines when it comes to ML models - Statisticians use them for their interpretability, in ML they are used for simplicity - Essentially the same training mechanics (except for the different objective functions) can be used for both regression and classification - Vanilla versions of these models can be improved by adding L1 or L2 norms as a form of regularization. --- layout: false class: center, middle ![:scale 100%](images/data_scientists_data_cleanup.jpg) --- layout: false class: center, middle Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering. .quote_author[-- Andrew Ng] --- layout: false class: center, middle .left-column[ ## Preprocessing ### - Scaling ] .right-column[ ![:scale 100%](images/knn_scaling.png) ] --- layout: false class: center, middle .left-column[ ## Preprocessing ### - Scaling ] .right-column[ ![:scale 100%](images/knn_scaling2.png) ] ??? - kNN computes euclidian distances and those are much much bigger along the X axis --- layout: false class: center, middle .left-column[ ## Preprocessing ### - Scaling ] .right-column[ ### Various scaling approaches ![:scale 100%](images/scaler_comparison_scatter.png) ] ??? --- class: middle ### StandardScaler in scikit-learn ```python from sklearn.linear_model import Ridge X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=0) scaler = StandardScaler() scaler.fit(X_train) X_train_scaled = scaler.transform(X_train) ridge = Ridge().fit(X_train_scaled, y_train) X_test_scaled = scaler.transform(X_test) ridge.score(X_test_scaled, y_test) ``` ``` 0.684 ``` --- layout: false class: center, middle ### Beware Improper Scaling ![:scale 100%](images/no_separate_scaling.png) ??? --- layout: false class: middle .left-column[ ## Preprocessing ### - Scaling ### - Categorical Variables ] .right-column[ .smaller[ ```python import pandas as pd df = pd.DataFrame({ 'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'], 'salary': [103, 89, 142, 54, 63, 219], 'vegan': ['No', 'No','No','Yes', 'Yes', 'No']}) ``` ]
boro
salary
vegan
0
Manhattan
103
No
1
Queens
89
No
2
Manhattan
142
No
3
Brooklyn
54
Yes
4
Brooklyn
63
Yes
5
Bronx
219
No
] ??? --- # Ordinal encoding .smaller[ ```python df['boro_ordinal'] = df.boro.astype("category").cat.codes df ``` ] .left-eq-column[
boro
salary
vegan
0
2
103
No
1
3
89
No
2
2
142
No
3
1
54
Yes
4
1
63
Yes
5
0
219
No
] -- .right-eq-column[ ![:scale 100%](images/boro_ordinal.png) ] ??? - This imposes order (it is only by luck that 0 is No, because it was first in line) --- # One-Hot (Dummy) Encoding .narrow-left-column[
boro
salary
vegan
0
Manhattan
103
No
1
Queens
89
No
2
Manhattan
142
No
3
Brooklyn
54
Yes
4
Brooklyn
63
Yes
5
Bronx
219
No
] .wide-right-column[ .tiny[ ```python pd.get_dummies(df) ```
salary
boro_Bronx
boro_Brooklyn
boro_Manhattan
boro_Queens
vegan_No
vegan_Yes
0
103
0
0
1
0
1
0
1
89
0
0
0
1
1
0
2
142
0
0
1
0
1
0
3
54
0
1
0
0
0
1
4
63
0
1
0
0
0
1
5
219
1
0
0
0
1
0
] ] --- # One-Hot (Dummy) Encoding .narrow-left-column[
boro
salary
vegan
0
Manhattan
103
No
1
Queens
89
No
2
Manhattan
142
No
3
Brooklyn
54
Yes
4
Brooklyn
63
Yes
5
Bronx
219
No
] .wide-right-column[ .tiny[ ```python pd.get_dummies(df, columns=['boro']) ```
salary
vegan
boro_Bronx
boro_Brooklyn
boro_Manhattan
boro_Queens
0
103
No
0
0
1
0
1
89
No
0
0
0
1
2
142
No
0
0
1
0
3
54
Yes
0
1
0
0
4
63
Yes
0
1
0
0
5
219
No
1
0
0
0
] ] ??? We can specify selectively which columns to apply the encoding to. --- # One-Hot (Dummy) Encoding .narrow-left-column[
boro
salary
vegan
0
2
103
No
1
3
89
No
2
2
142
No
3
1
54
Yes
4
1
63
Yes
5
0
219
No
] .wide-right-column[ .tiny[ ```python pd.get_dummies(df_ordinal, columns=['boro']) ```
salary
vegan
boro_0
boro_1
boro_2
boro_3
0
103
No
0
0
1
0
1
89
No
0
0
0
1
2
142
No
0
0
1
0
3
54
Yes
0
1
0
0
4
63
Yes
0
1
0
0
5
219
No
1
0
0
0
] ] ??? This also helps if the variable was already encoded using integers. Sometimes, someone has already encoded the categorical variables to integers like here. So here this is exactly the same information only except instead of strings you have them numbered. If you call the get_dummies on this nothing happens because none of them are object data types or categorical data types. If you want to look at the One Hot Encoding, you can explicitly pass columns equal and this will transform into boro_1, boro_2, boro_3. In this case get_dummies usually wouldn't do anything, but we can tell it which variables are categorical and it will dummy encode those for us. --- layout: false class: center, middle .left-column[ ## Preprocessing ### - Scaling ### - Categorical Variables ## Feature Engineering ] -- .right-column[ ![:scale 70%](images/1d-linearly-inseparable-classes.png) ] --- layout: false class: center, middle .left-column[ ## Preprocessing ### - Scaling ### - Categorical Variables ## Feature Engineering ] .right-column[ ![:scale 100%](images/1d-linearly-inseparable-classes-solution.png) ] --- layout: false class: center, middle .left-column[ ## Preprocessing ### - Scaling ### - Categorical Variables ## Feature Engineering ] -- .right-column[ ![:scale 70%](images/2d-linearly-inseparable-classes.png) ] --- layout: false class: center, middle .left-column[ ## Preprocessing ### - Scaling ### - Categorical Variables ## Feature Engineering ] .right-column[ ![:scale 100%](images/2d-linearly-inseparable-classes-solution.png) ] ??? - Is this really all we can do? Do we really have to guess? Is there not a better way? - Turns out there is for about 10 years now: neural networks --- # Preprocessing and Feature Engineering: Recap - The scale of the input data matters (especially when it is different across various features) - Beware Improper Scaling: never call `fit` or `fit_transform` on the test set - When dealing with categorical data, one-hot (dummy) encoding generally works bets - Engineering better features is one of the prime ways of improperving model's performance, but it's extra dull and tedious. Can we not automate it? --- layout: false # Neural Networks: Historical perspective - Nearly all the things we talk about today existed before the year 2000. -- - So what changed? -- A few things: - More data - More computational power -- - And a few new tricks .red[*]: - ReLU - Dropout - Adam, RMSProp, AdamW, RAdam, Ranger, cyclic learning raters... - BatchNorm - residual connections (this one is actually old as well...) .footnote[.font-small[.red[*]And sadly, way, way too much hype. ]] --- .left-column[ ## Neural Networks ### - Perceptron ] .right-column[ ### The basic building block .center[ ![:scale 60%](images/log_reg_nn.png) ] ] ??? Before I introduce you to neural networks, I want to introduce you to how people often draw neural networks, but drawing a model that you already know, binary logistic regression, in the same way. This drawing basically only encodes the prediction process, not the model building or fitting process. Often networks are drawn as circles, which basically represent numbers. So here, I drew a four-dimensional input vector, x[0] to x[3]. For each input, we drew a little circle representing this number. In binary logistic regression, the output is computed as the inner product of the input vector and a weight vector w, or in other words a weighted sum of the x's, weighted by the ws. So each of these arrows corresponds to weighting the input by an w_i and then they are all added up to the output y. In logistic regression we usually compute the output y as the probability, given by the logistic sigmoid of this weighted sum. This is not drawn but kind of implicit in the picture. I also didn't draw the bias here, which is also added to the final sum. So to summarize, circles mean numbers, and arrows mean products between what's on the origin of the arrow, and a coefficient that's part of the model. And arrows into circles mean sums, and they often imply some non-linear function like the logistic sigmoid. --- .left-column[ ## Neural Networks ### - Perceptron ] .right-column[ ### But can it solve XOR? ![:scale 100%](images/xor.png) Let's try out at [playground.tensorflow.org](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=xor®Dataset=reg-plane&learningRate=0.03®ularizationRate=0&noise=15&networkShape=&seed=0.71253&showTestData=false&discretize=false&percTrainData=70&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false) ] .footnote[.font-small[Image from https://medium.com/@lucaspereira0612/solving-xor-with-a-single-perceptron-34539f395182]] --- .left-column[ ## Neural Networks ### - Perceptron ### - MLP ] .right-column[ ### Basic Architecture .center[ ![:scale 80%](images/nn_basic_arch.png) ] $ h(x) = f(W_1x+b_1) $ $ o(x) = g(W_2h(x) + b_2) $ ] ??? So now let's look at what a neural network is. It's very similar to the logistic regression, only applied several times. So here I drew three inputs x, from which we compute weighted sums. But now there is not only a single output, but we compute several intermediate outputs, the hidden units. Each of the hidden units is a weighted sum of the inputs, using different weights. Each hidden unit corresponds to an inner product with a weight vector, so all together they correspond to multiplying the input by a matrix, here a 3 by 4 matrix. We also add a bias for each of the hidden units, so a vector of dimension 4 for all of them. Then, we basically repeat the process and compute again weighted sums of the hidden units, to arrive at the outputs. Here this would correspond to multiplying by a 4 by 2 matrix, and adding a vector of dimension 2. Then we could apply for example a logistic sigmoid or softmax to get a classification output. So we basically do two matrix multiplications. If that was all, we could simplify this by just multiplying the matrices together into a single matrix, and we would just have a linear model. But the interesting part is that in the hidden layer, after we compute the weighted sum, we apply a non-linear function, often called activation function or just nonlinearity. That is what makes this whole function non-linear, and allows us to express much more interesting relationships. You can think of this as doing logistic regression with learning non-linear basis functions. The process is written here in formulas, we take the input x, multiply with a matrix W, add a bias b, and apply a non-linearity f, to get h. Then we multipy h by another matrix W', add a bias b' and apply a non-linearity g. This looks a bit like the word2vec we had last time, though here it's really very important to have the non-linear activation functions. Wha we want to learn from data are the weights and biases, so in this case a 3x4 matrix, a vector of dim 4, a 2x4 matrix and a vector of dim 2. Each of these steps of computation is usually called a layer. Here we have an input layer, one hidden layer, and an output layer. The hidden layer is called hidden because the computation is not actually part of the result, it's just internal to the algorithm. Though I'm drawing all these things in a similar way, don't confuse them with graphical models, as I drew them for latent dirichlet allocation. All these nodes here are completely deterministic functions, and the graph just illustrates very simple computations. --- .left-column[ ## Neural Networks ### - Perceptron ### - MLP ] .right-column[ ### More layers -> Multi Layer Perceptron (MLP) .center[ ![:scale 60%](images/nn_manylayers.png) ] ] ??? - Hidden layers usually all have the same non-linear function, weights are different for each layer. - Many layers → “deep learning”. - This is called a multilayer perceptron, feed-forward neural network, vanilla feed-forward neural network. - For regression usually single output neuron with linear activation. - For classification one-hot-encoding of classes, n_classes many output variables with softmax. We can have arbitrary many layers in a neural network, and more layers or more units allow to express more complex functions. And the term deep-learning is referring to neural nets with many hidden layers. This type of network is called a multilayer perceptron, basically because it has more than one layer of computation. For the output, it functions similar to linear models. You can do binary classification with a single output, you can do multi-class classification with a softmax like in multinomial logistic regression, or you can have just a single output for regression. All of the hidden layers usually have the same non-linear function f, but that's not usually the same as the output function. --- .left-column[ ## Neural Networks ### - Perceptron ### - MLP ### - Activation Functions ] .right-column[ ### Nonlinear activation function .center[ ![:scale 65%](images/nonlin_fn.png) ] ] ??? - There is also sigmoid which we already saw - Reasons for preferring relu over tanh - Comment a bit on new developments (Swish, Leaky Relu) The two standard choices for the non-linear activation function are shown here, either the tanh, which is the more traditional choice, or the rectified linear unit or relu, which is more commonly used recently. Tanh basically squashes everything between -inf and +inf to -1 and 1 and saturates towards the infinities. The rectified linear unit just is constant zero for all negative numbers, and then the identity. One of the reasons given for preferring the relu unit is that the gradient of tanh is very small in most places, which makes it hard to optimize. --- # Training Objective `$$ h(x) = f(W_1x+b_1) $$` `$$ o(x) = g(W_2h(x)+b_2) = g(W_2f(W_1x + b_1) + b_2)$$` -- `$$ \min_{W_1,W_2,b_1,b_2} \sum\limits_{i=1}^N l(y_i,o(x_i)) $$` -- `$$ =\min_{W_1,W_2,b_1,b_2} \sum\limits_{i=1}^N l(y_i,g(W_2f(W_1x+b_1)+b_2)$$` -- - `$l$` Squared loss for regression. Cross-entropy loss for classification ??? So how are we going to learn these parameters, these weights and biases for all the layers? If we did logistic regression, this would be relatively simple. We know it's a convex optimization problem, to optimize the log-loss, so we can just run any optimizer, and they'll all come to the same result. For the neural network it's a bit more tricky. We can still write down the objective. So let's say we have a single hidden layer, then the hidden layer is x multiplied by W with bias b added, and then a non-linearity f, and the output o(x) is another matrix multiplication, another bias, and the output non-linearity like softmax. If we want to find the parameters W1,w2,b1,b2, we want to do empirical risk minimization, so for classification we want to minimize the cross-entropy loss for classification of the outputs o(x) given the ground thruth y_i over the training set. For regression we'd just use the square loss instead. FIXME why W1/W2 here and before W and W'? We could also add a regularizer, like an L2 penalty on the weights W if we wanted, though that's not necessarily as important in neural networks as in linear models. Generally, this is the same approach as for linear models, only the formula for the output is more complicated now. In particular this objective is not convex, so we can basically not hope to find the global optimum. But we can still try to find "good enough" values for the parameters w and b by running an optimizer. We could use a gradient based optimizer like gradient descent, or newton, or conjugate gradient of lbfgs on this objective and this will yield a local optimum, but not necessarily a global one, and that's basically the best we can do. Because you have to care about this optimization more than for many other models I'll go into a bit more details about how this works. Let me know if anything is unclear. --- .left-column[ ## Neural Networks ### - Perceptron ### - MLP ### - Activation Functions ### - Training ] .right-column[ ### Backpropagation .center[ Need $\frac{\partial l(y, o)}{\partial W_i}$ and $\frac{\partial l(y, o)}{\partial b_i}$ $$ \text{net}(x) := W_1x + b_1 $$ .center[![:scale 70%](images/backprop_eqn.png)] ] ] ??? To run an optimizer, we do need to compute the gradients for all our parameters though, and that's a bit non-obious. Luckily there's a simple algorithm to do that, called backpropagation, that is computationally very simple. You probably heard the name backpropagation before, and often people make a big deal out of it, but it is not actually a learning algorithm or anything like that, it's just a nice way to compute the gradients in a neural network. And back propagation is basically just a clever application of the chain rule for derivatives. So let's say we want to get the gradients for the first weight matrix W1, so del o/del w1. If you try to write this down directly from the formula for the network it's a bit gnarly, but using the chain rule we can simplify this a bit. Let's define net(X) as the first hidden layer before the non-linearity. Then we can apply the chain rule (twice) and we see that we can write the gradient as a product of three terms, the input to the first layer, x, the gradient of the non-linearity f, and the gradient of the layer after W1, namely h. So to compute the gradient of the first weight vector, we need the activation/value of the layer before, and the derivative of the activation after. FIXME backpropagation image? When computing the predictions, we compute all the activations of the layers, and we get an error. So we already have all these values. So to compute the gradients, we can do a single sweep, a backward pass, from the output to the input, computing the derivatives using the chain rule. It's probably educational to go through the details of this once yourself, deriving this starting from the chain rule. But you can also just look it up in the deep learning book I linked to. I don't think it's gonna be instructive if I try to walk you through the algebra. Anyone see a problem with this? In particular the gradient of the non-linearity? --- .left-column[ ## Neural Networks ### - Perceptron ### - MLP ### - Activation Functions ### - Training ] .right-column[ ### Gradient Descent Variants Batch `$$ W_i \leftarrow W_i - \eta\sum\limits_{j=1}^N \frac{\partial l(x_j,y_j)}{\partial W_i} $$` Online/Stochastic `$$ W_i \leftarrow W_i - \eta\frac{\partial l(x_j,y_j)}{\partial W_i}$$` Minibatch `$$ W_i \leftarrow W_i - \eta\sum\limits_{j=k}^{k+m} \frac{\partial l(x_j,y_j)}{\partial W_i}$$` ] ??? So doing standard gradient descent, we would update a weight matrix W_i but using the old W_i and taking a gradient step, so subtracting the gradient of the loss wrt the paramters, summed over the whole training set, times some learning rate. The problem with this is that it's quite slow. Computing all these gradients means that we need to pass all the examples forward through the network, make predictions, and then do a backward pass with backpropagation. That's a lot of matrix multiplications to do a single gradient step, in particular given that we want to do this for very large datasets. So what we can do to speed this up is doing a stochastic approximation, as we already saw for linear models, doing stochastic gradient descent aka online gradient descent. Here, you pick a sample at random, compute the gradient just considering that sample, and then update the parameter. So you update the weights much more often, but you have a much less stable estimate of the gradient. In practice, we often just iterate through the data instead of picking a sample at random. And as with linear models, this is much faster than doing full batches for large datasets. However, it's less stable, and also it doesn't necessarily use the hardware in the best way. So we can do a compromise in where we look at mini-batches of size k, usually something like 64 or 512. So we look at k samples, compute the gradients, average them, and update the weights. That allows us to update much more often than looking at the whole dataset, while still having a more stable gradient, and better being able to use the parallel computing capabilities of modern CPUs and GPUs. This is what's used in practice basically always. The reason why this is faster is basically that doing a matrix-matrix multiplication is faster than doing a bunch of matrix-vector operations. In principle we could also be using smarter optimization methods, like second order methods or LBFGS, but these are often not very effective on these large non-convex problems. One, called levenberg-marquardt is actually a possibility, but it's not really used these days. --- .left-column[ ## Neural Networks ### - Perceptron ### - MLP ### - Activation Functions ### - Training ] .right-column[ ### Playing with $\eta$ - Turns out static $\eta$ does not work very well - Having some sort of a schedule for $\eta$ helps - It is even better to have adaptive $\eta$ for each $W_i$ (for each layer) - Current state of the art: `Adam` .red[*] .footnote[.font-small[.red[*] Which was proven not to converge but people use it anyway because it somewhat works in practice...]] ] --- class: middle # Neural Networks: Recap - Neural Networks came a long way since their inception in 1950s -- most of the models we use today have withstood the test of time - They are a versatile framework that can encompass many "standard" model using the same learning mechanics - Despite their apparent simplicity, they have a lot of hyperparameters to tune, most notably the learning rate $\eta$ --- # Ball snake vs Carpet Python .center[ ![:scale 100%](images/ball_snake_vs_python.png) ] ??? - The ball python is a ground snake -- spend a lot of time hiding - The carpet Python will spend more time out and on display --- template: inverse class: center, middle ![:scale 100%](images/quote-to-deal-with-a-14-dimensional-space-visualize-a-3-d-space-and-say-fourteen-to-yourself-geoffrey-hinton-135-56-47.jpg) .footnote[.font-small[Source: https://www.azquotes.com/quote/1355647]] --- class: middle ![:scale 100%](images/MLvsDL.png) --- .left-column[ ## CNNs ] .right-column[ ### Historical perspective - Historically somewhat inspired by the brain (as understood back then) - Currently very far from what we understand the brain does. .font-small[__IEEE Spectrum__: We read about Deep Learning in the news a lot these days. What’s your least favorite definition of the term that you see in these stories? __Yann LeCun__: My least favorite description is, “It works just like the brain.” I don’t like people saying this because, while Deep Learning gets an inspiration from biology, it’s very, very far from what the brain actually does. And describing it like the brain gives a bit of the aura of magic to it, which is dangerous. It leads to hype; people claim things that are not true. AI has gone through a number of AI winters because people claimed things they couldn’t deliver. https://spectrum.ieee.org/automaton/artificial-intelligence/machine-learning/facebook-ai-director-yann-lecun-on-deep-learning ] ] ??? There are two things we want from CNNs: - translation invariance - weight sharing And that's implemented via convolutions. --- .left-column[ ## CNNs ] .right-column[ ### Historical perspective II - First practical success was LeNet applied to zip code prediction (LeCun et. al, 1998) .center[ ![:scale 100%](images/CNET1.png) ] ] --- .center[ ![:scale 100%](images/other_architectures.png) ] ??? Here are two more recent architectures, AlexNet from 2012 and VGG net from 2015. These nets are typically very deep, but often have very small convolutions. In VGG there are 3x3 convolutions and even 1x1 convolutions which serve to summarize multiple feature maps into one. There is often multiple convolutions without pooling in between but pooling is definitely essential. --- .left-column[ ## CNNs ### - Convolution filters ] .right-column[ ### Visualizing the mechanics of a convolution filter .center[ ![:scale 95%](images/2dconv_illustration.png) ] [source: Arden Dertat](https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2) ] --- .left-column[ ## CNNs ### - Convolution filters ] .right-column[ ### Visualizing the mechanics of a convolution filter .center[ ![:scale 90%](images/2dconv_animation.gif) ] [source: Arden Dertat](https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2) ] --- .left-column[ ## CNNs ### - Convolution filters ] .right-column[ ### 2d Smoothing .center[ ![:scale 80%](images/2dsmoothing.png) ] ] --- .left-column[ ## CNNs ### - Convolution filters ] .right-column[ ### 2d Gradients .center[ ![:scale 80%](images/2dgradient.png) ] ] --- .left-column[ ## CNNs ### - Convolution filters ] .right-column[ ### 2d Gradients .center[ ![:scale 80%](images/2dgradient.png) ] ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ] .right-column[ ![:scale 40%](images/ml_cnn/layer1.png) ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ] .right-column[ ![:scale 100%](images/ml_cnn/layer2.png) ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ] .right-column[ ![:scale 100%](images/ml_cnn/layer3.png) ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ] .right-column[ ![:scale 100%](images/ml_cnn/layer4.png) ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ] .right-column[ ![:scale 100%](images/ml_cnn/layer5.png) ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ] .right-column[ ![:scale 100%](images/ml_cnn/layer6.png) ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ] .right-column[ ![:scale 100%](images/ml_cnn/layer7.png) ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ] .right-column[ ![:scale 100%](images/ml_cnn/layer8.png) ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ] .right-column[ ![:scale 100%](images/ml_cnn/layer9.png) ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ] .right-column[ ![:scale 100%](images/ml_cnn/layer10.png) ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ] .right-column[ ![:scale 100%](images/ml_cnn/convnet.jpeg) ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ### - Polling layers ] .right-column[ ![:scale 90%](images/ml_cnn/pool_layer.png) ] --- class: center, middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ### - Polling layers ] .right-column[ ![:scale 100%](images/ml_cnn/max_pool.png) ] --- class: middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ### - Polling layers ### - Fully connected layers ] .right-column[ The final part of a CNN is usually a standard MLP we've already seen ![:scale 100%](images/ml_cnn/convnet.jpeg) ] --- class: middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ### - Polling layers ### - Fully connected layers ### - Example in Keras ] .right-column[ ### Prepare Data .smaller[ ```python batch_size = 128 num_classes = 10 epochs = 12 # input image dimensions img_rows, img_cols = 28, 28 # the data, shuffled and split between train and test sets (x_train, y_train), (x_test, y_test) = mnist.load_data() X_train_images = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1) X_test_images = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1) input_shape = (img_rows, img_cols, 1) y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes) ``` ] ] --- class: middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ### - Polling layers ### - Fully connected layers ### - Example in Keras ] .right-column[ # Create Tiny CNN ```python from keras.layers import Conv2D, MaxPooling2D, Flatten num_classes = 10 cnn = Sequential() cnn.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) cnn.add(MaxPooling2D(pool_size=(2, 2))) cnn.add(Conv2D(32, (3, 3), activation='relu')) cnn.add(MaxPooling2D(pool_size=(2, 2))) cnn.add(Flatten()) cnn.add(Dense(64, activation='relu')) cnn.add(Dense(num_classes, activation='softmax')) ``` ] --- ### Number of Parameters .left-eq-column[ Convolutional Network for MNIST ![:scale 100%](images/cnn_params_mnist.png) ] .right-eq-column[ Dense Network for MNIST ![:scale 100%](images/dense_params_mnist.png) ] --- class: middle .left-column[ ## CNNs ### - Convolution filters ### - Convolution layers ### - Polling layers ### - Fully connected layers ### - Example in Keras ] .right-column[ ### Train and Evaluate .smaller[ ```python cnn.compile("adam", "categorical_crossentropy", metrics=['accuracy']) history_cnn = cnn.fit(X_train_images, y_train, batch_size=128, epochs=20, verbose=1, validation_split=.1) cnn.evaluate(X_test_images, y_test) ``` ``` 9952/10000 [============================>.] - ETA: 0s [0.089020583277629253, 0.98429999999999995] ``` ] .center[ ![:scale 50%](images/train_evaluate.png) ] ] --- class: middle # CNNs: Recap - CNNs are a special kind of Neural Network model that allows for translation invariance in its input whlist also utilizing weight sharing - They are implemented as a set of convolution operators with trainable filters - The convolution layers are usually followed by polling layes that reduce the input size. - The fully connected layers that are oftentimes mentioned are just MLPs we've seen before -- they perform the final classificaiton --- class: middle .left-column[ ## DeepLearning Tricks ### - Data Augmentation ] .right-column[ ![:scale 35%](images/carpet_snake.png) --
![:scale 22%](preview/snek_0_1271.jpeg) ![:scale 22%](preview/snek_0_3411.jpeg) ![:scale 22%](preview/snek_0_4863.jpeg) ![:scale 22%](preview/snek_0_5876.jpeg) ![:scale 22%](preview/snek_0_9549.jpeg) ![:scale 22%](preview/snek_0_4484.jpeg) ![:scale 22%](preview/snek_0_6377.jpeg) ![:scale 22%](preview/snek_0_4599.jpeg) ] --- class: middle .left-column[ ## DeepLearning Tricks ### - Data Augmentation ] .right-column[ - Rotation - Mirroring - Random crops - Addition/removal of noise/contrast/blur - ...
- One of the few ways of getting supervised (labeled) training data for free. ] --- class: middle .left-column[ ## DeepLearning Tricks ### - Data Augmentation ### - Dropout ] .right-column[ .center[ ![:scale 75%](images/dropout_reg.png) ] -- - Set a percentage of weights to zero in training, at prediction time use all of them and down-weight them by `1 - dropout_rate` - Essentially transforms the the network into an ensamble of smaller ones that share weights. ] --- class: middle .left-column[ ## DeepLearning Tricks ### - Data Augmentation ### - Dropout ] .right-column[ .smaller[ ```python from keras.layers import Dropout model_dropout = Sequential([ Dense(1024, input_shape=(784,), activation='relu'), Dropout(.5), Dense(1024, activation='relu'), Dropout(.5), Dense(10, activation='softmax'), ]) model_dropout.compile("adam", "categorical_crossentropy", metrics=['accuracy']) history_dropout = model_dropout.fit(X_train, y_train, batch_size=128, epochs=20, verbose=1, validation_split=.1) ``` ] ] --- class: center, middle .left-column[ ## DeepLearning Tricks ### - Data Augmentation ### - Dropout ### - Residual Connections ] .right-column[ ### Problem ![:scale 90%](images/resnet-no-deep-nets.png) ] ??? We can't fit deep networks well - not even on training set! "vanishing gradient problem" - was motivation for relu, but not solved yet. The deeper the network gets, usually the performance gets better. But if you make your network too deep, then you can't learn it anymore. This is on CIFAR-10, which is a relatively small dataset. But if you try to learn a 56-layer convolutional, you cannot even optimize it on the training set. So basically, it's not that we can't generalize, we can't optimize. So these are universal approximators, so ideally, we should be able to overfit completely the training set. But here, if we make it too deep, we cannot overfit the training set anymore. It's kind of a bad thing. Because we can’t really optimize the problem. So this is sort of connected to this problem of vanishing gradient that it's very hard to backpropagate the error through a very deep net because basically, the idea is that the further you get from the output, the gradients become less and less informative. We talked about RELU units, which sort of helped to make this a little bit better. Without RELU units, you had like 4 or 5 layers, with RELU units, you have like 20 layers. But if you do 56 layers, it's not going to work anymore. Even it's not going to work anymore, even on the training set. So this has been like a big problem. And it has a surprisingly simple solution, which is the RES-NET layer. --- class: center, middle .left-column[ ## DeepLearning Tricks ### - Data Augmentation ### - Dropout ### - Residual Connections ] .right-column[ ### Solution ![:scale 50%](images/residual-layer.png) `$$\text{out} = F(x, \{W_i\}) + x \quad \text{ for same size layers}$$` `$$\text{out} = F(x, \{W_i\}) + W_sx \quad \text{ for different size layers}$$` `$$F(x) = \text{out} - x \quad \text{learing the residual}$$` ] ??? instead of learning a function. learn the difference to identity. if sizes different, add linear projection. Here's how the residual layer looks like. And the idea is, let's say you have a bunch of weight layers, instead of learning a function, we learn how the function is different from the identity. So you're not trying to model the whole relationship between x and y, you want to model how is y different from x. In practice, what happens is you have multiple weight layers, usually like 2, and you have skip connection that gives the identity from before these layers to after these layers. So basically, if you set these weights all to zero, you have a pass-through layer. And this allows information to be back-propagated much more easily because you have all these identity matrices, so something always gets backpropagated. So this obviously only works if y and x have the same shape. So in CNNs, often the convolutional layers have sort of the same shape. But then you also have max-pooling layers. And so what you can do is, instead of having the identity, you use a linear transformation. And this way, the gradients can propagate better. So just seems like a very simple idea but people have tried it before and it really made a big difference. --- class: center, middle ![:scale 40%](images/resnet-architecture.png) --- class: center, middle .left-column[ ## DeepLearning Tricks ### - Data Augmentation ### - Dropout ### - Residual Connections ] .right-column[ ![:scale 100%](images/resnet-success.png) ] ??? Here’s the result. The solid line is the training error and the bold line is the test error. The lines are depicted over the number of iterations. So what you can see here is the 18 layers works well than the 34 layers. And the training set is worse than the test set everywhere. But with the 34 layers, even the training set can’t beat the test set of the 18 layers. The next one is exactly the same architecture, but we put in all these identity matrices. And now we can see that the 18 layer is pretty unchanged. But the 34 layer is now actually better than the 18 layers. And in particular, we can overfit the dataset a little bit. This is a much better result than before. And when you publish this, this was state of the art and quite a big jump. --- class: center, middle .left-column[ ## DeepLearning Tricks ### - Data Augmentation ### - Dropout ### - Residual Connections ] .right-column[ ![:scale 70%](images/resnet-results.png) ] --- class: center, middle .left-column[ ## DeepLearning Tricks ### - Data Augmentation ### - Dropout ### - Residual Connections ### - Transfer Learning ] .right-column[ ### Reusing big CNN (and other) architectures .center[ ![:scale 80%](images/pretrained_network.png) ] ] --- class: spacious # Ball snake vs Carpet Python .center[ ![:scale 100%](images/ball_snake_vs_python.png) ] --- .center[ ![:scale 80%](images/carpet_python_snake.png) ] --- class: middle .left-column[ ## DeepLearning Tricks ### - Data Augmentation ### - Dropout ### - Residual Connections ### - Transfer Learning ] .right-column[ ### Extracting Features wih VGG .smaller[ ```python from keras.preprocessing import image X = np.array([image.img_to_array(img) for img in images_carpet + images_ball]) # load VGG16 model = applications.VGG16(include_top=False, weights='imagenet') # preprocessing for VGG16 from keras.applications.vgg16 import preprocess_input X_pre = preprocess_input(X) features = model.predict(X_pre) print(X.shape) print(features.shape) features_ = features.reshape(200, -1) ``` ``` (200, 224, 224, 3) (200, 7, 7, 512) ```] ] --- class: middle .left-column[ ## DeepLearning Tricks ### - Data Augmentation ### - Dropout ### - Residual Connections ### - Transfer Learning ] .right-column[ ### Classification with Logistic Regression .smaller[ ```python from sklearn.linear_model import LogisticRegressionCV lr = LogisticRegressionCV().fit(X_train, y_train) print(lr.score(X_train, y_train)) print(lr.score(X_test, y_test)) from sklearn.metrics import confusion_matrix confusion_matrix(y_test, lr.predict(X_test)) ``` ``` 1.0 0.82 array([[24, 1], [ 8, 17]]) ```] ] --- class: center, middle .left-column[ ## DeepLearning Tricks ### - Data Augmentation ### - Dropout ### - Residual Connections ### - Transfer Learning ] .right-column[ ### Finetuning .center[ ![:scale 90%](images/finetuning.png) ] ] --- # Deep Learning: Recap - Data Augmentation is one of the easiest ways of getting labeled data, which almost always helps. It is much easier to do for images than for other modalities (text for instance). - Dropout and Residual connections are two simple ideas that are very helpful in training Deep Learning models - When dealing with a new task, the shortest path to a baseline or proof-of-concept is via Transfer Learning - Thanks to Transfer Learning, Deep Learning architectures can be used on even small datasets, that would otherwise be prone to overfitting --- class: middle # Introduction to Supervised Learning: Recap - Supervised Learning - Model Selection - Linear Models - Preprocessing and Feature Engineering - Neural Networks - CNNs - Deep Learning Tricks --- # Resources - [Introdutcion to Statistical Learning](https://faculty.marshall.usc.edu/gareth-james/ISL/) - [Applied Machine Learning](https://www.cs.columbia.edu/~amueller/comsw4995s20/schedule/) - [Basics](https://github.com/madewithml/basics) - many nicely done introductory tutorials, along with code