Baixe o app para aproveitar ainda mais
Prévia do material em texto
1 Classificadores Lineares Francisco Carvalho e Cleber Zanchettin UFPE - Universidade Federal de Pernambuco CIn - Centro de Informática 3 Classificação de imagens 248 x 400 x 3 canais, um total of 297.600 números (0-255) http://cs231n.github.io/ 4 Desafios Um bom modelo de classificação deve ser robusto a essas variações, mantendo simultaneamente a sensibilidade às variações inter-classe http://cs231n.github.io/ Data-driven approach 5 Pipeline: Entrada – Aprendizado - Avaliação http://cs231n.github.io/ A essência do aprendizado de máquina 6 Learning From Data, Prof. Yaser Abu-Mostafa Existe um padrão Não é possível construir uma equação matemática Existem dados disponíveis Exemplo image classification dataset: CIFAR-10 7 60.000 imagens 32 x 32 pixels com 10 classes (50k treinamento e 10k teste) http://cs231n.github.io/ Vizinho mais próximo Distância L1 8 Taxa de acerto de 38.6% http://cs231n.github.io/ Olhando para a vizinhança 9 Mudando a distância 10 Distância L2 Taxa de acerto de 35.4% Usando k-vizinhos 13 Regiões de Separação - Linear Sabe-se que se formarmos uma combinação linear de duas variáveis, e igualá-la a um número, então os pontos no espaço bidimensional podem ser divididos em três categorias: – a) pontos pertencentes à linha com coordenadas tais que w1. x1 + w2 . x2 = q – b) pontos em um lado da linha tem coordenadas tais que w1 . x1 + w2 . x2 < q – c) pontos no outro lado da linha tem coordenadas tais que w1 . x1 + w2 . x2 > q . 20 O papel do bias Exemplo – Suponha uma rede neural cuja função de ativação é a step bipolar, isto é, f(u) = 1, se u ≥ θ e f(u) = -1, caso contrário. – Sabemos que y = x1*w1 + x2*w2 + … + xn*wn, então podemos resumir nosso problema à inequação: x1*w1 + x2*w2 + … + xn*wn ≥ θ – Que, sendo verdadeira, fará f(u) ter valor 1. 21 O papel do bias Então, para que nossa rede neural seja corretamente ajustada, precisaríamos ajustar não somente os pesos, mas também o threshold θ, tarefa não muito simples, já que precisaríamos ter alguma fórmula de ajuste do threshold. Entretanto, podemos fazer algumas alterações: x1*w1 + x2*w2 + … + xn*wn ≥ θ - θ + x1*w1 + x2*w2 + … + xn*wn ≥ 0 1*(-θ) + x1*w1 + x2*w2 + … + xn*wn ≥ 0 Agora, podemos considerar o primeiro termo 1*(-θ) = x0*w0, isto é, o valor de entrada sempre 1, um peso cujo valor poderá ser ajustado da mesma forma que ajustamos os demais pesos, o que seria o mesmo que ajustar o threshold. 15 Pontos 2x1+3x2 posição (x1 , x2 ) (0.0, 2.0) 6 linha (1.0, 1.0) 5 abaixo (1.0, 2.0) 8 acima (2.0, 0.0) 4 abaixo (2.0, 0.66) 6 linha (2.0, 1.0) 7 acima Posição dos pontos em função da linha 2 x1 + 3 x2 = 6 de delimitação. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2 x1 + 3 x2 = 6 linha (1.0, 2.0) (1.0, 1.0) (2.0, 1.0) (2.0, 0.66) (2.0, 0.0) (0.0, 2.0) x2 Exemplo x1 q = 6 Linha: Acima: Abaixo: 2 x1 + 3 x2 = 6 2 x1 + 3 x2 > 6 2 x1 + 3 x2 < 6 Linear models Think of training examples as points in d-dimensional space. Each dimension corresponds to one feature. A linear binary classifier defines a plane in the space which separates positive from negative examples. Chrupala and Stroppa (UdS) Linear models 2010 5 / 62 Linear decision boundary A hyperplane is a generalization of a straight line to > 2 dimensions A hyperplane contains all the points in a d dimensional space satisfying the following equation: w1x1 + w2x2, . . . ,+wdxd + w0 = 0 Each coefficient wi can be thought of as a weight on the corresponding feature The vector containing all the weights w = (w0, . . . , wd) is the parameter vector or weigth vector Chrupala and Stroppa (UdS) Linear models 2010 6 / 62 Normal vector Geometrically, the weight vector w is a normal vector of the separating hyperplane A normal vector of a surface is any vector which is perpendicular to it Chrupala and Stroppa (UdS) Linear models 2010 7 / 62 Hyperplane as a classifier Let g(x) = w1x1 + w2x2, . . . ,+wdxd + w0 Then y = sign(g(x)) = { +1 if g(x) ≥ 0 −1 otherwise Chrupala and Stroppa (UdS) Linear models 2010 8 / 62 Bias The slope of the hyperplane is determined by w1...wd. The location (intercept) is determined by bias w0 Include bias in the weight vector and add a dummy component to the feature vector Set this component to x0 = 1 Then g(x) = d∑ i=0 wixi g(x) = w · x Chrupala and Stroppa (UdS) Linear models 2010 9 / 62 Separating hyperplanes in 2 dimensions −4 −2 0 2 4 − 4 − 2 0 2 4 x y y=−1x−0.5 y=−3x+1 y=69x+1 l l l ll l l Chrupala and Stroppa (UdS) Linear models 2010 10 / 62 16 Exemplo X1 Class 0 +1 1 +1 2 -1 3 -1 Vantagens da separabilidade linear 0 1 2 3 4 Class +1 Class -1 17 Exemplo X1 Class 0 +1 1 -1 2 -1 3 +1 Como separar as duas classes com apenas um ponto? 0 1 2 3 4 Class +1 Class -1 18 0 2 4 6 8 10 0 1 2 3 4 Class +1 Class -1 Exemplo X1 X1 2 Class 0 0 +1 1 1 -1 2 4 -1 3 9 +1 Possível solução é aumentar a quantidade de dimensões (Teorema de Cover, 1965) Φ(X1) = (X1, X1 2) Lecture 2 - 6 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 2 - 6 Jan 201646 Example dataset: CIFAR-10 10 labels 50,000 training images each image is 32x32x3 10,000 test images. Lecture 2 - 6 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 2 - 6 Jan 201647 Parametric approach [32x32x3] array of numbers 0...1 (3072 numbers total) f(x,W) image parameters 10 numbers, indicating class scores Lecture 2 - 6 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 2 - 6 Jan 201648 Parametric approach: Linear classifier [32x32x3] array of numbers 0...1 10 numbers, indicating class scores Lecture 2 - 6 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 2 - 6 Jan 201649 Parametric approach: Linear classifier [32x32x3] array of numbers 0...1 10 numbers, indicating class scores 3072x1 10x1 10x3072 parameters, or “weights” Lecture 2 - 6 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 2 - 6 Jan 201650 Parametric approach: Linear classifier [32x32x3] array of numbers 0...1 10 numbers, indicating class scores 3072x1 10x1 10x3072 parameters, or “weights” (+b) 10x1 Lecture 2 - 6 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 2 - 6 Jan 201651 Example with an image with 4 pixels, and 3 classes (cat/dog/ship) Lecture 2 - 6 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 2 - 6 Jan 201653 Interpreting a Linear Classifier Example trained weights of a linear classifier trained on CIFAR-10: Lecture 2 - 6 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 2 - 6 Jan 201654 Interpreting a Linear Classifier [32x32x3] array of numbers 0...1 (3072 numbers total) Lecture 2 - 6 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 2 - 6 Jan 201655 Interpreting a Linear Classifier Q2: what would be a very hard set ofclasses for a linear classifier to distinguish? Lecture 2 - 6 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 2 - 6 Jan 201656 So far: We defined a (linear) score function: Example class scores for 3 images, with a random W: -3.45 -8.87 0.09 2.9 4.48 8.02 3.78 1.06 -0.36 -0.72 -0.51 6.04 5.31 -4.22 -4.19 3.58 4.49 -4.37 -2.09 -2.93 3.42 4.64 2.65 5.1 2.64 5.55 -4.34 -1.5 -4.79 6.14 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 20166 -3.45 -8.87 0.09 2.9 4.48 8.02 3.78 1.06 -0.36 -0.72 -0.51 6.04 5.31 -4.22 -4.19 3.58 4.49 -4.37 -2.09 -2.93 3.42 4.64 2.65 5.1 2.64 5.55 -4.34 -1.5 -4.79 6.14 1. Define a loss function that quantifies our unhappiness with the scores across the training data. 2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization) TODO: Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 20167 Suppose: 3 training examples, 3 classes. With some W the scores are: cat frog car 3.2 5.1 -1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 20168 Suppose: 3 training examples, 3 classes. With some W the scores are: cat frog car 3.2 5.1 -1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form: Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 20169 Suppose: 3 training examples, 3 classes. With some W the scores are: cat frog car 3.2 5.1 -1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form: = max(0, 5.1 - 3.2 + 1) +max(0, -1.7 - 3.2 + 1) = max(0, 2.9) + max(0, -3.9) = 2.9 + 0 = 2.92.9Losses: Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201610 Suppose: 3 training examples, 3 classes. With some W the scores are: cat frog car 3.2 5.1 -1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form: = max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1) = max(0, -2.6) + max(0, -1.9) = 0 + 0 = 00Losses: 2.9 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201611 Suppose: 3 training examples, 3 classes. With some W the scores are: cat frog car 3.2 5.1 -1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form: = max(0, 2.2 - (-3.1) + 1) +max(0, 2.5 - (-3.1) + 1) = max(0, 5.3) + max(0, 5.6) = 5.3 + 5.6 = 10.90Losses: 2.9 10.9 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201612 cat frog car 3.2 5.1 -1.7 4.9 1.3 2.0 -3.1 2.5 2.2 0Losses: 2.9 10.9 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form: and the full training loss is the mean over all examples in the training data: L = (2.9 + 0 + 10.9)/3 = 4.6 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201619 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201620 There is a bug with the loss: Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201621 There is a bug with the loss: E.g. Suppose that we found a W such that L = 0. Is this W unique? Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201622 Suppose: 3 training examples, 3 classes. With some W the scores are: cat frog car 3.2 5.1 -1.7 4.9 1.3 2.0 -3.1 2.5 2.2 = max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1) = max(0, -2.6) + max(0, -1.9) = 0 + 0 = 0 0Losses: 2.9 Before: With W twice as large: = max(0, 2.6 - 9.8 + 1) +max(0, 4.0 - 9.8 + 1) = max(0, -6.2) + max(0, -4.8) = 0 + 0 = 0 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201623 Weight Regularization \lambda = regularization strength (hyperparameter) In common use: L2 regularization L1 regularization Elastic net (L1 + L2) Max norm regularization (might see later) Dropout (will see later) Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201624 L2 regularization: motivation Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201625 Softmax Classifier (Multinomial Logistic Regression) cat frog car 3.2 5.1 -1.7 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201626 Softmax Classifier (Multinomial Logistic Regression) scores = unnormalized log probabilities of the classes. cat frog car 3.2 5.1 -1.7 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201627 Softmax Classifier (Multinomial Logistic Regression) scores = unnormalized log probabilities of the classes. cat frog car 3.2 5.1 -1.7 where Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201628 Softmax Classifier (Multinomial Logistic Regression) scores = unnormalized log probabilities of the classes. cat frog car 3.2 5.1 -1.7 where Softmax function Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201629 Softmax Classifier (Multinomial Logistic Regression) scores = unnormalized log probabilities of the classes. Want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class:cat frog car 3.2 5.1 -1.7 where Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201630 Softmax Classifier (Multinomial Logistic Regression) scores = unnormalized log probabilities of the classes. Want to maximizethe log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class:cat frog car 3.2 5.1 -1.7 in summary: where Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201631 Softmax Classifier (Multinomial Logistic Regression) cat frog car 3.2 5.1 -1.7 unnormalized log probabilities Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201632 Softmax Classifier (Multinomial Logistic Regression) cat frog car 3.2 5.1 -1.7 unnormalized log probabilities 24.5 164.0 0.18 exp unnormalized probabilities Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201633 Softmax Classifier (Multinomial Logistic Regression) cat frog car 3.2 5.1 -1.7 unnormalized log probabilities 24.5 164.0 0.18 exp normalize unnormalized probabilities 0.13 0.87 0.00 probabilities Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201634 Softmax Classifier (Multinomial Logistic Regression) cat frog car 3.2 5.1 -1.7 unnormalized log probabilities 24.5 164.0 0.18 exp normalize unnormalized probabilities 0.13 0.87 0.00 probabilities L_i = -log(0.13) = 0.89 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201637 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201640 Interactive Web Demo time.... http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/ Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201641 Optimization Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201642 Recap - We have some dataset of (x,y) - We have a score function: - We have a loss function: e.g. Softmax SVM Full loss Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201643 Strategy #1: A first very bad idea solution: Random search Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201644 Lets see how well this works on the test set... 15.5% accuracy! not bad! (SOTA is ~95%) Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201645 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201646 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201647 Strategy #2: Follow the slope In 1-dimension, the derivative of a function: In multiple dimensions, the gradient is the vector of (partial derivatives). Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201648 current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 gradient dW: [?, ?, ?, ?, ?, ?, ?, ?, ?,…] Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201649 current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 W + h (first dim): [0.34 + 0.0001, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25322 gradient dW: [?, ?, ?, ?, ?, ?, ?, ?, ?,…] Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201650 gradient dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,…] (1.25322 - 1.25347)/0.0001 = -2.5 current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 W + h (first dim): [0.34 + 0.0001, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25322 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201651 gradient dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 W + h (second dim): [0.34, -1.11 + 0.0001, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25353 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201652 gradient dW: [-2.5, 0.6, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 W + h (second dim): [0.34, -1.11 + 0.0001, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25353 (1.25353 - 1.25347)/0.0001 = 0.6 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201653 gradient dW: [-2.5, 0.6, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 W + h (third dim): [0.34, -1.11, 0.78 + 0.0001, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201654 gradient dW: [-2.5, 0.6, 0, ?, ?, ?, ?, ?, ?,…] current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 W + h (third dim): [0.34, -1.11, 0.78 + 0.0001, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 (1.25347 - 1.25347)/0.0001 = 0 Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201655 Evaluation the gradient numerically Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201656 Evaluation the gradient numerically - approximate - very slow to evaluate Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201657 This is silly. The loss is just a function of W: want Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201658 This is silly. The loss is just a function of W: want Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201659 This is silly. The loss is just a function of W: Calculus want Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201660 This is silly. The loss is just a function of W: = ... Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201661 gradient dW: [-2.5, 0.6, 0, 0.2, 0.7, -0.5, 1.1, 1.3, -2.1,…] current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 dW = ... (some function data and W) Lecture 3 -11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201662 In summary: - Numerical gradient: approximate, slow, easy to write - Analytic gradient: exact, fast, error-prone => In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check. Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201663 Gradient Descent Lecture 3 - 11 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - 11 Jan 201664 original W negative gradient direction W_1 W_2 http://sebastianruder.com/optimizing-gradient-descent/ Blank Page Blank Page
Compartilhar