Baixe o app para aproveitar ainda mais
Esta é uma pré-visualização de arquivo. Entre para ver o arquivo original
Camera Calibration and Epipolar Geometry CS5330, Huaizu Jiang Fall 2022, Northeastern University Announcement Grades of PA2 Grades of PA2 Reach out to us if you believe your grade is incorrect Piazza is preferred Email is also fine Using Teams, comments on Canvas may result in significant delay Recap Batch Normalization Learnable scale and shift parameters: Output, Shape is N x D Learning , will recover the identity function (in expectation) Input: Per-channel mean, shape is D Normalized x, Shape is N x D Per-channel std, shape is D Slide credit: D Fouhey & J Johnson Don’t/Can’t treat them as constants in PA4 (how about the gradients?) Derivation: https://www.adityaagrawal.net/blog/deep_learning/bprop_batch_norm Regularization: Dropout “randomly set some neurons to zero in the forward pass” [Srivastava et al., 2014] Slide credit: E. Learned-Miller, originally developed by Andrej Karpathy, Justin Johnson, Fei-Fei Li Randomly set the output value of some neurons to be 0 Oct. 5, 2021 VGG: Deeper Networks, Regular Design 3x3 conv, 128 Pool 3x3 conv, 64 3x3 conv, 64 Input 3x3 conv, 128 Pool 3x3 conv, 256 3x3 conv, 256 Pool 3x3 conv, 512 3x3 conv, 512 Pool 3x3 conv, 512 3x3 conv, 512 Pool FC 4096 FC 1000 Softmax FC 4096 3x3 conv, 512 3x3 conv, 512 3x3 conv, 384 Pool 5x5 conv, 256 11x11 conv, 96 Input Pool 3x3 conv, 384 3x3 conv, 256 Pool FC 4096 FC 4096 Softmax FC 1000 Pool Input Pool Pool Pool Pool Softmax 3x3 conv, 512 3x3 conv, 512 3x3 conv, 256 3x3 conv, 256 3x3 conv, 128 3x3 conv, 128 3x3 conv, 64 3x3 conv, 64 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 FC 4096 FC 1000 FC 4096 AlexNet VGG16 VGG19 VGG Design rules: All conv are 3x3 stride 1 pad 1 All max pool are 2x2 stride 2 After pool, double #channels Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015 Slide credit: D Fouhey & J Johnson Residual Networks Once we have Batch Normalization, we can train networks with 10+ layers. What happens as we go deeper? Training error Iterations 56-layer 20-layer Iterations 56-layer 20-layer Test error In fact the deep model seems to be underfitting since it also performs worse than the shallow model on the training set! It is actually underfitting He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Slide credit: D Fouhey & J Johnson Residual Networks Input Softmax 3x3 conv, 64 7x7 conv, 64, / 2 FC 1000 Pool 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128 3x3 conv, 128, / 2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 ... 3x3 conv, 512 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool relu Residual block 3x3 conv 3x3 conv F(x) + x F(x) relu X A residual network is a stack of many residual blocks Regular design, like VGG: each residual block has two 3x3 conv Network is divided into stages: the first block of each stage halves the resolution (with stride-2 conv) and doubles the number of channels He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Adapted from: D Fouhey & J Johnson relu Residual block 3x3 conv, stride 2 3x3 conv F(x) + x F(x) relu X 1x1 conv, stride 2 Today’s Class Training of CNNs Camera calibration Group-based Convolution Input: Cin x H x W Hyperparameters: Kernel size: KH x KW Number filters: Cout Padding: P Stride: S Groups: G Weight matrix: Cout /G x Cin /G x KH x KW x G Bias vector: Cout /G FLOPS: Cout /G x Cin /G x KH x KW x G x H x W MobileNet Channel: C, Groups: C Pointwise Conv 9C2HW 9CHW C2HW Computation reduction: 9C2HW/(9CHW + C2HW) = 9C/(9+C) [Howard et al., MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017] ShuffleNet [Zhang et al., ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. CVPR 2018] ShuffleNet Units [Zhang et al., ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. CVPR 2018] Readings AlexNet: https://papers.nips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf VGG: https://arxiv.org/abs/1409.1556 ResNet: https://arxiv.org/abs/1512.03385 MobileNets:https://arxiv.org/abs/1704.04861 ShuffleNet: https://arxiv.org/abs/1707.01083 Training Convolutional Networks Download big datasets Design CNN architecture Initialize Weights For t = 1 to T: Form minibatch Compute loss + gradient Update Weights Apply trained model to task Slide credit: D Fouhey & J Johnson Training Convolutional Networks Download big datasets Design CNN architecture Initialize Weights For t = 1 to T: Form minibatch Compute loss + gradient Update Weights Apply trained model to task Slide credit: D Fouhey & J Johnson Weight Initialization: Activation Statistics Forward pass for a 6-layer net with hidden size 4096 Slide credit: D Fouhey & J Johnson Weight Initialization: Activation Statistics Forward pass for a 6-layer net with hidden size 4096 All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like? Slide credit: D Fouhey & J Johnson All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like? A: All zero, no learning =( Weight Initialization: Activation Statistics Forward pass for a 6-layer net with hidden size 4096 Weights are too small at initialization! Slide credit: D Fouhey & J Johnson Weight Initialization: Activation Statistics Increase scale of weights at initialization 0.01 -> 0.05 Slide credit: D Fouhey & J Johnson Weight Initialization: Activation Statistics All activations saturate Q: What do the gradients look like? Weights are too big at initialization! Increase scale of weights at initialization 0.01 -> 0.05 Slide credit: D Fouhey & J Johnson All activations saturate Q: What do the gradients look like? A: Local gradients all zero, no learning =( Weight Initialization: Activation Statistics Increase scale of weights at initialization 0.01 -> 0.05 Weights are too big at initialization! Slide credit: D Fouhey & J Johnson Weight Initialization: Xavier Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 “Xavier” initialization: std = 1 / sqrt(Din) Slide credit: D Fouhey & J Johnson Weight Initialization: Xavier Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 “Xavier” initialization: std = 1 / sqrt(Din) Weights are just right at initialization! Slide credit: D Fouhey & J Johnson For conv layers, Din is kernel_size2 * input_channels Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Weight Initialization: Xavier “Xavier” initialization: std = 1 / sqrt(Din) Weights are just right at initialization! Slide credit: D Fouhey & J Johnson Weight Initialization: MSRA ”Just right” – activations nicely scaled for all layers He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015 For ReLU networks: std = sqrt(2 / Din) Weights are just right at initialization! Slide credit: D Fouhey & J Johnson Training Convolutional Networks Download big datasets Design CNN architecture Initialize Weights For t = 1 to T: Form minibatch Compute loss + gradient Update Weights Apply trained model to task Slide credit: D Fouhey & J Johnson Training Convolutional Networks Download big datasets Design CNN architecture Initialize Weights For t = 1 to T: Form minibatch Compute loss + gradient Update Weights Apply trained model to task If the model is big, won’t we overfit? Slide credit: D Fouhey & J Johnson Regularizing CNNs: Weight Decay Add L2 regularization term to the loss penalizing large weight matrices Usually don’t regularize bias terms, or BatchNorm scale / shift params *Technical note: Adding an explicit term to the loss is “L2 Regularization”; “Weight decay” adds a term to the gradient. They are equivalent for SGD, but not quite the same for other optimizers like Adam Slide credit: D Fouhey & J Johnson Regularizing CNNs: Data Augmentation Horizontal Flip Color Jitter Image Cropping Hippo Hippo? Hippo? Hippo? Slide credit: D Fouhey & J Johnson Regularizing CNNs: Data Augmentation Apply random transformations to input images during training Artificially “inflate” the size of your dataset Training Image Augmented images Hippo Hippo Slide credit: D Fouhey & J Johnson Training Convolutional Networks Download big datasets Design CNN architecture Initialize Weights For t = 1 to T: Form minibatch Compute loss + gradient Update Weights Apply trained model to task If the model is big, won’t we overfit? Slide credit: D Fouhey & J Johnson Training Convolutional Networks Download big datasets Design CNN architecture Initialize Weights For t = 1 to T: Form minibatch Compute loss + gradient Update Weights Apply trained model to task What if we can’t find one? Slide credit: D Fouhey & J Johnson Transfer Learning: Feature Extraction Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014 Image Conv-64 Conv-64 MaxPool Conv-128 Conv-128 MaxPool Conv-256 Conv-256 MaxPool Conv-512 Conv-512 MaxPool Conv-512 Conv-512 MaxPool FC-4096 FC-4096 FC-1000 1. Train on ImageNet Slide credit: D Fouhey & J Johnson Transfer Learning: Feature Extraction Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014 Use your small dataset to train a linear classifier on top of pretrained CNN features Image Conv-64 Conv-64 MaxPool Conv-128 Conv-128 MaxPool Conv-256 Conv-256 MaxPool Conv-512 Conv-512 MaxPool Conv-512 Conv-512 MaxPool FC-4096 FC-4096 FC-1000 Image Conv-64 Conv-64 MaxPool Conv-128 Conv-128 MaxPool Conv-256 Conv-256 MaxPool Conv-512 Conv-512 MaxPool Conv-512 Conv-512 MaxPool FC-4096 FC-4096 Freeze these Remove last layer 1. Train on ImageNet 2. CNN as feature extractor Slide credit: D Fouhey & J Johnson Transfer Learning: Fine-Tuning Reinitialize last layer and continue training whole network on your dataset Image Conv-64 Conv-64 MaxPool Conv-128 Conv-128 MaxPool Conv-256 Conv-256 MaxPool Conv-512 Conv-512 MaxPool Conv-512 Conv-512 MaxPool FC-4096 FC-4096 FC-1000 Image Conv-64 Conv-64 MaxPool Conv-128 Conv-128 MaxPool Conv-256 Conv-256 MaxPool Conv-512 Conv-512 MaxPool Conv-512 Conv-512 MaxPool FC-4096 FC-4096 Image Conv-64 Conv-64 MaxPool Conv-128 Conv-128 MaxPool Conv-256 Conv-256 MaxPool Conv-512 Conv-512 MaxPool Conv-512 Conv-512 MaxPool FC-4096 FC-4096 Freeze these Remove last layer FC 1. Train on ImageNet 2. CNN as feature extractor 3. Bigger dataset: Fine-Tuning Slide credit: D Fouhey & J Johnson Transfer Learning: Fine-Tuning Some tricks: Train with feature extraction first before fine-tuning Lower the learning rate: use ~1/10 of LR used in original training Sometimes freeze lower layers to save computation Reinitialize last layer and continue training whole network on your dataset Image Conv-64 Conv-64 MaxPool Conv-128 Conv-128 MaxPool Conv-256 Conv-256 MaxPool Conv-512 Conv-512 MaxPool Conv-512 Conv-512 MaxPool FC-4096 FC-4096 FC-1000 Image Conv-64 Conv-64 MaxPool Conv-128 Conv-128 MaxPool Conv-256 Conv-256 MaxPool Conv-512 Conv-512 MaxPool Conv-512 Conv-512 MaxPool FC-4096 FC-4096 Image Conv-64 Conv-64 MaxPool Conv-128 Conv-128 MaxPool Conv-256 Conv-256 MaxPool Conv-512 Conv-512 MaxPool Conv-512 Conv-512 MaxPool FC-4096 FC-4096 Freeze these Remove last layer FC 1. Train on ImageNet 2. CNN as feature extractor 3. Bigger dataset: Fine-Tuning Slide credit: D Fouhey & J Johnson Convolution Layers Pooling Layers Fully-Connected Layers Activation Function Normalization Recap: Convolutional Networks Slide credit: D Fouhey & J Johnson Recap: CNN Architectures Lin et al Sanchez & Perronnin Krizhevsky et al (AlexNet) Zeiler & Fergus Simonyan & Zisserman (VGG) Szegedy et al (GoogLeNet) He et al (ResNet) Russakovsky et al Shao et al Hu et al (SENet) Shallow 8 Layer 8 Layer 19 Layer 22 Layer 152 Layer 152 Layer 152 Layer AlexNet VGG ResNet Slide credit: D Fouhey & J Johnson 2010 2011 2012 2013 2014 2014 2015 2016 2017 Human 28.2 25.8 16.399999999999999 11.7 7.3 6.7 3.6 3 2.2999999999999998 5.0999999999999996 Error Rate Recap: Training CNNs Download big datasets Design CNN architecture Initialize Weights For t = 1 to T: Form minibatch Compute loss + gradient Update Weights Apply trained model to task Transfer Learning Xavier / MSRA Init Regularization + Data Augmentation Slide credit: D Fouhey & J Johnson Perspective & 3D Geometry Our goal: Recovery of 3D structure J. Vermeer, Music Lesson, 1662 A. Criminisi, M. Kemp, and A. Zisserman,Bringing Pictorial Space to Life: computer techniques for the analysis of paintings, Proc. Computers and the History of Art, 2002 Slide credit: S Lazebnik 44 Single-view ambiguity x X? X? X? Slide credit: S Lazebnik 45 Things aren’t always as they appear… http://en.wikipedia.org/wiki/Ames_room Slide credit: S Lazebnik 46 Single-view ambiguity Slide credit: S Lazebnik 47 Single-view ambiguity Rashad Alakbarov shadow sculptures Slide credit: S Lazebnik Anamorphic perspective Image source Slide credit: S Lazebnik Resolving Single-view Ambiguity Shoot light (lasers etc.) out of your eyes! Con: not so biologically plausible, dangerous? Slide credit: D Fouhey & J Johnson Resolving Single-view Ambiguity Shoot light (lasers etc.) out of your eyes! Con: not so biologically plausible, dangerous? Slide credit: D Fouhey & J Johnson Resolving Single-view Ambiguity X Stereo: given 2 calibrated cameras in different views and correspondences, can solve for X x x Original diagram credit: S. Lazebnik Slide credit: D Fouhey & J Johnson http://www.well.com/~jimg/stereo/stereo_list.html Slide credit: J. Hays 53 53 CS 376 Lecture 16: Stereo 1 http://www.well.com/~jimg/stereo/stereo_list.html Slide credit: J. Hays 54 54 CS 376 Lecture 16: Stereo 1 Autostereograms Exploit disparity as depth cue using single image. (Single image random dot stereogram, Single image stereogram) Slide credit: J. Hays, Images from magiceye.com 55 55 CS 376 Lecture 16: Stereo 1 Autostereograms Slide credit: J. Hays, Images from magiceye.com 56 56 CS 376 Lecture 16: Stereo 1 Multi-view geometry problems Camera 1 K ? Slide credit: Noah Snavely Calibration: Figure out intrinsics of camera (K). We need camera intrinsics / K in order to figure out where the rays are Structure from motion solves the following problem: Given a set of images of a static scene with 2D points in correspondence, shown here as color-coded points, find… a set of 3D points P and a rotation R and position t of the cameras that explain the observed correspondences. In other words, when we project a point into any of the cameras, the reprojection error between the projected and observed 2D points is low. This problem can be formulated as an optimization problem where we want to find the rotations R, positions t, and 3D point locations P that minimize sum of squared reprojection errors f. This is a non-linear least squares problem and can be solved with algorithms such as Levenberg-Marquart. However, because the problem is non-linear, it can be susceptible to local minima. Therefore, it’s important to initialize the parameters of the system carefully. In addition, we need to be able to deal with erroneous correspondences. Multi-view geometry problems Camera 3 R3,t3 ? Camera 1 Camera 2 R1,t1 R2,t2 Recovering structure: Given cameras and correspondences, find 3D. Slide credit: Noah Snavely Structure from motion solves the following problem: Given a set of images of a static scene with 2D points in correspondence, shown here as color-coded points, find… a set of 3D points P and a rotation R and position t of the cameras that explain the observed correspondences. In other words, when we project a point into any of the cameras, the reprojection error between the projected and observed 2D points is low. This problem can be formulated as an optimization problem where we want to find the rotations R, positions t, and 3D point locations P that minimize sum of squared reprojection errors f. This is a non-linear least squares problem and can be solved with algorithms such as Levenberg-Marquart. However, because the problem is non-linear, it can be susceptible to local minima. Therefore, it’s important to initialize the parameters of the system carefully. In addition, we need to be able to deal with erroneous correspondences. Multi-view geometry problems Camera 3 R3,t3 Camera 1 Camera 2 R1,t1 R2,t2 Stereo/Epipolar Geomery: Given 2 cameras and find where a point could be Slide credit: Noah Snavely Structure from motion solves the following problem: Given a set of images of a static scene with 2D points in correspondence, shown here as color-coded points, find… a set of 3D points P and a rotation R and position t of the cameras that explain the observed correspondences. In other words, when we project a point into any of the cameras, the reprojection error between the projected and observed 2D points is low. This problem can be formulated as an optimization problem where we want to find the rotations R, positions t, and 3D point locations P that minimize sum of squared reprojection errors f. This is a non-linear least squares problem and can be solved with algorithms such as Levenberg-Marquart. However, because the problem is non-linear, it can be susceptible to local minima. Therefore, it’s important to initialize the parameters of the system carefully. In addition, we need to be able to deal with erroneous correspondences. Multi-view geometry problems Camera 1 Camera 2 Camera 3 R1,t1 R2,t2 R3,t3 ? ? ? Motion: Figure out R, t for a set of cameras given correspondences Slide credit: Noah Snavely Structure from motion solves the following problem: Given a set of images of a static scene with 2D points in correspondence, shown here as color-coded points, find… a set of 3D points P and a rotation R and position t of the cameras that explain the observed correspondences. In other words, when we project a point into any of the cameras, the reprojection error between the projected and observed 2D points is low. This problem can be formulated as an optimization problem where we want to find the rotations R, positions t, and 3D point locations P that minimize sum of squared reprojection errors f. This is a non-linear least squares problem and can be solved with algorithms such as Levenberg-Marquart. However, because the problem is non-linear, it can be susceptible to local minima. Therefore, it’s important to initialize the parameters of the system carefully. In addition, we need to be able to deal with erroneous correspondences. Outline (Today) Calibration: Getting intrinsic matrix/K (Today, next week) Epipolar geometry: 2 pictures -> depth map (Next week) Stereo Rectified 2 pictures -> depth map Structure from motion (SfM): 2+ pictures -> cameras, pointcloud Optical flow (we may have a guest lecture here) Finding correspondences in 2 pictures (similar to stereo) Slide inspired by: D Fouhey & J Johnson Our goal: Recovery of 3D structure When certain assumptions hold, we can recover structure from a single view In general, we need multi-view geometry Image source But first, we need to understand the geometry of a single camera… Slide credit: S Lazebnik 62 Projection matrix Optical center Principal point Image plane center R,t Ow iw kw jw x: Image Coordinates: (u,v,1) K: Intrinsic Matrix (3x3) R: Rotation (3x3) t: Translation (3x1) X: World Coordinates: (X,Y,Z,1) Typical Perspective Model focal length principal point (image coords of camera origin on retina) Just moves camera origin 2D Projection of X rotation translation 3D point Slide credit: D Fouhey & J Johnson Notes on the Intrinsic Matrix . : how many pixels per unit in the x direction on the sensor : skew factor. Exists if the x and y coordinates are not perpendicular. : offset from the image plane center Radial Distortion in Wide-angle Lenses Slide adapted from James Hays Reading: Section 2.1.5, Computer Vision: Algorithms and Applications, 2nd Edition Camera Calibration Pairs of [X,Y,Z] and [u,v] → eqns to constrain M How do I get [X,Y,Z], [u,v]? Slide credit: D Fouhey & J Johnson Camera Calibration Targets Using a tape measure 312.747 309.140 30.086 305.796 311.649 30.356 307.694 312.358 30.418 310.149 307.186 29.298 311.937 310.105 29.216 311.202 307.572 30.682 307.106 306.876 28.660 309.317 312.490 30.230 307.435 310.151 29.318 308.253 306.300 28.881 306.650 309.301 28.905 308.069 306.831 29.189 309.671 308.834 29.029 308.255 309.955 29.267 307.546 308.613 28.963 311.036 309.206 28.913 307.518 308.175 29.069 309.950 311.262 29.990 312.160 310.772 29.080 311.988 312.709 30.514 880 214 43 203 270 197 886 347 745 302 943 128 476 590 419 214 317 335 783 521 235 427 665 429 655 362 427 333 412 415 746 351 434 415 525 234 716 308 602 187 Known 3d locations Known 2d image coords Image credit: J. Hays Slide credit: D Fouhey & J Johnson Camera Calibration Targets … A set of views of a plane (not covered here) Slide credit: D Fouhey & J Johnson Camera Calibration Targets A single, huge plane. What’s this for? Slide credit: D Fouhey & J Johnson Camera calibration Given n points with known 3D coordinates Xi and known image projections pi, estimate the camera parameters Xi pi Slide credit: S. Lazebnik 71 Camera Calibration: Linear Method Remember (from geometry): this implies MXi & pi are proportional/scaled copies of each other Remember (from homography fitting): this implies their cross product is 0 Slide credit: D Fouhey & J Johnson Camera Calibration: Linear Method …Some tedious math occurs… (see Homography deriviation) Slide credit: D Fouhey & J Johnson Camera Calibration: Linear Method How many linearly independent equations? 2 How many equations per [u,v] + [X,Y,Z] pair? 2 If M is 3x4, how many degrees of freedom? 11 Slide credit: D Fouhey & J Johnson Camera Calibration: Linear Method Derivation from L. Lazebnik; note we negate one of the equations from the cross product How do we solve problems of the form ? Eigenvector of ATA with smallest eigenvalue Slide credit: D Fouhey & J Johnson Camera calibration: Linear method Note: for coplanar points that satisfy ΠTX=0, we will get degenerate solutions (Π,0,0), (0,Π,0), or (0,0,Π) Slide credit: S. Lazebnik 76 Camera calibration: Linear vs. nonlinear Linear calibration is easy to formulate and solve, but it doesn’t directly tell us the camera parameters In practice, non-linear methods are preferred Write down objective function in terms of intrinsic and extrinsic parameters Define error as sum of squared distances between measured 2D points and estimated projections of 3D points Minimize error using Newton’s method or other non-linear optimization Can model radial distortion and impose constraints such as known focal length and orthogonality vs. Slide credit: S. Lazebnik 77 x h s x h s [ ] X t R K x = ú ú ú ú û ù ê ê ê ê ë é ú ú ú û ù ê ê ê ë é = ú ú ú û ù ê ê ê ë é 1 * * * * * * * * * * * * Z Y X y x l l l [ ] X t R K x =
Compartilhar