Siamese Neural Networks
Table of Contents
- useful in the case of Few Shot Learning.
- can be built by any kind of neural network : the difference arises in the way we engineer the loss function
- specifically, we use a .. (specifying into Computer Vision applications)
1. Triplet Loss Function
Starting out with three images:
- A : anchor
- P : positive
N : negative
.. such that A and P are two different pictures of the same person while N is a picture of another person.
Given that we have a model f that produces an embedding given an image as input the triplet loss is defined as :
(defun f (input-image) (...));returns embedding (defun L2-norm-squared (a b) (...));euclidean distance between two vectors ;;A,P,N are images as defined above (defun triplet-loss (A P N) (max (+ (- (L2-norm-squared (f A) (f P)) (L2-norm-squared (f A) (f N))) alpha) 0))
Intuitively speaking, this loss encourages embedding of P and A to be closer while that of N and A to be further away. alpha is a positive hyperparameter that if increased, encourages the model to increase the distinction between the first two terms in the triplet loss. A lower alpha allows for a more lenient model.
The overall cost function is the average of all such triplet losses.
Post back-propogation post forwards on a dataset's all such triplets (note that one shot doesn't imply training on only one sample but inference with respect to a single instance), we have a Siamese neural network.
To use it, given two pictures P1 and P2, if the euclidean distance between their forwards is less than a threshold (a hyperparameter), we mark the pair as being of the same person.
It's called one shot cause you need only one picture for a specific comparison (the training phase still needed a larger dataset though).
1.1. Triplet selection
- most triplets from a dataset wouldn't contribute much toward the objective.
- the triplet selection strategy then becomes important, for faster convergence:
- it is beneficial to select hard positives and hard negatives that are max and min of their respective pool of hard positives and negatives.
- it isn't feasible to maintain these maxes and mins for all such triplets.
- so, an approach is generate triplets offline (pause training loop) every n steps using the most recent network checkpoint and compute argmin and argmaxes on this subset
- another online approach (part of training loop) is generating hard positives and negatives from within the current minibatch
- do note that the minibatch here needs to be large enough for enough hard positives to be formed (FaceNet used 1000+ faces in a minibatch)
- read more about this process in section 3.2 of https://arxiv.org/pdf/1503.03832.pdf of the FaceNet paper.