HardNet interpretation – one week hot

Paper:Working hard to know your neighbor’s margins: Local descriptor learning loss

Why to introduce this article: This 2018cvpr article is mainly from the difficult sample, proposed a loss, simple but effective, in image matching, retrieval, Wide Baseline stereo, and so on have doneExtensive detailed experiments，In real tasksrealThe result of state-of-the-art is obtained. Code: https://github.com/DagnyT/hardnet. The papers in the last blog can be combined with this, and it will be better to read together. In addition, these are obtained through learn.Descriptor is not popular in practice, even worse than the traditional SIFT and its variants, although the results are good in some datasets! The conclusion is that the dataset is not big enough and not enough.

Abstract:

Inspired by the matching standard of Lowe’s SIFT, a loss for measuring learning is introduced. Maximizing the distance between the nearest positive sample and the nearest negative sample in a batch. This method is more effective than complex optimization method. It is very work for shallow or deep networks. Take this LThe combination of OSS and L2Net can achieve a more comprehensive description.HardNet。It has the same feature dimension 128 as SIFT, and state-of-art performance in.wide baseline stereo, patch verification and instance retrieval benchmarks.

1. introduce

Many computer vision tasks rely on finding local correspondences, such as image retrieval, panorama stitching, wide baseline stereo, and 3D-recoNstruction. Although more and more end-to-end methods try to replace complex classical methods, the classical detectors and descriptors for local patches are still used, mainly for robustness, efficiency and tight integration.

LIFT、MatchNetDeepCompare and DeepCompare were the first to try end-to-end learning, but these methods are not popular in practice, despite their good performance in the patch verification task. Recent studies confirm SIFT and its variants (RootSIFT-PCA [16], DSP-SIFT [17) is far ahead of learned descriptors in matching, retrieval and SD reconstruction. [19] the conclusion is thatlocal patches datasetsIt’s not big enough to have such a high quality, widely used descriptor.

This article focuses on the learning of descriptor and proposes a novel network HardNet. And a large number of experiments show that the descriptor learnt by this method is far superior to hand-crafted and learned in real-world tasks at the same time.Escriptors. Reached the real state-of-the-art.

2. Related work

The classical SIFT local feature matching consists of two parts: finding the nearest neighbor and comparing the distance ratio between the first and second nearest neighbors to filter false positive matches. No one has done this as far as the author knows.

【20】A simple filter plus pooling strategy and convex optimization are proposed to replace hand-crafted filters and poolings in SIFT. MatchNet uses Siamese network structure, first feature net, then metric net, which improves the matching performance, but can not use fast nearest neighbor algorithm such as kd-tree. [15] the same method explored different structures, [22]Pair-basesd similarity is explored by using hard negative mining and a relatively shallow structure.

The following papers are closely integrated with the classic SIFT matching strategy. [23] using triplet margin loss and triplet distance loss, random sampling patch. It shows triplet-based.The superiority of the structure is pair based. They use negative samples randomly. [7] the distance matrix of positive and negative samples is calculated and then pairwise contrastive loss.

L2-NetUsing n matching pairs in a batch to generate N-N negative sample pairs, the minimum distance from the true matching in each row and column is required. There is no other distance or distance ratio restriction. Instead, they propose a penalty for the correlation between the descriptors and intermedi.The ate feature map takes deep supervision. In this paper, the structure of L2-Net is used as the base structure.We show the possibility of learning more powerful descriptors without two auxiliary loss terms using simpler objective functions。

Figure 1 shows that this sampling strategy is relatively simple but effective.

The sampling process is analyzed below.

3. The proposed descriptor

3.1 Sampling and loss

The learning target mimics the matching standard of SIFT, as shown in figure 1.. First, a matching block generation in batch.，AIt stands for anchor and P stands for positive. So each pair is derived from the same SD point.

Second, the 2n patch enters the network in Figure 2.。 Calculate L2 pairwise distance matrix n*n:

that

Then we should find and match this pair separately.The closest mismatch patchas well asA similar mismatch patchEigenvectors. (this reflects the operation of SIFT).

The goal is to minimize.Match descriptorandThe closest matched descriptor。The N triplet distance is then sent to triplet margin loss.：

Distance is predicted during triplet reorganization, and the overhead compared to random triplet sampling is the calculation of the distance matrix and the calculation of the maximum row and column. In addition, compared to usual’s triplets learning, our solution needs only two two-stream CNN, it’s not the three way. Reduced computation and storage by 30%. Unlike L2-net, neither deep supervision for intermediate layers is used, nor a norStraint on the correlation of descriptor dimensions. did not produce an overfitting.

3.2 Model architecture

The structure comes from L2Net, and the specific parameters and settings are presented in the paper.

The interesting thing is that Pytorch update, the author can not reproduce, through hyper-parameter search, set LR = 10, dropout rate = 0.3 to get better results.

3.3 Model training

On the UBC dataset, in addition tofpr95：false positive rate (FPR) at point of 0.95 true positive recall.，The MatchNet and Tfeat are also given.FDR:（false discovery rate) Indicators. Besides, the author did not do it.CS（center-surrounding）The structure of the experiment, because many papers have proved that the structure can improve the results, so here is no more than to do, in fact, I think we can see how much more can be improved.

You can see that hardnet is stronger than L2Net in terms of data enhancement or not.

The author also explores the impact of batchsize: finding 128 is more appropriate, and 256 improves a little.

There are still many works to be done later, and some experiments on the practical application of real-world will be available.

Leave a Reply Cancel reply