












Figure 1: Unlike existing methods that explicitly compute and store discrete matching field defined at low resolution, we implicitly represent a highdimensional 4D matching field with deep fullyconnected networks defined at arbitrary original image resolution. 
Existing pipelines of semantic correspondence commonly include extracting high level semantic features for the invariance against intraclass variations and background clutters. This architecture, however, inevitably results in a lowresolution matching field that additionally requires an adhoc interpolation process as a postprocessing for converting it into a highresolution one, certainly limiting the performance of matching results. To overcome this, inspired by recent success of implicit neural representation, we present a novel method for semantic correspondence, called neural matching field (NeMF). However, complicacy and highdimensionality of a 4D matching field are the major hindrances. To address them, we propose a cost embedding network consisting of convolution and selfattention layers to process the coarse cost volume to obtain cost feature representation, which is used as a guidance for establishing highprecision matching field through the following fullyconnected network. Although this may help to better structure the matching field, learning a highdimensional matching field remains challenging mainly due to computational complexity, since a naïve ex haustive inference would require querying from all pixels in the 4D space to infer pixelwise correspondences. To overcome this, in the training phase, we randomly sample matching candidates. In the inference phase, we propose a novel inference approach which iteratively performs PatchMatchbased inference and coordinate optimization at test time. With the proposed method, competitive results are at tained on several standard benchmarks for semantic correspondence. 

Figure 2: Given a pair of images as an input, we first extract features using CNNs and compute an initial noisy cost volume at low resolution. We feed the noisy cost volume with the proposed encoder consisting of convolution and Transformer, and decode with deep fully connected networks by taking the encoded cost and coordinates as inputs. 

Figure 3: Overview of neural matching field optimization: Given an encoded cost, we randomly sample coordinates from uniform distribution. The random coordinates and groundtruth coordinate are then processed altogether to obtain the matching scores and the crossentropy loss is computed for the training signal. 

Figure 4: Illustration of the proposed PatchMatch and coordinate optimization: With the learned neural matching field, the proposed PatchMatch injects explicit smoothness and reduces the search range. The subsequent optimization strategy searches for a location that maximizes the score of MLP. 

Figure 5: Visualization of flow maps for different N iterations: (a) source image, (b) target image. As the number of iteration we set increases along (c), (d), (e) and (f) at inference phase, NeMF with trained MLP predicts more precise matching fields by PatchMatchbased sampling and coordinate optimization. 

Figure 6: Visualization of matching fields: (a) source image, where the keypoint is marked as green triangle, (b), (c) 2D contour plots of cost by CATs and the NeMF (ours), respectively, and (d), (e) 3D visualization of cost by CATs and NeMF, with respect to the keypoint in (a). Note that all the visualizations are smoothed by a Gaussian kernel. Compared to CATs, NeMF has higher peak near groundtruth and makes a more accurate prediction. 

Table 1: Quantitative evaluation on standard benchmarks : Higher PCK is better. The best results are in bold, and the second best results are underlined. All results are taken from the papers. Eval. Reso.: Evaluation Resolution, Flow Reso.: Flow Resolution. 

Table 2: Perclass quantitative evaluation on SPair71k benchmark. 

Figure 7: Qualitative results on PFPASCAL : keypoint transfer results by (a), (c) CATs and (b), (d) NeMF. Green and red line denote correct and wrong prediction (αimg = 0.1), respectively. Note that correspondences are estimated at the original resolutions of images. 
Acknowledgements 