Add pair mining

This commit is contained in:
Alexandre Foucher 2025-04-04 17:46:42 +02:00
parent a5092fdee6
commit 91a9809e83
2 changed files with 374 additions and 28 deletions

View file

@ -156,30 +156,6 @@ The training of an AI relies on the size, diversity and precision of the annotat
The topographic data are extracted from the national French DEM dataset RGE ALTI®, which divides France into 1km² tiles, each composed of a 1000x1000 matrix representing the elevation of a 1m² parcel. Using this base, we generated a 3D terrain model of the Lac de Guerlédan area in France, with a surface area of 5 km², containing 25 individual tiles as shown in Fig \ref{fig:topographic}. Note that in contrast to the real world, the simulation terrain is completely clear of any vegetation or buildings. Moreover we choose to use Unity3D to get better control on render by using the shadergraph tool integrated, allowing us to create custom shader based on the scene depth to increase horizon details visibility.
\begin{figure*}[!bh]
\centering
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/0.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/1.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/2.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/3.png}}
\vspace{3pt}
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/4.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/5.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/6.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/7.png}}
\caption{Genrated heatmap of the focus area with 200 nearest neighbors displayed. The white shape represent the real GNSS position with heading indictator. Otherwise the more intense the blue color, the higher the probability of localization.}
\label{fig:result_heatmap}
\end{figure*}
\subsection{Simulation Dataset}
With the aid of virtual terrain, we can generate an infinite amount of data with complete control over its characteristics. For instance, we can manipulate the position and rotation of the observer, as well as the width of its field of view, the water level, or simulate image deformations caused by the lens.
@ -216,9 +192,39 @@ Through this we produce a dataset with a surface area of 1.5 km$^2$ depicted in
It can be observed that the number of images in the dataset for 1.5km$^2$ is lower than the theoretical maximum number of images for a 1km$^2$ area. This is due to the fact that we only consider shots taken on water, whereas our terrain includes a lot of land.
\subsection{Training}
\begin{figure*}[!th]
\centering
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/0.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/1.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/2.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/3.png}}
Our model design and training is based on the Python library Pytorch. Using this library, we modified a ResNet18 network by removing the last classification layer and adding an output layer that depends on the size of the desired embedding. In our case, we chose a size of 64 bytes, which corresponds to a small embedding size but allows us to limit the effects of over-training in relation to the complexity of our images.
\vspace{3pt}
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/4.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/5.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/6.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/7.png}}
\caption{Genrated heatmap of the focus area with 200 nearest neighbors displayed. The white shape represent the real GNSS position with heading indictator. Otherwise the more intense the blue color, the higher the probability of localization.}
\label{fig:result_heatmap}
\end{figure*}
\section{Training}
Our model design and training is based on the Python library Pytorch. Using this library, we modified a ResNet18 network by changing the last fully connected layer output size with the desired embedding size instead. In our case, we chose a size of 64 bytes, which corresponds to a small embedding size but allows us to limit the effects of over-training in relation to the complexity of our images.
\subsection{Pair Mining}
In order to train our model according to similarity, we need to extract from our dataset a pair of images along with a corresponding similarity score. We call this step pair mining. This involves choosing an anchor, an image in the dataset that will be the reference. Then find a pair with a consistent similarity score, in this case an image with the same coordinates but a random heading. The similarity score will then be based on the agular distance between the headings associated with the images. This method guarantees efficient mining without slowing down training.
In addition to this, to prevent overfitting, we also apply data augmentation techniques to all trainings samples. For each sample, each augmenation has a 30\% chance of being applied, such as rotate, scale, translate and multiply brightness.
Despite the siamese networks being heavier to train as they require more inference per epoch, within a few hours our models seemed to reach their minimum loss.
@ -270,7 +276,7 @@ The accuracy of the localization method is evaluated by a success rate, which is
\caption{Success rate according to level of precision and number of neighbors requested.}
\label{table:accuracy}
\centering
\resizebox{\linewidth}{!}{%
% \resizebox{\linewidth}{!}{%
\begin{tblr}{
colspec = {QQS[table-format=2.2]},
column{2} = {c},
@ -295,7 +301,7 @@ The accuracy of the localization method is evaluated by a success rate, which is
& 10 & 79.41 \\
& 50 & 99.58
\end{tblr}
}
% }
\end{table}
\section{Discussion}

340
main.txt Executable file
View file

@ -0,0 +1,340 @@
\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
\usepackage{color,soul}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage[utf8]{inputenc}
\usepackage{placeins}
\usepackage{stfloats}
\usepackage[caption=false]{subfig}
\usepackage[style=ieee]{biblatex}
\usepackage{tabularray}
\usepackage{array,tabularx}
\usepackage{tabularray}
\UseTblrLibrary{siunitx}
\addbibresource{refs.bib}
\begin{document}
\title{Deep Visual Geo-localization in Maritime Coastal Environment
\thanks{This study is being conducted with the funding support of Agence de l'Innovation de Défense - Ministère des Armées - France.}
}
\author{
\IEEEauthorblockN{Alexandre Foucher, Cédric Seguin, Dominique Heller, Johann Laurent}
\textit{Lab-STICC, CNRS UMR 6285, Université Bretagne-Sud} \\
Lorient, France \\
\{firstname\}.\{lastname\}@univ-ubs.fr
}
\maketitle
\begin{abstract}
In the marine environment, geolocation is crucial to the autonomy of unmanned surface vehicles (USVs). Nowadays, it relies mainly on the Global Navigation Satellite System (GNSS), which is indispensable. However, this system can be vulnerable to jamming or spoofing, and needs to be compensated to improve overall resilience. Visual Geo-Localization (VG) is a major challenge in maritime environments, and necessary when the context does not allow the use of active sensors. This study investigates a real-time visual localization framework based on deep learning and designed for coastal navigation using a limited-field-of-view camera. Traditional visual localization methods show a decrease in accuracy when the field of view is restricted, a common constraint for thermal imaging and other specialized sensors. To address this, we propose an artificial intelligence-driven approach that relies on horizon-based correlation techniques to estimate USV localization. Our results highlight the feasibility of GNSS-free navigation for USVs, opening up new possibilities for robust and autonomous maritime operations.
\end{abstract}
\begin{IEEEkeywords}
Visual Geolocalization (VG), Unmanned Surface Vehicle (USV), K Nearest Neighbor (KNN), Deep Learning.
\end{IEEEkeywords}
\section{Introduction}
% Problem Statement
The increasing capabilities of Artificial Intelligence (AI) and embedded systems have enabled Unmanned Surface Vehicles (USVs) to integrate complete real-time automation chains. However, despite these advancements, USVs still rely on remote supervisors for monitoring operations and intervening in the event of malfunctions. To enhance autonomy, it is crucial to improve system resilience by reducing human dependency.
One critical aspect of achieving this autonomy is geolocalization, which enables subsequent navigation planning. Global Navigation Satellite System (GNSS) is a reliable and accurate method, however, its accuracy can be compromised due to sensor failure or malicious attacks such as spoofing and jamming. These types of interference are not limited to war zones but also occur in other environments, such as North Korea that emits strong interference from its coastline, disrupting surounding aircraft and ships.
In certain application contexts, such as defense, USVs may need to operate in stealthy manners where furtivity is a critical element of success. This aspect restricts the sensors that can be used for localization processes, eliminating active sensors like sonar, lidar, or radar due to their detectability that makes the system vulnerable.
% Current State of the Art
In scenarios where positioning using common methods is not possible or no longer reliable, such as in areas with limited GNSS coverage or in environments with significant interference, the scientific literature offers various geolocalization methods. Primarily focusing on two approaches: absolute and relative positioning. One one hand absolute geolocalization involves estimating the USV's position in relation to a fixed reference frame, often relying on sensor fusion with active sensors such as radars \cite{han2019coastal,ma2017radar}. On the other hand, relative positioning methods estimate the USV's position relative to a reference point. One prominent example of relative positioning is the use of Unmanned Aerial Vehicle (UAV) in cooperative navigation methods, where the position of the USV is estimated from aerial shots taken by the UAV \cite{dufek2016visual}. However, although these methods are effective, they are not applicable to our application context due to the detectability of communication in the same way as active sensors.
Lesser-known approach for USVs is Visual-Geolocalization (VG), which aims to estimate the position solely from visual information, respecting our constraints. An example of a method is localization through correlation of the visible horizon line and known terrain topography for Martian rovers that do not have a surrounding GNSS constellation \cite{chiodini2017mars}. In space exploration, this method is effective due to the stationary environment.
However, for maritime applications, the complexity increases due to factors such as ocean currents that keep the observer moving, tides that can change terrain shape repidly, weather, seasons etc. Fig \ref{fig:usvlog} shows typical examples of images taken by our USV on test lake for our methods. Please note that we are operating in hilly terrain dominated by vegetation, and that the field of view of our cameras is limited to 70\textdegree.
\begin{figure}[!h]
\centering
\vspace{0.62em}
\subfloat{\includegraphics[width=0.49\linewidth]{figure/log/2.jpg}}
\hfill
\subfloat{\includegraphics[width=0.49\linewidth]{figure/log/3.jpg}}
\vspace{0.62em}
\caption{Images extracted from USV's rosbags during testing at Lac de Guerlédan, France.}
\label{fig:usvlog}
\end{figure}
\newpage
In order to design a method capable of addressing this complexity, it is necessary to move towards an approach based on AI. Indeed computer vision has shown great potential with neural networks, particularly Convolutional Neural Networks (CNNs), such as ResNet \cite{he2016deep}, which achieved state-of-the-art results in the Large Scale Visual Recognition Challenge (LSVRC) in 2015 \cite{alom2018history}.
% Your Contribution
Methods extracted from the literature tend to show that the accuracy of visual localization is strongly dependent on the width of the field of view, and tends to give less good results for reduced fields of view \cite{grelsson2020gps}. However, depending on the sensors used, such as thermal imaging cameras, it may be impossible to achieve a wide field of view without impacting on sensor cost.
This paper aim to explore and validate the hypothesis of a real-time visual localization method for USVs in coastal marine environments without GNSS with a limited field of view camera using AI.
\section{Related Work}
The goal of visual localization is to answer the question: where should I position myself on a map to obtain an identical image? If the method is effective, the response should be the observer's location. Various methods exist, depending on the level of granularity, either global or local. Local methods assume that the search area is limited and typically focused around a known initial position \cite{liu2023cmlocate, grelsson2020gps}. They are expected to provide higher localization precision than global methods, which instead expand their search to the entire globe, dividing it into regions \cite{vo2017revisiting}.
Visual localization methods can also be distinguished by the type of visual scene for which they are designed: indoor, urban, natural or global environments. Each environment has its own specific characteristics, making it challenging to transfer a method from one environment to another \cite{brejcha2017state}.
\begin{figure*}[b]
\includegraphics[width=\textwidth]{figure/process_pipeline.png}
\caption{Localization heatmap generation process pipeline.}
\label{fig:process_pipeline}
\end{figure*}
\subsection{Horizon correlation}
Horizon matching methods aim to estimate the observer's location by correlating extracted horizon features with a pre-computed database of digitally rendered horizons. The idea is that the observed horizon line contains terrain and altitude information, which can be compared to known topographic data from digital elevation models (DEMs). This approach is not new and is not exclusive to any particular unmanned system, despite it was firstly develop for Unmanned Ground Vehicle (UGV) on Mars \cite{stein1995map}. Still, this method requires a clear and distant view of the terrain's elevation, favoring terrestrial systems. Also depending on the correlation method used, high computational power may be required limiting real-time application \cite{saurer2016image, grelsson2020gps}.
In a maritime context on a USV, a research team \cite{grelsson2020gps} obtained a mean localization error of 2.72m for a limited search area with a 360° field of view. As result, they highlight that the localization error grew exponentially as the field of view was reduced, stopping at 18.96m mean error for a 120° field of view. Based on this observation, we will compare our method aimed at compensating for this limitation.
\subsection{Feature Extraction}
To perform a correlation and obtain a similarity score between two horizons, it is necessary to have elements to compare, today, neural networks are the most versatile method to sovle this problem. More specifically, CNNs are used as the backbone for features extraction in computer vision tasks such as image classification, object detection, segmentation and more. For example VGG \cite{simonyan2014very}, ResNet \cite{he2016deep}, and Mobilenet \cite{howard2017mobilenets} to name only the most commonly used.
In the literature based on this principle, whether global approaches \cite{vo2017revisiting} or local approaches \cite{liu2023cmlocate}, authors precompute embeddings\footnote{Vector that summarizes the features of a data point resulting from the inference of a neural network.} for their entire dataset. The embeddings are then stored to enable rapid nearest-neighbor (KNN) searches with minimal memory usage.
\section{Proposed Approach}
To obtain accurate localization from an image despite the limited field of view, while respecting real-time constraints, we propose a pipeline based on feature extraction using CNNs and a nearest neighbor search to visualize the closest hypotheses on a heat map that can be seen on Fig \ref{fig:process_pipeline}.
This paper will focus on visual-geolocalization method in natural marine environment, assuming that we approximately know the location where the system operates to limit the search area.
Principle is to generate a sufficiently exhaustive dataset of digital horizon render from topographic data in first instance. Subsequently, after training a ResNet model to recognize similarities and differences in horizon, a vectorial database storing the entirety of embeddings from the original dataset is created. It is important to notice that all computationally expensive processes are performed beforehand on calculation server. Finally, only the database of embeddings and the trained model are stored onboard the USV, limiting memory usage. For localization, the drone will then need to perform only an inference and a search for the most similar horizons. As mentioned, we distinguish between the dataset, which represents image sets for deep-learning training, and the database, which represents embedding storage.
\subsection{Siameses Network}
In order to train our model to extract relevant information from horizon images, we use siamese networks. This training method is suitable in our case, where we need to train the model to find similarities and disparities between the perceived horizon and the simulated horizon. These are two identical networks that share the same weights and work in parallel on different inputs. The two output vectors, called embeddings, are then compared by calculating the Euclidean distance between them.
\begin{equation}
d(a,b)=\sqrt{\sum_{i=0}^{n}(a_i-b_i)^2}
\end{equation}
\begin{tabularx}{0.9\linewidth}{>{$}r<{$} @{${}={}$} X}
a, b & Two embeddings \\
n & Size of embeddings
\end{tabularx}
\vspace{1em}
The aim of this training is to ensure that the distance between the two outputs is inversely proportional to the similarity of the images. This means that the lower the distance between two images, the more similar the images should be. To evaluate the network's performance, the loss is calculated in relation to the distance obtained and the target distance. In our training, we use the Mean Sqare Error (MSE) loss function.
\begin{equation}
MSE = \frac{1}{m}\sum_{i=0}^{m}(Y_i-\hat{Y}_i)^2
\end{equation}
\begin{tabularx}{0.9\linewidth}{>{$}r<{$} @{${}={}$} X}
m & Batch size \\
Y & Target values \\
\hat{Y} & Predicted values
\end{tabularx}
\subsection{Nearest Neighbor and Heat map Localization}
Our proposed method aims to generate a probabilistic map of the observer's position in a given area. This probability map or heat map is generated by obtaining the nearest neighbors of a given horizon from a vector database. This database is pre-filled with data from simulations on which our trained model generates an embedding that serves as identifier. For each embedd, we associate a corresponding triplet containing latitude, longitude and heading.
Lastly, the heat map is created from the $n$ nearest neighbors, such that the more intense the color, the closer the similarity between the simulation horizon and the real horizon. This means that the more intense the color, the higher the probability that the observer is at that particular location.
\section{Experimental Setup}
\subsection{Topographic simulation}
The training of an AI relies on the size, diversity and precision of the annotations used in the dataset. The quality of the final model is therefore determined by these factors. To better control the data, we chose to develop a Unity3D simulator that generates synthetic images automatically from the topographic data of the testing site.
\begin{figure}[b]
\centering
\includegraphics[width=0.9\linewidth]{figure/plt_map_guerledan_big_0.jpg}
\caption{Topographic map of the testing area (5km$^2$) in resolution of 1m based on RGE ALTI data, Lac de Guerledan, France. The inner white square represents the area of interest for the following figures.}
\label{fig:topographic}
\end{figure}
The topographic data are extracted from the national French DEM dataset RGE ALTI®, which divides France into 1km² tiles, each composed of a 1000x1000 matrix representing the elevation of a 1m² parcel. Using this base, we generated a 3D terrain model of the Lac de Guerlédan area in France, with a surface area of 5 km², containing 25 individual tiles as shown in Fig \ref{fig:topographic}. Note that in contrast to the real world, the simulation terrain is completely clear of any vegetation or buildings. Moreover we choose to use Unity3D to get better control on render by using the shadergraph tool integrated, allowing us to create custom shader based on the scene depth to increase horizon details visibility.
\subsection{Simulation Dataset}
With the aid of virtual terrain, we can generate an infinite amount of data with complete control over its characteristics. For instance, we can manipulate the position and rotation of the observer, as well as the width of its field of view, the water level, or simulate image deformations caused by the lens.
To create a large dataset, we have automated the generation process. To achieve this, we first initialize a matrix of points on the water suface of the terrain, the distance between them depending on the resolution of the grid noted $r$. For the rest, we set $r=1000\text{km}^2$, allowing a maximum theoretical positioning accuracy of 1m. Subsequently, for each point, we perform $n$ camera captures with a rotation step of $n / 360$ degrees. We selected $n=16$, resulting in 16 captures with a step of 22.5°. This step is a balance between simplicity of training, offering sufficiently similar images, and limiting the number of images in the dataset. As Fig \ref{fig:simulation_render} shows, the simulated renders are only made from terrain depth and shape, water and sky are intentionally removed to enforce the model to learn from the horizon. Note also that we decided to simulate a field of view of 70\textdegree\space to match the specifications of our real cameras.
The theoretical maximum number of images that can be generated for a given surface $s$ in km$^2$ is estimated by : $nb\_img = s \times r \times n$, where $r$ is the number of point per km$^2$ and $n$ the number of camera capture per point. For a simulation area of 1 km$^2$, this means we can generate up to 16000 images. Moreover, the generated images are $640\times400$ JPEGs, as neural networks require relatively small image size to operate efficiently. Given that each image weighs approximately 15kB, for a 1 km$^2$ simulation, we would have up to 240 MB of data.
\begin{figure}[!ht]
\centering
\subfloat{\includegraphics[width=0.32\linewidth]{figure/sim/0.jpg}}
\hfill
\subfloat{\includegraphics[width=0.32\linewidth]{figure/sim/2.jpg}}
\hfill
\subfloat{\includegraphics[width=0.32\linewidth]{figure/sim/3.jpg}}
\caption{Simulation renderings, with water and sky intentionally rendered black to draw attention to the shape and depth of the terrain.}
\label{fig:simulation_render}
\end{figure}
Through this we produce a dataset with a surface area of 1.5 km$^2$ depicted in Fig \ref{fig:topographic}, which we divided randomly between training and testing at a 90/10 ratio. We also generate evaluation data that is not extracted from the previous grid, but follows the log of actual GNSS path, ensuring that the network evaluation will not involve over-fitting. The dataset specifications used in the rest of the document are described as follows :
\begin{table}[!ht]
% \caption{Specifications of the dataset generated and used for the following section.}
\renewcommand{\arraystretch}{1.3}
\centering
\begin{tabular}{lccc}
& \textbf{Train} & \textbf{Test} & \textbf{Eval} \\
\hline
Images count & 12240 & 1360 & 238
\end{tabular}
\end{table}
It can be observed that the number of images in the dataset for 1.5km$^2$ is lower than the theoretical maximum number of images for a 1km$^2$ area. This is due to the fact that we only consider shots taken on water, whereas our terrain includes a lot of land.
\begin{figure*}[!th]
\centering
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/0.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/1.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/2.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/3.png}}
\vspace{3pt}
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/4.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/5.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/6.png}}
\hfill
\subfloat{\includegraphics[width=0.24\linewidth]{figure/heatmap/7.png}}
\caption{Genrated heatmap of the focus area with 200 nearest neighbors displayed. The white shape represent the real GNSS position with heading indictator. Otherwise the more intense the blue color, the higher the probability of localization.}
\label{fig:result_heatmap}
\end{figure*}
\section{Training}
Our model design and training is based on the Python library Pytorch. Using this library, we modified a ResNet18 network by changing the last fully connected layer output size with the desired embedding size instead. In our case, we chose a size of 64 bytes, which corresponds to a small embedding size but allows us to limit the effects of over-training in relation to the complexity of our images.
\subsection{Pair Mining}
In order to train our model according to similarity, we need to extract from our dataset a pair of images along with a corresponding similarity score. We call this step pair mining. This involves choosing an anchor, an image in the dataset that will be the reference. Then find a pair with a consistent similarity score, in this case an image with the same coordinates but a random heading. The similarity score will then be based on the agular distance between the headings associated with the images. This method guarantees efficient mining without slowing down training.
In addition to this, to prevent overfitting, we also apply data augmentation techniques to all trainings samples. For each sample, each augmenation has a 30\% chance of being applied, such as rotate, scale, translate and multiply brightness.
Despite the siamese networks being heavier to train as they require more inference per epoch, within a few hours our models seemed to reach their minimum loss.
\section{Result}
To evaluate and compare our approach, we focus on two main metrics: inference time and accuracy. The following results are obtained by evaluating the model on 238 unique images in a 1.5km$^2$ area.
\subsection{Inference Time}
To evaluate the time efficiency of this approach, we have averaged the execution time for the nearest neighbor search and inference according to the number of neighbors requested (See Table \ref{table:inference_time}). Note that all measurements and training were performed on the following machine: Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz, Nvidia Quadro RTX 5000.
\hfill
\begin{table}[ht]
\caption{Mean execution time depending on the number of neighbors requested.}
\label{table:inference_time}
\centering
\begin{tblr}{
colspec = {QS[table-format=1.2]S[table-format=1.2]},
row{1} = {c},
row{2} = {c},
cell{1}{1} = {r=2}{},
cell{1}{2} = {c=2}{},
cell{3}{1} = {c},
cell{4}{1} = {c},
cell{5}{1} = {c},
cell{6}{1} = {c},
hline{1,7} = {-}{0.08em},
hline{3} = {-}{},
}
Neighbors count & {{{Mean time (ms)}}} & \\
& {{{KNN}}} & {{{ResNet18}}} \\
10 & 0.05 & 3.68 \\
20 & 0.11 & 3.64 \\
200 & 0.26 & 3.64 \\
2000 & 1.72 & 3.58
\end{tblr}
\end{table}
\hfill
\subsection{Accuracy}
The accuracy of the localization method is evaluated by a success rate, which is calculated according to the maximum accuracy for a given number of neighbors within a given radius. For example, the success rate for an accuracy of 1m among 10 neighbors corresponds to the percentage of predictions which, among their 10 closest neighbors, had a candidate less than 1m away from the authentic position of the observer.
\hfill
\begin{table}[ht]
\caption{Success rate according to level of precision and number of neighbors requested.}
\label{table:accuracy}
\centering
% \resizebox{\linewidth}{!}{%
\begin{tblr}{
colspec = {QQS[table-format=2.2]},
column{2} = {c},
cell{1}{3} = {c},
cell{2}{1} = {r=4}{c},
cell{6}{1} = {r=4}{c},
cell{10}{1} = {r=4}{c},
hline{1,14} = {-}{0.08em},
hline{2,6,10} = {-}{},
}
Neighbors count & Accuracy (m) & {{{Success rates (\%)}}} \\
10 & 1 & 0.84 \\
& 5 & 14.71 \\
& 10 & 47.48 \\
& 50 & 88.66 \\
50 & 1 & 1.26 \\
& 5 & 21.43 \\
& 10 & 67.23 \\
& 50 & 97.90 \\
200 & 1 & 1.26 \\
& 5 & 23.53 \\
& 10 & 79.41 \\
& 50 & 99.58
\end{tblr}
% }
\end{table}
\section{Discussion}
The results from the previous section, allow us to note the efficiency of vector databases, with an average time for a maximum of 2000 neighbors of just 1.72 ms. Additionally, the average execution time $\simeq 3.6$ms for the ResNet18 network remains constant whatever the number of neighbors. This is one of the strengths of this approach: in operation, there will always be only one inference, as most calculations are pre-processed.
Although the method has not been evaluated on an embedded platform (e.g. Nvidia Jetson), its order of scale is around $\simeq 6$ms for a search through 2000 neighbors. Where some other approaches, closer to signal processing, are around $\simeq 25$ms per grid point \cite{grelsson2020gps} meaning approximately 21.2s for our area size on substantially identical hardware.
Considering the frequency of our sensors and the maximum speed of our USV, we estimate that a minimum frequency of 10Hz is required to be considered real time. However, with 6ms of processing we can reach more than 166Hz.
In addition, the heat maps seem to provide some fairly heterogeneous results. Some show excellent and very promising results as A,E,F on Fig \ref{fig:result_heatmap} while others can have impressision down to 500m over a 1.5km$^2$ area as B,C,G on Fig \ref{fig:result_heatmap}. However despite a field of view limited to 70°, the approach is capable at 88\% of locating the observer at less than 50m within a 1.5km$^2$ zone, thanks to the 10 nearest neighbors from a database of 13,600 potential neighbors (See Table \ref{table:accuracy}).
\subsection{Limitation}
As a result of this work we were able to identify certain limitations listed below:
\begin{itemize}
\item When the observer is a few dozen meters from the coast, the vegetation and mainly the trees generate noise on the horizon line, making it much more difficult to exploit. In contrast to reality, the simulation terrain is vegetation-free.
\item Data used to train the model correspond to the same geographical location as that used for the evaluation. Indeed, to guarantee the model's capabilities, it would be necessary for the geographical area of the training to be different from that of the evaluation. Without this, it is impossible to ensure capacity in another location without training dedicated to that location.
\item Some results appear to have a corridor like C heatmap on Fig \ref{fig:result_heatmap} and spread effet like G on Fig \ref{fig:result_heatmap}, due to the small field of view limiting the amount of information available for processing.
\end{itemize}
\section{Conclusion}
In this paper, we propose a proof-of-concept for a real-time visual localization method with a limited field-of-view. Based on artificial intelligence feature extraction and nearest neighbor search. We have also proposed a simulator for rendering topographic data under Unity3D, enabling training and validation data to be generated and controlled.
This proof is the first necessary element in the development of a visual localization method in a coastal environment without GNSS. Subsequently enhancing the resilience of autonomous systems and reducing human dependency.
To improve this model for use with IR camera feeds, but also to overcome the limitations identified above. In future work, it will be necessary to carry out training and evaluation in two different geographical areas, to guarantee the model's capabilities. Furthermore, to improve accuracy despite a small field of view, it would be appropriate to add a notion of memory to the process and apply statistical filtering to limit corridor or propagation effects during localization.
To ensure operation on real data, it would be interesting to explore training in successive stages. Starting with exclusively synthetic data, then progressively adding real data to enable the model to learn effectively in relation of progressive complexity.
\FloatBarrier
\printbibliography
\end{document}