Automated video identification of marine species (AVIMS) - new application: report

A commissioned report on the development of a web-based computer application for machine learning-based (semi-)automated analysis of underwater video footage obtained during the monitoring of aquatic environments.

Methodology and design

The overall objective of this project was to develop an application that provides a user-friendly interface that allows Marine Directorate biologists – who do not have machine learning or programming expertise – to train computer vision and machine learning models for the purpose of analysing video footage, detecting and recognizing a range of species of fish and benthos.

Our developed system is built on the work of (Blowers et al., 2020) and our prior experience from the EU funded H2020 Smartfish (SMARTFISH H2020 – Innovation for Sustainable Fisheries (, French et al., 2020) and JellyMonitor (French et al., 2018; Gorpincenko et al., 2020). Blowers et al. provided a feasibility study that considered a handful of entities of interest extracted from a few hundred video frames. Their work demonstrated the power of convolutional neural networks (CNNs) over other more traditional approaches and suggested a possible system architecture. We followed Blowers et al.’s recommendations closely, making a number of important refinements based on our experience developing larger scale systems capable of handling many more entities of interest.

In the initial phase of the project, a number of meetings with Marine Directorate scientists took place which helped us to design and iteratively refine our approach so that the end product met Marine Directorate requirements. During this phase of the project we also obtained the video footage from Marine Directorate which included overhead or underwater footage from fish counters, fish surveys and seabed surveys. In this phase we considered whether our system will be a standalone desktop or a web based application. We decided to proceed with the latter as this was considered to meet the project requirements. The web based solution also offered the possibility of concurrent and remote working with the evolving prototypes of the tool by a number of Marine Directorate scientists. This work included the upload of the datasets, the annotation of parts of the uploaded datasets and provision of feedback to the development team.

Our application allows users without computer science / coding experience to create, train and execute computer vision models without any need of interaction with the underlying code. The application supports computer vision models which fall into the common framework of detection of objects from the predefined set of classes (schema), tracking detected objects across the consecutive frames in the video, and finally, counting all distinct objects in video for each class in the schema.

The workflow of our web-based application which implements the above computer vision framework is as follows:

Survey type. The application supports users in accomplishing a wide variety of tasks, e.g. counting fish and benthos in footage of the seabed or taken in fish survey work, and counting salmon in overhead footage from in-river fish counter sites. Given the differing appearance of the targets and the background in these tasks, the best performance is obtained by training separate models, even if both tasks share a general goal - counting entities of interest in a video. The application permits users to create different survey types for the various tasks that they wish to train models for.

Schema design. Given that each survey type is aimed at accomplishing a different task, the entities of interest that the user wishes to quantify are likely to differ. The application therefore provides an interface that allows the user to specify the list of species or classes of fish or other entities that the model should count within the video. Each survey type has its own schema that corresponds to the goals of the survey.

Video import. Videos selected by the user are added to the survey type with a view to extracting training images.

Extraction of images from video. Still images are extracted from the video files to be annotated by the user in the subsequent stages. Here, the user is required to choose which frames in the video they wish to include in the dataset that will be used to train and test the machine learning model/s. While extracting and annotating every frame may yield a robust model, the effort required to manually annotate potentially thousands of images will likely make this prohibitive. The user is presented with the graphical tool utilizing a traditional video slider where they scroll the video and choose the video frames they wish to be included in the dataset. These should be representative of the conditions in which the model is required to work.

Annotation. The selected frames from the imported video/s need to be annotated by expert taxonomists in order to provide the ground-truth data for training and testing the trained machine learning models. The annotation tool, which was built on our earlier work (GitHub - Britefury/django-labeller: An image labelling tool for creating segmentation data sets, for Django and Flask (, is presented to the user, allowing them to identify entities of interest and annotate them, either using polygonal annotations or ellipses. We anticipate that after testing the model the user may wish to annotate more data in order to improve the model performance.

Dataset construction. Machine learning requires an annotated training and testing set for training the model and evaluating its performance respectively. Segments from a single video are likely to share common visual appearance characteristics, e.g. lighting or turbidity due to time of day and weather conditions. Training a model on footage drawn from one part of a video would likely result in higher testing scores on footage from a different part of the same video than is achievable in the desired practical scenario of using the model to analyze footage from an as-of-yet unseen site. Here, our system allows for a number of approaches regarding how the data should be split so that the testing results are indicative of the future performance.

Training and Testing. A subset of annotated images is used to train the model. This involves iteratively presenting images to the neural network and optimizing the network parameters. This is computationally expensive and can take hours to complete on a GPU server. A proportion of the dataset constructed above is marked for testing. These images are held out during training i.e. are not part of the training set, and given to the model for inference at this stage. The predictions generated by the model are compared to the manually generated ground-truths. The system determines which objects of interest the model failed to detect (false negatives) and which detections produced by the model were spurious (false positives). The user is presented with these results.

The mAP (mean Average Precision) is used as a measure of system accuracy during training. This figure combines model accuracies (average precisions) for each class separately, which are then averaged, forming one figure - mAP. For example, a mAP of 80% should be highly reliable over all classes. In this case, a user can be relatively confident about the model predictions and the subsequent tracking results, as long as the video passed for the subsequent analysis/inference (see Inference below) is "similar" to what the model has been trained on.

Object tracker. Without an object tracker it is not possible to count objects in a video. The object detection model trained in prior stages detects entities in single frames. Using this alone would result in a significant over-count as an entity will be counted multiple times, once for each frame in which it is detected. The task of the object tracking system is to associate detections that correspond to a single entity across the frames in which that entity is visible. The user is presented with an interface that allows them to tune the tracker in order to achieve the most accurate results.

Inference (analysis of new videos). Once a model has been trained and achieved an acceptable level of accuracy, the system uses the trained model to analyze videos provided by the user, outputting the desired results. This stage can also be computationally expensive (especially for long videos) and usually requires high performance computer hardware.

Here, the user chooses a machine learning model for the task at hand and selects a set of videos for analysis. Once selected, the system processes the videos in turn, saving the event logs to a .csv and .json files which are available to the user for viewing either from the application GUI or for the download. The event logs contain an entry for each detected entity. The entry consists of a time stamp, the event duration, and a predicted class label of the object detected.

Moreover, we considered adding a confidence score for each detection. However, eventually we decided not to add this feature as we did not have a reliable way of generating a confidence score that is correct and meaningful. While the system could provide such scores in theory, our experiments/observations show that these would not be helpful i.e. they are not really what a human would consider confidence/probability.

At the end of this stage, the user is also given an opportunity to add the video/s that have just undergone inference to the training dataset to improve the potential subsequent iteration of the machine learning model. This may be particularly worthwhile in cases where videos are substantially different to those present in the dataset used to train (and test) the model and have consequently returned unsatisfactory results in the inference stage. This maximizes the improvement in performance obtained by expanding the machine learning dataset.

Our application supports a rapid annotate and test cycle, allowing the user to quickly assess the performance of the model and grow the training set in order to achieve the desired level of performance as quickly as possible.



Back to top