Advancing Fauna Conservation through Machine Learning-Based Spectrogram Recognition: A Study on Object Detection using YOLOv5

The protection and monitoring of fauna species are essential for maintaining biodiversity and ensuring the sustainability of ecosystems. Traditional methods of fauna conservation and habitat monitoring rely heavily on manual observation and data collection, which can be time-consuming, and labor-intensive. In recent years, the application of machine learning techniques, such as object detection, has shown great potential in automating the identification of fauna species. In this study, we propose an approach to advancing fauna conservation through the utilization of machine learning-based spectrogram recognition. Specifically, we employ an object detection algorithm, YOLOv5, to detect and classify fauna species from spectrogram images obtained from acoustic recordings. The spectrograms provide a visual representation of audio signals, capturing distinct patterns and characteristics unique to different fauna species. Through extensive experimentation and evaluation, our approach achieved promising results, demonstrating a precision of 0.95, recall of 0.98, F1 score of 0.91, and mean Average Precision (mAP) of 0.934. These performance metrics indicate a high level of accuracy and reliability in fauna species detection. By automating the identification process, our approach provides a scalable solution for monitoring fauna populations over large geographical areas and enables the collection of comprehensive data, facilitating better decision-making and targeted conservation strategies.


INTRODUCTION
The conservation of fauna species and the monitoring of their habitats are of paramount importance for preserving biodiversity and ensuring the long-term health of ecosystems (Stephenson et al., 2022).However, traditional methods of fauna conservation and habitat monitoring often rely on manual observation and data collection, which can be labor-intensive, time-consuming, and limited in their scope (Musvuugwa et al., 2021).In recent years, advancements in machine learning and computer vision techniques have provided promising avenues for automating and enhancing these processes (Taye, 2023).
The application of audio signal processing, specifically through the analysis of acoustic recordings, allows for the automation and streamlining of fauna monitoring processes (Mutanu et al., 2022).By employing machine learning algorithms, we can accurately detect and classify fauna species from these recordings, eliminating the need for manual observation and significantly reducing the time and effort required for data collection (Binta Islam et al., 2023).This enhanced efficiency enables researchers and conservationists to monitor larger areas and achieve a more comprehensive understanding of fauna populations (Dalton et al., 2023).
Traditional methods of fauna monitoring often face limitations in terms of coverage and scalability (Prosekov et al., 2020).Manual observation is inherently constrained by human resources and time, making it difficult to gather data across vast territories or inaccessible regions(Sharma et https://doi.org/10.21776/ub.jsal.2023.010.02.2 al., 2023).Audio signal processing combined with machine learning techniques enables us to overcome these limitations.By analyzing acoustic recordings, we can extend monitoring efforts to remote or challenging environments, providing a more complete picture of fauna populations and their habitats (Ravaglia et al., 2023).
Many fauna species are sensitive to human presence, and their behavior can be altered by direct observation (Hoy & Brereton, 2022) such as Bucheros Bicornis.Bucheros Bicornis commonly known as the Bicornis Hornbill, is a remarkable species found in the conservation forest of Bukit Suligi.These unique birds belong to the family Bucerotidae and are characterized by their striking appearance and fascinating behavior.Bucheros Bicornis is classified as a species of conservation concern due to habitat loss and degradation (Woods et al., 2022).The conservation forest of Bukit Suligi plays a crucial role in the survival and protection of this species.Efforts to preserve and restore the habitat, along with sustainable forest management practices, are essential for the long-term conservation of Bucheros Bicornis and its ecosystem (Angelstam et al., 2023).The presence of Bucheros Bicornis in the conservation forest of Bukit Suligi is a testament to the importance of protecting and maintaining the biodiversity within this area.By safeguarding their habitat and implementing conservation measures, we can ensure the survival of this fascinating species and contribute to the overall preservation of our natural heritage.
Audio signal processing offers a nonintrusive alternative to monitor these species without disturbing their natural habitats (Benocci et al., 2022).By analyzing acoustic recordings, we can obtain valuable insights into their vocalizations, migration patterns, and habitat preferences, all while minimizing our impact on their environment (Sharma et al., 2023;Stephenson et al., 2022).
Previous research studies (Jung & Choi, 2022;Kim et al., 2022) focused on utilizing YOLOv5, a state-of-the-art object detection algorithm, for the detection and tracking of individuals in camera footage.This approach was successful in identifying and monitoring objects of interest within the visual domain.The algorithm's ability to accurately detect and track individuals in real-time provided valuable insights for various applications, such as surveillance, human activity recognition, and behavior analysis.The earlier research demonstrated the effectiveness of YOLOv5 in the context of visual object detection, establishing it as a powerful tool in the field of computer vision.
In contrast, the current research builds upon the success of earlier research but applies YOLOv5 in a different domain: the automated detection of species from audio songbird recordings.Instead of using camera footage, this research have converted the audio recordings into spectrogram images, which capture the frequency and time characteristics of the bird songs.By adopting YOLOv5 for spectrogram recognition, the research aims to automate the detection and classification of different species based on their unique vocalizations.

Research Location
The research takes place in the forest area with special purpose (Kawasan hutan dengan tujuan khusus) known as Bukit Suligi, located in Riau Province.Bukit Suligi, being a forest area with special purpose, serves as a critical habitat for various flora and fauna species, including the Bucheros Bicornis.This bird species plays a significant ecological role within the ecosystem and is of conservation concern.To effectively monitor and protect the Bucheros Bicornis population, the research aims to leverage machine learning techniques, specifically spectrogram recognition, as a tool for their detection and tracking.
The research involves recording the songs of the Bucheros Bicornis using the Thronmax Pulse recording instrument at Latitude : 0°35'46.92"Nand Longitude : 100°32'32.78"E .The Thronmax Pulse is a specialized recording device capable of capturing audio signals with a recording radius of up to 500 meters with 48 recordings of bird song of Bucheros Bicornis.By utilizing this instrument, the research aims to collect high-quality audio recordings of the Bucheros Bicornis songs within their natural habitat in Bukit Suligi.These recorded audio samples will then be converted into spectrogram images, which provide a visual representation of the frequency and time characteristics of the bird songs.The YOLOv5 algorithm, a stateof-the-art object detection model, will be employed to analyze and recognize the spectrogram images, enabling the automated detection and identification of Bucheros Bicornis individuals within the recorded audio data.

Data Analysis
This research involves data analysis using object detection as the model type, implementing YOLO (You Only Look Once) as the architecture, and utilizing the PyTorch framework.All data processing, including training, validation, and testing is conducted in Google Colaboratory using open-source code.

Data Collection
The data analysis process begins with the collection of audio recordings from the designated forest area.Suitable recording devices or sensors are used to capture the vocalizations of the fauna species of interest.48 recordings of bird song of Bucheros Bicornis were recorded and used as data input for analysis.The collected audio recordings are then transferred to a computer or a cloud-based platform such as Google Colaboratory for further processing and analysis.

Audio Data Conversion
The application of the trained YOLOv5 model to unseen or real-time audio

Image Annotation
The next step involves data annotation, where the spectrogram images are manually annotated by marking the regions that correspond to the presence of the target fauna species.Labels indicating the species and potentially additional attributes, such as bounding boxes or key points, are assigned to the annotated regions.

Architecture Selection
For the model training phase, the YOLOv5 architecture is implemented, leveraging its ability to perform object detection tasks following the "You Only Look Once" principle (Glenn et al.,2022).The PyTorch framework is utilized to facilitate deep learning and computer vision tasks.The annotated dataset is split into training, validation, and testing sets, enabling the model to be trained and its performance evaluated.
Overview of YOLOv5 Architecture is given in Figure 2.

Data Training
During model training, the YOLOv5 architecture is fine-tuned on the annotated spectrogram images, allowing it to learn to detect and localize the target fauna species accurately.The model's performance is evaluated using evaluation metrics such as precision, recall, F1 score, and mean Average Precision (mAP).Throughout the research, all data processing tasks, including training, validation, and testing, are conducted in Google Colaboratory using open-source code.This choice of platform and tools provides computational resources, collaborative features, and access to a wide range of libraries and frameworks available in the Python ecosystem.

Model Evaluation
The trained model was evaluated using the validation dataset, measuring key metrics such as precision, recall, and F1 score (Arifando et al., 2023).The formula for measuring precision, recall, and F1 score is defined by Equation ( 1), (2), and (3) respectively.
Precision is defined by: P Recall is defined by: R= While the F score value can be translated as the normalized mean of the Precision and Recall measurement using the equation given by Equation 3. It is calculated as the harmonic mean of precision and recall and not the arithmetic mean.F1-score has a value between zero and 1; the higher the value, the higher the accuracy of detecting an object :

Model Deployment
During the deployment phase, the trained model is used to make predictions on new, unseen data.In the case of fauna conservation, the deployed model takes in spectrogram images as input and performs object detection to identify and localize the target fauna species within the images.

RESULT AND DISCUSSION
With the available training dataset, the YOLOv5 model demonstrated an exceptional mean Average Precision (mAP) of 0.934 in accurately detecting Bucheros Bicornis Spectrogram.The evaluation of the model's performance, as depicted in Table 1 and Figure 3, provides an in-depth analysis of key metrics including precision, recall, F1 score, and the mAP value.These metrics were computed based on the optimal outcome achieved after 300 epochs of training.The utilization of YOLOv5 in current research presents an exciting opportunity for advancing fauna conservation and habitat monitoring.By automating the detection of species from audio recordings, researchers and conservationists can gather comprehensive data on fauna populations across large geographic areas.This approach complements traditional visual monitoring methods and enhances efficiency by reducing manual efforts and expanding the scope of monitoring initiatives.The integration of YOLOv5 into audio-based species detection represents a significant leap forward in the field of fauna conservation and contributes to the development of innovative tools for effective habitat monitoring.
In conclusion, our research demonstrates the potential of machine learning-based spectrogram recognition for advancing fauna conservation and habitat monitoring.Traditional methods of fauna observation and data collection are often timeconsuming, labor-intensive, and limited in scope.However, by harnessing the power of machine learning, particularly through object detection using the YOLOv5 algorithm, we have achieved promising results.Our approach, which utilizes spectrogram images obtained from acoustic recordings, enables the automated detection and classification of fauna species.The spectrograms capture unique patterns and characteristics specific to each species, allowing for accurate identification.The evaluation of our methodology yielded impressive performance metrics, including a precision of 0.95, recall of 0.98, F1 score of 0.91, and mean Average Precision (mAP) of 0.934.These metrics indicate a high level of accuracy and reliability in detecting and classifying fauna species.
By automating the identification process, our approach reduces the reliance on manual observation, making it more efficient and scalable for monitoring fauna populations over large geographical areas.This automation opens up new possibilities for effective and efficient habitat monitoring, leading to better conservation strategies.The integration of machine learning techniques into existing conservation efforts facilitates the collection of comprehensive data, enabling better decision-making and targeted conservation strategies.
In summary, our research demonstrates the potential of machine learning-based spectrogram recognition as a valuable tool for advancing fauna conservation.The successful application of this approach contributes to the development of effective and efficient habitat monitoring methods, ultimately aiding in the protection and preservation of biodiversity and ecosystems.
Although our research achieved promising results with a limited dataset, further improvements can be made by increasing the size and diversity of the training dataset.By including a broader range of spectrogram recordings from various species and habitats, the model can enhance its ability to detect and classify fauna species accurately.Furthermore, We suggest Experimentation with different hyperparameters and model architectures could lead to improved performance.Finetuning the YOLOv5 model by adjusting parameters such as learning rate, network depth, or anchor box sizes may enhance the model's precision, recall, and F1 score.

ACKNOWLEDGMENT
We would like to express our heartfelt gratitude to Balai Pelatihan Lingkungan Hidup Dan Kehutanan Pekanbaru -Ministry of Environments and Forestry of the Republic of Indonesia for their invaluable contribution to our research.We extend our sincere thanks for providing us with the necessary resources, specifically the sensitive audio recorder, to gather the audio song bird dataset and access to the image processing platform.
recordings involves converting them into spectrogram images.The model's object detection capabilities are utilized to automatically detect and classify the target fauna species within the spectrogram images.The results of the detection process provide insights into the species' presence, distributionTo convert the audio recordings into spectrogram images, audio processing libraries called Librosa (Alamhashmi et al., 2022).The spectrogram parameters, including window size, hop length, and frequency range, are adjusted to optimize the visualization of the acoustic features relevant to the target fauna species.

Figure 3 .
Figure 3. Generated mAP of Bucheros Bicornis Spectrogram Detection by YOLOv5 (Source : Private documentation) ) further validates the model's performance.The mAP represents the average precision across various confidence thresholds, providing a comprehensive evaluation of the model's ability to accurately classify fauna species.Mean Average Precision (mAP) is a performance metric commonly used in object detection tasks to evaluate the accuracy and reliability of a model's predictions.It measures the average precision across multiple classes or categories.The obtained precision of 0.95 indicates that 95% of the identified fauna species were accurately classified.This high precision demonstrates the model's ability to minimize false positives, reducing the risk of misidentifying non-target species or generating erroneous results.The recall value of 0.98 indicates that the model successfully detected 98% of the instances of the target fauna species present in the spectrogram images.This high recall signifies the model's capability to capture the majority of instances, ensuring comprehensive monitoring and conservation efforts.

Figure 10 .
Figure 10.Number of annotating per class, Visualization of the location and each bounding box size, statistical position of bounding box position, and statistical distribution of bounding box size used as input for Data Training using YOLOv5 (Source : Private documentation)

Figure 11 .
Figure 11.Epochs (Iterations) of Training Data and Train Loss Function Curve of Bucheros Bicornis Spectrogram Detection Performed by YOLOv5 (Source : Private documentation) This innovative application of YOLOv5 in the field of audio signal processing offers several advantages for fauna conservation and habitat monitoring.The algorithm's object detection capabilities are repurposed to identify and localize distinct acoustic boundaries associated with different species vocalizations.By training the model on annotated spectrogram images, the researchers enable it to recognize and classify species directly from audio recordings, reducing the need for manual observation and analysis.This current research expands the scope of YOLOv5 beyond visual object detection, demonstrating its versatility and adaptability in different domains.By applying the algorithm to audio signal

Table 1 .
Result data analysis for automated