Paper on enhancing the usability of metadata in the BioSample database using LLM published in GigaScience.
2025. 07. 04 /
The research paper “Extraction of biological terms using large language models enhances the usability of metadata in the BioSample database” by Shuya Ikeda and colleagues of DBCLS has been published in GigaScience. The paper can be found at the following URL
https://doi.org/10.1093/gigascience/giaf070
BioSample is a repository of experimental sample metadata. It is a comprehensive archive that enables searches of experiments, regardless of type. However, there is substantial variability in the submitted metadata due to the difficulty in defining comprehensive rules for describing them and the limited user awareness of best practices in creating them. This inconsistency poses considerable challenges to the findability and reusability of archived data. To improve the usability of the data, previous studies attempted to map terms in BioSample records to ontologies, but it was difficult for conventional rule-based methods to accurately grasp the meaning of the terms.
In this study, a large language model was used to extract terms that should be mapped to ontologies based on the context. An evaluation using a test set confirmed that the accuracy and coverage were improved compared to a conventional method. Metadata organized semantically by ontologies is expected to improve the findability and reusability of experimental data and facilitate the re-use of useful data.