Journal article icon

Journal article

More than 17,000 tree species are at risk from rapid global change

Abstract:
Funding Information: AC was supported by a grant ( PRT/BD/152100/2021 ) financed by the Portuguese Foundation for Science and Technology (FCT) under MIT Portugal Program. AC and CC acknowledge support from FCT through support to CEG/IGOT Research Unit ( UIDB/00295/2020 and UIDP/00295/2020 ). JP was funded through FCT for funds to GHTM ( UID/04413/2020 ). LR was funded through the FCT contract \‘ CEECIND/00445/2017 \’ under the \‘Stimulus of Scientific Employment\—Individual Support\’ and by FCT \‘UNRAVEL\’ project ( PTDC/BIA-ECO/0207/2020 ; https://doi.org/10.54499/PTDC/BIA-ECO/0207/2020 ). PP acknowledge support from the Czech Science Foundation (project no. 23-07278S ). Publisher Copyright: © 2024 The AuthorsThe vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labour-intensive, posing a significant challenge. In this study, we aimed to assess the application of three state-of-the-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87–100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81–97%), whereas LLaMA-2-70B showed the worst performance (37–73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.publishersversionpublishe
Publication status:
Published
Peer review status:
Peer reviewed

Actions

Authors

More by this author
Role:
Author
ORCID:
0000-0003-2417-1579
More by this author
Role:
Author
ORCID:
0000-0003-1988-1154
More by this author
Role:
Author
ORCID:
0000-0001-5619-3233
More by this author
Role:
Author
ORCID:
0000-0002-6124-7096


More from this funder
Funder identifier:
10.13039/501100001732
Grant:
DNRF173
More from this funder
Funder identifier:
10.13039/100005716
Grant:
2225076
More from this funder
Funder identifier:
10.13039/501100001665
Grant:
ANR-21-CE32-0003
More from this funder
Funder identifier:
10.13039/501100000275
More from this funder
Funder identifier:
10.13039/100002158


Publisher:
Nature Research
Journal:
Nature Communications More from this journal
Volume:
15
Issue:
1
Pages:
166-166
Article number:
166
Publication date:
2024-01-02
DOI:
EISSN:
2041-1723
ISSN:
2041-1723


Language:
English
Keywords:
Pubs id:
1595481
Local pid:
pubs:1595481
Source identifiers:
W4390511178
Deposit date:
2026-06-04
ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP