Journal article
More than 17,000 tree species are at risk from rapid global change
- Abstract:
- Funding Information: AC was supported by a grant ( PRT/BD/152100/2021 ) financed by the Portuguese Foundation for Science and Technology (FCT) under MIT Portugal Program. AC and CC acknowledge support from FCT through support to CEG/IGOT Research Unit ( UIDB/00295/2020 and UIDP/00295/2020 ). JP was funded through FCT for funds to GHTM ( UID/04413/2020 ). LR was funded through the FCT contract \‘ CEECIND/00445/2017 \’ under the \‘Stimulus of Scientific Employment\—Individual Support\’ and by FCT \‘UNRAVEL\’ project ( PTDC/BIA-ECO/0207/2020 ; https://doi.org/10.54499/PTDC/BIA-ECO/0207/2020 ). PP acknowledge support from the Czech Science Foundation (project no. 23-07278S ). Publisher Copyright: © 2024 The AuthorsThe vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labour-intensive, posing a significant challenge. In this study, we aimed to assess the application of three state-of-the-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87–100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81–97%), whereas LLaMA-2-70B showed the worst performance (37–73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.publishersversionpublishe
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Version of record, pdf, 9.5MB, Terms of use)
-
- Publisher copy:
- 10.1038/s41467-023-44321-9
Authors
+ Danmarks Grundforskningsfond
More from this funder
- Funder identifier:
- 10.13039/501100001732
- Grant:
- DNRF173
+ NSF | National Science Board
More from this funder
- Funder identifier:
- 10.13039/100005716
- Grant:
- 2225076
+ Agence Nationale de la Recherche
More from this funder
- Funder identifier:
- 10.13039/501100001665
- Grant:
- ANR-21-CE32-0003
- Publisher:
- Nature Research
- Journal:
- Nature Communications More from this journal
- Volume:
- 15
- Issue:
- 1
- Pages:
- 166-166
- Article number:
- 166
- Publication date:
- 2024-01-02
- DOI:
- EISSN:
-
2041-1723
- ISSN:
-
2041-1723
- Language:
-
English
- Keywords:
- Pubs id:
-
1595481
- Local pid:
-
pubs:1595481
- Source identifiers:
-
W4390511178
- Deposit date:
-
2026-06-04
- ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.
Terms of use
- Copyright date:
- 2024
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record