| Title: | Detecting UN Sustainable Development Goals in Text |
|---|---|
| Description: | The United Nations' Sustainable Development Goals (SDGs) have become an important guideline for organisations to monitor and plan their contributions to social, economic, and environmental transformations. The 'text2sdg' package is an open-source analysis package that identifies SDGs in text using scientifically developed query systems, opening up the opportunity to monitor any type of text-based data, such as scientific output or corporate publications. For more information see Meier, Mata & Wulff (2025) <doi:10.32614/RJ-2024-005> and Wulff, Meier & Mata (2024) <doi:10.1007/s11625-024-01516-3>. |
| Authors: | Dirk U. Wulff [aut] (ORCID: <https://orcid.org/0000-0002-4008-8022>), Dominik S. Meier [aut, cre] (ORCID: <https://orcid.org/0000-0002-3999-1388>), Rui Mata [ctb] (ORCID: <https://orcid.org/0000-0002-1679-906X>) |
| Maintainer: | Dominik S. Meier <[email protected]> |
| License: | GPL-3 |
| Version: | 1.1.2 |
| Built: | 2026-06-05 08:11:47 UTC |
| Source: | https://github.com/dwulff/text2sdg |
A dataset containing the SDG queries of University of Auckland (version 1). The queries are available from https://www.sdgmapping.auckland.ac.nz/. The Auckland queries were developed to build on the processes developed by the United Nations and the Times Higher Education Ranking in order to create an expanded list of keywords that can be used to identify SDG-relevant research. There is one query per SDG. There are no queries for SDG-17.
auckland_queriesauckland_queries
A data frame with 16 rows and 4 columns
Name of system
Label of the SDG
Index of the query
SDG query
https://www.sdgmapping.auckland.ac.nz/
A dataset containing the SDG queries version 5.0 of the Aurora Universities Network. See the corresponding GitHub repository. For the actual implementation of the queries see aurora_simple, aurora_and, aurora_w, and the queries hard-coded in detect_aurora. There are multiple queries per SDG (one per row). In comparison to previous versions, this version of the queries Aurora added more keywords related to academic terminology to be able to detect more research papers related to the SDGs. The current version also drew inspiration from the SIRIS query system (siris_queries). The Aurora queries were designed to be precise rather than sensitive. To achieve this the queries make use complex keyword-combinations using several different logical search operators. All SDGs (1-17) are covered.
aurora_queriesaurora_queries
A data frame with 373 rows and 5 columns
Name of system
Label of the SDG
Title of the SDG
Description of the SDG
Index of the query
Original SDG query
https://github.com/Aurora-Network-Global/sdg-queries/releases/tag/v5.0
crosstab_sdg calculates cross tables (aka contingency tables) of SGSs or systems across hits identified via detect_sdg_systems.
crosstab_sdg(hits, compare = c("systems", "sdgs"), systems = NULL, sdgs = NULL)crosstab_sdg(hits, compare = c("systems", "sdgs"), systems = NULL, sdgs = NULL)
hits |
|
compare |
|
systems |
|
sdgs |
|
crosstab_sdg determines correlations between either query systems or SDGs. The respectively other dimension will be ignored. Note that correlations between SDGs may vary between query systems.
matrix showing correlation coefficients for all pairs of query systems (if compare = "systems") or SDGs (if compare = "SDGs").
# run sdg detection hits <- detect_sdg_systems(projects) # create cross table of systems crosstab_sdg(hits) # create cross table of systems crosstab_sdg(hits, compare = "sdgs")# run sdg detection hits <- detect_sdg_systems(projects) # create cross table of systems crosstab_sdg(hits) # create cross table of systems crosstab_sdg(hits, compare = "sdgs")
detect_any identifies SDGs in text using user provided query systems. Works like detect_sdg_systems but uses a user specified query system instead of an existing one like detect_sdg_systems does.
detect_any( text, system, queries = lifecycle::deprecated(), sdgs = NULL, output = c("features", "documents"), verbose = TRUE )detect_any( text, system, queries = lifecycle::deprecated(), sdgs = NULL, output = c("features", "documents"), verbose = TRUE )
text |
|
system |
a data frame that must contain the following variables: a |
queries |
deprecated. |
sdgs |
|
output |
|
verbose |
|
The function returns a tibble containing the SDG hits found in the vector of documents. Depending on the value of output the tibble will contain all or some of the following columns:
Index of the element in text where match was found. Formatted as a factor with the number of levels matching the original number of documents.
Label of the SDG found in document.
The name of the query system that produced the match.
Index of the query within the query system that produced the match.
Concatenated list of words that caused the query to match.
Index of hit for a given system.
# create data frame with query system my_queries <- tibble::tibble( system = "my_system", query = c( "theory", "analysis OR analyses OR analyzed", "study AND hypothesis" ), sdg = c(1, 2, 2) ) # run sdg detection with own query system hits <- detect_any(projects, my_queries) # run sdg detection for sdg 2 only hits <- detect_any(projects, my_queries, sdgs = 2)# create data frame with query system my_queries <- tibble::tibble( system = "my_system", query = c( "theory", "analysis OR analyses OR analyzed", "study AND hypothesis" ), sdg = c(1, 2, 2) ) # run sdg detection with own query system hits <- detect_any(projects, my_queries) # run sdg detection for sdg 2 only hits <- detect_any(projects, my_queries, sdgs = 2)
detect_sdg identifies SDGs in text using an ensemble model approach considering multiple existing SDG query systems and text length.
detect_sdg( text, systems = lifecycle::deprecated(), output = lifecycle::deprecated(), sdgs = 1:17, synthetic = c("equal"), verbose = TRUE )detect_sdg( text, systems = lifecycle::deprecated(), output = lifecycle::deprecated(), sdgs = 1:17, synthetic = c("equal"), verbose = TRUE )
text |
|
systems |
As of text2sdg 1.0.0 the 'systems' argument of 'detect_sdg()' is deprecated. This is because 'detect_sdg()' now makes use of an ensemble approach that draws on all systems as well as on the text length, see –preprint– for more information. The old version of 'detect_sdg()' is available through the 'detect_sdg_systems()' function. |
output |
As of text2sdg 1.0.0 the 'output' argument of 'detect_sdg()' is deprecated. This is because 'detect_sdg()' now makes use of an ensemble approach that draws on all systems as well as on the text length, see –preprint– for more information. The old version of 'detect_sdg()' is available through the 'detect_sdg_systems()' function. |
sdgs |
|
synthetic |
|
verbose |
|
detect_sdg implements a ensemble model to detect SDGs in text. The ensemble model combines the six systems implemented by detect_sdg_systems and text length in a random forest architecture. The ensemble model has been trained on three data sets with SDG labels assigned by experts and a matching number of synthetic texts generated by random sampling from a word frequency list. The user has the choice of multiple versions of the ensemble model that have been trained on different amounts of synthetic texts to adjust the sensitivity and specificity of the model. Increasing the amount of of synthetic data makes the ensemble more conservative, leading to increased sensitivity and decreased specificity.
By default, detect_sdg implements the version of the ensemble model that has been trained on an equal amount of expert-labeled and synthetic data, providing a reasonable balance between sensitivity and specificity. For details, see article by Wulff et al. (2024).
The function returns a tibble containing the SDG hits found in the vector of documents. The columns of the tibble are described below. The tibble also includes as an attribute with name "system_hits" the predictions of the individual systems produced by detect_sdg_systems().
Index of the element in text where match was found. Formatted as a factor with the number of levels matching the original number of documents.
Label of the SDG found in document.
The name of the ensemble system that produced the match.
Index of hit for the Ensemble model.
Wulff, D. U., Meier, D. S., & Mata, R. (2024). Using novel data and ensemble models to improve automated labeling of Sustainable Development Goals. Sustainability Science. https://doi.org/10.1007/s11625-024-01516-3
# run sdg detection hits <- detect_sdg(projects) # run sdg detection for sdg 3 only hits <- detect_sdg(projects, sdgs = 3) # extract systems hits attr(hits, "system_hits")# run sdg detection hits <- detect_sdg(projects) # run sdg detection for sdg 3 only hits <- detect_sdg(projects, sdgs = 3) # extract systems hits attr(hits, "system_hits")
detect_sdg_systems identifies SDGs in text using multiple SDG query systems.
detect_sdg_systems( text, systems = c("Aurora", "Elsevier", "Auckland", "SIRIS"), sdgs = 1:17, output = c("features", "documents"), verbose = TRUE )detect_sdg_systems( text, systems = c("Aurora", "Elsevier", "Auckland", "SIRIS"), sdgs = 1:17, output = c("features", "documents"), verbose = TRUE )
text |
|
systems |
|
sdgs |
|
output |
|
verbose |
|
detect_sdg_systems implements six SDG query systems. Four systems developed by the Aurora Universities Network (see aurora_queries), Elsevier (see elsevier_queries), Auckland University (see elsevier_queries), and SIRIS Academic (see siris_queries) rely on Lucene-style Boolean queries, whereas two systems, namely SDGO (see sdgo_queries) and SDSN (see sdsn_queries) rely on basic keyword matching. 'detect_sdg_systems' calls dedicated detect_* for each of the five system. Search of the queries is implemented using the search_features function from the corpustools package.
By default, detect_sdg_systems runs only the Aurora, Elsevier, Auckland, and Siris query systems, as they are considerably less liberal than the SDSN and SDGO systems and therefore likely produce more valid SDG classifications. Users should be aware that systematic validations and comparison between the systems are largely lacking and that results should be interpreted with caution.
The function returns a tibble containing the SDG hits found in the vector of documents. The columns of the tibble depend on the value of output. Possible columns are:
Index of the element in text where match was found. Formatted as a factor with the number of levels matching the original number of documents.
Label of the SDG found in document.
The name of the query system that produced the match.
Index of the query within the query system that produced the match.
Concatenated list of words that caused the query to match.
Index of hit for a given system.
Number of queries that produced a hit for a given system, sdg, and document.
# run sdg detection hits <- detect_sdg_systems(projects) # run sdg detection with Aurora only hits <- detect_sdg_systems(projects, systems = "Aurora") # run sdg detection for sdg 3 only hits <- detect_sdg_systems(projects, sdgs = 3)# run sdg detection hits <- detect_sdg_systems(projects) # run sdg detection with Aurora only hits <- detect_sdg_systems(projects, systems = "Aurora") # run sdg detection for sdg 3 only hits <- detect_sdg_systems(projects, sdgs = 3)
A dataset containing the SDG queries of Elsevier (version 1). The queries are available from data.mendeley.com. The Elsevier queries were developed to maximize SDG hits on the Scopus database. A detailed description of how each SDG query was developed can be found here. There is one query per SDG. There are no queries for SDG-17.
elsevier_querieselsevier_queries
A data frame with 16 rows and 4 columns
Name of system
Label of the SDG
Index of the query
SDG query
https://data.mendeley.com/datasets/87txkw7khs/1
plot_sdg creates a (stacked) barplot of the frequency distribution of SDGs identified via detect_sdg or detect_sdg_systems.
plot_sdg( hits, systems = NULL, sdgs = NULL, normalize = "none", color = "unibas", sdg_titles = FALSE, remove_duplicates = TRUE, ... )plot_sdg( hits, systems = NULL, sdgs = NULL, normalize = "none", color = "unibas", sdg_titles = FALSE, remove_duplicates = TRUE, ... )
hits |
|
systems |
|
sdgs |
|
normalize |
|
color |
|
sdg_titles |
|
remove_duplicates |
|
... |
arguments passed to |
The function is built using ggplot and can thus be flexibly extended. See examples.
The function returns a ggplot object that can either be stored in an object or printed to produce the plot.
# run sdg detection hits <- detect_sdg_systems(projects) # create barplot plot_sdg(hits) # create barplot with facets plot_sdg(hits) + ggplot2::facet_wrap(~system)# run sdg detection hits <- detect_sdg_systems(projects) # create barplot plot_sdg(hits) # create barplot with facets plot_sdg(hits) + ggplot2::facet_wrap(~system)
500 project descriptions of University of Basel research projects that were funded by the Swiss National Science Foundation. The project descriptions were drawn randomly from University of Basel projects listed in the the public data.snf.ch project data base.
projectsprojects
A character vector of length 500.
https://data.snf.ch/about/glossary
A dataset containing the SDG queries based on the keyword ontology by OSDG. The queries are available from figshare.com.
sdgo_queriessdgo_queries
A data frame with 4,122 rows and 5 columns
Name of system
Label of the SDG
SDG keyword used in query
Index of the query
SDG query
Bautista-Puig, N.; Mauleón E. (2019). Unveiling the path towards sustainability: is there a research interest on sustainable goals? In the 17th International Conference on Scientometrics & Informetrics (ISSI 2019), Rome (Italy), Volume II, ISBN 978-88-3381-118-5, p.2770-2771. The authors of these queries first created an ontology from central keywords in the SDG UN description and expanded these keywords with keywords they identified in SDG related research output. There are multiple queries per SDG. All SDGs (1-17) are covered.
https://figshare.com/articles/dataset/SDG_ontology/11106113/1
A dataset containing SDG-specific keywords compiled from several universities from the Sustainable Development Solutions Network (SDSN) Australia, New Zealand & Pacific Network. The authors used UN documents, Google searches and personal communications as sources for the keywords. All SDGs (1-17) are covered.
sdsn_queriessdsn_queries
A data frame with 847 rows and 5 columns
Name of system
Label of the SDG
SDG keyword used in query
Index of the query
SDG query
https://ap-unsdsn.org/regional-initiatives/universities-sdgs/
A dataset containing the SDG queries of SIRIS Academic. The queries are available fromZenodo.org. The SIRIS queries were developed by extracting key terms from the UN official list of goals, targets and indicators as well from relevant literature around SDGs. The query system has subsequently been expanded with a pre-trained word2vec model and an algorithm that selects related words from Wikipedia. There are multiple queries per SDG (one per row). There are no queries for SDG-17.
siris_queriessiris_queries
A data frame with 3,445 rows and 6 columns
Name of system
Label of the SDG
Primary SDG query element
Secodary SDG query element
Index of the query
SDG query
https://zenodo.org/record/3567769#.YVMhH9gzYUG