Detect SDGs in text using ensemble model — detect

detect_sdg identifies SDGs in text using an ensemble model approach considering multiple existing SDG query systems and text length.

detect_sdg(
  text,
  systems = lifecycle::deprecated(),
  output = lifecycle::deprecated(),
  sdgs = 1:17,
  synthetic = c("equal"),
  verbose = TRUE
)

Arguments

text: character vector or object of class tCorpus containing text in which SDGs shall be detected. Not allowed to contain any missing values.
systems: As of text2sdg 1.0.0 the `systems` argument of `detect_sdg()` is deprecated. This is because `detect_sdg()` now makes use of an ensemble approach that draws on all systems as well as on the text length, see --preprint-- for more information. The old version of `detect_sdg()` is available through the `detect_sdg_systems()` function.
output: As of text2sdg 1.0.0 the `output` argument of `detect_sdg()` is deprecated. This is because `detect_sdg()` now makes use of an ensemble approach that draws on all systems as well as on the text length, see --preprint-- for more information. The old version of `detect_sdg()` is available through the `detect_sdg_systems()` function.
sdgs: numeric vector with integers between 1 and 17 specifying the sdgs to identify in text. Defaults to 1:17.
synthetic: character vector specifying the ensemble version to be used. These versions vary in terms of the amount of synthetic data used in training (relative to the amount of expert-labeled data). Can be one or more of "none", "third", "equal", and "triple". The default is "equal".
verbose: logical specifying whether messages on the function's progress should be printed.

Value

The function returns a tibble containing the SDG hits found in the vector of documents. The columns of the tibble are described below. The tibble also includes as an attribute with name "system_hits" the predictions of the individual systems produced by detect_sdg_systems().

document: Index of the element in text where match was found. Formatted as a factor with the number of levels matching the original number of documents.
sdg: Label of the SDG found in document.
system: The name of the ensemble system that produced the match.
hit: Index of hit for the Ensemble model.

Details

detect_sdg implements a ensemble model to detect SDGs in text. The ensemble model combines the six systems implemented by detect_sdg_systems and text length in a random forest architecture. The ensemble model has been trained on three data sets with SDG labels assigned by experts and a matching number of synthetic texts generated by random sampling from a word frequency list. The user has the choice of multiple versions of the ensemble model that have been trained on different amounts of synthetic texts to adjust the sensitivity and specificity of the model. Increasing the amount of of synthetic data makes the ensemble more conservative, leading to increased sensitivity and decreased specificity.

By default, detect_sdg implements the version of the ensemble model that has been trained on an equal amount of expert-labeled and synthetic data, providing a reasonable balance between sensitivity and specificity. For details, see article by Wulff et al. (2024).

References

Wulff, D. U., Meier, D. S., & Mata, R. (2024). Using novel data and ensemble models to improve automated labeling of Sustainable Development Goals. Sustainability Science. https://doi.org/10.1007/s11625-024-01516-3

Examples

# \donttest{
# run sdg detection
hits <- detect_sdg(projects)
#> Running systems
#> Obtaining text lengths
#> Building features
#> Running ensemble

# run sdg detection for sdg 3 only
hits <- detect_sdg(projects, sdgs = 3)
#> Running systems
#> Obtaining text lengths
#> Building features
#> Running ensemble

# extract systems hits
attr(hits, "system_hits")
#> # A tibble: 979 × 5
#>    document sdg    system   n_hit features                                      
#>    <fct>    <chr>  <chr>    <int> <chr>                                         
#>  1 1        SDG-03 Auckland     1 tuberculosis, human, tuberculosis, disease    
#>  2 1        SDG-03 Elsevier     1 tuberculosis, human, tuberculosis, disease    
#>  3 1        SDG-03 SDGO         7 antibiotics, bacteria, disease, infection, in…
#>  4 1        SDG-03 SDSN         1 tuberculosis                                  
#>  5 2        SDG-03 Auckland     1 SARs                                          
#>  6 3        SDG-03 Auckland     1 immunology, medicine                          
#>  7 3        SDG-03 SDGO         7 disease, disease, states, epigenetic, fetal, …
#>  8 3        SDG-03 SDSN         1 health                                        
#>  9 6        SDG-03 Auckland     1 cancer                                        
#> 10 6        SDG-03 Elsevier     1 cancer                                        
#> # ℹ 969 more rows
# }