Check input list against a reference list and perform fuzzy matching
rfs_update_list.Rd
This function takes an input list and a reference list, and performs a series of checks and fuzzy matching operations to create a verified list. The function checks if the input list is a data frame, if it has a column named "species", if the reference list is a data frame, and if the reference list has the mandatory columns. It also allows for specifying the method and maximum distance for fuzzy matching. The function returns a final data frame with the verified list.
Usage
rfs_update_list(
input_list,
reference_list,
complete_name = FALSE,
method = "lv",
max_dist = 3,
max_dist_2 = 5
)
Arguments
- input_list
A data frame representing the input list to be checked. Must contain a column named "species". Species name like genus + "sp." should be avoided. If you want to use complete species names (genus + species
authors + year) as key, you should avoid using comma to separate authors and year, and use space instead. For example: 'Genus species Author 0000'.
- reference_list
A data frame representing the reference list to compare against. This is obtained using rFishStatus::rfs_get_species_list().
- complete_name
A logical value indicating whether to use the species + author + year nomenclature (please, avoid using comma before the year), TRUE, or genus + epithet, FALSE, for joining. Default is FALSE.
- method
The method to use for fuzzy matching. Default is "lv" (Levenshtein distance).
- max_dist
The maximum distance for fuzzy matching. Default is 3. The user can change this value according to the desired string distance; this will affect the number of matches and false positives, and will vary according to the input list. Most of cases between 2 and 3 are good values when using species names, and 4 to 5 when using complete names. Please, check the fuzzy matching results carefully.
- max_dist_2
The maximum dostance for deeper fuzzy matching. You can keep the same as of "max_dist" or increase it by 1 or 2. This can improve query_species solving but also increase false positives.
Examples
input_list <- data.frame(
species = c("Acanthochromis polyacanthus", "Acanthochromis sp.", "Cichla kelberi", "Cichla piquitii",
"Cichla monocolos", "Cichla monoculus", "Cicla monoculus")
)
reference_list <- rFishStatus::rfs_get_species_list(
rFishStatus::data_template_ref
)
#> ℹ Creating species list from the input dataset.
#> Looking for uncertain species.
#> Looking for valid species.
#> Looking for all previous valid species names.
#> Filtering valid species.
#> Error in dplyr::filter(dplyr::mutate(dplyr::mutate(species_df, genus = stringr::str_extract(scientific_name, "\\w+"), epithet = stringr::str_extract(scientific_name, stringr::regex("(?<=\\s)[a-z]+(\\s(?![a-z]*\\.)[a-z]+)*"))), species = paste(genus, epithet, sep = " "), year = as.numeric(stringr::str_extract(scientific_name_author, "\\d{4}")), authors = stringr::str_replace(scientific_name_author, paste0(" ", year), ""), scientific_name = stringr::str_replace_all(scientific_name, "\\[.*?\\]", ""), scientific_name = stringr::str_trim(scientific_name, side = "both"), status = dplyr::if_else(scientific_name %in% valid_spp_list, "Valid", status), status = dplyr::if_else(scientific_name == valid_scientific_name, "Valid", status)), !(status == "Valid" & scientific_name != valid_scientific_name), year != "Invalid Number", year <= lubridate::year(Sys.Date()), !grepl("^\\w+ & \\w+ \\d{4}$", scientific_name), !grepl("^[a-z]", scientific_name), !stringr::str_detect(scientific_name, " \\[ref. "), !stringr::str_detect(scientific_name, "^\\["), !stringr::str_detect(scientific_name, ";")): ℹ In argument: `year <= lubridate::year(Sys.Date())`.
#> Caused by error in `loadNamespace()`:
#> ! there is no package called ‘lubridate’
checked_list <- rfs_update_list(input_list, reference_list)
#> Error: object 'reference_list' not found