library(qpdt)

Disclaimer

Pedigrees that are analysed by the software system PopReport must fullfill certain properties and must be formatted according to certain rules. The proporties and the fullfillment of the rules are checked by this package. The checks that are implemented in this package are described in this vignette.

Checks and Diagnoses

The checks of the pedigrees are based on the following points

  • number of records
  • number of unique IDs
  • number of animals with parents
  • empirical distribution of birthdates
  • empirical distribution of sexes
  • empirical distribution of postal codes

Uniqueness of Animal IDs

Animal IDs are the primary keys of an individual. Hence these IDs must be unique. This uniqueness is tested with the following statement using the function check_pedig_id(). This function requires the path to an input pedigree as an argument. This package contains a number of test pedigrees which can be tested by the following statement.

# get path to test pedigrees
(vec_ped_path <- list.files(system.file('extdata', package = 'qpdt'), full.names = TRUE))
#> [1] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/data_sample_cycle.csv"                       
#> [2] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/data_sample.csv"                             
#> [3] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/data_sample2.csv"                            
#> [4] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/PopReport_GG_mit_20210113.csv_adaptfin2.csv" 
#> [5] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/PopReport_SN_ohne_20210115.csv_adaptfin2.csv"

The first pedigree is the test-pedigree that can be obtained from the GenMon-Website. The difference between the first and the second test-pedigree is that the first has an animal with ‘F’ as a sire. In the second pedigree, this sire is removed. The first pedigree is checked by the following statement.

check_pedig_id(ps_pedig_path = vec_ped_path[2],
               ps_id_col = '#animal')
#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory
#> $PedFile
#> [1] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/data_sample.csv"
#> 
#> $NrRecord
#> [1] 1227
#> 
#> $NrAnimals
#> [1] 1227
#> 
#> $TblDuplicates
#> NULL

The output of the test-pedigree indicates that all animal IDs are unique.

The checks of the remaining pedigrees are run by applying the function check_pedig_id to all the pedigrees.

# run checks
lapply(vec_ped_path[3:length(vec_ped_path)], check_pedig_id)
#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory

#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory

#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory
#> [[1]]
#> [[1]]$PedFile
#> [1] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/data_sample2.csv"
#> 
#> [[1]]$NrRecord
#> [1] 1227
#> 
#> [[1]]$NrAnimals
#> [1] 0
#> 
#> [[1]]$TblDuplicates
#> NULL
#> 
#> 
#> [[2]]
#> [[2]]$PedFile
#> [1] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/PopReport_GG_mit_20210113.csv_adaptfin2.csv"
#> 
#> [[2]]$NrRecord
#> [1] 172076
#> 
#> [[2]]$NrAnimals
#> [1] 172063
#> 
#> [[2]]$TblDuplicates
#> # A tibble: 32 x 9
#>    `#IDTier`   IDVater IDMutter Birthdate  Geschlecht   PLZ introg inb_gen  cryo
#>        <dbl>     <dbl>    <dbl> <date>     <chr>      <dbl>  <dbl> <lgl>   <dbl>
#>  1     99954 999532045   9.99e8 1996-01-26 F           3550      0 NA          0
#>  2     99954 999555626   1.00e9 1998-03-08 F           3274      0 NA          0
#>  3     99950 999434122   1.00e9 1981-04-04 F           3052      0 NA          0
#>  4     99950 999558998   1.00e9 2002-03-12 F           3537      0 NA          0
#>  5     99956 999555626   1.00e9 1998-03-08 F           3274      0 NA          0
#>  6     99956 999466991   1.00e9 2003-02-02 F           3152      0 NA          0
#>  7     99956 999560618   1.00e9 2004-02-14 F           3619      0 NA          0
#>  8     99957 999467818   1.00e9 2000-04-10 F           3537      0 NA          0
#>  9     99957 999466991   1.00e9 2003-02-04 M           3152      0 NA          0
#> 10     99958 999555626   1.00e9 1998-03-05 F           3252      0 NA          0
#> # … with 22 more rows
#> 
#> 
#> [[3]]
#> [[3]]$PedFile
#> [1] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/PopReport_SN_ohne_20210115.csv_adaptfin2.csv"
#> 
#> [[3]]$NrRecord
#> [1] 95119
#> 
#> [[3]]$NrAnimals
#> [1] 95119
#> 
#> [[3]]$TblDuplicates
#> NULL

The output of the second pedigree shows that all animal IDs are unique. The output of the first pedigree shows a list of duplicated IDs. They must be removed.

Parents

Interesting quantities related to parents are

  • number of animals with parents
  • are parents also animals
  • parent and offspring pairs with inconsistent birthdates

The following function call checks the properties of parents for the first test pedigree /private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/data_sample2.csv.

check_pedig_parent(ps_pedig_path = vec_ped_path[3],
                   ps_id_col        = '#animal',
                   ps_sire_col      = 'sire',
                   ps_dam_col       = 'dam',
                   ps_bd_col        = 'birth_date',
                   ps_sex_col       = 'sex')
#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory
#> $PedFile
#> [1] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/data_sample2.csv"
#> 
#> $NrMissingSire
#> [1] 2
#> 
#> $NrMissingDam
#> [1] 1
#> 
#> $NrSireNotAnimal
#> [1] 137
#> 
#> $NrDamNotAnimal
#> [1] 330
#> 
#> $TblSireBdate
#> # A tibble: 0 x 4
#> # … with 4 variables: `#animal` <dbl>, sire <dbl>, `birth_date.#animal` <date>,
#> #   birth_date.sire <date>
#> 
#> $TblDamBdate
#> # A tibble: 0 x 4
#> # … with 4 variables: `#animal` <dbl>, dam <dbl>, `birth_date.#animal` <date>,
#> #   birth_date.dam <date>
#> 
#> $TblSireEqID
#> # A tibble: 0 x 9
#> # … with 9 variables: `#animal` <dbl>, sire <dbl>, dam <dbl>,
#> #   birth_date <date>, sex <chr>, plz <dbl>, introg <dbl>, inb_gen <lgl>,
#> #   cryo <lgl>
#> 
#> $TblDamEqID
#> # A tibble: 0 x 9
#> # … with 9 variables: `#animal` <dbl>, sire <dbl>, dam <dbl>,
#> #   birth_date <date>, sex <chr>, plz <dbl>, introg <dbl>, inb_gen <lgl>,
#> #   cryo <lgl>
#> 
#> $TblSireWrongSex
#> # A tibble: 0 x 2
#> # … with 2 variables: sire <dbl>, sex <chr>
#> 
#> $TblDamWrongSex
#> # A tibble: 0 x 2
#> # … with 2 variables: dam <dbl>, sex <chr>

The checks for the other test pedigrees are done with the following apply-statement. This is possible, because these pedigrees have the same column headers.

lapply(vec_ped_path[4:length(vec_ped_path)], check_pedig_parent)
#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory

#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory
#> [[1]]
#> [[1]]$PedFile
#> [1] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/PopReport_GG_mit_20210113.csv_adaptfin2.csv"
#> 
#> [[1]]$NrMissingSire
#> [1] 9288
#> 
#> [[1]]$NrMissingDam
#> [1] 5617
#> 
#> [[1]]$NrSireNotAnimal
#> [1] 4
#> 
#> [[1]]$NrDamNotAnimal
#> [1] 23
#> 
#> [[1]]$TblSireBdate
#> # A tibble: 6 x 4
#>   `#IDTier`   IDVater `Birthdate.#IDTier` Birthdate.IDVater
#>       <dbl>     <dbl> <date>              <date>           
#> 1 999780574 999780544 1910-01-01          2007-03-01       
#> 2 999780582 999780544 1910-01-01          2007-03-01       
#> 3 999750225 999439795 1910-01-02          1991-12-14       
#> 4 999552253 999552298 1994-03-01          1995-03-01       
#> 5 999780535 999481777 1995-03-01          1997-11-15       
#> 6 999791525 999757076 2010-03-01          2010-12-23       
#> 
#> [[1]]$TblDamBdate
#> # A tibble: 1 x 4
#>   `#IDTier`  IDMutter `Birthdate.#IDTier` Birthdate.IDMutter
#>       <dbl>     <dbl> <date>              <date>            
#> 1 999780585 999774799 1910-01-01          2007-03-01        
#> 
#> [[1]]$TblSireEqID
#> # A tibble: 0 x 9
#> # … with 9 variables: `#IDTier` <dbl>, IDVater <dbl>, IDMutter <dbl>,
#> #   Birthdate <date>, Geschlecht <chr>, PLZ <dbl>, introg <dbl>, inb_gen <lgl>,
#> #   cryo <dbl>
#> 
#> [[1]]$TblDamEqID
#> # A tibble: 1 x 9
#>    `#IDTier`  IDVater  IDMutter Birthdate  Geschlecht   PLZ introg inb_gen  cryo
#>        <dbl>    <dbl>     <dbl> <date>     <chr>      <dbl>  <dbl> <lgl>   <dbl>
#> 1 1004954789   1.00e9    1.00e9 NA         F             NA      0 NA          0
#> 
#> [[1]]$TblSireWrongSex
#> # A tibble: 0 x 2
#> # … with 2 variables: IDVater <dbl>, Geschlecht <chr>
#> 
#> [[1]]$TblDamWrongSex
#> # A tibble: 1 x 2
#>     IDMutter Geschlecht
#>        <dbl> <chr>     
#> 1 1004954570 M         
#> 
#> 
#> [[2]]
#> [[2]]$PedFile
#> [1] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/PopReport_SN_ohne_20210115.csv_adaptfin2.csv"
#> 
#> [[2]]$NrMissingSire
#> [1] 1211
#> 
#> [[2]]$NrMissingDam
#> [1] 462
#> 
#> [[2]]$NrSireNotAnimal
#> [1] 4243
#> 
#> [[2]]$NrDamNotAnimal
#> [1] 6468
#> 
#> [[2]]$TblSireBdate
#> # A tibble: 9 x 4
#>    `#IDTier`    IDVater `Birthdate.#IDTier` Birthdate.IDVater
#>        <dbl>      <dbl> <date>              <date>           
#> 1 1000810031 1000050624 2013-03-09          2013-05-02       
#> 2  999870937 1000161006 2013-03-20          2013-10-08       
#> 3 1000810027 1000050624 2013-03-25          2013-05-02       
#> 4 1000813707 1000050624 2013-03-26          2013-05-02       
#> 5 1000810038 1000050624 2013-03-26          2013-05-02       
#> 6 1000810020 1000050624 2013-04-15          2013-05-02       
#> 7 1000711443 1000050624 2013-04-15          2013-05-02       
#> 8 1000456126 1005609568 2013-11-26          2016-02-02       
#> 9 1004653188 1005845648 2015-03-07          2016-09-29       
#> 
#> [[2]]$TblDamBdate
#> # A tibble: 2 x 4
#>    `#IDTier`   IDMutter `Birthdate.#IDTier` Birthdate.IDMutter
#>        <dbl>      <dbl> <date>              <date>            
#> 1      99768      99765 1983-03-20          1985-02-26        
#> 2 1000878464 1004910666 2015-03-05          2015-10-12        
#> 
#> [[2]]$TblSireEqID
#> # A tibble: 0 x 9
#> # … with 9 variables: `#IDTier` <dbl>, IDVater <dbl>, IDMutter <dbl>,
#> #   Birthdate <date>, Geschlecht <chr>, PLZ <dbl>, introg <dbl>, inb_gen <lgl>,
#> #   cryo <dbl>
#> 
#> [[2]]$TblDamEqID
#> # A tibble: 1 x 9
#>    `#IDTier`  IDVater  IDMutter Birthdate  Geschlecht   PLZ introg inb_gen  cryo
#>        <dbl>    <dbl>     <dbl> <date>     <chr>      <dbl>  <dbl> <lgl>   <dbl>
#> 1 1004866450   9.99e8    1.00e9 2011-03-27 F           7411      0 NA          0
#> 
#> [[2]]$TblSireWrongSex
#> # A tibble: 3 x 2
#>      IDVater Geschlecht
#>        <dbl> <chr>     
#> 1 1007737726 F         
#> 2 1006000916 F         
#> 3 1004668293 F         
#> 
#> [[2]]$TblDamWrongSex
#> # A tibble: 0 x 2
#> # … with 2 variables: IDMutter <dbl>, Geschlecht <chr>

Datatypes

The processing of the pedigrees by PopReport and further by GenMon, requires that the columns of the pedigree contain objects of a certain data-type. The requirement of the data-types is checked by the function check_pedigree_datatypes(). The following statement shows a few tests and experiments.

tbl_ped <- read_prp_pedigree(ps_pedig_path = vec_ped_path[2], ps_delim = '|')
#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory

The data-types can be checked using the function class().

class(tbl_ped[["#animal"]])
#> [1] "numeric"

In the pedigree ‘tbl_ped’, the column ‘sire’ has an unexpected data-type which is shown by

class(tbl_ped[["sire"]])
#> [1] "character"

The column birthdate can have a special data-type

class(tbl_ped[["birth_date"]])
#> [1] "Date"

A similar functionality can be achieved by the function readr::guess_parser().

readr::guess_parser(tbl_ped[["sire"]], guess_integer = TRUE)
#> [1] "character"

For the column of dams

readr::guess_parser(tbl_ped[["dam"]], guess_integer = TRUE)
#> [1] "integer"

In case of a column with floating point numbers, we get

readr::guess_parser(tbl_ped[["introg"]], guess_integer = TRUE)
#> [1] "double"

For each of the columns to be checked, we give the required data-type. This can be specified in a list

(l_req_dt <- list(col = c("#animal", "sire", "dam", "birth_date", "sex", "plz", "introg"),
                  dtp = c("integer", "integer", "integer", "date", "character", "integer", "double")))
#> $col
#> [1] "#animal"    "sire"       "dam"        "birth_date" "sex"       
#> [6] "plz"        "introg"    
#> 
#> $dtp
#> [1] "integer"   "integer"   "integer"   "date"      "character" "integer"  
#> [7] "double"

The check can be performed in a simple loop

require(readr)
tbl_par_problem <- NULL
for (idx in seq_along(l_req_dt$col)) {
  cat(" * Checking column: ", l_req_dt$col[idx], "\n")
  s_cur_parser <- readr::guess_parser(tbl_ped[[ l_req_dt$col[idx]]], guess_integer = TRUE)
  if (s_cur_parser != l_req_dt$dtp[idx]){
    cat(" *** ERROR: column ",  l_req_dt$col[idx], " wrong datatype\n")
    parfun <- match.fun(paste('parse_', l_req_dt$dtp[idx], sep = ''))
    par_result <- parfun(tbl_ped[[l_req_dt$col[idx]]])
    tbl_par_problem <- problems(par_result)
  }
}
#>  * Checking column:  #animal 
#>  * Checking column:  sire 
#>  *** ERROR: column  sire  wrong datatype
#> Warning: 1 parsing failure.
#> row col   expected actual
#> 477  -- an integer      F
#>  * Checking column:  dam 
#>  * Checking column:  birth_date 
#>  * Checking column:  sex 
#>  * Checking column:  plz 
#>  * Checking column:  introg
tbl_par_problem
#> # A tibble: 1 x 4
#>     row   col expected   actual
#>   <int> <int> <chr>      <chr> 
#> 1   477    NA an integer F

The check is implemented in the function check_pedigree_datatypes(). This function requires as an input a list with column names and required datatypes. Such a list can be obtained using the function get_pedigree_datatypes().

(l_ped_dt_result <- get_pedigree_datatypes(ps_pedig_path = vec_ped_path[2]))
#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory
#> $PedFile
#> [1] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/data_sample.csv"
#> 
#> $l_dtype
#> $l_dtype$col
#> [1] "#animal"    "sire"       "dam"        "birth_date" "sex"       
#> [6] "plz"        "introg"     "inb_gen"    "cryo"      
#> 
#> $l_dtype$dtp
#> [1] "integer"   "character" "integer"   "date"      "character" "integer"  
#> [7] "double"    "logical"   "logical"

The list required as input for check_pedigree_datatypes() is obtained as the component l_dtype of the result of get_pedigree_datatypes() by

l_ped_dt_result$l_dtype
#> $col
#> [1] "#animal"    "sire"       "dam"        "birth_date" "sex"       
#> [6] "plz"        "introg"     "inb_gen"    "cryo"      
#> 
#> $dtp
#> [1] "integer"   "character" "integer"   "date"      "character" "integer"  
#> [7] "double"    "logical"   "logical"

From this result, we can see that the column of “sire-IDs” has datatype character. This is not correct, because all ID-columns should have the same datatype. The problematic entry in the sire-ID column can be found by the result of the function check_pedigree_datatypes().

The check can now be run by

l_dtype <- l_ped_dt_result$l_dtype
(l_dtp_check_result <- check_pedigree_datatypes(ps_pedig_path = vec_ped_path[2], pl_dtype = l_dtype))
#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory
#> $PedFile
#> [1] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/data_sample.csv"
#> 
#> $ReqDType
#> $ReqDType$col
#> [1] "#animal"    "sire"       "dam"        "birth_date" "sex"       
#> [6] "plz"        "introg"     "inb_gen"    "cryo"      
#> 
#> $ReqDType$dtp
#> [1] "integer"   "character" "integer"   "date"      "character" "integer"  
#> [7] "double"    "logical"   "logical"  
#> 
#> 
#> $CurDType
#> $CurDType$col
#> [1] "#animal"    "sire"       "dam"        "birth_date" "sex"       
#> [6] "plz"        "introg"     "inb_gen"    "cryo"      
#> 
#> $CurDType$dtp
#> [1] "integer"   "character" "integer"   "date"      "character" "integer"  
#> [7] "double"    "logical"   "logical"  
#> 
#> 
#> $DTypeProblems
#> NULL

Using the corrected version of this pedigree results in the following check

(l_ped_dt_result <- get_pedigree_datatypes(ps_pedig_path = vec_ped_path[3]))
#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory
#> $PedFile
#> [1] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/data_sample2.csv"
#> 
#> $l_dtype
#> $l_dtype$col
#> [1] "#animal"    "sire"       "dam"        "birth_date" "sex"       
#> [6] "plz"        "introg"     "inb_gen"    "cryo"      
#> 
#> $l_dtype$dtp
#> [1] "integer"   "integer"   "integer"   "date"      "character" "integer"  
#> [7] "double"    "logical"   "logical"
l_dtype <- l_ped_dt_result$l_dtype
(l_dtp_check_result <- check_pedigree_datatypes(ps_pedig_path = vec_ped_path[3], pl_dtype = l_dtype))
#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory
#> $PedFile
#> [1] "/private/var/folders/2v/jfsqj8zj2f122jcgy15nzfn00000gn/T/Rtmp4wSii3/temp_libpath909f297b1eee/qpdt/extdata/data_sample2.csv"
#> 
#> $ReqDType
#> $ReqDType$col
#> [1] "#animal"    "sire"       "dam"        "birth_date" "sex"       
#> [6] "plz"        "introg"     "inb_gen"    "cryo"      
#> 
#> $ReqDType$dtp
#> [1] "integer"   "integer"   "integer"   "date"      "character" "integer"  
#> [7] "double"    "logical"   "logical"  
#> 
#> 
#> $CurDType
#> $CurDType$col
#> [1] "#animal"    "sire"       "dam"        "birth_date" "sex"       
#> [6] "plz"        "introg"     "inb_gen"    "cryo"      
#> 
#> $CurDType$dtp
#> [1] "integer"   "integer"   "integer"   "date"      "character" "integer"  
#> [7] "double"    "logical"   "logical"  
#> 
#> 
#> $DTypeProblems
#> NULL