Genetic File Information

library(BinaryDosage)

The routines getbdinfo, getvcfinfo, and getgeninfo return a list with information about the data in the files. The list returned by each of these routines a section common to them all and a list additionalinfo that is specific to the file type.

Common section

The common section has the following elements

  • filename - Character value with the complete path and file name of the file with the genetic data
  • usesfid - Logical value indicating if the subject data has family IDs.
  • samples - Data frame containing the following information about the subjects
    • fid - Character value with family IDs
    • sid - Character value with the individual IDs
  • onchr - Logical value indicating if all the SNPs are on the same chromosome
  • snpidformat - Integer indicating the format of the SNP IDs as follows
    • 0 - Unknown for VCF and GEN files or user specified for binary dosage files
    • 1 - chromosome:location
    • 2 - chromosome:location:referenceallele:alternateallele
    • 3 - chromosome:location_referenceallele_alternateallele
  • snps - Data frame containing the following values
    • chromosome - Character value indicating what chromosome the SNP is on
    • location - Integer value with the location of the SNP on the chromosome
    • snpid - Character value with the ID of the SNP
    • reference - Character value of the reference allele
    • alternate - Character value of the alternate allele
  • snpinfo - List that contain the following information
    • aaf - numeric vector with the alternate allele frequencies
    • maf - numeric vector with the minor allele frequencies
    • avgcall - Numeric vector with the imputation average call
    • rsq - Numeric vector with the imputation r squared value
  • datasize - Numeric vector indicating the size of data in the file for each SNP
  • indices - Numeric vector indicating the starting location in the file for each SNP

The list returned has its class value set to “genetic-info”.

The datasize and indices values are only returned if the parameter index is set equal to TRUE

Binary Dosage Additional Information

The additional information returned for binary dosage files contains the following information.

  • format - numeric value with the format of the binary dosage file
  • subformat - numeric value with the subformat of the binary dosage file
  • headersize - integer value with the size of the header in the binary dosage file
  • numgroups - integer value of the number of groups of subjects in the binary dosage file. This is usually the number of binary dosage files merged together to form the file
  • groups - integer vector with size of each of the groups

This list has its class value set to “bdose-info”.

VCF File Additional Information

The additional information returned for VCF files contains the following information.

  • gzipped - Logical value indicating if the file has been compressed using gzip
  • headerlines - Integer value indicating the number of lines in the header
  • headersize - Numeric value indicating the size of the header in bytes
  • quality - Character vector containing the values in QUALITY column
  • filter - Character vector containing the values in the FILTER column
  • info - Character vector containing the values in the INFO column
  • format - Character vector containing the values in the FORMAT column
  • datacolumns - Data frame summarizing the entries in the FORMAT value containing the following information
    • numcolumns - Integer value indicating the number of values in the FORMAT value
    • dosage - Integer value indicating the column containing the dosage value
    • genotypeprob - Integer value indicating the column containing the genotype probabilities
    • genotype - Integer value indicating the column containing the genotype call

This list has its class value set to “vcf-info”.

The values for quality, filter, info, and format can have a length of 0 if all the values are missing. They will have a length of 1 if all the values are equal. The number of rows in the datacolumns data frame will be equal to the length of the format value.

GEN File Additional Information

The additional information returned for GEN files contains the following information.

  • gzipped - Logical value indicating if the GEN file is compressed using gz
  • headersize - Integer value indicating the size of the header in bytes
  • format - Integer value indicating the number of genotype probabilities for each subject with the following meanings
    • 1 - Dosage only
    • 2 - Pr (g = 0) and Pr (g = 1)
    • 3 - Pr (g = 0), Pr (g = 1), and Pr (g = 2)
  • startcolumn - Integer value indicating in which column the genetic data starts
  • sep - Character value indicating what value separates the columns

g indicates the number of alternate alleles the subject has.

This list has its class value set to “gen-info”.