Package 'BinaryDosage'

Title: Creates, Merges, and Reads Binary Dosage Files
Description: Tools to create binary dosage from either VCF or GEN files, merge binary dosage files, and read binary dosage files.
Authors: John Morrison [aut, cre], NIEHS [fnd] (P01 CA196559), NIEHS [fnd] (R01 CA201407), NIEHS [fnd] (P30 ES007048), NIEHS [fnd] (P01 HL115606)
Maintainer: John Morrison <[email protected]>
License: GPL-3
Version: 1.0.0
Built: 2025-03-07 05:05:31 UTC
Source: https://github.com/cran/BinaryDosage

Help Index


Apply a function to each SNP in a binary dosage file

Description

A routine that reads in the SNP data serially from a binary dosage file and applies a user specified function to the data.

Usage

bdapply(bdinfo, func, ...)

Arguments

bdinfo

List with information about the binary dosage file returned from getbdinfo

func

A user supplied function to apply to the data for each snp. The function must be provide with the following parameters, dosage, p0, p1, and p2, where dosage is the dosage values for each subject and p0, p1, and p2 are the probabilities that a subject has zero, one, and two copies of the alternate allele, respectively.

...

Additional parameters needed by the user supplied function

Value

A list with length equal to the number of SNPs in the binary dosage file. Each element of the list is the value returned by the user supplied function

See Also

Other Iterating functions: genapply(), vcfapply()

Examples

# Get information about a binary dosage file

vcf1abdfile <- system.file("extdata", "vcf1a.bdose", package = "BinaryDosage")
bdinfo <- getbdinfo(bdfiles = vcf1abdfile)

# Apply the getaaf, get alternate allele frequency, function
# to all the SNPs in the binary dosage file

aaf <- bdapply(bdinfo = bdinfo,
               func = BinaryDosage:::getaaf)

Merge binary dosage files together

Description

Routine to merge binary dosage files together. The files don't have to be in the same format. They will be merged into a file with the format specified. Information about the SNPs, aaf, maf, avgcall, rsq, can be maintained for each file, or recalculated for the merged set.

Usage

bdmerge(
  mergefiles,
  format = 4,
  subformat = 0L,
  bdfiles,
  famfiles = character(),
  mapfiles = character(),
  onegroup = TRUE,
  bdoptions = character(),
  snpjoin = "inner"
)

Arguments

mergefiles

Vector of file names for the merged binary files. The first is the binary dosage data containing the dosages and genetic probabilities. The second file name is the family information file. The third file name is the SNP information file. The family and SNP information files are not used if the binary dosage file is in format 4. For this format the family and SNP information are in the file with the dosages and genetic probabilities.

format

The format of the output binary dosage file. Allowed values are 1, 2, 3, and 4. The default value is 4. Using the default value is recommended.

subformat

The subformat of the format of the output binary dosage file. A value of 1 or 3 indicates that only the dosage value is saved. A value of 2 or 4 indicates the dosage and genetic probabilities will be output. Values of 3 or 4 are only allowed with formats 3 and 4. If a value of zero if provided, and genetic probabilities are in the vcf file, subformat 2 will be used for formats 1 and 2, and subformat 4 will be used for formats 3 and 4. If the vcf file does not contain genetic probabilities, subformat 1 will be used for formats 1 and 2, and subformat 3 will be used for formats 3 and 4. The default value is 0.

bdfiles

Vector of binary dosage file names to be merged.

famfiles

Vector of family file names that correspond to the names in bdfiles. If the binary dosage files are all in format 4, this may be an empty character array. Default value is character().

mapfiles

Vector of map file names that correspond to the names in bdfiles. If the binary dosage files are all in format 4, this may be an empty character array. Default value is character().

onegroup

Indicator to combine all the samples in one group. If this is FALSE, the groups in each binary dosage file are maintained and any binary dosage file with one group is made into its own group. Default value is TRUE.

bdoptions

Options indicating what information to calculate and store for each SNP. These can be aaf, maf, and rsq. This option is only available if format is equal to 4 and onegroup is TRUE. Default value is character().

snpjoin

Character value that can be either "inner" or "outer". This indicates whether to do an inner or outer join of the SNPs in each binary dosage file. Default value is "inner".

Value

None

Examples

bdvcf1afile <- system.file("extdata", "vcf1a.bdose", package = "BinaryDosage")
bdvcf1bfile <- system.file("extdata", "vcf1b.bdose", package = "BinaryDosage")
mergefiles <- tempfile()

BinaryDosage:::bdmerge(mergefiles = mergefiles,
                       bdfiles = c(bdvcf1afile, bdvcf1bfile),
                       bdoptions = "maf")
bdinfo <- getbdinfo(mergefiles)

Apply a function to each SNP in a gen, impute2, file

Description

A routine that reads in the SNP data serially from a gen file and applies a user specified function to the data.

Usage

genapply(geninfo, func, ...)

Arguments

geninfo

List with information about the gen, impute2, file returned from getgeninfo

func

A user supplied function to apply to the data for each snp. The function must be provide with the following parameters, dosage, p0, p1, and p2, where dosage is the dosage values for each subject and p0, p1, and p2 are the probabilities that a subject has zero, one, and two copies of the alternate allele, respectively.

...

Additional parameters needed by the user supplied function

Value

A list with length equal to the number of SNPs in the vcf file. Each element of the list is the value returned by the user supplied function

See Also

Other Iterating functions: bdapply(), vcfapply()

Examples

# Get information about a gen, impute2, file

gen1afile <- system.file("extdata", "set1a.imp", package = "BinaryDosage")
geninfo <- getgeninfo(genfiles = gen1afile,
                      snpcolumns = c(1L, 3L, 2L, 4L, 5L),
                      header = TRUE)


aaf <- genapply(geninfo = geninfo,
                func = BinaryDosage:::getaaf)

Convert a gen file to a binary dosage file

Description

Routine to read information from a gen file and create a binary dosage file. Note: This routine can take a long time to run if the gen file is large.

Usage

gentobd(
  genfiles,
  snpcolumns = 1L:5L,
  startcolumn = 6L,
  impformat = 3L,
  chromosome = character(),
  header = c(FALSE, TRUE),
  gz = FALSE,
  sep = "\t",
  bdfiles,
  format = 4L,
  subformat = 0L,
  snpidformat = 0L,
  bdoptions = character(0)
)

Arguments

genfiles

A vector of file names. The first is the name of the gen file. The second is name of the sample file that contains the subject information.

snpcolumns

Column numbers containing chromosome, snpid, location, reference allele, alternate allele, respectively. This must be an integer vector. All values must be positive except for the chromosome. The value for the chromosome may be -1 or -0. -1 indicates that the chromosome value is passed to the routine using the chromosome parameter. 0 indicates that the chromosome value is in the snpid and that the snpid has the format chromosome:other_data. Default value is c(1L, 2L, 3L, 4L, 5L).

startcolumn

Column number of first column with genetic probabilities or dosages. Must be an integer value. Default value is 6L.

impformat

Number of genetic data values per subject. 1 indicates dosage only, 2 indicates P(g=0) and P(g=1) only, 3 indicates P(g=0), P(g=1), and P(g=2). Default value is 3L.

chromosome

Chromosome value to use if the first value of the snpcolumns is equal to 0. Default value is character().

header

Indicators if the gen and sample files have headers. If the gen file does not have a header. A sample file must be included. Default value is c(FALSE, TRUE).

gz

Indicator if file is compressed using gzip. Default value is FALSE.

sep

Separator used in the gen file. Default value is "\t"

bdfiles

Vector of names of the output files. The binary dosage file name is first. The family and map files follow. For format 4, no family and map file names are needed.

format

The format of the output binary dosage file. Allowed values are 1, 2, 3, and 4. The default value is 4. Using the default value is recommended.

subformat

The subformat of the format of the output binary dosage file. A value of 1 or 3 indicates that only the dosage value is saved. A value of 2 or 4 indicates the dosage and genetic probabilities will be output. Values of 3 or 4 are only allowed with formats 3 and 4. If a value of zero if provided, and genetic probabilities are in the vcf file, subformat 2 will be used for formats 1 and 2, and subformat 4 will be used for formats 3 and 4. If the vcf file does not contain genetic probabilities, subformat 1 will be used for formats 1 and 2, and subformat 3 will be used for formats 3 and 4. The default value is 0.

snpidformat

The format that the SNP ID will be saved as. -1 - SNP ID not written. 0 - same as in the VCF file. 1 - chromosome:location. 2 - chromosome:location:reference_allele:alternate_allele. If snpidformat is 1 and the VCF file uses format 2, an error is generated. Default value is 0.

bdoptions

Character array containing any of the following value, "aaf", "maf", "rsq". The presence of any of these values indicates that the specified values should be calculates and stored in the binary dosage file. These values only apply to format 4.

Value

None

Examples

# Find the gen file names
gen3afile <- system.file("extdata", "set3a.imp", package = "BinaryDosage")
gen3asample <- system.file("extdata", "set3a.sample", package = "BinaryDosage")
# Get temporary output file name
bdfiles <- tempfile()
# Convert the file
gentobd(genfiles = c(gen3afile, gen3asample),
        snpcolumns = c(0L, 2L:5L),
        bdfiles = bdfiles)
# Verify the file was written correctly
bdinfo <- getbdinfo(bdfiles = bdfiles)

Calculate alternate allele frequency

Description

Routine to calculate the alternate allele frequency given the dosages. Missing values for dosage ignored. This function is used internally and is exported for use in examples.

Usage

getaaf(dosage, p0, p1, p2)

Arguments

dosage

Dosage values

p0

Pr(g=0) - unused

p1

Pr(g=1) - unused

p2

Pr(g=2) - unused

Value

Alternate allele frequency

Examples

# Get information about binary dosage file
bdfile <- system.file("extdata", "vcf1a.bdose", package = "BinaryDosage")
bdinfo <- getbdinfo(bdfiles = bdfile)
snp1 <- getsnp(bdinfo = bdinfo, 1)
aaf <- getaaf(snp1$dosage)

Get information about a binary dosage file

Description

Routine to return information about a binary dosage file. This information is used by other routines to allow for quicker extraction of values from the file.

Usage

getbdinfo(bdfiles)

Arguments

bdfiles

Vector of file names. The first is the binary dosage data containing the dosages and genetic probabilities. The second file name is the family information file. The third file name is the SNP information file. The family and SNP information files are not used if the binary dosage file is in format 4. For this format the family and SNP information are in the file with the dosages and genetic probabilities.

Value

List with information about the binary dosage file. This includes family and subject IDs along with a list of the SNPs in the file. Other information needed to read the file is also included.

Examples

vcf1abdfile <- system.file("extdata", "vcf1a.bdose", package = "BinaryDosage")
bdinfo <- getbdinfo(bdfiles = vcf1abdfile)

Get information about a gen, impute2, file

Description

Routine to return information about a gen file. This information is used by other routines to allow for quicker extraction of values from the file.

Usage

getgeninfo(
  genfiles,
  snpcolumns = 1L:5L,
  startcolumn = 6L,
  impformat = 3L,
  chromosome = character(),
  header = c(FALSE, TRUE),
  gz = FALSE,
  index = TRUE,
  snpidformat = 0L,
  sep = c("\t", "\t")
)

Arguments

genfiles

A vector of file names. The first is the name of the gen file. The second is name of the sample file that contains the subject information.

snpcolumns

Column numbers containing chromosome, snpid, location, reference allele, alternate allele, respectively. This must be an integer vector. All values must be positive except for the chromosome. The value for the chromosome may be -1 or -0. -1 indicates that the chromosome value is passed to the routine using the chromosome parameter. 0 indicates that the chromosome value is in the snpid and that the snpid has the format chromosome:other_data. Default value is c(1L, 2L, 3L, 4L, 5L).

startcolumn

Column number of first column with genetic probabilities or dosages. Must be an integer value. Default value is 6L.

impformat

Number of genetic data values per subject. 1 indicates dosage only, 2 indicates P(g=0) and P(g=1) only, 3 indicates P(g=0), P(g=1), and P(g=2). Default value is 3L.

chromosome

Chromosome value to use if the first value of the snpcolumns is equal to 0. Default value is character().

header

Indicators if the gen and sample files have headers. If the gen file does not have a header. A sample file must be included. Default value is c(FALSE, TRUE).

gz

Indicator if file is compressed using gzip. Default value is FALSE.

index

Indicator if file should be indexed. This allows for faster reading of the file. Indexing a gzipped file is not supported. Default value is TRUE.

snpidformat

Format to change the snpid to. 0 indicates to use the snpid format in the file. 1 indicates to change the snpid into chromosome:location, 2 indicates to change the snpid into chromosome:location:referenceallele:alternateallele, 3 indicates to change the snpid into chromosome:location_referenceallele_alternateallele, Default value is 0.

sep

Separators used in the gen file and sample files, respectively. If only value is provided it is used for both files. Default value is c("\t", "\t")

Value

List with information about the gen file. This includes family and subject IDs along with a list of the SNPs in the file. Other information needed to read the file is also included.

Examples

# Get file names of th gen and sample file
gen3afile <- system.file("extdata", "set3a.imp", package = "BinaryDosage")
gen3ainfo <- system.file("extdata", "set3a.sample", package = "BinaryDosage")

# Get the information about the gen file
geninfo <- getgeninfo(genfiles = c(gen3afile, gen3ainfo),
                      snpcolumns = c(0L, 2L:5L))

Calculate minor allele frequency

Description

Routine to calculate the minor allele frequency given the dosages. Missing values for dosage ignored. This function is used internally and is exported for use in examples. Note: The minor allele in one data set may be different from another data set. This can make comparing minor allele frequencies between data sets nonsensical.

Usage

getmaf(dosage, p0, p1, p2)

Arguments

dosage

Dosage values

p0

Pr(g=0) - unused

p1

Pr(g=1) - unused

p2

Pr(g=2) - unused

Value

Minor allele frequency

Examples

# Get information about binary dosage file
bdfile <- system.file("extdata", "vcf1a.bdose", package = "BinaryDosage")
bdinfo <- getbdinfo(bdfiles = bdfile)
snp1 <- getsnp(bdinfo = bdinfo, 1)
maf <- getmaf(snp1$dosage)

Calculate imputation r squared

Description

Routine to calculate the imputation r squared given the dosages and Pr(g=2). This is an estimate for the imputation r squared returned from minimac and impute2. The r squared values are calculated slightly differently between the programs. This estimate is based on the method used by minimac. It does well for minor allele frequencies above 5%. This function is used internally and is exported for use in examples.

Usage

getrsq(dosage, p0, p1, p2)

Arguments

dosage

Dosage values

p0

Pr(g=0) - unused

p1

Pr(g=1) - unused

p2

Pr(g=2)

Value

Imputation r squared

Examples

# Get information about binary dosage file
bdfile <- system.file("extdata", "vcf1a.bdose", package = "BinaryDosage")
bdinfo <- getbdinfo(bdfiles = bdfile)
snp1 <- getsnp(bdinfo = bdinfo, 1, dosageonly = FALSE)
rsq <- BinaryDosage:::getrsq(snp1$dosage, p2 = snp1$p2)

Read SNP data from a binary dosage file

Description

Routine to read the dosage and genetic probabilities about a SNP from a binary dosage file

Usage

getsnp(bdinfo, snp, dosageonly = TRUE)

Arguments

bdinfo

Information about a binary dosage file return from getbdinfo

snp

The SNP to read the information about. This may be the SNP ID or the index of the SNP in the snps dataset in the bdinfo list

dosageonly

Indicator to return the dosages only or the dosages allowing with the genetic probabilities. Default value is TRUE

Value

A list with either the dosages or the dosages and the genetic probabilities.

Examples

# Get the information about the file
vcf1abdfile <- system.file("extdata", "vcf1a.bdose", package = "BinaryDosage")
bdinfo <- getbdinfo(bdfiles = vcf1abdfile)

# Read the first SNP
getsnp(bdinfo, 1, FALSE)

Get information about a vcf file

Description

Routine to return information about a vcf file. This information is used by other routines to allow for quicker extraction of values from the file.

Usage

getvcfinfo(vcffiles, gz = FALSE, index = TRUE, snpidformat = 0L)

Arguments

vcffiles

A vector of file names. The first is the name of the vcf file. The second is name of the file that contains information about the imputation of the SNPs. This file is produced by minimac 3 and 4.

gz

Indicator if VCF file is compressed using gzip. Default value is FALSE.

index

Indicator if file should be indexed. This allows for faster reading of the file. Indexing a gzipped file is not supported. Default value is TRUE.

snpidformat

The format that the SNP ID will be saved as. 0 - same as in the VCF file 1 - chromosome:location 2 - chromosome:location:referenceallele:alternateallele If snpidformat is 1 and the VCF file uses format 2, an error is generated. Default value is 0.

Value

List containing information about the VCF file to include file name, subject IDs, and information about the SNPs. Indices for faster reading will be included if index is set to TRUE

Examples

# Get file names of th vcf and infromation file
vcf1afile <- system.file("extdata", "set1a.vcf", package = "BinaryDosage")
vcf1ainfo <- system.file("extdata", "set1a.info", package = "BinaryDosage")

# Get the information about the vcf file
vcf1ainfo <- getvcfinfo(vcffiles = c(vcf1afile, vcf1ainfo))

Apply a function to each SNP in a vcf file

Description

A routine that reads in the SNP data serially from a vcf file and applies a user specified function to the data.

Usage

vcfapply(vcfinfo, func, ...)

Arguments

vcfinfo

List with information about the vcf file returned from getvcfinfo

func

A user supplied function to apply to the data for each snp. The function must be provide with the following parameters, dosage, p0, p1, and p2, where dosage is the dosage values for each subject and p0, p1, and p2 are the probabilities that a subject has zero, one, and two copies of the alternate allele, respectively.

...

Additional parameters needed by the user supplied function

Value

A list with length equal to the number of SNPs in the vcf file. Each element of the list is the value returned by the user supplied function

See Also

Other Iterating functions: bdapply(), genapply()

Examples

# Get information about a vcf file

vcf1afile <- system.file("extdata", "set1a.vcf", package = "BinaryDosage")
vcfinfo <- getvcfinfo(vcffiles = vcf1afile)

# Apply the getaaf, get alternate allele frequency, function
# to all the SNPs in the vcf file

aaf <- vcfapply(vcfinfo = vcfinfo,
                func = BinaryDosage:::getaaf)

Convert a VCF file to a binary dosage file

Description

Routine to read information from a VCF file and create a binary dosage file. The function is designed to use files return from the Michigan Imputation Server but will run on other VCF files if they contain dosage and genetic probabilities. Note: This routine can take a long time to run if the VCF file is large.

Usage

vcftobd(
  vcffiles,
  gz = FALSE,
  bdfiles,
  format = 4L,
  subformat = 0L,
  snpidformat = 0,
  bdoptions = character(0)
)

Arguments

vcffiles

A vector of file names. The first is the name of the vcf file. The second is name of the file that contains information about the imputation of the SNPs. This file is produced by minimac 3 and 4.

gz

Indicator if VCF file is compressed using gzip. Default value is FALSE.

bdfiles

Vector of names of the output files. The binary dosage file name is first. The family and map files follow. For format 4, no family and map file names are needed.

format

The format of the output binary dosage file. Allowed values are 1, 2, 3, and 4. The default value is 4. Using the default value is recommended.

subformat

The subformat of the format of the output binary dosage file. A value of 1 or 3 indicates that only the dosage value is saved. A value of 2 or 4 indicates the dosage and genetic probabilities will be output. Values of 3 or 4 are only allowed with formats 3 and 4. If a value of zero if provided, and genetic probabilities are in the vcf file, subformat 2 will be used for formats 1 and 2, and subformat 4 will be used for formats 3 and 4. If the vcf file does not contain genetic probabilities, subformat 1 will be used for formats 1 and 2, and subformat 3 will be used for formats 3 and 4. The default value is 0.

snpidformat

The format that the SNP ID will be saved as. -1 SNP ID not written 0 - same as in the VCF file 1 - chromosome:location 2 - chromosome:location:reference_allele:alternate_allele If snpidformat is 1 and the VCF file uses format 2, an error is generated. Default value is 0.

bdoptions

Character array containing any of the following value, "aaf", "maf", "rsq". The presence of any of these values indicates that the specified values should be calculates and stored in the binary dosage file. These values only apply to format 4.

Value

None

Examples

# Find the vcf file names
vcf1afile <- system.file("extdata", "set1a.vcf", package = "BinaryDosage")
vcf1ainfo <- system.file("extdata", "set1a.info", package = "BinaryDosage")
bdfiles <- tempfile()
# Convert the file
vcftobd(vcffiles = c(vcf1afile, vcf1ainfo), bdfiles = bdfiles)
# Verify the file was written correctly
bdinfo <- getbdinfo(bdfiles)