Sample Missingness Module¶
The usage of the standalone module is shown below:
$ pyGenClean_sample_missingness --help
usage: pyGenClean_sample_missingness [-h] [-v] --ifile FILE [--is-bfile]
[--mind FLOAT] [--out FILE]
Computes sample missingness using Plink.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Input File:
--ifile FILE The input file prefix (by default, this input file must be a
tfile. If options --is-bfile is used, the input file must be
a bfile).
Options:
--is-bfile The input file (--ifile) is a bfile instead of a tfile.
--mind FLOAT The missingness threshold (remove samples with more than x
percent missing genotypes). [Default: 0.100]
Output File:
--out FILE The prefix of the output files (wich will be a Plink binary
file). [default: clean_mind]
Input Files¶
This module uses either PLINK’s binary file format (bed, bim and fam
files) or the transposed pedfile format separated by tabulations (tped and
tfam) for the source data set (the data of interest).
Procedure¶
Here are the steps performed by the module:
- Uses Plink to remove samples with a high missing rate (above a user defined threshold).
Output Files¶
The output files of each of the steps described above are as follow (note that
the output prefix shown is the one by default [i.e. clean_geno]):
- One set of PLINK’s output and result files:
clean_mind: the new dataset with samples having a high missing rate removed (above a user defined threshold). The fileclean_mind.iremcontains a list of samples that were removed.
The Algorithm¶
For more information about the actual algorithms and source codes, refer to the following page.
