List of Modules and their Options

The following sections show a list the available scripts that can be used in the configuration file, along with their options for customization.

Contamination

The name to use in the configuration file is contamination and the List of options for the contamination script. table shows its configuration.

List of options for the contamination script.
Option Type Description
--raw-dir STRING Directory containing the raw data (one file per sample, where the name of the file (minus the extension) is the sample identification number.
--colsample STRING The sample column.
--colmarker STRING The marker column.
--colbaf STRING The B allele frequency column.
--colab1 STRING The AB Allele 1 column.
--colab2 STRING The AB Allele 2 column.
--sge   Use SGE for parallelization.
--sge-walltime STRING The walltime for the job to run on the cluster. Do not use if you are not required to specify a walltime for your jobs on your cluster (e.g.qsub -lwalltime=1:0:0‘ on the cluster).
--sge-nodes INT The number of nodes and the number of processor per nodes to use (e.g.qsub -lnodes=X:ppn=Y‘ on the cluster, where X is the number of nodes and Y is the number of processor to use. Do not use if you are not required to specify the number of nodes for your jobs on the cluster.
--sample-per-run-for-sge INT The number of sample to run for a single SGE job.

The name of the standalone script is pyGenClean_check_contamination.

Duplicated Samples

The name to use in the configuration file is duplicated_samples and the List of options for the duplicated_samples script. table shows its configuration.

List of options for the duplicated_samples script.
Option Type Description
--sample-completion-threshold FLOAT The completion threshold to consider a replicate when choosing the best replicates and for creating the composite samples. [default: 0.9]
--sample-concordance-threshold FLOAT The concordance threshold to consider a replicate when choosing the best replicates and for creating the composite samples. [default: 0.97]

The name of the standalone script is pyGenClean_duplicated_samples.

Duplicated Markers

The name to use in the configuration file is duplicated_snps and the List of options for the duplicated_snps script. table shows its configuration.

List of options for the duplicated_snps script.
Option Type Description
--snp-completion-threshold FLOAT The completion threshold to consider a replicate when choosing the best replicates and for composite creation. [default: 0.9]
--snp-concordance-threshold FLOAT The concordance threshold to consider a replicate when choosing the best replicates and for composite creation. [default: 0.98]
--frequency_difference FLOAT The maximum difference in frequency between duplicated markers [default: 0.05]

The name of the standalone script is pyGenClean_duplicated_snps.

Clean No Call and Only Heterozygous Markers

The name to use in the configuration file is noCall_hetero_snps and there are no customization possible.

The name of the standalone script is pyGenClean_clean_noCall_hetero_snps.

Sample Missingness

The name to use in the configuration file is sample_missingness and the List of options for the sample_missingness script. table shows its configuration.

List of options for the sample_missingness script.
Option Type Description
--mind FLOAT The missingness threshold (remove samples with more than x percent missing genotypes). [Default: 0.100]

The name of the standalone script is pyGenClean_sample_missingness.

Marker Missingness

The name to use in the configuration file is snp_missingness and the List of options for the snp_missingness script. table shows its configuration.

List of options for the snp_missingness script.
Option Type Description
--geno FLOAT The missingness threshold (remove SNPs with more than x percent missing genotypes). [Default: 0.020]

The name of the standalone script is pyGenClean_snp_missingness.

Sex Check

The name to use in the configuration file is sex_check and the List of options for the sex_check script. table shows its configuration.

List of options for the sex_check script.
Option Type Description
--femaleF FLOAT The female F threshold. [default: < 0.300000]
--maleF FLOAT The male F threshold. [default: > 0.700000]
--nbChr23 INT The minimum number of markers on chromosome 23 before computing Plink’s sex check [default: 50]
--gender-plot   Create the gender plot (summarized chr Y intensities in function of summarized chr X intensities) for problematic samples. Not used by default.
--sex-chr-intensities FILE A file containing alleles intensities for each of the markers located on the X and Y chromosome for the gender plot.
--gender-plot-format STRING The output file format for the gender plot (png, ps, or pdf formats are available). [default: png]
--lrr-baf   Create the LRR and BAF plot for problematic samples. Not used by default.
--lrr-baf-raw-dir DIR Directory or list of directories containing information about every samples (BAF and LRR).
--lrr-baf-format STRING The output file format for the LRR and BAF plot (png, ps or pdf formats are available). [default: png]
--lrr-baf-dpi INT The pixel density of the figure(s) (DPI).

The name of the standalone script is pyGenClean_sex_check. If you want to redo the BAF and LRR plot or the gender plot, you can use the pyGenClean_baf_lrr_plot and pyGenClean_gender_plot scripts, respectively.

Plate Bias

The name to use in the configuration file is plate_bias and the List of options for the plate_bias script. table shows its configuration.

List of options for the plate_bias script.
Option Type Description
--loop-assoc FILE The file containing the plate organization of each samples. Must contains three column (with no header): famID, indID and plateName.
--pfilter FLOAT The significance threshold used for the plate effect. [default: 1.0e-07]

The name of the standalone script is pyGenClean_plate_bias.

Heterozygous Haploid

The name to use in the configuration file is remove_heterozygous_haploid and there are no customization possible.

The name of the standalone script is pyGenClean_remove_heterozygous_haploid.

Ethnicity

The name to use in the configuration file is check_ethnicity and the List of options for the check_ethnicity script. table shows its configuration.

List of options for the check_ethnicity script.
Option Type Description
--skip-ref-pops   Perform the MDS computation, but skip the three reference panels.
--ceu-bfile FILE The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively.) for the CEU population.
--yri-bfile FILE The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively.) for the CEU population.
--jpt-chb-bfile FILE The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively.) for the JPT-CHB population.
--min-nb-snp FILE The minimum number of markers needed to compute IBS values. [Default: 8000]
--indep-pairwise INT INT FLOAT Three numbers: window size, window shift and the r2 threshold. [default: [‘50’, ‘5’, ‘0.1’]]
--maf INT Restrict to SNPs with MAF >= threshold. [default: 0.05]
--sge   Use SGE for parallelization.
--sge-walltime STRING The time limit (for clusters). Do not use if you are not required to specify a walltime for your jobs on your cluster (e.g. -lwalltime=1:0:0 on the cluster). Allow enough time for proper job completion.
--sge-nodes INT INT The number of nodes and the number of processor per nodes to use (e.g. qsub -lnodes=X:ppn=Y on the cluster, where X is the number of nodes and Y is the number of processor to use. Do not use if you are not required to specify the number of nodes for your jobs on the cluster. Allow enough ressources for proper job completion.
--ibs-sge-walltime STRING The time limit (for clusters) for the IBS jobs. Do not use if you are not required to specify a walltime for your jobs on your cluster (e.g. -lwalltime=1:0:0 on the cluster). Allow enough time for proper job completion.
--ibs-sge-nodes INT INT The number of nodes and the number of processor per nodes to use for the IBS jobs (e.g. qsub -lnodes=X:ppn=Y on the cluster, where X is the number of nodes and Y is the number of processor to use. Do not use if you are not required to specify the number of nodes for your jobs on the cluster. Allow enough ressources for proper job completion.
--line-per-file-for-sge INT The number of line per file for SGE task array. [default: 100]
--nb-components INT The number of component to compute. [default: 10]
--outliers-of STRING Finds the outliers of this population. [default: CEU]
--multiplier FLOAT To find the outliers, we look for more than x times the cluster standard deviation. [default: 1.9]
--xaxis STRING The component to use for the X axis. [default: C1]
--yaxis STRING The component to use for the Y axis. [default: C2]
--format STRING The output file format (png, ps, pdf, or X11 formats are available). [default: png]
--title STRING The title of the MDS plot. [default: C2 in function of C1 - MDS]
--xlabel STRING The label of the X axis. [default: C1]
--ylabel STRING The label of the Y axis. [default: C2]
--create-scree-plot   Computes Eigenvalues and creates a scree plot.
--scree-plot-title STRING The main title of the scree plot

The name of the standalone script is pyGenClean_check_ethnicity. If you want to redo the outlier detection using a different multiplier, have a look at the pyGenClean_find_outliers script. If you want to redo any MDS plot, have a look at the pyGenClean_plot_MDS script. If you want to compute the Eigenvectors using the smartpca tool, have a look at the pyGenClean_plot_eigenvalues script.

Minor Allele Frequency of Zero

The name to use in the configuration file is flag_maf_zero and there are no customization possible.

The name of the standalone script is pyGenClean_flag_maf_zero.

Hardy Weinberg Equilibrium

The name to use in the configuration file is flag_hw and the List of options for the flag_hw script. table shows its configuration.

List of options for the flag_hw script.
Option Type Description
--hwe FLOAT The Hardy-Weinberg equilibrium threshold. [default: 1e-4]

The name of the standalone script is pyGenClean_flag_hw.

Subsetting the Data

The name to use in the configuration file is subset and the List of options for the subset script. table shows its configuration.

List of options for the subset script.
Option Type Description
--exclude FILE A file containing SNPs to exclude from the data set.
--extract FILE A file containing SNPs to extract from the data set.
--remove FILE A file containing samples (FID and IID) to remove from the data set.
--keep FILE A file containing samples (FID and IID) to keep from the data set.

The name of the standalone script is pyGenClean_subset_data.

Comparison with a Gold Standard

The name to use in the configuration file is compare_gold_standard and the List of options for the compare_gold_standard script. table shows its configuration.

List of options for the compare_gold_standard script.
Option Type Description
--gold-bfile FILE The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively.) for the Gold Standard .
--same-samples FILE A file containing samples which are present in both the gold standard and the source panel. One line by identity and tab separated. For each row, first sample is Gold Standard, second is source panel.
--source-manifest FILE The illumina marker manifest.
--source-alleles FILE A file containing the source alleles (TOP). Two columns (separated by tabulation, one with the marker name, the other with the alleles (separated by space). No header.
--sge   Use SGE for parallelization.
--do-not-flip   Do not flip SNPs. WARNING: only use this option only if the Gold Standard was generated using the same chip (hence, flipping is unnecessary).
--use-marker-names   Use marker names instead of (chr, position). WARNING: only use this options only if the Gold Standard was generated using the same chip (hence, they have the same marker names).

The name of the standalone script is pyGenClean_compare_gold_standard.