List of Modules and their Options¶

The following sections show a list the available scripts that can be used in the configuration file, along with their options for customization.

Contamination¶

The name to use in the configuration file is contamination and the List of options for the contamination script. table shows its configuration.

List of options for the **contamination** script.¶
Option	Type	Description
`--raw-dir`	`STRING`	Directory containing the raw data (one file per sample, where the name of the file (minus the extension) is the sample identification number.
`--colsample`	`STRING`	The sample column.
`--colmarker`	`STRING`	The marker column.
`--colbaf`	`STRING`	The B allele frequency column.
`--colab1`	`STRING`	The AB Allele 1 column.
`--colab2`	`STRING`	The AB Allele 2 column.
`--sge`		Use SGE for parallelization.
`--sge-walltime`	`STRING`	The walltime for the job to run on the cluster. Do not use if you are not required to specify a walltime for your jobs on your cluster (e.g. ‘`qsub -lwalltime=1:0:0`‘ on the cluster).
`--sge-nodes`	`INT`	The number of nodes and the number of processor per nodes to use (e.g. ‘`qsub -lnodes=X:ppn=Y`‘ on the cluster, where X is the number of nodes and Y is the number of processor to use. Do not use if you are not required to specify the number of nodes for your jobs on the cluster.
`--sample-per-run-for-sge`	`INT`	The number of sample to run for a single SGE job.

The name of the standalone script is pyGenClean_check_contamination.

Duplicated Samples¶

The name to use in the configuration file is duplicated_samples and the List of options for the duplicated_samples script. table shows its configuration.

List of options for the **duplicated_samples** script.¶
Option	Type	Description
`--sample-completion-threshold`	`FLOAT`	The completion threshold to consider a replicate when choosing the best replicates and for creating the composite samples. [default: 0.9]
`--sample-concordance-threshold`	`FLOAT`	The concordance threshold to consider a replicate when choosing the best replicates and for creating the composite samples. [default: 0.97]

The name of the standalone script is pyGenClean_duplicated_samples.

Duplicated Markers¶

The name to use in the configuration file is duplicated_snps and the List of options for the duplicated_snps script. table shows its configuration.

List of options for the **duplicated_snps** script.¶
Option	Type	Description
`--snp-completion-threshold`	`FLOAT`	The completion threshold to consider a replicate when choosing the best replicates and for composite creation. [default: 0.9]
`--snp-concordance-threshold`	`FLOAT`	The concordance threshold to consider a replicate when choosing the best replicates and for composite creation. [default: 0.98]
`--frequency_difference`	`FLOAT`	The maximum difference in frequency between duplicated markers [default: 0.05]

The name of the standalone script is pyGenClean_duplicated_snps.

Clean No Call and Only Heterozygous Markers¶

The name to use in the configuration file is noCall_hetero_snps and there are no customization possible.

The name of the standalone script is pyGenClean_clean_noCall_hetero_snps.

Sample Missingness¶

The name to use in the configuration file is sample_missingness and the List of options for the sample_missingness script. table shows its configuration.

List of options for the **sample_missingness** script.¶
Option	Type	Description
`--mind`	`FLOAT`	The missingness threshold (remove samples with more than x percent missing genotypes). [Default: 0.100]

The name of the standalone script is pyGenClean_sample_missingness.

Marker Missingness¶

The name to use in the configuration file is snp_missingness and the List of options for the snp_missingness script. table shows its configuration.

List of options for the **snp_missingness** script.¶
Option	Type	Description
`--geno`	`FLOAT`	The missingness threshold (remove SNPs with more than x percent missing genotypes). [Default: 0.020]

The name of the standalone script is pyGenClean_snp_missingness.

Sex Check¶

The name to use in the configuration file is sex_check and the List of options for the sex_check script. table shows its configuration.

List of options for the **sex_check** script.¶
Option	Type	Description
`--femaleF`	`FLOAT`	The female F threshold. [default: < 0.300000]
`--maleF`	`FLOAT`	The male F threshold. [default: > 0.700000]
`--nbChr23`	`INT`	The minimum number of markers on chromosome 23 before computing Plink’s sex check [default: 50]
`--gender-plot`		Create the gender plot (summarized chr Y intensities in function of summarized chr X intensities) for problematic samples. Not used by default.
`--sex-chr-intensities`	`FILE`	A file containing alleles intensities for each of the markers located on the X and Y chromosome for the gender plot.
`--gender-plot-format`	`STRING`	The output file format for the gender plot (png, ps, or pdf formats are available). [default: png]
`--lrr-baf`		Create the LRR and BAF plot for problematic samples. Not used by default.
`--lrr-baf-raw-dir`	`DIR`	Directory or list of directories containing information about every samples (BAF and LRR).
`--lrr-baf-format`	`STRING`	The output file format for the LRR and BAF plot (png, ps or pdf formats are available). [default: png]
`--lrr-baf-dpi`	`INT`	The pixel density of the figure(s) (DPI).

The name of the standalone script is pyGenClean_sex_check. If you want to redo the BAF and LRR plot or the gender plot, you can use the pyGenClean_baf_lrr_plot and pyGenClean_gender_plot scripts, respectively.

Plate Bias¶

The name to use in the configuration file is plate_bias and the List of options for the plate_bias script. table shows its configuration.

List of options for the **plate_bias** script.¶
Option	Type	Description
`--loop-assoc`	`FILE`	The file containing the plate organization of each samples. Must contains three column (with no header): famID, indID and plateName.
`--pfilter`	`FLOAT`	The significance threshold used for the plate effect. [default: 1.0e-07]

The name of the standalone script is pyGenClean_plate_bias.

Heterozygous Haploid¶

The name to use in the configuration file is remove_heterozygous_haploid and there are no customization possible.

The name of the standalone script is pyGenClean_remove_heterozygous_haploid.

Related Samples¶

The name to use in the configuration file is find_related_samples and the List of options for the find_related_samples script. table shows its configuration.

List of options for the **find_related_samples** script.¶
Option	Type	Description
`--genome-only`		Only create the genome file. Not selected by default.
`--min-nb-snp`	`INT`	The minimum number of markers needed to compute IBS values. [Default: 10000]
`--indep-pairwise`	`INT` `INT` `FLOAT`	Three numbers: window size, window shift and the r2 threshold. [default: [‘50’, ‘5’, ‘0.1’]]
`--maf`	`FLOAT`	Restrict to SNPs with MAF >= threshold. [default: 0.05]
`--ibs2-ratio`	`FLOAT`	The initial IBS2* ratio (the minimum value to show in the plot. [default: 0.8]
`--sge`		Use SGE for parallelization.
`--sge-walltime`	`STRING`	The time limit (for clusters). Do not use if you are not required to specify a walltime for your jobs on your cluster (e.g. `-lwalltime=1:0:0` on the cluster). Allow enough time for proper job completion.
`--sge-nodes`	`INT` `INT`	The number of nodes and the number of processor per nodes to use (e.g. `qsub -lnodes=X:ppn=Y` on the cluster, where X is the number of nodes and Y is the number of processor to use. Do not use if you are not required to specify the number of nodes for your jobs on the cluster. Allow enough ressources for proper job completion.
`--line-per-file-for-sge`	`INT`	The number of line per file for SGE task array. [default: 100]

The name of the standalone script is pyGenClean_find_related_samples. Even though randomly choosing a subset of related samples is done automatically, you can use the pyGenClean_merge_related_samples to perform it again.

Ethnicity¶

The name to use in the configuration file is check_ethnicity and the List of options for the check_ethnicity script. table shows its configuration.

List of options for the **check_ethnicity** script.¶
Option	Type	Description
`--skip-ref-pops`		Perform the MDS computation, but skip the three reference panels.
`--ceu-bfile`	`FILE`	The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively.) for the CEU population.
`--yri-bfile`	`FILE`	The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively.) for the CEU population.
`--jpt-chb-bfile`	`FILE`	The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively.) for the JPT-CHB population.
`--min-nb-snp`	`FILE`	The minimum number of markers needed to compute IBS values. [Default: 8000]
`--indep-pairwise`	`INT` `INT` `FLOAT`	Three numbers: window size, window shift and the r2 threshold. [default: [‘50’, ‘5’, ‘0.1’]]
`--maf`	`INT`	Restrict to SNPs with MAF >= threshold. [default: 0.05]
`--sge`		Use SGE for parallelization.
`--sge-walltime`	`STRING`	The time limit (for clusters). Do not use if you are not required to specify a walltime for your jobs on your cluster (e.g. `-lwalltime=1:0:0` on the cluster). Allow enough time for proper job completion.
`--sge-nodes`	`INT` `INT`	The number of nodes and the number of processor per nodes to use (e.g. `qsub -lnodes=X:ppn=Y` on the cluster, where X is the number of nodes and Y is the number of processor to use. Do not use if you are not required to specify the number of nodes for your jobs on the cluster. Allow enough ressources for proper job completion.
`--ibs-sge-walltime`	`STRING`	The time limit (for clusters) for the IBS jobs. Do not use if you are not required to specify a walltime for your jobs on your cluster (e.g. `-lwalltime=1:0:0` on the cluster). Allow enough time for proper job completion.
`--ibs-sge-nodes`	`INT` `INT`	The number of nodes and the number of processor per nodes to use for the IBS jobs (e.g. `qsub -lnodes=X:ppn=Y` on the cluster, where X is the number of nodes and Y is the number of processor to use. Do not use if you are not required to specify the number of nodes for your jobs on the cluster. Allow enough ressources for proper job completion.
`--line-per-file-for-sge`	`INT`	The number of line per file for SGE task array. [default: 100]
`--nb-components`	`INT`	The number of component to compute. [default: 10]
`--outliers-of`	`STRING`	Finds the outliers of this population. [default: CEU]
`--multiplier`	`FLOAT`	To find the outliers, we look for more than x times the cluster standard deviation. [default: 1.9]
`--xaxis`	`STRING`	The component to use for the X axis. [default: C1]
`--yaxis`	`STRING`	The component to use for the Y axis. [default: C2]
`--format`	`STRING`	The output file format (png, ps, pdf, or X11 formats are available). [default: png]
`--title`	`STRING`	The title of the MDS plot. [default: C2 in function of C1 - MDS]
`--xlabel`	`STRING`	The label of the X axis. [default: C1]
`--ylabel`	`STRING`	The label of the Y axis. [default: C2]
`--create-scree-plot`		Computes Eigenvalues and creates a scree plot.
`--scree-plot-title`	`STRING`	The main title of the scree plot

The name of the standalone script is pyGenClean_check_ethnicity. If you want to redo the outlier detection using a different multiplier, have a look at the pyGenClean_find_outliers script. If you want to redo any MDS plot, have a look at the pyGenClean_plot_MDS script. If you want to compute the Eigenvectors using the smartpca tool, have a look at the pyGenClean_plot_eigenvalues script.

Minor Allele Frequency of Zero¶

The name to use in the configuration file is flag_maf_zero and there are no customization possible.

The name of the standalone script is pyGenClean_flag_maf_zero.

Hardy Weinberg Equilibrium¶

The name to use in the configuration file is flag_hw and the List of options for the flag_hw script. table shows its configuration.

List of options for the **flag_hw** script.¶
Option	Type	Description
`--hwe`	`FLOAT`	The Hardy-Weinberg equilibrium threshold. [default: 1e-4]

The name of the standalone script is pyGenClean_flag_hw.

Subsetting the Data¶

The name to use in the configuration file is subset and the List of options for the subset script. table shows its configuration.

List of options for the **subset** script.¶
Option	Type	Description
`--exclude`	`FILE`	A file containing SNPs to exclude from the data set.
`--extract`	`FILE`	A file containing SNPs to extract from the data set.
`--remove`	`FILE`	A file containing samples (FID and IID) to remove from the data set.
`--keep`	`FILE`	A file containing samples (FID and IID) to keep from the data set.

The name of the standalone script is pyGenClean_subset_data.

Comparison with a Gold Standard¶

The name to use in the configuration file is compare_gold_standard and the List of options for the compare_gold_standard script. table shows its configuration.

List of options for the **compare_gold_standard** script.¶
Option	Type	Description
`--gold-bfile`	`FILE`	The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively.) for the Gold Standard .
`--same-samples`	`FILE`	A file containing samples which are present in both the gold standard and the source panel. One line by identity and tab separated. For each row, first sample is Gold Standard, second is source panel.
`--source-manifest`	`FILE`	The illumina marker manifest.
`--source-alleles`	`FILE`	A file containing the source alleles (TOP). Two columns (separated by tabulation, one with the marker name, the other with the alleles (separated by space). No header.
`--sge`		Use SGE for parallelization.
`--do-not-flip`		Do not flip SNPs. WARNING: only use this option only if the Gold Standard was generated using the same chip (hence, flipping is unnecessary).
`--use-marker-names`		Use marker names instead of (chr, position). WARNING: only use this options only if the Gold Standard was generated using the same chip (hence, they have the same marker names).

The name of the standalone script is pyGenClean_compare_gold_standard.

Table Of Contents

Previous topic

Next topic

This Page