List of Modules and their Options¶
The following sections show a list the available scripts that can be used in the configuration file, along with their options for customization.
Contamination¶
The name to use in the configuration file is contamination
and the
List of options for the contamination script. table shows its configuration.
Option | Type | Description |
---|---|---|
--raw-dir |
STRING |
Directory containing the raw data (one file per sample, where the name of the file (minus the extension) is the sample identification number. |
--colsample |
STRING |
The sample column. |
--colmarker |
STRING |
The marker column. |
--colbaf |
STRING |
The B allele frequency column. |
--colab1 |
STRING |
The AB Allele 1 column. |
--colab2 |
STRING |
The AB Allele 2 column. |
--sge |
Use SGE for parallelization. | |
--sge-walltime |
STRING |
The walltime for the job to
run on the cluster. Do not
use if you are not required
to specify a walltime for
your jobs on your cluster
(e.g.
‘qsub -lwalltime=1:0:0 ‘
on the cluster). |
--sge-nodes |
INT |
The number of nodes and the
number of processor per
nodes to use (e.g.
‘qsub -lnodes=X:ppn=Y ‘
on the cluster, where X is
the number of nodes and Y is
the number of processor to
use. Do not use if you are
not required to specify the
number of nodes for your
jobs on the cluster. |
--sample-per-run-for-sge |
INT |
The number of sample to run for a single SGE job. |
The name of the standalone script is pyGenClean_check_contamination
.
Duplicated Samples¶
The name to use in the configuration file is duplicated_samples
and the
List of options for the duplicated_samples script. table shows its configuration.
Option | Type | Description |
---|---|---|
--sample-completion-threshold |
FLOAT |
The completion threshold to consider a replicate when choosing the best replicates and for creating the composite samples. [default: 0.9] |
--sample-concordance-threshold |
FLOAT |
The concordance threshold to consider a replicate when choosing the best replicates and for creating the composite samples. [default: 0.97] |
The name of the standalone script is pyGenClean_duplicated_samples
.
Duplicated Markers¶
The name to use in the configuration file is duplicated_snps
and the
List of options for the duplicated_snps script. table shows its configuration.
Option | Type | Description |
---|---|---|
--snp-completion-threshold |
FLOAT |
The completion threshold to consider a replicate when choosing the best replicates and for composite creation. [default: 0.9] |
--snp-concordance-threshold |
FLOAT |
The concordance threshold to consider a replicate when choosing the best replicates and for composite creation. [default: 0.98] |
--frequency_difference |
FLOAT |
The maximum difference in frequency between duplicated markers [default: 0.05] |
The name of the standalone script is pyGenClean_duplicated_snps
.
Clean No Call and Only Heterozygous Markers¶
The name to use in the configuration file is noCall_hetero_snps
and there
are no customization possible.
The name of the standalone script is pyGenClean_clean_noCall_hetero_snps
.
Sample Missingness¶
The name to use in the configuration file is sample_missingness
and the
List of options for the sample_missingness script. table shows its configuration.
Option | Type | Description |
---|---|---|
--mind |
FLOAT |
The missingness threshold (remove samples with more than x percent missing genotypes). [Default: 0.100] |
The name of the standalone script is pyGenClean_sample_missingness
.
Marker Missingness¶
The name to use in the configuration file is snp_missingness
and the
List of options for the snp_missingness script. table shows its configuration.
Option | Type | Description |
---|---|---|
--geno |
FLOAT |
The missingness threshold (remove SNPs with more than x percent missing genotypes). [Default: 0.020] |
The name of the standalone script is pyGenClean_snp_missingness
.
Sex Check¶
The name to use in the configuration file is sex_check
and the
List of options for the sex_check script. table shows its configuration.
Option | Type | Description |
---|---|---|
--femaleF |
FLOAT |
The female F threshold. [default: < 0.300000] |
--maleF |
FLOAT |
The male F threshold. [default: > 0.700000] |
--nbChr23 |
INT |
The minimum number of markers on chromosome 23 before computing Plink’s sex check [default: 50] |
--gender-plot |
Create the gender plot (summarized chr Y intensities in function of summarized chr X intensities) for problematic samples. Not used by default. | |
--sex-chr-intensities |
FILE |
A file containing alleles intensities for each of the markers located on the X and Y chromosome for the gender plot. |
--gender-plot-format |
STRING |
The output file format for the gender plot (png, ps, or pdf formats are available). [default: png] |
--lrr-baf |
Create the LRR and BAF plot for problematic samples. Not used by default. | |
--lrr-baf-raw-dir |
DIR |
Directory or list of directories containing information about every samples (BAF and LRR). |
--lrr-baf-format |
STRING |
The output file format for the LRR and BAF plot (png, ps or pdf formats are available). [default: png] |
--lrr-baf-dpi |
INT |
The pixel density of the figure(s) (DPI). |
The name of the standalone script is pyGenClean_sex_check
. If you want to
redo the BAF and LRR plot or the gender plot, you can use the
pyGenClean_baf_lrr_plot
and pyGenClean_gender_plot
scripts,
respectively.
Plate Bias¶
The name to use in the configuration file is plate_bias
and the
List of options for the plate_bias script. table shows its configuration.
Option | Type | Description |
---|---|---|
--loop-assoc |
FILE |
The file containing the plate organization of each samples. Must contains three column (with no header): famID, indID and plateName. |
--pfilter |
FLOAT |
The significance threshold used for the plate effect. [default: 1.0e-07] |
The name of the standalone script is pyGenClean_plate_bias
.
Heterozygous Haploid¶
The name to use in the configuration file is remove_heterozygous_haploid
and
there are no customization possible.
The name of the standalone script is pyGenClean_remove_heterozygous_haploid
.
Ethnicity¶
The name to use in the configuration file is check_ethnicity
and the
List of options for the check_ethnicity script. table shows its configuration.
Option | Type | Description |
---|---|---|
--skip-ref-pops |
Perform the MDS computation, but skip the three reference panels. | |
--ceu-bfile |
FILE |
The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively.) for the CEU population. |
--yri-bfile |
FILE |
The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively.) for the CEU population. |
--jpt-chb-bfile |
FILE |
The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively.) for the JPT-CHB population. |
--min-nb-snp |
FILE |
The minimum number of markers needed to compute IBS values. [Default: 8000] |
--indep-pairwise |
INT
INT
FLOAT |
Three numbers: window size, window shift and the r2 threshold. [default: [‘50’, ‘5’, ‘0.1’]] |
--maf |
INT |
Restrict to SNPs with MAF >= threshold. [default: 0.05] |
--sge |
Use SGE for parallelization. | |
--sge-walltime |
STRING |
The time limit (for clusters).
Do not use if you are not
required to specify a walltime
for your jobs on your cluster
(e.g. -lwalltime=1:0:0 on
the cluster). Allow enough
time for proper job
completion. |
--sge-nodes |
INT
INT |
The number of nodes and the
number of processor per nodes
to use (e.g. qsub
-lnodes=X:ppn=Y on the
cluster, where X is the number
of nodes and Y is the number
of processor to use. Do not
use if you are not required to
specify the number of nodes
for your jobs on the cluster.
Allow enough ressources for
proper job completion. |
--ibs-sge-walltime |
STRING |
The time limit (for clusters)
for the IBS jobs. Do not use
if you are not required to
specify a walltime for your
jobs on your cluster (e.g.
-lwalltime=1:0:0 on the
cluster). Allow enough time
for proper job completion. |
--ibs-sge-nodes |
INT
INT |
The number of nodes and the
number of processor per nodes
to use for the IBS jobs (e.g.
qsub
-lnodes=X:ppn=Y on the
cluster, where X is the number
of nodes and Y is the number
of processor to use. Do not
use if you are not required to
specify the number of nodes
for your jobs on the cluster.
Allow enough ressources for
proper job completion. |
--line-per-file-for-sge |
INT |
The number of line per file for SGE task array. [default: 100] |
--nb-components |
INT |
The number of component to compute. [default: 10] |
--outliers-of |
STRING |
Finds the outliers of this population. [default: CEU] |
--multiplier |
FLOAT |
To find the outliers, we look for more than x times the cluster standard deviation. [default: 1.9] |
--xaxis |
STRING |
The component to use for the X axis. [default: C1] |
--yaxis |
STRING |
The component to use for the Y axis. [default: C2] |
--format |
STRING |
The output file format (png, ps, pdf, or X11 formats are available). [default: png] |
--title |
STRING |
The title of the MDS plot. [default: C2 in function of C1 - MDS] |
--xlabel |
STRING |
The label of the X axis. [default: C1] |
--ylabel |
STRING |
The label of the Y axis. [default: C2] |
--create-scree-plot |
Computes Eigenvalues and creates a scree plot. | |
--scree-plot-title |
STRING |
The main title of the scree plot |
The name of the standalone script is pyGenClean_check_ethnicity
. If you want
to redo the outlier detection using a different multiplier, have a look at the
pyGenClean_find_outliers
script. If you want to redo any MDS plot, have a
look at the pyGenClean_plot_MDS
script. If you want to compute the
Eigenvectors using the smartpca
tool, have a look at the
pyGenClean_plot_eigenvalues
script.
Minor Allele Frequency of Zero¶
The name to use in the configuration file is flag_maf_zero
and there
are no customization possible.
The name of the standalone script is pyGenClean_flag_maf_zero
.
Hardy Weinberg Equilibrium¶
The name to use in the configuration file is flag_hw
and the
List of options for the flag_hw script. table shows its configuration.
Option | Type | Description |
---|---|---|
--hwe |
FLOAT |
The Hardy-Weinberg equilibrium threshold. [default: 1e-4] |
The name of the standalone script is pyGenClean_flag_hw
.
Subsetting the Data¶
The name to use in the configuration file is subset
and the
List of options for the subset script. table shows its configuration.
Option | Type | Description |
---|---|---|
--exclude |
FILE |
A file containing SNPs to exclude from the data set. |
--extract |
FILE |
A file containing SNPs to extract from the data set. |
--remove |
FILE |
A file containing samples (FID and IID) to remove from the data set. |
--keep |
FILE |
A file containing samples (FID and IID) to keep from the data set. |
The name of the standalone script is pyGenClean_subset_data
.
Comparison with a Gold Standard¶
The name to use in the configuration file is compare_gold_standard
and the
List of options for the compare_gold_standard script. table shows its configuration.
Option | Type | Description |
---|---|---|
--gold-bfile |
FILE |
The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively.) for the Gold Standard . |
--same-samples |
FILE |
A file containing samples which are present in both the gold standard and the source panel. One line by identity and tab separated. For each row, first sample is Gold Standard, second is source panel. |
--source-manifest |
FILE |
The illumina marker manifest. |
--source-alleles |
FILE |
A file containing the source alleles (TOP). Two columns (separated by tabulation, one with the marker name, the other with the alleles (separated by space). No header. |
--sge |
Use SGE for parallelization. | |
--do-not-flip |
Do not flip SNPs. WARNING: only use this option only if the Gold Standard was generated using the same chip (hence, flipping is unnecessary). | |
--use-marker-names |
Use marker names instead of (chr, position). WARNING: only use this options only if the Gold Standard was generated using the same chip (hence, they have the same marker names). |
The name of the standalone script is pyGenClean_compare_gold_standard
.