Proposed Protocol¶

Contamination Module¶

Note

Input files:

PLINK binary pedfiles (BED, BIM and FAM)

Examine the output file named contamination.bafRegress. It will contain the contamination estimates (along with confidence values). Usually, an estimate higher than 0.01 means possible contamination.
The automatic report will contain the list of samples with possible contamination (i.e. an estimate value higher than 0.01).

Preprocessing Steps¶

Remove SNPs without chromosomal and physical position (chromosome and position of 0).
Remove INDELs (markers with alleles I or D).
Determine if there are duplicated samples. These samples must have exactly the same family (FID) and individual (IID) identification to be treated as duplicated samples by the Duplicated Samples Module. PLINK’s option --update-ids could be used.
If input is a transposed pedfile, be sure to use PLINK’s option --tab to produce the appropriate file format.
For the Plate Bias Module, a text file explaining the plate distribution of each sample must be provided using the option --loop-assoc in the configuration file. The following columns are required (in order, without a header):
- the family identification;
- the individual identification;
- and the plate identification.
Produce parameter files (see the Configuration Files for details about parameter file).
To launch the analysis consult the section How to Run the Pipeline.

Duplicated Samples Module¶

Note

Input files:

PLINK transposed pedfiles from the Preprocessing Steps.

Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine dup_samples.diff to evaluate if some samples have many discordant genotypes (this could indicate a possible samples mix up). To identify discordant samples, use the following command line:
```
$ cut -f4 dup_samples.diff | sort -k1,1 | uniq -c
$ cut -f5 dup_samples.diff | sort -k1,1 | uniq -c
```
If samples are present more than 10,000 times (for 2.5E-6 SNPs) this could indicate a sample mix up.
Examine dup_samples.not_good_enough to determine if samples have a concordance rate below the threshold set by the user. These samples are present in the dup_samples.final.tfam if they are the chosen ones.
Examine dup_samples.summary to evaluate completion rate and concordance between the replicates of potentially problematic samples.
Examine dup_samples.concordance file for the problematic samples; this could help to determine which sample is the discordant replicate.
If a sample appears problematic rename it and keep it in the analysis to determine if it is a duplicate of another sample (mix up) with the related sample module.

If necessary, samples present in the dup_samples.not_good_enough file can be removed from the data set with the subset module (see the First Subset Module (optional)). If not, proceed to the Duplicated Markers Module).

First Subset Module (optional)¶

Note

Input files:

From the Duplicated Samples Module:

dup_samples.final.tfam
dup_samples.final.tped

Extract the family (FID) and individual (IID) identification from dup_samples.not_good_enough with the following command line:
```
$ cut -f3,4 dup_samples.not_good_enough | sed "1d" | sort -k1,1 \
>     | uniq > samples_to_remove
```

Duplicated Markers Module¶

Note

Input files:

From the Duplicated Samples Module:

dup_samples.final.tfam
dup_samples.final.tped

or from the First Subset Module (optional):

subset.bed
subset.bim
subset.fam

Examine the log to confirm options used and to detect any problems occurring while running the script
Examine dup_snps.duplicated_marker_names to detect SNPs with exactly the same name but mapping to different chromosomal location. (This file is not produce if no duplicated marker names are identified).
Determine the number of duplicated SNPs merged (same allele, same frequency, etc). SNPs merged were removed and are listed in the file dup_snps.removed_duplicates. Number of lines in this file corresponds to number of SNPs merged. SNPs not merged and reasons why (e.g. homo_hetero, diff_frequency, homo_flip, etc.) are present in file dup_snps.problems.
SNPs with concordance rate below the threshold are present in dup_snps.not_good_enough. To have the list of those SNPs:
```
$ grep -w concordance dup_snps.not_good_enough | cut -f1 \
>     > SNP_with_low_concordance_rate
```

If necessary, use the subset option in the configuration file to remove the low concordance rate SNPs (see the Second Subset Module (optional)).

Second Subset Module (optional)¶

Note

Input files:

From the Duplicated Markers Module:

dup_snps.final.tfam
dup_snps.final.tped

Extract SNPs with concordance rate below the threshold set by the user with the command line

$ grep -w concordance dup_snps.not_good_enough | cut -f1 \
>     > SNP_with_low_concordance_rate

Clean No Call and Only Heterozygous Markers Module¶

Note

Input files:

From the Duplicated Markers Module:

dup_snps.final.tfam
dup_snps.final.tped

or from the Second Subset Module (optional):

subset.bed
subset.bim
subset.fam

Examine the log to confirm options used and to detect any problems occurring while running the script.
SNPs removed because they are failed are listed in clean_noCall_hetero.allFailed.
SNPs removed because they are all heterozygous are listed in clean_noCall_hetero.allHetero.

Sample Missingness Module (mind 0.1)¶

Note

Input files:

From the Clean No Call and Only Heterozygous Markers Module:

clean_noCall_hetero.tfam
clean_noCall_hetero.tped

Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine PLINK’s log file to detect any problem at this step.
Individuals removed because they did not pass the completion rate threshold are listed in clean_mind.irem.

Marker Missingness Module¶

Note

Input files:

From the Sample Missingness Module (mind 0.1):

clean_mind.bed
clean_mind.bim
clean_mind.fam

Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine PLINK’s log file to detect any problem at this step.
SNPs removed because they did not pass the completion rate threshold are listed in clean_geno.removed_snps.

Sample Missingness Module (mind 0.02)¶

Note

Input files:

From the Marker Missingness Module:

clean_geno.bed
clean_geno.bim
clean_geno.fam

Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine PLINK’s log file to detect any problem at this step.
Individuals removed because they did not pass the completion rate threshold are listed in clean_mind.irem.

Sex Check Module¶

Note

Input files:

From Sample Missingness Module (mind 0.02):

clean_mind.bed
clean_mind.bim
clean_mind.fam

Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine PLINK’s log file to detect any problem at this step.
Examine sexcheck.list_problem_sex, it contains all individuals identified by PLINK as having gender problem.
Examine sexcheck.chr23_recodeA.raw.hetero to determine heterozygosity on the X chromosome of problematic samples. Consanguineous females may have low heterozygosity on the X chromosome. If many genotyped SNPs are rare, heterozygosity may also be low.
Examine sexcheck.chr24_recodeA.raw.noCall to determine the number of Y markers with missing calls. Females have low number of genotypes for Y chromosome markers (high values of missing calls), but is often not equal to 0 probably because some Y markers come from pseudo autosomal regions. Column nbGeno is the total number of genotypes check and nbNoCall is the number of genotypes with missing calls on chromosome Y. Males should have low values in this column while females have higher number of missing calls but not equal to the total number of genotypes tested.

If probe intensities from X and Y chromosomes are available and the gender plot has been created:

Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine sexcheck.png to detect any individuals in the XXY or X0 regions, females in the male cluster and males in the female cluster (see the Gender plot figure). Confirm if possible the gender problems identified with the previous sex check problem step.

If intensities file for each sample are available and the BAF and LRR plot has been created:

Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine sexcheck_sample-id_lrr_baf.png for each sample. Usually, females have LRR values around 0 (between -0.5 and 0.5) while males have LRR values between -0.5 and -1. Females have three lines on BAF graphics: one at 1 (homozygous for the B allele), one at 0.5 (heterozygous AB) and one at 0 (homozygous for the A allele). Males have two lines: one at 1 (homozygous for the B allele) and one a at 0 (homozygous for the A allele). For more details, see the The Plots section of the Sex Check Module.

Keep individuals identified with gender problem until the Related Samples Module (mix up of samples could be resolved at this step).

Plate Bias Module¶

Note

Input files:

From the Sample Missingness Module (mind 0.02):

clean_mind.bed
clean_mind.bim
clean_mind.fam

or if subset option is used to remove SNPs from nof file (see below):

subset.bed
subset.bim
subset.fam

Verify if there is a nof file produce by PLINK when the input files for this step were produced (from the Sample Missingness Module (mind 0.02)). The nof contains SNPs with no founder genotypes observed. If so, remove the SNPs present in the nof file using the subset tool before launching the plate bias analysis. Those SNPs, if they are not removed will produced an error message when PLINK performs the loop-assoc analysis and the following message will be present in PLINK’s log file plate_bias.log: “ERROR: FEXACT error 3”. SNPs on chromosome 24 could also produce this error.
Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine plate_bias.log to detect any problem at this step.
The plate_bias.significant_SNPs.txt file contains a list of SNPs with P value below the threshold. Care should be taken with those SNPs if significant results are obtained in association tests. These SNPs are NOT removed from the data set, only flagged.
Low MAF can explain part of plate bias. Examine the output file plate_bias.significant_SNPs.frq to determine if SNPs have low MAF. Other reasons explaining plate bias are relatedness or ethnicity of individuals assign to the same plates and none of them on other plates.

Related Samples Module¶

Note

Input files:

From the Sample Missingness Module (mind 0.02):

clean_mind.bed
clean_mind.bim
clean_mind.fam

Examine the log to confirm options used and to detect any problems occurring while running the script.
File ibs.pruning_0.1.prune.in contains the list of uncorrelated SNPs used for the IBS analysis
Examine ibs.related_individuals_z1.png and ibs.related_individuals_z2.png to detect if there are samples in the parent-child, duplicated samples, first degree relative and second degree relative areas. (see Z1 in function of IBS2 ratio and Z2 in function of IBS2 ratio plots).
File ibs.related_individuals lists pairs of related individuals. Index column indicates group of related samples. Status column indicated the probable link between pair of individuals based on $Z_0$, $Z_1$ and $Z_2$ values (see the IBD allele sharing values table [for which $Z$ values are approximation] or RelatedSamples.find_related_samples.extractRelatedIndividuals() function for thresholds).
If there are known duplicated samples, examine ibs.related_individuals to determine if they were identified correctly, if not this could indicate a possible samples mix up.
File ibs.choosen_related_individuals contains a list of related samples to keep. One related sample from the pair is randomly selected. If there are a group of related individuals, one sample in randomly selected from the group. All non selected samples are listed in ibs.discarded_related_individuals and should be removed from the analysis at a later stage.

IBD allele sharing values¶
Relationship	$k_0$	$k_1$	$k_2$	Coancestry $\theta = 1/2 k_2 + 1/4 k_1$
Unrelated	1	0	0	0
Identical twins	0	0	1	$1/2$
Parent-child	0	1	0	$1/4$
Full siblings	$1/4$	$1/2$	$1/4$	$1/4$
Half siblings	$1/2$	$1/2$	0	$1/8$
Uncle nephew	$1/2$	$1/2$	0	$1/8$
Grandparent grandchild	$1/2$	$1/2$	0	$1/8$
Double first cousins	$9/16$	$3/8$	$1/16$	$1/8$
First cousins	$3/4$	$1/4$	0	$1/16$

Ethnicity Module¶

Note

Input files:

From the Sample Missingness Module (mind 0.02):

clean_mind.bed
clean_mind.bim
clean_mind.fam

Examine the log to confirm options used and to detect any problems occurring while running the script.
File ethnic.ibs.pruning_0.1.prune.in contains the list of uncorrelated SNPs used for the MDS analysis.
File ethnic.mds.mds contains the list of principale components as calculated by PLINK.
Examine ethnicity.mds.png, ethnicity.before.png, ethnicity.after.png and ethnicity.outliers.png to detect samples outside the selected cluster (see The Plots generated from the Ethnicity Module for more information).

If there are too many outliers still present in the data set (i.e. radius is too large), analysis can be redone using the pyGenClean_find_outliers standalone script, using a different value for --multiplier. For more information, refer to the Finding Outliers section of the Ethnicity Module.
Samples outside the selected cluster are listed in ethnicity.outliers. If necessary those samples could be removed at a later stage with the subset option.

Third Subset Module¶

Note

Input files:

From the Sample Missingness Module (mind 0.02):

clean_mind.bed
clean_mind.bim
clean_mind.fam

Use the subset module to remove samples with gender problems (the Sex Check Module), outliers from the ethnicity cluster (the Ethnicity Module), related samples (the Related Samples Module) and any other samples that need to be removed from the data set.

To produces a file containing all the samples to remove from the dataset:
```
$ cat sexcheck.list_problem_sex_ids ibs.discarded_related_individuals \
>     ethnicity.outliers > samples_to_remove.txt
```
One sample may be removed for more than one reason, hence be present more than one time in the final samples_to_remove.txt file. This is not an issue for this step.

Heterozygote Haploid Module¶

Note

Input files:

From the Third Subset Module:

subset.bed
subset.bim
subset.fam

Samples with gender problems must have been removed before performing this module.

Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine without_hh_genotypes.log to detect any problem at this step.

Number of heterozygous haploid genotypes set to missing are indicated in without_hh_genotypes.log file.

Minor Allele Frequency of Zero Module¶

Note

Input files:

From the Heterozygote Haploid Module:

without_hh_genotypes.bed
without_hh_genotypes.bim
without_hh_genotypes.fam

Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine flag_maf_0.log to detect any problem at this step.
The file flag_maf_0.na_list contains a list of SNPs with minor allele frequency of 0.

If necessary, use subset module to remove SNPs with minor allele frequency of 0, since they were only flagged using the Fourth Subset Module (optional).

Fourth Subset Module (optional)¶

Note

Input files:

From the Heterozygote Haploid Module:

without_hh_genotypes.bed
without_hh_genotypes.bim
without_hh_genotypes.fam

Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine subset.log to detect any problem at this step.

Hardy Weinberg Equilibrium Module¶

Note

Input files:

From the Heterozygote Haploid Module:

without_hh_genotypes.bed
without_hh_genotypes.bim
without_hh_genotypes.fam

or from the Fourth Subset Module (optional):

subset.bed
subset.bim
subset.fam

Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine flag_hw.threshold_1e-4.log and flag_hw.threshold_Bonferroni.log to detect any problem at this step.
The files flag_hw.snp_flag_threshold_Bonferroni and flag_hw.snp_flag_threshold_1e-4 contain lists of SNPs with P value below Bonferroni and below $1 \times 10^{-4}$ threshold, respectively.

The markers are only flagged using this module. If you want to remove those markers, have a look at the Fifth Subset Module (optional).

Fifth Subset Module (optional)¶

Note

Input files:

From the Heterozygote Haploid Module:

without_hh_genotypes.bed
without_hh_genotypes.bim
without_hh_genotypes.fam

or from the Fourth Subset Module (optional):

subset.bed
subset.bim
subsert.fam

Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine subset.log to detect any problem at this step.

Relationship	\(k_0\)	\(k_1\)	\(k_2\)	Coancestry \(\theta = 1/2 k_2 + 1/4 k_1\)
Unrelated	1	0	0	0
Identical twins	0	0	1	\(1/2\)
Parent-child	0	1	0	\(1/4\)
Full siblings	\(1/4\)	\(1/2\)	\(1/4\)	\(1/4\)
Half siblings	\(1/2\)	\(1/2\)	0	\(1/8\)
Uncle nephew	\(1/2\)	\(1/2\)	0	\(1/8\)
Grandparent grandchild	\(1/2\)	\(1/2\)	0	\(1/8\)
Double first cousins	\(9/16\)	\(3/8\)	\(1/16\)	\(1/8\)
First cousins	\(3/4\)	\(1/4\)	0	\(1/16\)

Table Of Contents

Previous topic

Next topic

This Page

Proposed Protocol¶

Contamination Module¶

Preprocessing Steps¶

Duplicated Samples Module¶

First Subset Module (optional)¶

Duplicated Markers Module¶

Second Subset Module (optional)¶

Clean No Call and Only Heterozygous Markers Module¶

Sample Missingness Module (mind 0.1)¶

Marker Missingness Module¶

Sample Missingness Module (mind 0.02)¶

Sex Check Module¶

Plate Bias Module¶

Ethnicity Module¶

Third Subset Module¶

Heterozygote Haploid Module¶

Minor Allele Frequency of Zero Module¶

Fourth Subset Module (optional)¶

Hardy Weinberg Equilibrium Module¶

Fifth Subset Module (optional)¶