Proposed Protocol

Contamination Module


Input files:

  • PLINK binary pedfiles (BED, BIM and FAM)
  1. Examine the output file named contamination.bafRegress. It will contain the contamination estimates (along with confidence values). Usually, an estimate higher than 0.01 means possible contamination.
  2. The automatic report will contain the list of samples with possible contamination (i.e. an estimate value higher than 0.01).

Preprocessing Steps

  • Remove SNPs without chromosomal and physical position (chromosome and position of 0).

  • Remove INDELs (markers with alleles I or D).

  • Determine if there are duplicated samples. These samples must have exactly the same family (FID) and individual (IID) identification to be treated as duplicated samples by the Duplicated Samples Module. PLINK’s option --update-ids could be used.

  • If input is a transposed pedfile, be sure to use PLINK’s option --tab to produce the appropriate file format.

  • For the Plate Bias Module, a text file explaining the plate distribution of each sample must be provided using the option --loop-assoc in the configuration file. The following columns are required (in order, without a header):

    • the family identification;
    • the individual identification;
    • and the plate identification.
  • Produce parameter files (see the Configuration Files for details about parameter file).

  • To launch the analysis consult the section How to Run the Pipeline.

Duplicated Samples Module


Input files:

  1. Examine the log to confirm options used and to detect any problems occurring while running the script.

  2. Examine dup_samples.diff to evaluate if some samples have many discordant genotypes (this could indicate a possible samples mix up). To identify discordant samples, use the following command line:

    $ cut -f4 dup_samples.diff | sort -k1,1 | uniq -c
    $ cut -f5 dup_samples.diff | sort -k1,1 | uniq -c

    If samples are present more than 10,000 times (for 2.5E-6 SNPs) this could indicate a sample mix up.

  3. Examine dup_samples.not_good_enough to determine if samples have a concordance rate below the threshold set by the user. These samples are present in the if they are the chosen ones.

  4. Examine dup_samples.summary to evaluate completion rate and concordance between the replicates of potentially problematic samples.

  5. Examine dup_samples.concordance file for the problematic samples; this could help to determine which sample is the discordant replicate.

  6. If a sample appears problematic rename it and keep it in the analysis to determine if it is a duplicate of another sample (mix up) with the related sample module.

If necessary, samples present in the dup_samples.not_good_enough file can be removed from the data set with the subset module (see the First Subset Module (optional)). If not, proceed to the Duplicated Markers Module).

First Subset Module (optional)


Input files:

From the Duplicated Samples Module:

  1. Extract the family (FID) and individual (IID) identification from dup_samples.not_good_enough with the following command line:

    $ cut -f3,4 dup_samples.not_good_enough | sed "1d" | sort -k1,1 \
    >     | uniq > samples_to_remove

Duplicated Markers Module


Input files:

From the Duplicated Samples Module:


or from the First Subset Module (optional):

  • subset.bed
  • subset.bim
  • subset.fam
  1. Examine the log to confirm options used and to detect any problems occurring while running the script

  2. Examine dup_snps.duplicated_marker_names to detect SNPs with exactly the same name but mapping to different chromosomal location. (This file is not produce if no duplicated marker names are identified).

  3. Determine the number of duplicated SNPs merged (same allele, same frequency, etc). SNPs merged were removed and are listed in the file dup_snps.removed_duplicates. Number of lines in this file corresponds to number of SNPs merged. SNPs not merged and reasons why (e.g. homo_hetero, diff_frequency, homo_flip, etc.) are present in file dup_snps.problems.

  4. SNPs with concordance rate below the threshold are present in dup_snps.not_good_enough. To have the list of those SNPs:

    $ grep -w concordance dup_snps.not_good_enough | cut -f1 \
    >     > SNP_with_low_concordance_rate

If necessary, use the subset option in the configuration file to remove the low concordance rate SNPs (see the Second Subset Module (optional)).

Second Subset Module (optional)


Input files:

From the Duplicated Markers Module:

  • Extract SNPs with concordance rate below the threshold set by the user with the command line

    $ grep -w concordance dup_snps.not_good_enough | cut -f1 \
    >     > SNP_with_low_concordance_rate

Clean No Call and Only Heterozygous Markers Module


Input files:

From the Duplicated Markers Module:


or from the Second Subset Module (optional):

  • subset.bed
  • subset.bim
  • subset.fam
  1. Examine the log to confirm options used and to detect any problems occurring while running the script.
  2. SNPs removed because they are failed are listed in clean_noCall_hetero.allFailed.
  3. SNPs removed because they are all heterozygous are listed in clean_noCall_hetero.allHetero.

Sample Missingness Module (mind 0.1)


Input files:

From the Clean No Call and Only Heterozygous Markers Module:

  • clean_noCall_hetero.tfam
  • clean_noCall_hetero.tped
  1. Examine the log to confirm options used and to detect any problems occurring while running the script.
  2. Examine PLINK’s log file to detect any problem at this step.
  3. Individuals removed because they did not pass the completion rate threshold are listed in clean_mind.irem.

Marker Missingness Module


Input files:

From the Sample Missingness Module (mind 0.1):

  • clean_mind.bed
  • clean_mind.bim
  • clean_mind.fam
  1. Examine the log to confirm options used and to detect any problems occurring while running the script.
  2. Examine PLINK’s log file to detect any problem at this step.
  3. SNPs removed because they did not pass the completion rate threshold are listed in clean_geno.removed_snps.

Sample Missingness Module (mind 0.02)


Input files:

From the Marker Missingness Module:

  • clean_geno.bed
  • clean_geno.bim
  • clean_geno.fam
  1. Examine the log to confirm options used and to detect any problems occurring while running the script.
  2. Examine PLINK’s log file to detect any problem at this step.
  3. Individuals removed because they did not pass the completion rate threshold are listed in clean_mind.irem.

Sex Check Module


Input files:

From Sample Missingness Module (mind 0.02):

  • clean_mind.bed
  • clean_mind.bim
  • clean_mind.fam
  1. Examine the log to confirm options used and to detect any problems occurring while running the script.
  2. Examine PLINK’s log file to detect any problem at this step.
  3. Examine sexcheck.list_problem_sex, it contains all individuals identified by PLINK as having gender problem.
  4. Examine sexcheck.chr23_recodeA.raw.hetero to determine heterozygosity on the X chromosome of problematic samples. Consanguineous females may have low heterozygosity on the X chromosome. If many genotyped SNPs are rare, heterozygosity may also be low.
  5. Examine sexcheck.chr24_recodeA.raw.noCall to determine the number of Y markers with missing calls. Females have low number of genotypes for Y chromosome markers (high values of missing calls), but is often not equal to 0 probably because some Y markers come from pseudo autosomal regions. Column nbGeno is the total number of genotypes check and nbNoCall is the number of genotypes with missing calls on chromosome Y. Males should have low values in this column while females have higher number of missing calls but not equal to the total number of genotypes tested.

If probe intensities from X and Y chromosomes are available and the gender plot has been created:

  1. Examine the log to confirm options used and to detect any problems occurring while running the script.
  2. Examine sexcheck.png to detect any individuals in the XXY or X0 regions, females in the male cluster and males in the female cluster (see the Gender plot figure). Confirm if possible the gender problems identified with the previous sex check problem step.

If intensities file for each sample are available and the BAF and LRR plot has been created:

  1. Examine the log to confirm options used and to detect any problems occurring while running the script.
  2. Examine sexcheck_sample-id_lrr_baf.png for each sample. Usually, females have LRR values around 0 (between -0.5 and 0.5) while males have LRR values between -0.5 and -1. Females have three lines on BAF graphics: one at 1 (homozygous for the B allele), one at 0.5 (heterozygous AB) and one at 0 (homozygous for the A allele). Males have two lines: one at 1 (homozygous for the B allele) and one a at 0 (homozygous for the A allele). For more details, see the The Plots section of the Sex Check Module.

Keep individuals identified with gender problem until the Related Samples Module (mix up of samples could be resolved at this step).

Plate Bias Module


Input files:

From the Sample Missingness Module (mind 0.02):

  • clean_mind.bed
  • clean_mind.bim
  • clean_mind.fam

or if subset option is used to remove SNPs from nof file (see below):

  • subset.bed
  • subset.bim
  • subset.fam
  1. Verify if there is a nof file produce by PLINK when the input files for this step were produced (from the Sample Missingness Module (mind 0.02)). The nof contains SNPs with no founder genotypes observed. If so, remove the SNPs present in the nof file using the subset tool before launching the plate bias analysis. Those SNPs, if they are not removed will produced an error message when PLINK performs the loop-assoc analysis and the following message will be present in PLINK’s log file plate_bias.log: “ERROR: FEXACT error 3”. SNPs on chromosome 24 could also produce this error.
  2. Examine the log to confirm options used and to detect any problems occurring while running the script.
  3. Examine plate_bias.log to detect any problem at this step.
  4. The plate_bias.significant_SNPs.txt file contains a list of SNPs with P value below the threshold. Care should be taken with those SNPs if significant results are obtained in association tests. These SNPs are NOT removed from the data set, only flagged.
  5. Low MAF can explain part of plate bias. Examine the output file plate_bias.significant_SNPs.frq to determine if SNPs have low MAF. Other reasons explaining plate bias are relatedness or ethnicity of individuals assign to the same plates and none of them on other plates.

Ethnicity Module


Input files:

From the Sample Missingness Module (mind 0.02):

  • clean_mind.bed
  • clean_mind.bim
  • clean_mind.fam
  1. Examine the log to confirm options used and to detect any problems occurring while running the script.

  2. File contains the list of uncorrelated SNPs used for the MDS analysis.

  3. File ethnic.mds.mds contains the list of principale components as calculated by PLINK.

  4. Examine ethnicity.mds.png, ethnicity.before.png, ethnicity.after.png and ethnicity.outliers.png to detect samples outside the selected cluster (see The Plots generated from the Ethnicity Module for more information).

    If there are too many outliers still present in the data set (i.e. radius is too large), analysis can be redone using the pyGenClean_find_outliers standalone script, using a different value for --multiplier. For more information, refer to the Finding Outliers section of the Ethnicity Module.

  5. Samples outside the selected cluster are listed in ethnicity.outliers. If necessary those samples could be removed at a later stage with the subset option.

Third Subset Module


Input files:

From the Sample Missingness Module (mind 0.02):

  • clean_mind.bed
  • clean_mind.bim
  • clean_mind.fam

Use the subset module to remove samples with gender problems (the Sex Check Module), outliers from the ethnicity cluster (the Ethnicity Module), related samples (the Related Samples Module) and any other samples that need to be removed from the data set.

  • To produces a file containing all the samples to remove from the dataset:

    $ cat sexcheck.list_problem_sex_ids ibs.discarded_related_individuals \
    >     ethnicity.outliers > samples_to_remove.txt

    One sample may be removed for more than one reason, hence be present more than one time in the final samples_to_remove.txt file. This is not an issue for this step.

Heterozygote Haploid Module


Input files:

From the Third Subset Module:

  • subset.bed
  • subset.bim
  • subset.fam

Samples with gender problems must have been removed before performing this module.

  1. Examine the log to confirm options used and to detect any problems occurring while running the script.
  2. Examine without_hh_genotypes.log to detect any problem at this step.

Number of heterozygous haploid genotypes set to missing are indicated in without_hh_genotypes.log file.

Minor Allele Frequency of Zero Module


Input files:

From the Heterozygote Haploid Module:

  • without_hh_genotypes.bed
  • without_hh_genotypes.bim
  • without_hh_genotypes.fam
  1. Examine the log to confirm options used and to detect any problems occurring while running the script.
  2. Examine flag_maf_0.log to detect any problem at this step.
  3. The file flag_maf_0.na_list contains a list of SNPs with minor allele frequency of 0.

If necessary, use subset module to remove SNPs with minor allele frequency of 0, since they were only flagged using the Fourth Subset Module (optional).

Fourth Subset Module (optional)


Input files:

From the Heterozygote Haploid Module:

  • without_hh_genotypes.bed
  • without_hh_genotypes.bim
  • without_hh_genotypes.fam
  1. Examine the log to confirm options used and to detect any problems occurring while running the script.
  2. Examine subset.log to detect any problem at this step.

Hardy Weinberg Equilibrium Module


Input files:

From the Heterozygote Haploid Module:

  • without_hh_genotypes.bed
  • without_hh_genotypes.bim
  • without_hh_genotypes.fam

or from the Fourth Subset Module (optional):

  • subset.bed
  • subset.bim
  • subset.fam
  1. Examine the log to confirm options used and to detect any problems occurring while running the script.
  2. Examine flag_hw.threshold_1e-4.log and flag_hw.threshold_Bonferroni.log to detect any problem at this step.
  3. The files flag_hw.snp_flag_threshold_Bonferroni and flag_hw.snp_flag_threshold_1e-4 contain lists of SNPs with P value below Bonferroni and below \(1 \times 10^{-4}\) threshold, respectively.

The markers are only flagged using this module. If you want to remove those markers, have a look at the Fifth Subset Module (optional).

Fifth Subset Module (optional)


Input files:

From the Heterozygote Haploid Module:

  • without_hh_genotypes.bed
  • without_hh_genotypes.bim
  • without_hh_genotypes.fam

or from the Fourth Subset Module (optional):

  • subset.bed
  • subset.bim
  • subsert.fam
  1. Examine the log to confirm options used and to detect any problems occurring while running the script.
  2. Examine subset.log to detect any problem at this step.