Proposed Protocol¶
Contamination Module¶
Note
Input files:
- PLINK binary pedfiles (BED, BIM and FAM)
- Examine the output file named
contamination.bafRegress
. It will contain the contamination estimates (along with confidence values). Usually, an estimate higher than 0.01 means possible contamination. - The automatic report will contain the list of samples with possible contamination (i.e. an estimate value higher than 0.01).
Preprocessing Steps¶
Remove SNPs without chromosomal and physical position (chromosome and position of 0).
Remove INDELs (markers with alleles
I
orD
).Determine if there are duplicated samples. These samples must have exactly the same family (
FID
) and individual (IID
) identification to be treated as duplicated samples by the Duplicated Samples Module. PLINK’s option--update-ids
could be used.If input is a transposed pedfile, be sure to use PLINK’s option
--tab
to produce the appropriate file format.For the Plate Bias Module, a text file explaining the plate distribution of each sample must be provided using the option
--loop-assoc
in the configuration file. The following columns are required (in order, without a header):- the family identification;
- the individual identification;
- and the plate identification.
Produce parameter files (see the Configuration Files for details about parameter file).
To launch the analysis consult the section How to Run the Pipeline.
Duplicated Samples Module¶
Examine the log to confirm options used and to detect any problems occurring while running the script.
Examine
dup_samples.diff
to evaluate if some samples have many discordant genotypes (this could indicate a possible samples mix up). To identify discordant samples, use the following command line:$ cut -f4 dup_samples.diff | sort -k1,1 | uniq -c $ cut -f5 dup_samples.diff | sort -k1,1 | uniq -c
If samples are present more than 10,000 times (for 2.5E-6 SNPs) this could indicate a sample mix up.
Examine
dup_samples.not_good_enough
to determine if samples have a concordance rate below the threshold set by the user. These samples are present in thedup_samples.final.tfam
if they are the chosen ones.Examine
dup_samples.summary
to evaluate completion rate and concordance between the replicates of potentially problematic samples.Examine
dup_samples.concordance
file for the problematic samples; this could help to determine which sample is the discordant replicate.If a sample appears problematic rename it and keep it in the analysis to determine if it is a duplicate of another sample (mix up) with the related sample module.
If necessary, samples present in the dup_samples.not_good_enough
file can be
removed from the data set with the subset module (see the
First Subset Module (optional)). If not, proceed to the
Duplicated Markers Module).
First Subset Module (optional)¶
Extract the family (
FID
) and individual (IID
) identification fromdup_samples.not_good_enough
with the following command line:$ cut -f3,4 dup_samples.not_good_enough | sed "1d" | sort -k1,1 \ > | uniq > samples_to_remove
Duplicated Markers Module¶
Note
Input files:
From the Duplicated Samples Module:
dup_samples.final.tfam
dup_samples.final.tped
or from the First Subset Module (optional):
subset.bed
subset.bim
subset.fam
Examine the log to confirm options used and to detect any problems occurring while running the script
Examine
dup_snps.duplicated_marker_names
to detect SNPs with exactly the same name but mapping to different chromosomal location. (This file is not produce if no duplicated marker names are identified).Determine the number of duplicated SNPs merged (same allele, same frequency, etc). SNPs merged were removed and are listed in the file
dup_snps.removed_duplicates
. Number of lines in this file corresponds to number of SNPs merged. SNPs not merged and reasons why (e.g.homo_hetero
,diff_frequency
,homo_flip
, etc.) are present in filedup_snps.problems
.SNPs with concordance rate below the threshold are present in
dup_snps.not_good_enough
. To have the list of those SNPs:$ grep -w concordance dup_snps.not_good_enough | cut -f1 \ > > SNP_with_low_concordance_rate
If necessary, use the subset option in the configuration file to remove the low concordance rate SNPs (see the Second Subset Module (optional)).
Second Subset Module (optional)¶
Extract SNPs with concordance rate below the threshold set by the user with the command line
$ grep -w concordance dup_snps.not_good_enough | cut -f1 \ > > SNP_with_low_concordance_rate
Clean No Call and Only Heterozygous Markers Module¶
Note
Input files:
From the Duplicated Markers Module:
dup_snps.final.tfam
dup_snps.final.tped
or from the Second Subset Module (optional):
subset.bed
subset.bim
subset.fam
- Examine the log to confirm options used and to detect any problems occurring while running the script.
- SNPs removed because they are failed are listed in
clean_noCall_hetero.allFailed
. - SNPs removed because they are all heterozygous are listed in
clean_noCall_hetero.allHetero
.
Sample Missingness Module (mind 0.1)¶
Note
Input files:
From the Clean No Call and Only Heterozygous Markers Module:
clean_noCall_hetero.tfam
clean_noCall_hetero.tped
- Examine the log to confirm options used and to detect any problems occurring while running the script.
- Examine PLINK’s log file to detect any problem at this step.
- Individuals removed because they did not pass the completion rate threshold
are listed in
clean_mind.irem
.
Marker Missingness Module¶
Note
Input files:
From the Sample Missingness Module (mind 0.1):
clean_mind.bed
clean_mind.bim
clean_mind.fam
- Examine the log to confirm options used and to detect any problems occurring while running the script.
- Examine PLINK’s log file to detect any problem at this step.
- SNPs removed because they did not pass the completion rate threshold are
listed in
clean_geno.removed_snps
.
Sample Missingness Module (mind 0.02)¶
- Examine the log to confirm options used and to detect any problems occurring while running the script.
- Examine PLINK’s log file to detect any problem at this step.
- Individuals removed because they did not pass the completion rate threshold
are listed in
clean_mind.irem
.
Sex Check Module¶
Note
Input files:
From Sample Missingness Module (mind 0.02):
clean_mind.bed
clean_mind.bim
clean_mind.fam
- Examine the log to confirm options used and to detect any problems occurring while running the script.
- Examine PLINK’s log file to detect any problem at this step.
- Examine
sexcheck.list_problem_sex
, it contains all individuals identified by PLINK as having gender problem. - Examine
sexcheck.chr23_recodeA.raw.hetero
to determine heterozygosity on the X chromosome of problematic samples. Consanguineous females may have low heterozygosity on the X chromosome. If many genotyped SNPs are rare, heterozygosity may also be low. - Examine
sexcheck.chr24_recodeA.raw.noCall
to determine the number of Y markers with missing calls. Females have low number of genotypes for Y chromosome markers (high values of missing calls), but is often not equal to 0 probably because some Y markers come from pseudo autosomal regions. ColumnnbGeno
is the total number of genotypes check andnbNoCall
is the number of genotypes with missing calls on chromosome Y. Males should have low values in this column while females have higher number of missing calls but not equal to the total number of genotypes tested.
If probe intensities from X and Y chromosomes are available and the gender plot has been created:
- Examine the log to confirm options used and to detect any problems occurring while running the script.
- Examine
sexcheck.png
to detect any individuals in the XXY or X0 regions, females in the male cluster and males in the female cluster (see the Gender plot figure). Confirm if possible the gender problems identified with the previous sex check problem step.
If intensities file for each sample are available and the BAF and LRR plot has been created:
- Examine the log to confirm options used and to detect any problems occurring while running the script.
- Examine
sexcheck_sample-id_lrr_baf.png
for each sample. Usually, females have LRR values around 0 (between -0.5 and 0.5) while males have LRR values between -0.5 and -1. Females have three lines on BAF graphics: one at 1 (homozygous for the B allele), one at 0.5 (heterozygous AB) and one at 0 (homozygous for the A allele). Males have two lines: one at 1 (homozygous for the B allele) and one a at 0 (homozygous for the A allele). For more details, see the The Plots section of the Sex Check Module.
Keep individuals identified with gender problem until the Related Samples Module (mix up of samples could be resolved at this step).
Plate Bias Module¶
Note
Input files:
From the Sample Missingness Module (mind 0.02):
clean_mind.bed
clean_mind.bim
clean_mind.fam
or if subset option is used to remove SNPs from nof
file (see below):
subset.bed
subset.bim
subset.fam
- Verify if there is a
nof
file produce by PLINK when the input files for this step were produced (from the Sample Missingness Module (mind 0.02)). Thenof
contains SNPs with no founder genotypes observed. If so, remove the SNPs present in thenof
file using the subset tool before launching the plate bias analysis. Those SNPs, if they are not removed will produced an error message when PLINK performs theloop-assoc
analysis and the following message will be present in PLINK’s log fileplate_bias.log
: “ERROR: FEXACT error 3
”. SNPs on chromosome 24 could also produce this error. - Examine the log to confirm options used and to detect any problems occurring while running the script.
- Examine
plate_bias.log
to detect any problem at this step. - The
plate_bias.significant_SNPs.txt
file contains a list of SNPs with P value below the threshold. Care should be taken with those SNPs if significant results are obtained in association tests. These SNPs are NOT removed from the data set, only flagged. - Low MAF can explain part of plate bias. Examine the output file
plate_bias.significant_SNPs.frq
to determine if SNPs have low MAF. Other reasons explaining plate bias are relatedness or ethnicity of individuals assign to the same plates and none of them on other plates.
Ethnicity Module¶
Note
Input files:
From the Sample Missingness Module (mind 0.02):
clean_mind.bed
clean_mind.bim
clean_mind.fam
Examine the log to confirm options used and to detect any problems occurring while running the script.
File
ethnic.ibs.pruning_0.1.prune.in
contains the list of uncorrelated SNPs used for the MDS analysis.File
ethnic.mds.mds
contains the list of principale components as calculated by PLINK.Examine
ethnicity.mds.png
,ethnicity.before.png
,ethnicity.after.png
andethnicity.outliers.png
to detect samples outside the selected cluster (see The Plots generated from the Ethnicity Module for more information).If there are too many outliers still present in the data set (i.e. radius is too large), analysis can be redone using the
pyGenClean_find_outliers
standalone script, using a different value for--multiplier
. For more information, refer to the Finding Outliers section of the Ethnicity Module.Samples outside the selected cluster are listed in
ethnicity.outliers
. If necessary those samples could be removed at a later stage with the subset option.
Third Subset Module¶
Note
Input files:
From the Sample Missingness Module (mind 0.02):
clean_mind.bed
clean_mind.bim
clean_mind.fam
Use the subset module to remove samples with gender problems (the Sex Check Module), outliers from the ethnicity cluster (the Ethnicity Module), related samples (the Related Samples Module) and any other samples that need to be removed from the data set.
To produces a file containing all the samples to remove from the dataset:
$ cat sexcheck.list_problem_sex_ids ibs.discarded_related_individuals \ > ethnicity.outliers > samples_to_remove.txt
One sample may be removed for more than one reason, hence be present more than one time in the final
samples_to_remove.txt
file. This is not an issue for this step.
Heterozygote Haploid Module¶
Samples with gender problems must have been removed before performing this module.
- Examine the log to confirm options used and to detect any problems occurring while running the script.
- Examine
without_hh_genotypes.log
to detect any problem at this step.
Number of heterozygous haploid genotypes set to missing are indicated in
without_hh_genotypes.log
file.
Minor Allele Frequency of Zero Module¶
Note
Input files:
From the Heterozygote Haploid Module:
without_hh_genotypes.bed
without_hh_genotypes.bim
without_hh_genotypes.fam
- Examine the log to confirm options used and to detect any problems occurring while running the script.
- Examine
flag_maf_0.log
to detect any problem at this step. - The file
flag_maf_0.na_list
contains a list of SNPs with minor allele frequency of 0.
If necessary, use subset module to remove SNPs with minor allele frequency of 0, since they were only flagged using the Fourth Subset Module (optional).
Fourth Subset Module (optional)¶
Note
Input files:
From the Heterozygote Haploid Module:
without_hh_genotypes.bed
without_hh_genotypes.bim
without_hh_genotypes.fam
- Examine the log to confirm options used and to detect any problems occurring while running the script.
- Examine
subset.log
to detect any problem at this step.
Hardy Weinberg Equilibrium Module¶
Note
Input files:
From the Heterozygote Haploid Module:
without_hh_genotypes.bed
without_hh_genotypes.bim
without_hh_genotypes.fam
or from the Fourth Subset Module (optional):
subset.bed
subset.bim
subset.fam
- Examine the log to confirm options used and to detect any problems occurring while running the script.
- Examine
flag_hw.threshold_1e-4.log
andflag_hw.threshold_Bonferroni.log
to detect any problem at this step. - The files
flag_hw.snp_flag_threshold_Bonferroni
andflag_hw.snp_flag_threshold_1e-4
contain lists of SNPs with P value below Bonferroni and below \(1 \times 10^{-4}\) threshold, respectively.
The markers are only flagged using this module. If you want to remove those markers, have a look at the Fifth Subset Module (optional).
Fifth Subset Module (optional)¶
Note
Input files:
From the Heterozygote Haploid Module:
without_hh_genotypes.bed
without_hh_genotypes.bim
without_hh_genotypes.fam
or from the Fourth Subset Module (optional):
subset.bed
subset.bim
subsert.fam
- Examine the log to confirm options used and to detect any problems occurring while running the script.
- Examine
subset.log
to detect any problem at this step.