.. _proposed_protocol_label: Proposed Protocol ***************** .. _contamination_module_label: Contamination Module ==================== .. note:: **Input files:** * PLINK binary pedfiles (BED, BIM and FAM) 1. Examine the output file named ``contamination.bafRegress``. It will contain the contamination estimates (along with confidence values). Usually, an estimate higher than 0.01 means possible contamination. 2. The automatic report will contain the list of samples with possible contamination (*i.e.* an estimate value higher than 0.01). .. _preprocessing_label: Preprocessing Steps =================== * Remove SNPs without chromosomal and physical position (chromosome and position of 0). * Remove INDELs (markers with alleles ``I`` or ``D``). * Determine if there are duplicated samples. These samples must have exactly the same family (``FID``) and individual (``IID``) identification to be treated as duplicated samples by the :ref:`dup_sample_module_label`. PLINK's option ``--update-ids`` could be used. * If input is a transposed pedfile, be sure to use PLINK's option ``--tab`` to produce the appropriate file format. * For the :ref:`plate_bias`, a text file explaining the plate distribution of each sample must be provided using the option ``--loop-assoc`` in the configuration file. The following columns are required (in order, without a header): * the family identification; * the individual identification; * and the plate identification. * Produce parameter files (see the :ref:`config_files` for details about parameter file). * To launch the analysis consult the section :ref:`how_ro_run`. .. _dup_sample_module_label: Duplicated Samples Module ========================= .. note:: **Input files:** * PLINK transposed pedfiles from the :ref:`preprocessing_label`. 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. Examine ``dup_samples.diff`` to evaluate if some samples have many discordant genotypes (this could indicate a possible samples mix up). To identify discordant samples, use the following command line: .. code-block:: console $ cut -f4 dup_samples.diff | sort -k1,1 | uniq -c $ cut -f5 dup_samples.diff | sort -k1,1 | uniq -c If samples are present more than 10,000 times (for 2.5E-6 SNPs) this could indicate a sample mix up. 3. Examine ``dup_samples.not_good_enough`` to determine if samples have a concordance rate below the threshold set by the user. These samples **are present** in the ``dup_samples.final.tfam`` if they are the chosen ones. 4. Examine ``dup_samples.summary`` to evaluate completion rate and concordance between the replicates of potentially problematic samples. 5. Examine ``dup_samples.concordance`` file for the problematic samples; this could help to determine which sample is the discordant replicate. 6. If a sample appears problematic rename it and keep it in the analysis to determine if it is a duplicate of another sample (mix up) with the related sample module. If necessary, samples present in the ``dup_samples.not_good_enough`` file can be removed from the data set with the subset module (see the :ref:`first_subset_label`). If not, proceed to the :ref:`dup_marker_module_label`). .. _first_subset_label: First Subset Module (optional) ============================== .. note:: **Input files:** From the :ref:`dup_sample_module_label`: * ``dup_samples.final.tfam`` * ``dup_samples.final.tped`` 1. Extract the family (``FID``) and individual (``IID``) identification from ``dup_samples.not_good_enough`` with the following command line: .. code-block:: console $ cut -f3,4 dup_samples.not_good_enough | sed "1d" | sort -k1,1 \ > | uniq > samples_to_remove .. _dup_marker_module_label: Duplicated Markers Module ========================== .. note:: **Input files:** From the :ref:`dup_sample_module_label`: * ``dup_samples.final.tfam`` * ``dup_samples.final.tped`` or from the :ref:`first_subset_label`: * ``subset.bed`` * ``subset.bim`` * ``subset.fam`` 1. Examine the log to confirm options used and to detect any problems occurring while running the script 2. Examine ``dup_snps.duplicated_marker_names`` to detect SNPs with exactly the same name but mapping to different chromosomal location. (This file is not produce if no duplicated marker names are identified). 3. Determine the number of duplicated SNPs merged (same allele, same frequency, etc). SNPs merged were removed and are listed in the file ``dup_snps.removed_duplicates``. Number of lines in this file corresponds to number of SNPs merged. SNPs not merged and reasons why (*e.g.* ``homo_hetero``, ``diff_frequency``, ``homo_flip``, etc.) are present in file ``dup_snps.problems``. 4. SNPs with concordance rate below the threshold are present in ``dup_snps.not_good_enough``. To have the list of those SNPs: .. code-block:: console $ grep -w concordance dup_snps.not_good_enough | cut -f1 \ > > SNP_with_low_concordance_rate If necessary, use the subset option in the configuration file to remove the low concordance rate SNPs (see the :ref:`second_subset_label`). .. _second_subset_label: Second Subset Module (optional) =============================== .. note:: **Input files:** From the :ref:`dup_marker_module_label`: * ``dup_snps.final.tfam`` * ``dup_snps.final.tped`` * Extract SNPs with concordance rate below the threshold set by the user with the command line .. code-block:: console $ grep -w concordance dup_snps.not_good_enough | cut -f1 \ > > SNP_with_low_concordance_rate .. _clean_noCall_hetero: Clean No Call and Only Heterozygous Markers Module ================================================== .. note:: **Input files:** From the :ref:`dup_marker_module_label`: * ``dup_snps.final.tfam`` * ``dup_snps.final.tped`` or from the :ref:`second_subset_label`: * ``subset.bed`` * ``subset.bim`` * ``subset.fam`` 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. SNPs removed because they are failed are listed in ``clean_noCall_hetero.allFailed``. 3. SNPs removed because they are all heterozygous are listed in ``clean_noCall_hetero.allHetero``. .. _sample_missingness_01: Sample Missingness Module (mind 0.1) ===================================== .. note:: **Input files:** From the :ref:`clean_noCall_hetero`: * ``clean_noCall_hetero.tfam`` * ``clean_noCall_hetero.tped`` 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. Examine PLINK's log file to detect any problem at this step. 3. Individuals removed because they did not pass the completion rate threshold are listed in ``clean_mind.irem``. .. _marker_missingness: Marker Missingness Module ========================= .. note:: **Input files:** From the :ref:`sample_missingness_01`: * ``clean_mind.bed`` * ``clean_mind.bim`` * ``clean_mind.fam`` 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. Examine PLINK's log file to detect any problem at this step. 3. SNPs removed because they did not pass the completion rate threshold are listed in ``clean_geno.removed_snps``. .. _sample_missingness_02: Sample Missingness Module (mind 0.02) ===================================== .. note:: **Input files:** From the :ref:`marker_missingness`: * ``clean_geno.bed`` * ``clean_geno.bim`` * ``clean_geno.fam`` 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. Examine PLINK's log file to detect any problem at this step. 3. Individuals removed because they did not pass the completion rate threshold are listed in ``clean_mind.irem``. .. _sex_check: Sex Check Module ================ .. note:: **Input files:** From :ref:`sample_missingness_02`: * ``clean_mind.bed`` * ``clean_mind.bim`` * ``clean_mind.fam`` 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. Examine PLINK's log file to detect any problem at this step. 3. Examine ``sexcheck.list_problem_sex``, it contains all individuals identified by PLINK as having gender problem. 4. Examine ``sexcheck.chr23_recodeA.raw.hetero`` to determine heterozygosity on the X chromosome of problematic samples. Consanguineous females may have low heterozygosity on the X chromosome. If many genotyped SNPs are rare, heterozygosity may also be low. 5. Examine ``sexcheck.chr24_recodeA.raw.noCall`` to determine the number of Y markers with missing calls. Females have low number of genotypes for Y chromosome markers (high values of missing calls), but is often not equal to 0 probably because some Y markers come from pseudo autosomal regions. Column ``nbGeno`` is the total number of genotypes check and ``nbNoCall`` is the number of genotypes with missing calls on chromosome Y. Males should have low values in this column while females have higher number of missing calls but not equal to the total number of genotypes tested. If probe intensities from X and Y chromosomes are available and the gender plot has been created: 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. Examine ``sexcheck.png`` to detect any individuals in the XXY or X0 regions, females in the male cluster and males in the female cluster (see the :ref:`gender_plot_figure` figure). Confirm if possible the gender problems identified with the previous sex check problem step. If intensities file for each sample are available and the BAF and LRR plot has been created: 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. Examine ``sexcheck_sample-id_lrr_baf.png`` for each sample. Usually, females have LRR values around 0 (between -0.5 and 0.5) while males have LRR values between -0.5 and -1. Females have three lines on BAF graphics: one at 1 (homozygous for the B allele), one at 0.5 (heterozygous AB) and one at 0 (homozygous for the A allele). Males have two lines: one at 1 (homozygous for the B allele) and one a at 0 (homozygous for the A allele). For more details, see the :ref:`sex_check_plots` section of the :ref:`sexcheck_module_lable`. Keep individuals identified with gender problem until the :ref:`related_samples` (mix up of samples could be resolved at this step). .. _plate_bias: Plate Bias Module ================= .. note:: **Input files:** From the :ref:`sample_missingness_02`: * ``clean_mind.bed`` * ``clean_mind.bim`` * ``clean_mind.fam`` or if subset option is used to remove SNPs from ``nof`` file (see below): * ``subset.bed`` * ``subset.bim`` * ``subset.fam`` 1. Verify if there is a ``nof`` file produce by PLINK when the input files for this step were produced (from the :ref:`sample_missingness_02`). The ``nof`` contains SNPs with no founder genotypes observed. If so, remove the SNPs present in the ``nof`` file using the subset tool before launching the plate bias analysis. Those SNPs, if they are not removed will produced an error message when PLINK performs the ``loop-assoc`` analysis and the following message will be present in PLINK's log file ``plate_bias.log``: "``ERROR: FEXACT error 3``". SNPs on chromosome 24 could also produce this error. 2. Examine the log to confirm options used and to detect any problems occurring while running the script. 3. Examine ``plate_bias.log`` to detect any problem at this step. 4. The ``plate_bias.significant_SNPs.txt`` file contains a list of SNPs with P value below the threshold. Care should be taken with those SNPs if significant results are obtained in association tests. These SNPs are NOT removed from the data set, only flagged. 5. Low MAF can explain part of plate bias. Examine the output file ``plate_bias.significant_SNPs.frq`` to determine if SNPs have low MAF. Other reasons explaining plate bias are relatedness or ethnicity of individuals assign to the same plates and none of them on other plates. .. _related_samples: Related Samples Module ====================== .. note:: **Input files:** From the :ref:`sample_missingness_02`: * ``clean_mind.bed`` * ``clean_mind.bim`` * ``clean_mind.fam`` 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. File ``ibs.pruning_0.1.prune.in`` contains the list of uncorrelated SNPs used for the IBS analysis 3. Examine ``ibs.related_individuals_z1.png`` and ``ibs.related_individuals_z2.png`` to detect if there are samples in the parent-child, duplicated samples, first degree relative and second degree relative areas. (see :ref:`ibs_z1_figure` and :ref:`ibs_z2_figure` plots). 4. File ``ibs.related_individuals`` lists pairs of related individuals. Index column indicates group of related samples. Status column indicated the probable link between pair of individuals based on :math:`Z_0`, :math:`Z_1` and :math:`Z_2` values (see the :ref:`z_values_table` table [for which :math:`Z` values are approximation] or :py:func:`RelatedSamples.find_related_samples.extractRelatedIndividuals` function for thresholds). 5. If there are known duplicated samples, examine ``ibs.related_individuals`` to determine if they were identified correctly, if not this could indicate a possible samples mix up. 6. File ``ibs.choosen_related_individuals`` contains a list of related samples to keep. One related sample from the pair is randomly selected. If there are a group of related individuals, one sample in randomly selected from the group. All non selected samples are listed in ``ibs.discarded_related_individuals`` and should be removed from the analysis at a later stage. .. _z_values_table: .. table:: IBD allele sharing values +--------------+--------------+-------------+--------------+------------------------------------+ | Relationship | :math:`k_0` | :math:`k_1` | :math:`k_2` | Coancestry | | | | | | :math:`\theta = 1/2 k_2 + 1/4 k_1` | +==============+==============+=============+==============+====================================+ | Unrelated | 1 | 0 | 0 | 0 | +--------------+--------------+-------------+--------------+------------------------------------+ | Identical | 0 | 0 | 1 | :math:`1/2` | | twins | | | | | +--------------+--------------+-------------+--------------+------------------------------------+ | Parent-child | 0 | 1 | 0 | :math:`1/4` | +--------------+--------------+-------------+--------------+------------------------------------+ | Full | :math:`1/4` | :math:`1/2` | :math:`1/4` | :math:`1/4` | | siblings | | | | | +--------------+--------------+-------------+--------------+------------------------------------+ | Half | :math:`1/2` | :math:`1/2` | 0 | :math:`1/8` | | siblings | | | | | +--------------+--------------+-------------+--------------+------------------------------------+ | Uncle | :math:`1/2` | :math:`1/2` | 0 | :math:`1/8` | | nephew | | | | | +--------------+--------------+-------------+--------------+------------------------------------+ | Grandparent | :math:`1/2` | :math:`1/2` | 0 | :math:`1/8` | | grandchild | | | | | +--------------+--------------+-------------+--------------+------------------------------------+ | Double first | :math:`9/16` | :math:`3/8` | :math:`1/16` | :math:`1/8` | | cousins | | | | | +--------------+--------------+-------------+--------------+------------------------------------+ | First | :math:`3/4` | :math:`1/4` | 0 | :math:`1/16` | | cousins | | | | | +--------------+--------------+-------------+--------------+------------------------------------+ .. _ethnicity: Ethnicity Module ================ .. note:: **Input files:** From the :ref:`sample_missingness_02`: * ``clean_mind.bed`` * ``clean_mind.bim`` * ``clean_mind.fam`` 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. File ``ethnic.ibs.pruning_0.1.prune.in`` contains the list of uncorrelated SNPs used for the MDS analysis. 3. File ``ethnic.mds.mds`` contains the list of principale components as calculated by PLINK. 4. Examine ``ethnicity.mds.png``, ``ethnicity.before.png``, ``ethnicity.after.png`` and ``ethnicity.outliers.png`` to detect samples outside the selected cluster (see :ref:`ethnicity_plots_label` generated from the :ref:`ethnicity_module_label` for more information). If there are too many outliers still present in the data set (*i.e.* radius is too large), analysis can be redone using the ``pyGenClean_find_outliers`` standalone script, using a different value for ``--multiplier``. For more information, refer to the :ref:`ethnicity_find_outliers` section of the :ref:`ethnicity_module_label`. 5. Samples outside the selected cluster are listed in ``ethnicity.outliers``. If necessary those samples could be removed at a later stage with the subset option. .. _third_subset_label: Third Subset Module =================== .. note:: **Input files:** From the :ref:`sample_missingness_02`: * ``clean_mind.bed`` * ``clean_mind.bim`` * ``clean_mind.fam`` Use the subset module to remove samples with gender problems (the :ref:`sex_check`), outliers from the ethnicity cluster (the :ref:`ethnicity`), related samples (the :ref:`related_samples`) and any other samples that need to be removed from the data set. * To produces a file containing all the samples to remove from the dataset: .. code-block:: console $ cat sexcheck.list_problem_sex_ids ibs.discarded_related_individuals \ > ethnicity.outliers > samples_to_remove.txt One sample may be removed for more than one reason, hence be present more than one time in the final ``samples_to_remove.txt`` file. This is not an issue for this step. .. _heterozygote_haploid: Heterozygote Haploid Module =========================== .. note:: **Input files:** From the :ref:`third_subset_label`: * ``subset.bed`` * ``subset.bim`` * ``subset.fam`` Samples with gender problems **must have been removed before** performing this module. 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. Examine ``without_hh_genotypes.log`` to detect any problem at this step. Number of heterozygous haploid genotypes set to missing are indicated in ``without_hh_genotypes.log`` file. .. _maf: Minor Allele Frequency of Zero Module ===================================== .. note:: **Input files:** From the :ref:`heterozygote_haploid`: * ``without_hh_genotypes.bed`` * ``without_hh_genotypes.bim`` * ``without_hh_genotypes.fam`` 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. Examine ``flag_maf_0.log`` to detect any problem at this step. 3. The file ``flag_maf_0.na_list`` contains a list of SNPs with minor allele frequency of 0. If necessary, use subset module to remove SNPs with minor allele frequency of 0, since they were only flagged using the :ref:`fourth_subset_label`. .. _fourth_subset_label: Fourth Subset Module (optional) =============================== .. note:: **Input files:** From the :ref:`heterozygote_haploid`: * ``without_hh_genotypes.bed`` * ``without_hh_genotypes.bim`` * ``without_hh_genotypes.fam`` 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. Examine ``subset.log`` to detect any problem at this step. .. _hwe: Hardy Weinberg Equilibrium Module ================================= .. note:: **Input files:** From the :ref:`heterozygote_haploid`: * ``without_hh_genotypes.bed`` * ``without_hh_genotypes.bim`` * ``without_hh_genotypes.fam`` or from the :ref:`fourth_subset_label`: * ``subset.bed`` * ``subset.bim`` * ``subset.fam`` 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. Examine ``flag_hw.threshold_1e-4.log`` and ``flag_hw.threshold_Bonferroni.log`` to detect any problem at this step. 3. The files ``flag_hw.snp_flag_threshold_Bonferroni`` and ``flag_hw.snp_flag_threshold_1e-4`` contain lists of SNPs with P value below Bonferroni and below :math:`1 \times 10^{-4}` threshold, respectively. The markers are only flagged using this module. If you want to remove those markers, have a look at the :ref:`fifth_subset_label`. .. _fifth_subset_label: Fifth Subset Module (optional) ============================== .. note:: **Input files:** From the :ref:`heterozygote_haploid`: * ``without_hh_genotypes.bed`` * ``without_hh_genotypes.bim`` * ``without_hh_genotypes.fam`` or from the :ref:`fourth_subset_label`: * ``subset.bed`` * ``subset.bim`` * ``subsert.fam`` 1. Examine the log to confirm options used and to detect any problems occurring while running the script. 2. Examine ``subset.log`` to detect any problem at this step.