Duplicated Samples Module¶
The usage of the standalone module is shown below:
$ pyGenClean_duplicated_samples --help
usage: pyGenClean_duplicated_samples [-h] [-v] --tfile FILE
[--sample-completion-threshold FLOAT]
[--sample-concordance-threshold FLOAT]
[--out FILE]
Extracts and merges duplicated samples.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Input File:
--tfile FILE The input file prefix (will find the tped and tfam
file by appending the prefix to .tped and .tfam,
respectively.) The duplicated samples should have the
same identification numbers (both family and
individual ids.)
Options:
--sample-completion-threshold FLOAT
The completion threshold to consider a replicate when
choosing the best replicates and for creating the
composite samples. [default: 0.9]
--sample-concordance-threshold FLOAT
The concordance threshold to consider a replicate when
choosing the best replicates and for creating the
composite samples. [default: 0.97]
Output File:
--out FILE The prefix of the output files. [default: dup_samples]
Input Files¶
This module uses PLINK’s transposed pedfile format (tped
and tfam
files). For this step to work, the duplicated samples must have the same
identification (family and sample ID). One should keep a file containing the
original identifications before modifying the dataset accordingly.
Procedure¶
Here are the steps performed by the module:
- Reads the
tfam
file to find duplicated samples. - Separates the duplicated samples from the unique samples.
- Writes the unique samples into a file.
- Reads the
tped
file and write the pedigree file for the unique samples. Saves in memory the pedigree for the duplicated samples. Updates the indexes of the duplicated samples. - If there are no duplicated samples, simply create the final file. Stop here.
- Computes the completion (for each of the duplicated samples) and the concordance of each sample pairs.
- Prints statistics (concordance and completion).
- Prints the concordance matrix for each duplicated samples.
- Prints the
tped
and thetfam
file for the duplicated samples. - Chooses the best of each duplicates (to keep and to complete) according to completion and concordance.
- Creates a unique
tped
andtfam
from the duplicated samples by completing the best chosen one with the other samples. - Creates the final dataset.
Output Files¶
The output files of each of the steps described above are as follow (note that the output prefix shown is the one by default [i.e. dup_samples]):
- No output file is created.
- No output file is created.
- Only one of the two PLINK’s transposed pedfiles is created:
dup_samples.unique_samples.tfam
: thetfam
file containing only the unique samples from the original dataset.
- The second of the two PLINK’s transposed pedfiles is created (see previous
step):
dup_samples.unique_samples.tped
: thetped
file containing only the unique samples from the original dataset.
- If there are not duplicated samples, the final PLINK’s transposed pedfiles
are created (if not, continue tu next step):
dup_samples.final
: thetfam
andtped
final files.
- One result file is created:
dup_samples.diff
: a file containing the differences in the genotypes for each pair of duplicated samples. Each line contains the following information:name
the name of the marker,famID
the family ID,indID
the individual ID,dupIndex_1
the index of the first duplicated sample in the original dataset (since the identification of each duplicated samples are the same),dupIndex_2
the index of the second duplicated sample in the original dataset,genotype_1
andgenotype_2
, the genotype of the first and second duplicated samples for the current marker, respectively.
- One result file is created:
dup_samples.summary
: the completion and summarized concordance of each duplicated sample pair. The first two columns (origIndex
anddupIndex
are the indexes of the duplicated sample in the original and duplicated transposed pedfiles, respectively.
- One result file is created
dup_samples.concordance
: the pairwise concordance of each duplicated samples.
- One set of PLINK’s transposed pedfiles:
dup_samples.duplicated_samples
: the dataset containing the duplicated samples from the original dataset.
- Two output files are created:
dup_samples.chosen_samples.info
: a list of samples that were chosen for completion according to their completion and summarized concordance with their duplicates. Again, their indexes in the original and duplicated transposed pedfiles are saved (the two first columns).dup_samples.excluded_samples.info
: a list of samples that were not chosen for completion according to their completion and summarized concordance with their duplicates.
- Multiple output files are created, along with on set of PLINK’s transposed
pedfiles:
dup_samples.zeroed_out
: the list of genotypes that were zeroed out during completion of the chosen duplicated samples.dup_samples.not_good_enough
: the list of samples that were not good enough (according to completion and concordance) to create the composite sample (the chosen duplicated samples).
- Two sets of PLINK’s transposed pedfiles are created:
dup_samples.chosen_samples
: a transposed pedfiles containing the completed chosen samples.dup_samples.final
: the final dataset.
The Algorithm¶
For more information about the actual algorithms and source codes, refer to the following page.