pyGenClean.DupSamples package¶
For more information about how to use this module, refer to the Duplicated Samples Module.
Module contents¶
Submodules¶
pyGenClean.DupSamples.duplicated_samples module¶
-
exception
pyGenClean.DupSamples.duplicated_samples.
ProgramError
(msg)[source]¶ Bases:
exceptions.Exception
An
Exception
raised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.DupSamples.duplicated_samples.
addToTPEDandTFAM
(tped, tfam, prefix, toAddPrefix)[source]¶ Append a tfile to another, creating a new one.
Parameters: Here are the steps of this function:
- Writes the
tped
intoprefix.chosen_samples.tped
. - Writes the
tfam
intoprefix.chosen_samples.tfam
. - Copies the previous
tfam
(toAddPrefix.tfam
) into the finaltfam
(prefix.final.tfam
). - Append the
tfam
to the finaltfam
(prefix.final.tfam
). - Reads the previous
tped
(toAddPrefix.tped
) and append the newtped
to it, writing the final one (prefix.final.tped
).
Warning
The
tped
andtfam
variables need to contain at least one sample.- Writes the
-
pyGenClean.DupSamples.duplicated_samples.
checkArgs
(args)[source]¶ Checks the arguments and options.
Parameters: args ( argparse.Namespace
) – aargparse.Namespace
object containing the options of the program.Returns: True
if everything was OK.If there is a problem with an option, an exception is raised using the
ProgramError
class, a message is printed to thesys.stderr
and the program exists with code 1.
-
pyGenClean.DupSamples.duplicated_samples.
chooseBestDuplicates
(tped, samples, oldSamples, completion, concordance_all, prefix)[source]¶ Choose the best duplicates according to the completion rate.
Parameters: - tped (
numpy.array
) – thetped
containing the duplicated samples. - samples (dict) – the updated position of the samples in the tped containing only duplicated samples.
- oldSamples (dict) – the original duplicated sample positions.
- completion (
numpy.array
) – the completion of each of the duplicated samples. - concordance_all (dict) – the concordance of every duplicated samples.
- prefix (str) – the prefix of all the files.
Returns: a tuple where the first element is a list of the chosen samples’ indexes, the second on is the completion and the last one is the concordance (a map).
These are the steps to find the best duplicated sample:
- Sort the list of concordances.
- Sort the list of completions.
- Choose the best of the concordance and put in a set.
- Choose the best of the completion and put it in a set.
- Compute the intersection of the two sets. If there is one sample or more, then randomly choose one sample.
- If the intersection doesn’t contain at least one sample, redo steps 3 and 4, but increase the number of chosen best by one. Redo step 5 and 6 (if required).
The chosen samples are written in
prefix.chosen_samples.info
. The rest are written inprefix.excluded_samples.info
.- tped (
-
pyGenClean.DupSamples.duplicated_samples.
computeStatistics
(tped, tfam, samples, oldSamples, prefix)[source]¶ Computes the completion and concordance of each samples.
Parameters: - tped (
numpy.array
) – thetped
containing duplicated samples. - tfam (
numpy.array
) – thetfam
containing duplicated samples. - samples (dict) – the updated position of the samples in the tped containing only duplicated samples.
- oldSamples (dict) – the original duplicated sample positions.
- prefix (str) – the prefix of all the files.
Returns: a tuple containing the completion (
numpy.array
) as first element, and the concordance (dict
) as last element.Reads the
tped
file and compute the completion for each duplicated samples and the pairwise concordance between duplicated samples.Note
The completion and concordance computation excludes a markers if it’s on chromosome 24 and if the sample is a female.
Note
A missing genotype is encoded by
0
.Note
No percentage is computed here, only the numbers. Percentages are computing in other functions:
printStatistics()
, for completion, andprintConcordance()
, for concordance.Completion
Computes the completion of none zero values (where all genotypes of at least one duplicated sample are no call [i.e.
0
]). The completion of sample \(i\) (i.e. \(Comp_i\)) is the number of genotypes that have a call divided by the total number of genotypes (the set \(G_i\)):\[Comp_i = \frac{||g \in G_i\textrm{ where }g \neq 0||}{||G_i||}\]Note
We consider a genotype as being missing if the sample is a male and if a marker on chromosome 23 or 24 is heterozygous.
Concordance
Computes the pairwise concordance between duplicated samples. For each marker, if both genotypes are not missing, we add one to the total number of compared markers. If both genotypes are the same, we add one to the number of concordant calls. We write the observed genotype difference in the file
prefix.diff
. The concordance between sample \(i\) and \(j\) (i.e. \(Concordance_{i,j}\)) is the number of genotypes that are equal divided by the total number of genotypes (excluding the no calls):\[Concordance_{i,j} = \frac{ ||g \in G_i \cup G_j \textrm{ where } g_i = g_j \neq 0|| }{ ||g \in G_i \cup G_j \textrm{ where } g \neq 0|| }\]- tped (
-
pyGenClean.DupSamples.duplicated_samples.
createAndCleanTPED
(tped, tfam, samples, oldSamples, chosenSamples, prefix, completion, completionT, concordance, concordanceT)[source]¶ Complete a TPED for duplicate samples.
Parameters: - tped (
numpy.array
) – thetped
containing the duplicated samples. - tfam (
numpy.array
) – thetfam
containing the duplicated samples. - samples (dict) – the updated position of the samples in the
tped
containing only duplicated samples. - oldSamples (dict) – the original duplicated sample positions.
- chosenSamples (dict) – the position of the chosen samples.
- prefix (str) – the prefix of all the files.
- completion (
numpy.array
) – the completion of each of the duplicated samples. - completionT (float) – the completion threshold.
- concordance (dict) – the pairwise concordance of each of the duplicated samples.
- concordanceT (float) – the concordance threshold.
Using a
tped
containing duplicated samples, it creates atped
containing unique samples by completing a chosen sample with the other replicates.Note
A chosen sample is not completed using bad replicates (those that don’t have a concordance or a completion higher than a certain threshold). The bad replicates are written in the file
prefix.not_good_enough
.- tped (
-
pyGenClean.DupSamples.duplicated_samples.
findDuplicates
(tfam)[source]¶ Finds the duplicates in a TFAM.
Parameters: tfam (list) – representation of a tfam
file.Returns: two dict
, containing unique and duplicated samples position.
-
pyGenClean.DupSamples.duplicated_samples.
main
(argString=None)[source]¶ Check for duplicated samples in a tfam/tped file.
Parameters: argString (list) – the options Here are the steps for the duplicated samples step.
- Prints the options.
- Reads the
tfam
file (readTFAM()
). - Separate the duplicated samples from the unique samples
(
findDuplicates()
). - Writes the unique samples into a file named
prefix.unique_samples.tfam
(printUniqueTFAM()
). - Reads the
tped
file and write intoprefix.unique_samples.tped
the pedigree file for the unique samples (processTPED()
). Saves in memory the pedigree for the duplicated samples. Updates the indexes of the duplicated samples. - If there are no duplicated samples, simply copies the files
prefix.unique_samples
(tped
andtfam
) toprefix.final.tfam
andprefix..final.tped
, respectively. - Computes the completion (for each of the duplicated samples) and the
concordance of each sample pairs (
computeStatistics()
). - Prints statistics (concordance and completion)
(
printStatistics()
). - We print the concordance matrix for each duplicated samples
(
printConcordance()
). - We print the
tped
and thetfam
file for the duplicated samples (prefix.duplicated_samples
) (printDuplicatedTPEDandTFAM()
). - Choose the best of each duplicates (to keep and to complete) according
to completion and concordance (
chooseBestDuplicates()
). - Creates a unique
tped
andtfam
from the duplicated samples by completing the best chosen one with the other samples (createAndCleanTPED()
). - Merge the two tfiles together (
prefix.unique_samples
andprefix.chosen_samples
) to create the final dataset (prefix.final
) (addToTPEDandTFAM()
).
-
pyGenClean.DupSamples.duplicated_samples.
parseArgs
(argString=None)[source]¶ Parses the command line options and arguments.
Parameters: argString (list) – the options Returns: A argparse.Namespace
object created by theargparse
module. It contains the values of the different options.Options Type Description --tfile
string The input file prefix (of type tfile
).--sample-completion-threshold
float The completion threshold. --sample-concordance-threshold
float The concordance threshold. --out
string The prefix of the output files. Note
No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see
checkArgs()
).
-
pyGenClean.DupSamples.duplicated_samples.
printConcordance
(concordance, prefix)[source]¶ Print the concordance.
Parameters: Returns: the concordance percentage (dict)
The concordance is the number of genotypes that are equal when comparing a duplicated samples with another one, divided by the total number of genotypes (excluding genotypes that are no call [i.e.
0
]). If a duplicated sample has 100% of no calls, the concordance will be zero.The file
prefix.concordance
will contain \(N \times N\) matrices for each set of duplicated samples.
-
pyGenClean.DupSamples.duplicated_samples.
printDuplicatedTPEDandTFAM
(tped, tfam, samples, oldSamples, prefix)[source]¶ Print the TPED and TFAM of the duplicated samples.
Parameters: - tped (
numpy.array
) – thetped
containing duplicated samples. - tfam (
numpy.array
) – thetfam
containing duplicated samples. - samples (dict) – the updated position of the samples in the tped containing only duplicated samples.
- oldSamples (dict) – the original duplicated sample positions.
- prefix (str) – the prefix of all the files.
The
tped
andtfam
files are written inprefix.duplicated_samples.tped
andprefix.duplicated_samples.tfam
, respectively.- tped (
-
pyGenClean.DupSamples.duplicated_samples.
printStatistics
(completion, concordance, tpedSamples, oldSamples, prefix)[source]¶ Print the statistics in a file.
Parameters: - completion (
numpy.array
) – the completion of each duplicated samples. - concordance (dict) – the concordance of each duplicated samples.
- tpedSamples (dict) – the updated position of the samples in the tped containing only duplicated samples.
- oldSamples (dict) – the original duplicated sample positions.
- prefix (str) – the prefix of all the files.
Returns: the completion for each duplicated samples, as a
numpy.array
.Prints the statistics (completion of each samples and pairwise concordance between duplicated samples) in a file (
prefix.summary
).- completion (
-
pyGenClean.DupSamples.duplicated_samples.
printUniqueTFAM
(tfam, samples, prefix)[source]¶ Prints a new TFAM with only unique samples.
Parameters:
-
pyGenClean.DupSamples.duplicated_samples.
processTPED
(uniqueSamples, duplicatedSamples, fileName, prefix)[source]¶ Process the TPED file.
Parameters: - uniqueSamples (dict) – the position of unique samples.
- duplicatedSamples (collections.defaultdict) – the position of duplicated samples.
- fileName (str) – the name of the file.
- prefix (str) – the prefix of all the files.
Returns: a tuple containing the
tped
(numpy.array
) as first element, and the updated positions of the duplicated samples (dict
)Reads the entire
tped
and prints another one containing only unique samples (prefix.unique_samples.tped
). It then creates anumpy.array
containing the duplicated samples.