pyGenClean.DupSamples package¶
For more information about how to use this module, refer to the Duplicated Samples Module.
Module contents¶
Submodules¶
pyGenClean.DupSamples.duplicated_samples module¶
-
exception
pyGenClean.DupSamples.duplicated_samples.ProgramError(msg)[source]¶ Bases:
exceptions.ExceptionAn
Exceptionraised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.DupSamples.duplicated_samples.addToTPEDandTFAM(tped, tfam, prefix, toAddPrefix)[source]¶ Append a tfile to another, creating a new one.
Parameters: Here are the steps of this function:
- Writes the
tpedintoprefix.chosen_samples.tped. - Writes the
tfamintoprefix.chosen_samples.tfam. - Copies the previous
tfam(toAddPrefix.tfam) into the finaltfam(prefix.final.tfam). - Append the
tfamto the finaltfam(prefix.final.tfam). - Reads the previous
tped(toAddPrefix.tped) and append the newtpedto it, writing the final one (prefix.final.tped).
Warning
The
tpedandtfamvariables need to contain at least one sample.- Writes the
-
pyGenClean.DupSamples.duplicated_samples.checkArgs(args)[source]¶ Checks the arguments and options.
Parameters: args ( argparse.Namespace) – aargparse.Namespaceobject containing the options of the program.Returns: Trueif everything was OK.If there is a problem with an option, an exception is raised using the
ProgramErrorclass, a message is printed to thesys.stderrand the program exists with code 1.
-
pyGenClean.DupSamples.duplicated_samples.chooseBestDuplicates(tped, samples, oldSamples, completion, concordance_all, prefix)[source]¶ Choose the best duplicates according to the completion rate.
Parameters: - tped (
numpy.array) – thetpedcontaining the duplicated samples. - samples (dict) – the updated position of the samples in the tped containing only duplicated samples.
- oldSamples (dict) – the original duplicated sample positions.
- completion (
numpy.array) – the completion of each of the duplicated samples. - concordance_all (dict) – the concordance of every duplicated samples.
- prefix (str) – the prefix of all the files.
Returns: a tuple where the first element is a list of the chosen samples’ indexes, the second on is the completion and the last one is the concordance (a map).
These are the steps to find the best duplicated sample:
- Sort the list of concordances.
- Sort the list of completions.
- Choose the best of the concordance and put in a set.
- Choose the best of the completion and put it in a set.
- Compute the intersection of the two sets. If there is one sample or more, then randomly choose one sample.
- If the intersection doesn’t contain at least one sample, redo steps 3 and 4, but increase the number of chosen best by one. Redo step 5 and 6 (if required).
The chosen samples are written in
prefix.chosen_samples.info. The rest are written inprefix.excluded_samples.info.- tped (
-
pyGenClean.DupSamples.duplicated_samples.computeStatistics(tped, tfam, samples, oldSamples, prefix)[source]¶ Computes the completion and concordance of each samples.
Parameters: - tped (
numpy.array) – thetpedcontaining duplicated samples. - tfam (
numpy.array) – thetfamcontaining duplicated samples. - samples (dict) – the updated position of the samples in the tped containing only duplicated samples.
- oldSamples (dict) – the original duplicated sample positions.
- prefix (str) – the prefix of all the files.
Returns: a tuple containing the completion (
numpy.array) as first element, and the concordance (dict) as last element.Reads the
tpedfile and compute the completion for each duplicated samples and the pairwise concordance between duplicated samples.Note
The completion and concordance computation excludes a markers if it’s on chromosome 24 and if the sample is a female.
Note
A missing genotype is encoded by
0.Note
No percentage is computed here, only the numbers. Percentages are computing in other functions:
printStatistics(), for completion, andprintConcordance(), for concordance.Completion
Computes the completion of none zero values (where all genotypes of at least one duplicated sample are no call [i.e.
0]). The completion of sample \(i\) (i.e. \(Comp_i\)) is the number of genotypes that have a call divided by the total number of genotypes (the set \(G_i\)):\[Comp_i = \frac{||g \in G_i\textrm{ where }g \neq 0||}{||G_i||}\]Note
We consider a genotype as being missing if the sample is a male and if a marker on chromosome 23 or 24 is heterozygous.
Concordance
Computes the pairwise concordance between duplicated samples. For each marker, if both genotypes are not missing, we add one to the total number of compared markers. If both genotypes are the same, we add one to the number of concordant calls. We write the observed genotype difference in the file
prefix.diff. The concordance between sample \(i\) and \(j\) (i.e. \(Concordance_{i,j}\)) is the number of genotypes that are equal divided by the total number of genotypes (excluding the no calls):\[Concordance_{i,j} = \frac{ ||g \in G_i \cup G_j \textrm{ where } g_i = g_j \neq 0|| }{ ||g \in G_i \cup G_j \textrm{ where } g \neq 0|| }\]- tped (
-
pyGenClean.DupSamples.duplicated_samples.createAndCleanTPED(tped, tfam, samples, oldSamples, chosenSamples, prefix, completion, completionT, concordance, concordanceT)[source]¶ Complete a TPED for duplicate samples.
Parameters: - tped (
numpy.array) – thetpedcontaining the duplicated samples. - tfam (
numpy.array) – thetfamcontaining the duplicated samples. - samples (dict) – the updated position of the samples in the
tpedcontaining only duplicated samples. - oldSamples (dict) – the original duplicated sample positions.
- chosenSamples (dict) – the position of the chosen samples.
- prefix (str) – the prefix of all the files.
- completion (
numpy.array) – the completion of each of the duplicated samples. - completionT (float) – the completion threshold.
- concordance (dict) – the pairwise concordance of each of the duplicated samples.
- concordanceT (float) – the concordance threshold.
Using a
tpedcontaining duplicated samples, it creates atpedcontaining unique samples by completing a chosen sample with the other replicates.Note
A chosen sample is not completed using bad replicates (those that don’t have a concordance or a completion higher than a certain threshold). The bad replicates are written in the file
prefix.not_good_enough.- tped (
-
pyGenClean.DupSamples.duplicated_samples.findDuplicates(tfam)[source]¶ Finds the duplicates in a TFAM.
Parameters: tfam (list) – representation of a tfamfile.Returns: two dict, containing unique and duplicated samples position.
-
pyGenClean.DupSamples.duplicated_samples.main(argString=None)[source]¶ Check for duplicated samples in a tfam/tped file.
Parameters: argString (list) – the options Here are the steps for the duplicated samples step.
- Prints the options.
- Reads the
tfamfile (readTFAM()). - Separate the duplicated samples from the unique samples
(
findDuplicates()). - Writes the unique samples into a file named
prefix.unique_samples.tfam(printUniqueTFAM()). - Reads the
tpedfile and write intoprefix.unique_samples.tpedthe pedigree file for the unique samples (processTPED()). Saves in memory the pedigree for the duplicated samples. Updates the indexes of the duplicated samples. - If there are no duplicated samples, simply copies the files
prefix.unique_samples(tpedandtfam) toprefix.final.tfamandprefix..final.tped, respectively. - Computes the completion (for each of the duplicated samples) and the
concordance of each sample pairs (
computeStatistics()). - Prints statistics (concordance and completion)
(
printStatistics()). - We print the concordance matrix for each duplicated samples
(
printConcordance()). - We print the
tpedand thetfamfile for the duplicated samples (prefix.duplicated_samples) (printDuplicatedTPEDandTFAM()). - Choose the best of each duplicates (to keep and to complete) according
to completion and concordance (
chooseBestDuplicates()). - Creates a unique
tpedandtfamfrom the duplicated samples by completing the best chosen one with the other samples (createAndCleanTPED()). - Merge the two tfiles together (
prefix.unique_samplesandprefix.chosen_samples) to create the final dataset (prefix.final) (addToTPEDandTFAM()).
-
pyGenClean.DupSamples.duplicated_samples.parseArgs(argString=None)[source]¶ Parses the command line options and arguments.
Parameters: argString (list) – the options Returns: A argparse.Namespaceobject created by theargparsemodule. It contains the values of the different options.Options Type Description --tfilestring The input file prefix (of type tfile).--sample-completion-thresholdfloat The completion threshold. --sample-concordance-thresholdfloat The concordance threshold. --outstring The prefix of the output files. Note
No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see
checkArgs()).
-
pyGenClean.DupSamples.duplicated_samples.printConcordance(concordance, prefix)[source]¶ Print the concordance.
Parameters: Returns: the concordance percentage (dict)
The concordance is the number of genotypes that are equal when comparing a duplicated samples with another one, divided by the total number of genotypes (excluding genotypes that are no call [i.e.
0]). If a duplicated sample has 100% of no calls, the concordance will be zero.The file
prefix.concordancewill contain \(N \times N\) matrices for each set of duplicated samples.
-
pyGenClean.DupSamples.duplicated_samples.printDuplicatedTPEDandTFAM(tped, tfam, samples, oldSamples, prefix)[source]¶ Print the TPED and TFAM of the duplicated samples.
Parameters: - tped (
numpy.array) – thetpedcontaining duplicated samples. - tfam (
numpy.array) – thetfamcontaining duplicated samples. - samples (dict) – the updated position of the samples in the tped containing only duplicated samples.
- oldSamples (dict) – the original duplicated sample positions.
- prefix (str) – the prefix of all the files.
The
tpedandtfamfiles are written inprefix.duplicated_samples.tpedandprefix.duplicated_samples.tfam, respectively.- tped (
-
pyGenClean.DupSamples.duplicated_samples.printStatistics(completion, concordance, tpedSamples, oldSamples, prefix)[source]¶ Print the statistics in a file.
Parameters: - completion (
numpy.array) – the completion of each duplicated samples. - concordance (dict) – the concordance of each duplicated samples.
- tpedSamples (dict) – the updated position of the samples in the tped containing only duplicated samples.
- oldSamples (dict) – the original duplicated sample positions.
- prefix (str) – the prefix of all the files.
Returns: the completion for each duplicated samples, as a
numpy.array.Prints the statistics (completion of each samples and pairwise concordance between duplicated samples) in a file (
prefix.summary).- completion (
-
pyGenClean.DupSamples.duplicated_samples.printUniqueTFAM(tfam, samples, prefix)[source]¶ Prints a new TFAM with only unique samples.
Parameters:
-
pyGenClean.DupSamples.duplicated_samples.processTPED(uniqueSamples, duplicatedSamples, fileName, prefix)[source]¶ Process the TPED file.
Parameters: - uniqueSamples (dict) – the position of unique samples.
- duplicatedSamples (collections.defaultdict) – the position of duplicated samples.
- fileName (str) – the name of the file.
- prefix (str) – the prefix of all the files.
Returns: a tuple containing the
tped(numpy.array) as first element, and the updated positions of the duplicated samples (dict)Reads the entire
tpedand prints another one containing only unique samples (prefix.unique_samples.tped). It then creates anumpy.arraycontaining the duplicated samples.
