pyGenClean.DupSamples package

For more information about how to use this module, refer to the Duplicated Samples Module.

Module contents

Submodules

pyGenClean.DupSamples.duplicated_samples module

exception pyGenClean.DupSamples.duplicated_samples.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.
pyGenClean.DupSamples.duplicated_samples.addToTPEDandTFAM(tped, tfam, prefix, toAddPrefix)[source]

Append a tfile to another, creating a new one.

Parameters:
  • tped (numpy.array) – the tped that will be appended to the other one.
  • tfam (numpy.array) – the tfam that will be appended to the other one.
  • prefix (str) – the prefix of all the files.
  • toAddPrefix (str) – the prefix of the final file.

Here are the steps of this function:

  1. Writes the tped into prefix.chosen_samples.tped.
  2. Writes the tfam into prefix.chosen_samples.tfam.
  3. Copies the previous tfam (toAddPrefix.tfam) into the final tfam (prefix.final.tfam).
  4. Append the tfam to the final tfam (prefix.final.tfam).
  5. Reads the previous tped (toAddPrefix.tped) and append the new tped to it, writing the final one (prefix.final.tped).

Warning

The tped and tfam variables need to contain at least one sample.

pyGenClean.DupSamples.duplicated_samples.checkArgs(args)[source]

Checks the arguments and options.

Parameters:args (argparse.Namespace) – a argparse.Namespace object containing the options of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.DupSamples.duplicated_samples.chooseBestDuplicates(tped, samples, oldSamples, completion, concordance_all, prefix)[source]

Choose the best duplicates according to the completion rate.

Parameters:
  • tped (numpy.array) – the tped containing the duplicated samples.
  • samples (dict) – the updated position of the samples in the tped containing only duplicated samples.
  • oldSamples (dict) – the original duplicated sample positions.
  • completion (numpy.array) – the completion of each of the duplicated samples.
  • concordance_all (dict) – the concordance of every duplicated samples.
  • prefix (str) – the prefix of all the files.
Returns:

a tuple where the first element is a list of the chosen samples’ indexes, the second on is the completion and the last one is the concordance (a map).

These are the steps to find the best duplicated sample:

  1. Sort the list of concordances.
  2. Sort the list of completions.
  3. Choose the best of the concordance and put in a set.
  4. Choose the best of the completion and put it in a set.
  5. Compute the intersection of the two sets. If there is one sample or more, then randomly choose one sample.
  6. If the intersection doesn’t contain at least one sample, redo steps 3 and 4, but increase the number of chosen best by one. Redo step 5 and 6 (if required).

The chosen samples are written in prefix.chosen_samples.info. The rest are written in prefix.excluded_samples.info.

pyGenClean.DupSamples.duplicated_samples.computeStatistics(tped, tfam, samples, oldSamples, prefix)[source]

Computes the completion and concordance of each samples.

Parameters:
  • tped (numpy.array) – the tped containing duplicated samples.
  • tfam (numpy.array) – the tfam containing duplicated samples.
  • samples (dict) – the updated position of the samples in the tped containing only duplicated samples.
  • oldSamples (dict) – the original duplicated sample positions.
  • prefix (str) – the prefix of all the files.
Returns:

a tuple containing the completion (numpy.array) as first element, and the concordance (dict) as last element.

Reads the tped file and compute the completion for each duplicated samples and the pairwise concordance between duplicated samples.

Note

The completion and concordance computation excludes a markers if it’s on chromosome 24 and if the sample is a female.

Note

A missing genotype is encoded by 0.

Note

No percentage is computed here, only the numbers. Percentages are computing in other functions: printStatistics(), for completion, and printConcordance(), for concordance.

Completion

Computes the completion of none zero values (where all genotypes of at least one duplicated sample are no call [i.e. 0]). The completion of sample \(i\) (i.e. \(Comp_i\)) is the number of genotypes that have a call divided by the total number of genotypes (the set \(G_i\)):

\[Comp_i = \frac{||g \in G_i\textrm{ where }g \neq 0||}{||G_i||}\]

Note

We consider a genotype as being missing if the sample is a male and if a marker on chromosome 23 or 24 is heterozygous.

Concordance

Computes the pairwise concordance between duplicated samples. For each marker, if both genotypes are not missing, we add one to the total number of compared markers. If both genotypes are the same, we add one to the number of concordant calls. We write the observed genotype difference in the file prefix.diff. The concordance between sample \(i\) and \(j\) (i.e. \(Concordance_{i,j}\)) is the number of genotypes that are equal divided by the total number of genotypes (excluding the no calls):

\[Concordance_{i,j} = \frac{ ||g \in G_i \cup G_j \textrm{ where } g_i = g_j \neq 0|| }{ ||g \in G_i \cup G_j \textrm{ where } g \neq 0|| }\]
pyGenClean.DupSamples.duplicated_samples.createAndCleanTPED(tped, tfam, samples, oldSamples, chosenSamples, prefix, completion, completionT, concordance, concordanceT)[source]

Complete a TPED for duplicate samples.

Parameters:
  • tped (numpy.array) – the tped containing the duplicated samples.
  • tfam (numpy.array) – the tfam containing the duplicated samples.
  • samples (dict) – the updated position of the samples in the tped containing only duplicated samples.
  • oldSamples (dict) – the original duplicated sample positions.
  • chosenSamples (dict) – the position of the chosen samples.
  • prefix (str) – the prefix of all the files.
  • completion (numpy.array) – the completion of each of the duplicated samples.
  • completionT (float) – the completion threshold.
  • concordance (dict) – the pairwise concordance of each of the duplicated samples.
  • concordanceT (float) – the concordance threshold.

Using a tped containing duplicated samples, it creates a tped containing unique samples by completing a chosen sample with the other replicates.

Note

A chosen sample is not completed using bad replicates (those that don’t have a concordance or a completion higher than a certain threshold). The bad replicates are written in the file prefix.not_good_enough.

pyGenClean.DupSamples.duplicated_samples.findDuplicates(tfam)[source]

Finds the duplicates in a TFAM.

Parameters:tfam (list) – representation of a tfam file.
Returns:two dict, containing unique and duplicated samples position.
pyGenClean.DupSamples.duplicated_samples.main(argString=None)[source]

Check for duplicated samples in a tfam/tped file.

Parameters:argString (list) – the options

Here are the steps for the duplicated samples step.

  1. Prints the options.
  2. Reads the tfam file (readTFAM()).
  3. Separate the duplicated samples from the unique samples (findDuplicates()).
  4. Writes the unique samples into a file named prefix.unique_samples.tfam (printUniqueTFAM()).
  5. Reads the tped file and write into prefix.unique_samples.tped the pedigree file for the unique samples (processTPED()). Saves in memory the pedigree for the duplicated samples. Updates the indexes of the duplicated samples.
  6. If there are no duplicated samples, simply copies the files prefix.unique_samples (tped and tfam) to prefix.final.tfam and prefix..final.tped, respectively.
  7. Computes the completion (for each of the duplicated samples) and the concordance of each sample pairs (computeStatistics()).
  8. Prints statistics (concordance and completion) (printStatistics()).
  9. We print the concordance matrix for each duplicated samples (printConcordance()).
  10. We print the tped and the tfam file for the duplicated samples (prefix.duplicated_samples) (printDuplicatedTPEDandTFAM()).
  11. Choose the best of each duplicates (to keep and to complete) according to completion and concordance (chooseBestDuplicates()).
  12. Creates a unique tped and tfam from the duplicated samples by completing the best chosen one with the other samples (createAndCleanTPED()).
  13. Merge the two tfiles together (prefix.unique_samples and prefix.chosen_samples) to create the final dataset (prefix.final) (addToTPEDandTFAM()).
pyGenClean.DupSamples.duplicated_samples.parseArgs(argString=None)[source]

Parses the command line options and arguments.

Parameters:argString (list) – the options
Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--tfile string The input file prefix (of type tfile).
--sample-completion-threshold float The completion threshold.
--sample-concordance-threshold float The concordance threshold.
--out string The prefix of the output files.

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.DupSamples.duplicated_samples.printConcordance(concordance, prefix)[source]

Print the concordance.

Parameters:
  • concordance (dict) – the concordance of each sample.
  • prefix (str) – the prefix of all the files.
Returns:

the concordance percentage (dict)

The concordance is the number of genotypes that are equal when comparing a duplicated samples with another one, divided by the total number of genotypes (excluding genotypes that are no call [i.e. 0]). If a duplicated sample has 100% of no calls, the concordance will be zero.

The file prefix.concordance will contain \(N \times N\) matrices for each set of duplicated samples.

pyGenClean.DupSamples.duplicated_samples.printDuplicatedTPEDandTFAM(tped, tfam, samples, oldSamples, prefix)[source]

Print the TPED and TFAM of the duplicated samples.

Parameters:
  • tped (numpy.array) – the tped containing duplicated samples.
  • tfam (numpy.array) – the tfam containing duplicated samples.
  • samples (dict) – the updated position of the samples in the tped containing only duplicated samples.
  • oldSamples (dict) – the original duplicated sample positions.
  • prefix (str) – the prefix of all the files.

The tped and tfam files are written in prefix.duplicated_samples.tped and prefix.duplicated_samples.tfam, respectively.

pyGenClean.DupSamples.duplicated_samples.printStatistics(completion, concordance, tpedSamples, oldSamples, prefix)[source]

Print the statistics in a file.

Parameters:
  • completion (numpy.array) – the completion of each duplicated samples.
  • concordance (dict) – the concordance of each duplicated samples.
  • tpedSamples (dict) – the updated position of the samples in the tped containing only duplicated samples.
  • oldSamples (dict) – the original duplicated sample positions.
  • prefix (str) – the prefix of all the files.
Returns:

the completion for each duplicated samples, as a numpy.array.

Prints the statistics (completion of each samples and pairwise concordance between duplicated samples) in a file (prefix.summary).

pyGenClean.DupSamples.duplicated_samples.printUniqueTFAM(tfam, samples, prefix)[source]

Prints a new TFAM with only unique samples.

Parameters:
  • tfam (list) – a representation of a TFAM file.
  • samples (dict) – the position of the samples
  • prefix (str) – the prefix of the output file name
pyGenClean.DupSamples.duplicated_samples.processTPED(uniqueSamples, duplicatedSamples, fileName, prefix)[source]

Process the TPED file.

Parameters:
  • uniqueSamples (dict) – the position of unique samples.
  • duplicatedSamples (collections.defaultdict) – the position of duplicated samples.
  • fileName (str) – the name of the file.
  • prefix (str) – the prefix of all the files.
Returns:

a tuple containing the tped (numpy.array) as first element, and the updated positions of the duplicated samples (dict)

Reads the entire tped and prints another one containing only unique samples (prefix.unique_samples.tped). It then creates a numpy.array containing the duplicated samples.

pyGenClean.DupSamples.duplicated_samples.readTFAM(fileName)[source]

Reads the TFAM file.

Parameters:fileName (str) – the name of the tfam file.
Returns:a list of tuples, representing the tfam file.
pyGenClean.DupSamples.duplicated_samples.safe_main()[source]

A safe version of the main function (that catches ProgramError).