pyGenClean.DupSNPs package

For more information about how to use this module, refer to the Duplicated Markers Module.

Module contents

Submodules

pyGenClean.DupSNPs.duplicated_snps module

exception pyGenClean.DupSNPs.duplicated_snps.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.
pyGenClean.DupSNPs.duplicated_snps.checkArgs(args)[source]

Checks the arguments and options.

Parameters:args (argparse.Namespace) – an object containing the options of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.DupSNPs.duplicated_snps.chooseBestSnps(tped, snps, trueCompletion, trueConcordance, prefix)[source]

Choose the best duplicates according to the completion and concordance.

Parameters:
  • tped (numpy.array) – a representation of the tped of duplicated markers.
  • snps (dict) – the position of the duplicated markers in the tped.
  • trueCompletion (numpy.array) – the completion of each markers.
  • trueConcordance (dict) – the pairwise concordance of each markers.
  • prefix (str) – the prefix of the output files.
Returns:

a tuple containing the chosen indexes (dict) as the first element, the completion (numpy.array) as the second element, and the concordance (dict) as last element.

It creates two output files: prefix.chosen_snps.info and prefix.not_chosen_snps.info. The first one contains the markers that were chosen for completion, and the second one, the markers that weren’t.

It starts by computing the completion of each markers (dividing the number of calls divided by the total number of genotypes). Then, for each of the duplicated markers, we choose the best one according to completion and concordance (see explanation in DupSamples.duplicated_samples.chooseBestDuplicates() for more details).

pyGenClean.DupSNPs.duplicated_snps.computeFrequency(prefix, outPrefix)[source]

Computes the frequency of the SNPs using Plink.

Parameters:
  • prefix (str) – the prefix of the input files.
  • outPrefix (str) – the prefix of the output files.
Returns:

a dict containing the frequency of each marker.

Start by computing the frequency of all markers using Plink. Then, it reads the output file, and saves the frequency and allele information.

pyGenClean.DupSNPs.duplicated_snps.computeStatistics(tped, tfam, snps)[source]

Computes the completion and concordance of each SNPs.

Parameters:
  • tped (numpy.array) – a representation of the tped.
  • tfam (list) – a representation of the tfam
  • snps (dict) – the position of the duplicated markers in the tped.
Returns:

a tuple containing the completion of duplicated markers (numpy.array) as first element, and the concordance (dict) of duplicated markers, as last element.

A marker’s completion is compute using this formula (where \(G_i\) is the set of genotypes for the marker \(i\)):

\[Completion_i = \frac{||g \in G_i \textrm{ where } g \neq 0||} {||G_i||}\]

The pairwise concordance between duplicated markers is compute as follow (where \(G_i\) and \(G_j\) are the sets of genotypes for markers \(i\) and \(j\), respectively):

\[Concordance_{i,j} = \frac{ ||g \in G_i \cup G_j \textrm{ where } g_i = g_j \neq 0|| }{ ||g \in G_i \cup G_j \textrm{ where } g \neq 0|| }\]

Hence, we only computes the numerators and denominators of the completion and concordance, for future reference.

Note

When the genotypes are not comparable, the function tries to flip one of the genotype to see if it becomes comparable.

pyGenClean.DupSNPs.duplicated_snps.createAndCleanTPED(tped, tfam, snps, prefix, chosenSNPs, completion, concordance, snpsToComplete, tfamFileName, completionT, concordanceT)[source]

Complete a TPED for duplicated SNPs.

Parameters:
  • tped (numpy.array) – a representation of the tped of duplicated markers.
  • tfam (list) – a representation of the tfam.
  • snps (dict) – the position of duplicated markers in the tped.
  • prefix (str) – the prefix of the output files.
  • chosenSNPs (dict) – the markers that were chosen for completion (including problems).
  • completion (numpy.array) – the completion of each of the duplicated markers.
  • concordance (dict) – the pairwise concordance of the duplicated markers.
  • snpsToComplete (set) – the markers that will be completed (excluding problems).
  • tfamFileName (str) – the name of the original tfam file.
  • completionT (float) – the completion threshold.
  • concordanceT (float) – the concordance threshold.
Returns:

a tuple containing the new tped after completion (numpy.array as the first element, and the index of the markers that will need to be rid of (set) as the last element.

It creates three different files:

  • prefix.zeroed_out: contains information about markers and samples
    where the genotyped was zeroed out.
  • prefix.not_good_enough: contains information about markers that were
    not good enough to help in completing the chosen markers (because of concordance or completion).
  • prefix.removed_duplicates: the list of markers that where used for
    completing the chosen one, hence they will be removed from the final data set.

Cycling through every genotypes of every samples of every duplicated markers, checks if the genotypes are all the same. If the chosen one was not called, but the other ones were, then we complete the chosen one with the genotypes for the others (assuming that they are all the same). If there is a difference between the genotypes, it is zeroed out for the chosen marker.

pyGenClean.DupSNPs.duplicated_snps.createFinalTPEDandTFAM(tped, toReadPrefix, prefix, snpToRemove)[source]

Creates the final TPED and TFAM.

Parameters:
  • tped (numpy.array) – a representation of the tped of duplicated markers.
  • toReadPrefix (str) – the prefix of the unique files.
  • prefix (str) – the prefix of the output files.
  • snpToRemove (set) – the markers to remove.

Starts by copying the unique markers’ tfam file to prefix.final.tfam. Then, it copies the unique markers’ tped file, in which the chosen markers will be appended.

The final data set will include the unique markers, the chosen markers which were completed, and the problematic duplicated markers (for further analysis). The markers that were used to complete the chosen ones are not present in the final data set.

pyGenClean.DupSNPs.duplicated_snps.findUniques(mapF)[source]

Finds the unique markers in a MAP.

Parameters:mapF (list) – representation of a map file.
Returns:a dict containing unique markers (according to their genomic localisation).
pyGenClean.DupSNPs.duplicated_snps.flipGenotype(genotype)[source]

Flips a genotype.

Parameters:genotype (set) – the genotype to flip.
Returns:the new flipped genotype (as a set)
>>> flipGenotype({"A", "T"})
set(['A', 'T'])
>>> flipGenotype({"C", "T"})
set(['A', 'G'])
>>> flipGenotype({"T", "G"})
set(['A', 'C'])
>>> flipGenotype({"0", "0"})
Traceback (most recent call last):
    ...
ProgramError: 0: unkown allele
>>> flipGenotype({"A", "N"})
Traceback (most recent call last):
    ...
ProgramError: N: unkown allele
pyGenClean.DupSNPs.duplicated_snps.getIndexOfHeteroMen(genotypes, menIndex)[source]

Get the indexes of heterozygous men.

Parameters:
  • genotypes (numpy.array) – the genotypes of everybody.
  • menIndex (numpy.array) – the indexes of the men (for the genotypes).
Returns:

a numpy.array containing the indexes of the genotypes to remove.

Finds the mean that have a heterozygous genotype for this current marker. Usually used on sexual chromosomes.

pyGenClean.DupSNPs.duplicated_snps.main(argString=None)[source]

The main function of the module..

Here are the steps for duplicated samples:

  1. Prints the options.
  2. Reads the map file to gather marker’s position (readMAP()).
  3. Reads the tfam file (readTFAM()).
  4. Finds the unique markers using the map file (findUniques()).
  5. Process the tped file. Write a file containing unique markers in prefix.unique_snps (tfam and tped). Keep in memory information about the duplicated markers (tped) (processTPED()).
  6. If there are no duplicated markers, the file prefix.unique_snps (tped and tfam) are copied to prefix.final.
  7. If there are duplicated markers, print a tped and tfam file containing the duplicated markers (printDuplicatedTPEDandTFAM()).
  8. Computes the frequency of the duplicated markers (using Plink) and read the output file (computeFrequency()).
  9. Computes the concordance and pairwise completion of each of the duplicated markers (computeStatistics()).
  10. Prints the problematic duplicated markers with a file containing the summary of the statistics (completion and pairwise concordance) (printProblems()).
  11. Print the pairwise concordance in a file (matrices) (printConcordance()).
  12. Choose the best duplicated markers using concordance and completion (chooseBestSnps()).
  13. Completes the chosen markers with the remaining duplicated markers (createAndCleanTPED()).
  14. Creates the final tped file, containing the unique markers, the chosen duplicated markers that were completed, and the problematic duplicated markers (for further analysis). This set excludes markers that were used for completing the chosen ones (createFinalTPEDandTFAM()).
pyGenClean.DupSNPs.duplicated_snps.parseArgs(argString=None)[source]

Parses the command line options and arguments.

Parameters:argString (list) – the options.
Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--tfile string The input file prefix (Plink tfile).
--snp-completion-threshold float The completion threshold to consider a replicate when choosing the best replicates.
--snp-concordance-threshold float The concordance threshold to consider a replicate when choosing the best replicates.
--frequency_difference float The maximum difference in frequency between duplicated markers.
--out string The prefix of the output files.

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.DupSNPs.duplicated_snps.printConcordance(concordance, prefix, tped, snps)[source]

Print the concordance.

Parameters:
  • concordance (dict) – the concordance.
  • prefix (str) – the prefix if the output files.
  • tped (numpy.array) – a representation of the tped of duplicated markers.
  • snps (dict) – the position of the duplicated markers in the tped.

Prints the concordance in a file, in the format of a matrix. For each duplicated markers, the first line (starting with the # signs) contains the name of all the markers in the duplicated markers set. Then a \(N \times N\) matrix is printed to file (where \(N\) is the number of markers in the duplicated marker list), containing the pairwise concordance.

pyGenClean.DupSNPs.duplicated_snps.printDuplicatedTPEDandTFAM(tped, tfamFileName, outPrefix)[source]

Print the duplicated SNPs TPED and TFAM.

Parameters:
  • tped (numpy.array) – a representation of the tped of duplicated markers.
  • tfamFileName (str) – the name of the original tfam file.
  • outPrefix (str) – the output prefix.

First, it copies the original tfam into outPrefix.duplicated_snps.tfam. Then, it prints the tped of duplicated markers in outPrefix.duplicated_snps.tped.

pyGenClean.DupSNPs.duplicated_snps.printProblems(completion, concordance, tped, snps, frequencies, prefix, diffFreq)[source]

Print the statistics.

Parameters:
  • completion (numpy.array) – the completion of each duplicated markers.
  • concordance (dict) – the pairwise concordance between duplicated markers.
  • tped (numpy.array) – a representation of the tped of duplicated markers.
  • snps (dict) – the positions of the duplicated markers in the tped
  • frequencies (dict) – the frequency of each of the duplicated markers.
  • prefix (str) – the prefix of the output files.
  • diffFreq (float) – the frequency difference threshold.
Returns:

a set containing duplicated markers to complete.

Creates a summary file (prefix.summary) containing information about duplicated markers: chromosome, position, name, alleles, status, completion percentage, completion number and mean concordance.

The frequency and the minor allele are used to be certain that two duplicated markers are exactly the same marker (and not a tri-allelic one, for example).

For each duplicated markers:

  1. Constructs the set of available alleles for the first marker.
  2. Constructs the set of available alleles for the second marker.
  3. If the two sets are different, but the number of alleles is the same, we try to flip one of the marker. If the two sets are the same, but the number of alleles is 1, we set the status to homo_flip. If the markers are heterozygous, we set the status to flip.
  4. If there is a difference in the number of alleles (one is homozygous, the other, heterozygous), and that there is on allele in common, we set the status to homo_hetero. If there are no allele in common, we try to flip one. If the new sets have one allele in common, we set the status to homo_hetero_flip.
  5. If the sets of available alleles are the same (without flip), we check the frequency and the minor alleles. If the minor allele is different, we set the status to diff_minor_allele. If the difference in frequencies is higher than a threshold, we set the status to diff_frequency.
  6. If all of the above fail, we set the status to problem.

Problems are written in the prefix.problems file, and contains the following columns: chromosome, position, name and status. This file contains all the markers with a status, as explained above.

pyGenClean.DupSNPs.duplicated_snps.processTPED(uniqueSNPs, mapF, fileName, tfam, prefix)[source]

Process the TPED file.

Parameters:
  • uniqueSNPs (dict) – the unique markers.
  • mapF (list) – a representation of the map file.
  • fileName (str) – the name of the tped file.
  • tfam (str) – the name of the tfam file.
  • prefix (str) – the prefix of all the files.
Returns:

a tuple with the representation of the tped file (numpy.array) as first element, and the updated position of the duplicated markers in the tped representation.

Copies the tfam file into prefix.unique_snps.tfam. While reading the tped file, creates a new one (prefix.unique_snps.tped) containing only unique markers.

pyGenClean.DupSNPs.duplicated_snps.readMAP(fileName, prefix)[source]

Reads the MAP file.

Parameters:fileName (str) – the name of the map file.
Returns:a list of tuples, representing the map file.

While reading the map file, it saves a file (prefix.duplicated_marker_names) containing the name of the unique duplicated markers.

pyGenClean.DupSNPs.duplicated_snps.readTFAM(fileName)[source]

Reads the TFAM file.

Parameters:fileName (str) – the name of the tfam file.
Returns:a representation the tfam file (numpy.array).
pyGenClean.DupSNPs.duplicated_snps.runCommand(command)[source]

Run the command in Plink.

Parameters:command (list) – the command to run.

Tries to run a command using subprocess.

pyGenClean.DupSNPs.duplicated_snps.safe_main()[source]

A safe version of the main function (that catches ProgramError).