pyGenClean.DupSNPs package¶

For more information about how to use this module, refer to the Duplicated Markers Module.

Module contents¶

Submodules¶

pyGenClean.DupSNPs.duplicated_snps module¶

exception pyGenClean.DupSNPs.duplicated_snps.ProgramError(msg)[source]¶

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:	msg (str) – the message to print to the user before exiting.

pyGenClean.DupSNPs.duplicated_snps.checkArgs(args)[source]¶

Checks the arguments and options.

Parameters:	args (argparse.Namespace) – an object containing the options of the program.
Returns:	`True` if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.DupSNPs.duplicated_snps.chooseBestSnps(tped, snps, trueCompletion, trueConcordance, prefix)[source]¶

Choose the best duplicates according to the completion and concordance.

Parameters:	tped (numpy.array) – a representation of the `tped` of duplicated markers. snps (dict) – the position of the duplicated markers in the `tped`. trueCompletion (numpy.array) – the completion of each markers. trueConcordance (dict) – the pairwise concordance of each markers. prefix (str) – the prefix of the output files.
Returns:	a tuple containing the chosen indexes (`dict`) as the first element, the completion (`numpy.array`) as the second element, and the concordance (`dict`) as last element.

It creates two output files: prefix.chosen_snps.info and prefix.not_chosen_snps.info. The first one contains the markers that were chosen for completion, and the second one, the markers that weren’t.

It starts by computing the completion of each markers (dividing the number of calls divided by the total number of genotypes). Then, for each of the duplicated markers, we choose the best one according to completion and concordance (see explanation in DupSamples.duplicated_samples.chooseBestDuplicates() for more details).

pyGenClean.DupSNPs.duplicated_snps.computeFrequency(prefix, outPrefix)[source]¶

Computes the frequency of the SNPs using Plink.

Parameters:	prefix (str) – the prefix of the input files. outPrefix (str) – the prefix of the output files.
Returns:	a `dict` containing the frequency of each marker.

Start by computing the frequency of all markers using Plink. Then, it reads the output file, and saves the frequency and allele information.

pyGenClean.DupSNPs.duplicated_snps.computeStatistics(tped, tfam, snps)[source]¶

Computes the completion and concordance of each SNPs.

Parameters:	tped (numpy.array) – a representation of the `tped`. tfam (list) – a representation of the `tfam` snps (dict) – the position of the duplicated markers in the `tped`.
Returns:	a tuple containing the completion of duplicated markers (`numpy.array`) as first element, and the concordance (`dict`) of duplicated markers, as last element.

A marker’s completion is compute using this formula (where \(G_i\) is the set of genotypes for the marker \(i\)):

\[Completion_i = \frac{||g \in G_i \textrm{ where } g \neq 0||} {||G_i||}\]

The pairwise concordance between duplicated markers is compute as follow (where \(G_i\) and \(G_j\) are the sets of genotypes for markers \(i\) and \(j\), respectively):

\[Concordance_{i,j} = \frac{ ||g \in G_i \cup G_j \textrm{ where } g_i = g_j \neq 0|| }{ ||g \in G_i \cup G_j \textrm{ where } g \neq 0|| }\]

Hence, we only computes the numerators and denominators of the completion and concordance, for future reference.

Note

When the genotypes are not comparable, the function tries to flip one of the genotype to see if it becomes comparable.

pyGenClean.DupSNPs.duplicated_snps.createAndCleanTPED(tped, tfam, snps, prefix, chosenSNPs, completion, concordance, snpsToComplete, tfamFileName, completionT, concordanceT)[source]¶

Complete a TPED for duplicated SNPs.

Parameters:

tped (numpy.array) – a representation of the tped of duplicated markers.
tfam (list) – a representation of the tfam.
snps (dict) – the position of duplicated markers in the tped.
prefix (str) – the prefix of the output files.
chosenSNPs (dict) – the markers that were chosen for completion (including problems).
completion (numpy.array) – the completion of each of the duplicated markers.
concordance (dict) – the pairwise concordance of the duplicated markers.
snpsToComplete (set) – the markers that will be completed (excluding problems).
tfamFileName (str) – the name of the original tfam file.
completionT (float) – the completion threshold.
concordanceT (float) – the concordance threshold.

Returns:

a tuple containing the new tped after completion (numpy.array as the first element, and the index of the markers that will need to be rid of (set) as the last element.

It creates three different files:

prefix.zeroed_out: contains information about markers and samples

where the genotyped was zeroed out.
prefix.not_good_enough: contains information about markers that were

not good enough to help in completing the chosen markers (because of concordance or completion).
prefix.removed_duplicates: the list of markers that where used for

completing the chosen one, hence they will be removed from the final data set.

Cycling through every genotypes of every samples of every duplicated markers, checks if the genotypes are all the same. If the chosen one was not called, but the other ones were, then we complete the chosen one with the genotypes for the others (assuming that they are all the same). If there is a difference between the genotypes, it is zeroed out for the chosen marker.

pyGenClean.DupSNPs.duplicated_snps.createFinalTPEDandTFAM(tped, toReadPrefix, prefix, snpToRemove)[source]¶

Creates the final TPED and TFAM.

Parameters:	tped (numpy.array) – a representation of the `tped` of duplicated markers. toReadPrefix (str) – the prefix of the unique files. prefix (str) – the prefix of the output files. snpToRemove (set) – the markers to remove.

Starts by copying the unique markers’ tfam file to prefix.final.tfam. Then, it copies the unique markers’ tped file, in which the chosen markers will be appended.

The final data set will include the unique markers, the chosen markers which were completed, and the problematic duplicated markers (for further analysis). The markers that were used to complete the chosen ones are not present in the final data set.

pyGenClean.DupSNPs.duplicated_snps.findUniques(mapF)[source]¶

Finds the unique markers in a MAP.

Parameters:	mapF (list) – representation of a `map` file.
Returns:	a `dict` containing unique markers (according to their genomic localisation).

pyGenClean.DupSNPs.duplicated_snps.flipGenotype(genotype)[source]¶

Flips a genotype.

Parameters:	genotype (set) – the genotype to flip.
Returns:	the new flipped genotype (as a `set`)

>>> flipGenotype({"A", "T"})
set(['A', 'T'])
>>> flipGenotype({"C", "T"})
set(['A', 'G'])
>>> flipGenotype({"T", "G"})
set(['A', 'C'])
>>> flipGenotype({"0", "0"})
Traceback (most recent call last):
    ...
ProgramError: 0: unkown allele
>>> flipGenotype({"A", "N"})
Traceback (most recent call last):
    ...
ProgramError: N: unkown allele

pyGenClean.DupSNPs.duplicated_snps.getIndexOfHeteroMen(genotypes, menIndex)[source]¶

Get the indexes of heterozygous men.

Parameters:	genotypes (numpy.array) – the genotypes of everybody. menIndex (numpy.array) – the indexes of the men (for the genotypes).
Returns:	a `numpy.array` containing the indexes of the genotypes to remove.

Finds the mean that have a heterozygous genotype for this current marker. Usually used on sexual chromosomes.

pyGenClean.DupSNPs.duplicated_snps.main(argString=None)[source]¶

The main function of the module..

Here are the steps for duplicated samples:

Prints the options.
Reads the map file to gather marker’s position (readMAP()).
Reads the tfam file (readTFAM()).
Finds the unique markers using the map file (findUniques()).
Process the tped file. Write a file containing unique markers in prefix.unique_snps (tfam and tped). Keep in memory information about the duplicated markers (tped) (processTPED()).
If there are no duplicated markers, the file prefix.unique_snps (tped and tfam) are copied to prefix.final.
If there are duplicated markers, print a tped and tfam file containing the duplicated markers (printDuplicatedTPEDandTFAM()).
Computes the frequency of the duplicated markers (using Plink) and read the output file (computeFrequency()).
Computes the concordance and pairwise completion of each of the duplicated markers (computeStatistics()).
Prints the problematic duplicated markers with a file containing the summary of the statistics (completion and pairwise concordance) (printProblems()).
Print the pairwise concordance in a file (matrices) (printConcordance()).
Choose the best duplicated markers using concordance and completion (chooseBestSnps()).
Completes the chosen markers with the remaining duplicated markers (createAndCleanTPED()).
Creates the final tped file, containing the unique markers, the chosen duplicated markers that were completed, and the problematic duplicated markers (for further analysis). This set excludes markers that were used for completing the chosen ones (createFinalTPEDandTFAM()).

pyGenClean.DupSNPs.duplicated_snps.parseArgs(argString=None)[source]¶

Parses the command line options and arguments.

Parameters:	argString (list) – the options.
Returns:	A `argparse.Namespace` object created by the `argparse` module. It contains the values of the different options.

Options	Type	Description
`--tfile`	string	The input file prefix (Plink tfile).
`--snp-completion-threshold`	float	The completion threshold to consider a replicate when choosing the best replicates.
`--snp-concordance-threshold`	float	The concordance threshold to consider a replicate when choosing the best replicates.
`--frequency_difference`	float	The maximum difference in frequency between duplicated markers.
`--out`	string	The prefix of the output files.

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.DupSNPs.duplicated_snps.printConcordance(concordance, prefix, tped, snps)[source]¶

Print the concordance.

Parameters:	concordance (dict) – the concordance. prefix (str) – the prefix if the output files. tped (numpy.array) – a representation of the `tped` of duplicated markers. snps (dict) – the position of the duplicated markers in the `tped`.

Prints the concordance in a file, in the format of a matrix. For each duplicated markers, the first line (starting with the # signs) contains the name of all the markers in the duplicated markers set. Then a \(N \times N\) matrix is printed to file (where \(N\) is the number of markers in the duplicated marker list), containing the pairwise concordance.

pyGenClean.DupSNPs.duplicated_snps.printDuplicatedTPEDandTFAM(tped, tfamFileName, outPrefix)[source]¶

Print the duplicated SNPs TPED and TFAM.

Parameters:	tped (numpy.array) – a representation of the `tped` of duplicated markers. tfamFileName (str) – the name of the original `tfam` file. outPrefix (str) – the output prefix.

First, it copies the original tfam into outPrefix.duplicated_snps.tfam. Then, it prints the tped of duplicated markers in outPrefix.duplicated_snps.tped.

pyGenClean.DupSNPs.duplicated_snps.printProblems(completion, concordance, tped, snps, frequencies, prefix, diffFreq)[source]¶

Print the statistics.

Parameters:

completion (numpy.array) – the completion of each duplicated markers.
concordance (dict) – the pairwise concordance between duplicated markers.
tped (numpy.array) – a representation of the tped of duplicated markers.
snps (dict) – the positions of the duplicated markers in the tped
frequencies (dict) – the frequency of each of the duplicated markers.
prefix (str) – the prefix of the output files.
diffFreq (float) – the frequency difference threshold.

Returns:

a set containing duplicated markers to complete.

Creates a summary file (prefix.summary) containing information about duplicated markers: chromosome, position, name, alleles, status, completion percentage, completion number and mean concordance.

The frequency and the minor allele are used to be certain that two duplicated markers are exactly the same marker (and not a tri-allelic one, for example).

For each duplicated markers:

Constructs the set of available alleles for the first marker.
Constructs the set of available alleles for the second marker.
If the two sets are different, but the number of alleles is the same, we try to flip one of the marker. If the two sets are the same, but the number of alleles is 1, we set the status to homo_flip. If the markers are heterozygous, we set the status to flip.
If there is a difference in the number of alleles (one is homozygous, the other, heterozygous), and that there is on allele in common, we set the status to homo_hetero. If there are no allele in common, we try to flip one. If the new sets have one allele in common, we set the status to homo_hetero_flip.
If the sets of available alleles are the same (without flip), we check the frequency and the minor alleles. If the minor allele is different, we set the status to diff_minor_allele. If the difference in frequencies is higher than a threshold, we set the status to diff_frequency.
If all of the above fail, we set the status to problem.

Problems are written in the prefix.problems file, and contains the following columns: chromosome, position, name and status. This file contains all the markers with a status, as explained above.

pyGenClean.DupSNPs.duplicated_snps.processTPED(uniqueSNPs, mapF, fileName, tfam, prefix)[source]¶

Process the TPED file.

Parameters:	uniqueSNPs (dict) – the unique markers. mapF (list) – a representation of the `map` file. fileName (str) – the name of the `tped` file. tfam (str) – the name of the `tfam` file. prefix (str) – the prefix of all the files.
Returns:	a tuple with the representation of the `tped` file (`numpy.array`) as first element, and the updated position of the duplicated markers in the `tped` representation.

Copies the tfam file into prefix.unique_snps.tfam. While reading the tped file, creates a new one (prefix.unique_snps.tped) containing only unique markers.

pyGenClean.DupSNPs.duplicated_snps.readMAP(fileName, prefix)[source]¶

Reads the MAP file.

Parameters:	fileName (str) – the name of the `map` file.
Returns:	a list of tuples, representing the `map` file.

While reading the map file, it saves a file (prefix.duplicated_marker_names) containing the name of the unique duplicated markers.

pyGenClean.DupSNPs.duplicated_snps.readTFAM(fileName)[source]¶

Reads the TFAM file.

Parameters:	fileName (str) – the name of the `tfam` file.
Returns:	a representation the `tfam` file (`numpy.array`).

pyGenClean.DupSNPs.duplicated_snps.runCommand(command)[source]¶

Run the command in Plink.

Parameters:	command (list) – the command to run.

Tries to run a command using subprocess.

pyGenClean.DupSNPs.duplicated_snps.safe_main()[source]¶: A safe version of the main function (that catches ProgramError).

Table Of Contents

Previous topic

Next topic

This Page

pyGenClean.DupSNPs package¶

Module contents¶

Submodules¶

pyGenClean.DupSNPs.duplicated_snps module¶