pyGenClean.DupSNPs package¶
For more information about how to use this module, refer to the Duplicated Markers Module.
Module contents¶
Submodules¶
pyGenClean.DupSNPs.duplicated_snps module¶
-
exception
pyGenClean.DupSNPs.duplicated_snps.
ProgramError
(msg)[source]¶ Bases:
exceptions.Exception
An
Exception
raised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.DupSNPs.duplicated_snps.
checkArgs
(args)[source]¶ Checks the arguments and options.
Parameters: args (argparse.Namespace) – an object containing the options of the program. Returns: True
if everything was OK.If there is a problem with an option, an exception is raised using the
ProgramError
class, a message is printed to thesys.stderr
and the program exists with code 1.
-
pyGenClean.DupSNPs.duplicated_snps.
chooseBestSnps
(tped, snps, trueCompletion, trueConcordance, prefix)[source]¶ Choose the best duplicates according to the completion and concordance.
Parameters: - tped (numpy.array) – a representation of the
tped
of duplicated markers. - snps (dict) – the position of the duplicated markers in the
tped
. - trueCompletion (numpy.array) – the completion of each markers.
- trueConcordance (dict) – the pairwise concordance of each markers.
- prefix (str) – the prefix of the output files.
Returns: a tuple containing the chosen indexes (
dict
) as the first element, the completion (numpy.array
) as the second element, and the concordance (dict
) as last element.It creates two output files:
prefix.chosen_snps.info
andprefix.not_chosen_snps.info
. The first one contains the markers that were chosen for completion, and the second one, the markers that weren’t.It starts by computing the completion of each markers (dividing the number of calls divided by the total number of genotypes). Then, for each of the duplicated markers, we choose the best one according to completion and concordance (see explanation in
DupSamples.duplicated_samples.chooseBestDuplicates()
for more details).- tped (numpy.array) – a representation of the
-
pyGenClean.DupSNPs.duplicated_snps.
computeFrequency
(prefix, outPrefix)[source]¶ Computes the frequency of the SNPs using Plink.
Parameters: Returns: a
dict
containing the frequency of each marker.Start by computing the frequency of all markers using Plink. Then, it reads the output file, and saves the frequency and allele information.
-
pyGenClean.DupSNPs.duplicated_snps.
computeStatistics
(tped, tfam, snps)[source]¶ Computes the completion and concordance of each SNPs.
Parameters: - tped (numpy.array) – a representation of the
tped
. - tfam (list) – a representation of the
tfam
- snps (dict) – the position of the duplicated markers in the
tped
.
Returns: a tuple containing the completion of duplicated markers (
numpy.array
) as first element, and the concordance (dict
) of duplicated markers, as last element.A marker’s completion is compute using this formula (where \(G_i\) is the set of genotypes for the marker \(i\)):
\[Completion_i = \frac{||g \in G_i \textrm{ where } g \neq 0||} {||G_i||}\]The pairwise concordance between duplicated markers is compute as follow (where \(G_i\) and \(G_j\) are the sets of genotypes for markers \(i\) and \(j\), respectively):
\[Concordance_{i,j} = \frac{ ||g \in G_i \cup G_j \textrm{ where } g_i = g_j \neq 0|| }{ ||g \in G_i \cup G_j \textrm{ where } g \neq 0|| }\]Hence, we only computes the numerators and denominators of the completion and concordance, for future reference.
Note
When the genotypes are not comparable, the function tries to flip one of the genotype to see if it becomes comparable.
- tped (numpy.array) – a representation of the
-
pyGenClean.DupSNPs.duplicated_snps.
createAndCleanTPED
(tped, tfam, snps, prefix, chosenSNPs, completion, concordance, snpsToComplete, tfamFileName, completionT, concordanceT)[source]¶ Complete a TPED for duplicated SNPs.
Parameters: - tped (numpy.array) – a representation of the
tped
of duplicated markers. - tfam (list) – a representation of the
tfam
. - snps (dict) – the position of duplicated markers in the
tped
. - prefix (str) – the prefix of the output files.
- chosenSNPs (dict) – the markers that were chosen for completion (including problems).
- completion (numpy.array) – the completion of each of the duplicated markers.
- concordance (dict) – the pairwise concordance of the duplicated markers.
- snpsToComplete (set) – the markers that will be completed (excluding problems).
- tfamFileName (str) – the name of the original
tfam
file. - completionT (float) – the completion threshold.
- concordanceT (float) – the concordance threshold.
Returns: a tuple containing the new
tped
after completion (numpy.array
as the first element, and the index of the markers that will need to be rid of (set
) as the last element.It creates three different files:
prefix.zeroed_out
: contains information about markers and samples- where the genotyped was zeroed out.
prefix.not_good_enough
: contains information about markers that were- not good enough to help in completing the chosen markers (because of concordance or completion).
prefix.removed_duplicates
: the list of markers that where used for- completing the chosen one, hence they will be removed from the final data set.
Cycling through every genotypes of every samples of every duplicated markers, checks if the genotypes are all the same. If the chosen one was not called, but the other ones were, then we complete the chosen one with the genotypes for the others (assuming that they are all the same). If there is a difference between the genotypes, it is zeroed out for the chosen marker.
- tped (numpy.array) – a representation of the
-
pyGenClean.DupSNPs.duplicated_snps.
createFinalTPEDandTFAM
(tped, toReadPrefix, prefix, snpToRemove)[source]¶ Creates the final TPED and TFAM.
Parameters: - tped (numpy.array) – a representation of the
tped
of duplicated markers. - toReadPrefix (str) – the prefix of the unique files.
- prefix (str) – the prefix of the output files.
- snpToRemove (set) – the markers to remove.
Starts by copying the unique markers’
tfam
file toprefix.final.tfam
. Then, it copies the unique markers’tped
file, in which the chosen markers will be appended.The final data set will include the unique markers, the chosen markers which were completed, and the problematic duplicated markers (for further analysis). The markers that were used to complete the chosen ones are not present in the final data set.
- tped (numpy.array) – a representation of the
-
pyGenClean.DupSNPs.duplicated_snps.
findUniques
(mapF)[source]¶ Finds the unique markers in a MAP.
Parameters: mapF (list) – representation of a map
file.Returns: a dict
containing unique markers (according to their genomic localisation).
-
pyGenClean.DupSNPs.duplicated_snps.
flipGenotype
(genotype)[source]¶ Flips a genotype.
Parameters: genotype (set) – the genotype to flip. Returns: the new flipped genotype (as a set
)>>> flipGenotype({"A", "T"}) set(['A', 'T']) >>> flipGenotype({"C", "T"}) set(['A', 'G']) >>> flipGenotype({"T", "G"}) set(['A', 'C']) >>> flipGenotype({"0", "0"}) Traceback (most recent call last): ... ProgramError: 0: unkown allele >>> flipGenotype({"A", "N"}) Traceback (most recent call last): ... ProgramError: N: unkown allele
-
pyGenClean.DupSNPs.duplicated_snps.
getIndexOfHeteroMen
(genotypes, menIndex)[source]¶ Get the indexes of heterozygous men.
Parameters: - genotypes (numpy.array) – the genotypes of everybody.
- menIndex (numpy.array) – the indexes of the men (for the genotypes).
Returns: a
numpy.array
containing the indexes of the genotypes to remove.Finds the mean that have a heterozygous genotype for this current marker. Usually used on sexual chromosomes.
-
pyGenClean.DupSNPs.duplicated_snps.
main
(argString=None)[source]¶ The main function of the module..
Here are the steps for duplicated samples:
- Prints the options.
- Reads the
map
file to gather marker’s position (readMAP()
). - Reads the
tfam
file (readTFAM()
). - Finds the unique markers using the
map
file (findUniques()
). - Process the
tped
file. Write a file containing unique markers inprefix.unique_snps
(tfam
andtped
). Keep in memory information about the duplicated markers (tped
) (processTPED()
). - If there are no duplicated markers, the file
prefix.unique_snps
(tped
andtfam
) are copied toprefix.final
. - If there are duplicated markers, print a
tped
andtfam
file containing the duplicated markers (printDuplicatedTPEDandTFAM()
). - Computes the frequency of the duplicated markers (using Plink) and read
the output file (
computeFrequency()
). - Computes the concordance and pairwise completion of each of the
duplicated markers (
computeStatistics()
). - Prints the problematic duplicated markers with a file containing the
summary of the statistics (completion and pairwise concordance)
(
printProblems()
). - Print the pairwise concordance in a file (matrices)
(
printConcordance()
). - Choose the best duplicated markers using concordance and completion
(
chooseBestSnps()
). - Completes the chosen markers with the remaining duplicated markers
(
createAndCleanTPED()
). - Creates the final
tped
file, containing the unique markers, the chosen duplicated markers that were completed, and the problematic duplicated markers (for further analysis). This set excludes markers that were used for completing the chosen ones (createFinalTPEDandTFAM()
).
-
pyGenClean.DupSNPs.duplicated_snps.
parseArgs
(argString=None)[source]¶ Parses the command line options and arguments.
Parameters: argString (list) – the options. Returns: A argparse.Namespace
object created by theargparse
module. It contains the values of the different options.Options Type Description --tfile
string The input file prefix (Plink tfile). --snp-completion-threshold
float The completion threshold to consider a replicate when choosing the best replicates. --snp-concordance-threshold
float The concordance threshold to consider a replicate when choosing the best replicates. --frequency_difference
float The maximum difference in frequency between duplicated markers. --out
string The prefix of the output files. Note
No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see
checkArgs()
).
-
pyGenClean.DupSNPs.duplicated_snps.
printConcordance
(concordance, prefix, tped, snps)[source]¶ Print the concordance.
Parameters: - concordance (dict) – the concordance.
- prefix (str) – the prefix if the output files.
- tped (numpy.array) – a representation of the
tped
of duplicated markers. - snps (dict) – the position of the duplicated markers in the
tped
.
Prints the concordance in a file, in the format of a matrix. For each duplicated markers, the first line (starting with the # signs) contains the name of all the markers in the duplicated markers set. Then a \(N \times N\) matrix is printed to file (where \(N\) is the number of markers in the duplicated marker list), containing the pairwise concordance.
-
pyGenClean.DupSNPs.duplicated_snps.
printDuplicatedTPEDandTFAM
(tped, tfamFileName, outPrefix)[source]¶ Print the duplicated SNPs TPED and TFAM.
Parameters: - tped (numpy.array) – a representation of the
tped
of duplicated markers. - tfamFileName (str) – the name of the original
tfam
file. - outPrefix (str) – the output prefix.
First, it copies the original
tfam
intooutPrefix.duplicated_snps.tfam
. Then, it prints thetped
of duplicated markers inoutPrefix.duplicated_snps.tped
.- tped (numpy.array) – a representation of the
-
pyGenClean.DupSNPs.duplicated_snps.
printProblems
(completion, concordance, tped, snps, frequencies, prefix, diffFreq)[source]¶ Print the statistics.
Parameters: - completion (numpy.array) – the completion of each duplicated markers.
- concordance (dict) – the pairwise concordance between duplicated markers.
- tped (numpy.array) – a representation of the
tped
of duplicated markers. - snps (dict) – the positions of the duplicated markers in the
tped
- frequencies (dict) – the frequency of each of the duplicated markers.
- prefix (str) – the prefix of the output files.
- diffFreq (float) – the frequency difference threshold.
Returns: a
set
containing duplicated markers to complete.Creates a summary file (
prefix.summary
) containing information about duplicated markers: chromosome, position, name, alleles, status, completion percentage, completion number and mean concordance.The frequency and the minor allele are used to be certain that two duplicated markers are exactly the same marker (and not a tri-allelic one, for example).
For each duplicated markers:
- Constructs the set of available alleles for the first marker.
- Constructs the set of available alleles for the second marker.
- If the two sets are different, but the number of alleles is the same, we
try to flip one of the marker. If the two sets are the same, but the
number of alleles is 1, we set the status to
homo_flip
. If the markers are heterozygous, we set the status toflip
. - If there is a difference in the number of alleles (one is homozygous,
the other, heterozygous), and that there is on allele in common, we set
the status to
homo_hetero
. If there are no allele in common, we try to flip one. If the new sets have one allele in common, we set the status tohomo_hetero_flip
. - If the sets of available alleles are the same (without flip), we check
the frequency and the minor alleles. If the minor allele is different,
we set the status to
diff_minor_allele
. If the difference in frequencies is higher than a threshold, we set the status todiff_frequency
. - If all of the above fail, we set the status to
problem
.
Problems are written in the
prefix.problems
file, and contains the following columns: chromosome, position, name and status. This file contains all the markers with a status, as explained above.
-
pyGenClean.DupSNPs.duplicated_snps.
processTPED
(uniqueSNPs, mapF, fileName, tfam, prefix)[source]¶ Process the TPED file.
Parameters: Returns: a tuple with the representation of the
tped
file (numpy.array
) as first element, and the updated position of the duplicated markers in thetped
representation.Copies the
tfam
file intoprefix.unique_snps.tfam
. While reading thetped
file, creates a new one (prefix.unique_snps.tped
) containing only unique markers.
-
pyGenClean.DupSNPs.duplicated_snps.
readMAP
(fileName, prefix)[source]¶ Reads the MAP file.
Parameters: fileName (str) – the name of the map
file.Returns: a list of tuples, representing the map
file.While reading the
map
file, it saves a file (prefix.duplicated_marker_names
) containing the name of the unique duplicated markers.
-
pyGenClean.DupSNPs.duplicated_snps.
readTFAM
(fileName)[source]¶ Reads the TFAM file.
Parameters: fileName (str) – the name of the tfam
file.Returns: a representation the tfam
file (numpy.array
).
-
pyGenClean.DupSNPs.duplicated_snps.
runCommand
(command)[source]¶ Run the command in Plink.
Parameters: command (list) – the command to run. Tries to run a command using
subprocess
.