pyGenClean.RelatedSamples package¶
For more information about how to use this module, refer to the Related Samples Module.
Module contents¶
Submodules¶
pyGenClean.RelatedSamples.find_related_samples module¶
-
exception
pyGenClean.RelatedSamples.find_related_samples.ProgramError(msg)[source]¶ Bases:
exceptions.ExceptionAn
Exceptionraised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.RelatedSamples.find_related_samples.checkArgs(args)[source]¶ Checks the arguments and options.
Parameters: args (argparse.Namespace) – an object containing the options of the program. Returns: Trueif everything was OK.If there is a problem with an option, an exception is raised using the
ProgramErrorclass, a message is printed to thesys.stderrand the program exists with code 1.
-
pyGenClean.RelatedSamples.find_related_samples.checkNumberOfSNP(fileName, minimumNumber)[source]¶ Check there is enough SNPs in the file (with minimum).
Parameters: Returns: Trueif there is enough markers in the file,Falseotherwise.Reads the number of markers (number of lines) in a file.
-
pyGenClean.RelatedSamples.find_related_samples.extractRelatedIndividuals(fileName, outPrefix, ibs2_ratio_threshold)[source]¶ Extract related individuals according IBS2* ratio.
Parameters: Returns: a
numpy.recarraydata set containing (for each related sample pair) theibs2 ratio,Z1,Z2and the type of relatedness.Reads a
genomefile (provided byrunGenome()) and extract related sample pairs according toIBS2 ratio.A
genomefile contains at least the following information for each sample pair:- FID1: the family ID of the first sample in the pair.
- IID1: the individual ID of the first sample in the pair.
- FID2: the family ID of the second sample in the pair.
- IID2: the individual ID of the second sample in the pair.
- Z0: the probability that \(IBD = 0\).
- Z1: the probability that \(IBD = 1\).
- Z2: the probability that \(IBD = 2\).
- HOMHOM: the number of \(IBS = 0\) SNP pairs used in
PPCtest. - HETHET: the number of \(IBS = 2\) het/het SNP pairs in
PPCtest.
The
IBS2 ratiois computed using the following formula:\[\textrm{IBS2 ratio} = \frac{\textrm{HETHET}} {\textrm{HOMHOM} + \textrm{HETHET}}\]If the
IBS2 ratiois higher than the threshold, the samples in the pair are related. The following values help in finding the relatedness of the sample pair.Values Relation Code \(0.17 \leq z_0 \leq 0.33\) and \(0.40 \leq z_1 \leq 0.60\) Full-sibs 1 \(0.40 \leq z_0 \leq 0.60\) and \(0.40 \leq z_1 \leq 0.60\) Half-sibs or Grand-parent-Child or Uncle-Nephew 2 \(z_0 \leq 0.05\) and \(z_1 \geq 0.95\) and \(z_2 \leq 0.05\) Parent-Child 3 \(z_0 \leq 0.05\) and \(z_1 \leq 0.05\) and \(z_2 \geq 0.95\) Twins or Duplicated samples 4
-
pyGenClean.RelatedSamples.find_related_samples.extractSNPs(snpsToExtract, options)[source]¶ Extract markers using Plink.
Parameters: - snpsToExtract (str) – the name of the file containing markers to extract.
- options (argparse.Namespace) – the options
Returns: the prefix of the output files.
-
pyGenClean.RelatedSamples.find_related_samples.main(argString=None)[source]¶ The main function of this module.
Parameters: argString (list) – the options. Here are the steps for this function:
- Prints the options.
- Uses Plink to extract markers according to LD
(
selectSNPsAccordingToLD()). - Checks if there is enough markers after pruning
(
checkNumberOfSNP()). If not, then quits. - Extract markers according to LD (
extractSNPs()). - Runs Plink with the
genomeoption (runGenome()). Quits here if the user asker only for thegenomefile. - Finds related individuals and gets values for plotting
(
extractRelatedIndividuals()). - Plots
Z1in function ofIBS2 ratiofor related individuals (plot_related_data()). - Plots
Z2in function ofIBS2 ratiofor related individuals (plot_related_data()).
-
pyGenClean.RelatedSamples.find_related_samples.mergeGenomeLogFiles(outPrefix, nbSet)[source]¶ Merge genome and log files together.
Parameters: Returns: the name of the output file (the
genomefile).After merging, the files are deleted to save space.
-
pyGenClean.RelatedSamples.find_related_samples.parseArgs(argString=None)[source]¶ Parses the command line options and arguments.
Parameters: argString (list) – the options. Returns: A argparse.Namespaceobject created by theargparsemodule. It contains the values of the different options.Options Type Description --bfilestring The input file prefix (Plink binary file). --genome-onlybool Only create the genome file. --min-nb-snpint The minimum number of markers needed to compute IBS values. --indep-pairwisestring Three numbers: window size, window shift and the r2 threshold. --mafstring Restrict to SNPs with MAF >= threshold. --ibs2-ratiofloat The initial IBS2* ratio (the minimum value to show in the plot. --sgebool Use SGE for parallelization. --sge-walltimeint The time limit (for clusters). --sge-nodesint int Two INTs (number of nodes and number of processor per nodes). --line-per-file-for-sgeint The number of line per file for SGE task array. --outstring The prefix of the output files. Note
No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see
checkArgs()).
-
pyGenClean.RelatedSamples.find_related_samples.plot_related_data(x, y, code, ylabel, fileName, options)[source]¶ Plot Z1 and Z2 in function of IBS2* ratio.
Parameters: - x (numpy.array of floats) – the x axis of the plot (
IBS2 ratio). - y (numpy.array of floats) – the y axis of the plot (either
z1orz2). - code (numpy.array) – the code of the relatedness of each sample pair.
- ylabel (str) – the label of the y axis (either
z1orz2). - fileName (str) – the name of the output file.
- options (argparse.Namespace) – the options.
There are four different relation codes (represented by 4 different color in the plots:
Code Relation Color 1 Full-sbis #CC00002 Half-sibs or Grand-parent-Child or Uncle-Nephew #0099CC3 Parent-Child #FF88004 Twins or Duplicated samples #9933CCSample pairs with unknown relation are plotted using
#669900as color.- x (numpy.array of floats) – the x axis of the plot (
-
pyGenClean.RelatedSamples.find_related_samples.runCommand(command)[source]¶ Run a command.
Parameters: command (list) – the command to run. Tries to run a command. If it fails, raise a
ProgramError. This function uses thesubprocessmodule.Warning
The variable
commandshould be a list of strings (no other type).
-
pyGenClean.RelatedSamples.find_related_samples.runGenome(bfile, options)[source]¶ Runs the genome command from plink.
Parameters: - bfile (str) – the input file prefix.
- options (argparse.Namespace) – the options.
Returns: the name of the
genomefile.Runs Plink with the
genomeoption. If the user asks for SGE (options.sgeis True), a frequency file is first created by plink. Then, the input files are split inoptions.line_per_file_for_sgeand Plink is called (using thegenomeoption) on the cluster using SGE (runGenomeSGE()). After the analysis, Plink’s output files and logs are merged usingmergeGenomeLogFiles().
-
pyGenClean.RelatedSamples.find_related_samples.runGenomeSGE(bfile, freqFile, nbJob, outPrefix, options)[source]¶ Runs the genome command from plink, on SGE.
Parameters: - bfile (str) – the prefix of the input file.
- freqFile (str) – the name of the frequency file (from Plink).
- nbJob (int) – the number of jobs to launch.
- outPrefix (str) – the prefix of all the output files.
- options (argparse.Namespace) – the options.
Runs Plink with the
genomeoptions on the cluster (using SGE).
-
pyGenClean.RelatedSamples.find_related_samples.safe_main()[source]¶ A safe version of the main function (that catches ProgramError).
-
pyGenClean.RelatedSamples.find_related_samples.selectSNPsAccordingToLD(options)[source]¶ Compute LD using Plink.
Parameters: options (argparse.Namespace) – the options. Returns: the name of the output file (from Plink).
pyGenClean.RelatedSamples.merge_related_samples module¶
-
exception
pyGenClean.RelatedSamples.merge_related_samples.ProgramError(msg)[source]¶ Bases:
exceptions.ExceptionAn
Exceptionraised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.RelatedSamples.merge_related_samples.checkArgs(args)[source]¶ Checks the arguments and options.
Parameters: args (argparse.Namespace) – a an object containing the options of the program. Returns: Trueif everything was OK.If there is a problem with an option, an exception is raised using the
ProgramErrorclass, a message is printed to thesys.stderrand the program exists with code 1.
-
pyGenClean.RelatedSamples.merge_related_samples.main(argString=None)[source]¶ The main function of the module.
Parameters: argString (list) – the options.
-
pyGenClean.RelatedSamples.merge_related_samples.merge_related_samples(file_name, out_prefix, no_status)[source]¶ Merge related samples.
Parameters: In the output file, there are a pair of samples per line. Hence, one can find related individuals by merging overlapping pairs.
-
pyGenClean.RelatedSamples.merge_related_samples.parseArgs(argString=None)[source]¶ Parses the command line options and arguments.
Parameters: argString (list) – the options. Returns: A argparse.Namespaceobject created by theargparsemodule. It contains the values of the different options.Options Type Description --ibs-relatedstring The input file containing related individuals according to IBS value. --no-statusbool The input file doesn’t have a statuscolumn.--outstring The prefix of the output files. Note
No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see
checkArgs()).
