pyGenClean.RelatedSamples package¶
For more information about how to use this module, refer to the Related Samples Module.
Module contents¶
Submodules¶
pyGenClean.RelatedSamples.find_related_samples module¶
-
exception
pyGenClean.RelatedSamples.find_related_samples.
ProgramError
(msg)[source]¶ Bases:
exceptions.Exception
An
Exception
raised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.RelatedSamples.find_related_samples.
checkArgs
(args)[source]¶ Checks the arguments and options.
Parameters: args (argparse.Namespace) – an object containing the options of the program. Returns: True
if everything was OK.If there is a problem with an option, an exception is raised using the
ProgramError
class, a message is printed to thesys.stderr
and the program exists with code 1.
-
pyGenClean.RelatedSamples.find_related_samples.
checkNumberOfSNP
(fileName, minimumNumber)[source]¶ Check there is enough SNPs in the file (with minimum).
Parameters: Returns: True
if there is enough markers in the file,False
otherwise.Reads the number of markers (number of lines) in a file.
-
pyGenClean.RelatedSamples.find_related_samples.
extractRelatedIndividuals
(fileName, outPrefix, ibs2_ratio_threshold)[source]¶ Extract related individuals according IBS2* ratio.
Parameters: Returns: a
numpy.recarray
data set containing (for each related sample pair) theibs2 ratio
,Z1
,Z2
and the type of relatedness.Reads a
genome
file (provided byrunGenome()
) and extract related sample pairs according toIBS2 ratio
.A
genome
file contains at least the following information for each sample pair:- FID1: the family ID of the first sample in the pair.
- IID1: the individual ID of the first sample in the pair.
- FID2: the family ID of the second sample in the pair.
- IID2: the individual ID of the second sample in the pair.
- Z0: the probability that \(IBD = 0\).
- Z1: the probability that \(IBD = 1\).
- Z2: the probability that \(IBD = 2\).
- HOMHOM: the number of \(IBS = 0\) SNP pairs used in
PPC
test. - HETHET: the number of \(IBS = 2\) het/het SNP pairs in
PPC
test.
The
IBS2 ratio
is computed using the following formula:\[\textrm{IBS2 ratio} = \frac{\textrm{HETHET}} {\textrm{HOMHOM} + \textrm{HETHET}}\]If the
IBS2 ratio
is higher than the threshold, the samples in the pair are related. The following values help in finding the relatedness of the sample pair.Values Relation Code \(0.17 \leq z_0 \leq 0.33\) and \(0.40 \leq z_1 \leq 0.60\) Full-sibs 1 \(0.40 \leq z_0 \leq 0.60\) and \(0.40 \leq z_1 \leq 0.60\) Half-sibs or Grand-parent-Child or Uncle-Nephew 2 \(z_0 \leq 0.05\) and \(z_1 \geq 0.95\) and \(z_2 \leq 0.05\) Parent-Child 3 \(z_0 \leq 0.05\) and \(z_1 \leq 0.05\) and \(z_2 \geq 0.95\) Twins or Duplicated samples 4
-
pyGenClean.RelatedSamples.find_related_samples.
extractSNPs
(snpsToExtract, options)[source]¶ Extract markers using Plink.
Parameters: - snpsToExtract (str) – the name of the file containing markers to extract.
- options (argparse.Namespace) – the options
Returns: the prefix of the output files.
-
pyGenClean.RelatedSamples.find_related_samples.
main
(argString=None)[source]¶ The main function of this module.
Parameters: argString (list) – the options. Here are the steps for this function:
- Prints the options.
- Uses Plink to extract markers according to LD
(
selectSNPsAccordingToLD()
). - Checks if there is enough markers after pruning
(
checkNumberOfSNP()
). If not, then quits. - Extract markers according to LD (
extractSNPs()
). - Runs Plink with the
genome
option (runGenome()
). Quits here if the user asker only for thegenome
file. - Finds related individuals and gets values for plotting
(
extractRelatedIndividuals()
). - Plots
Z1
in function ofIBS2 ratio
for related individuals (plot_related_data()
). - Plots
Z2
in function ofIBS2 ratio
for related individuals (plot_related_data()
).
-
pyGenClean.RelatedSamples.find_related_samples.
mergeGenomeLogFiles
(outPrefix, nbSet)[source]¶ Merge genome and log files together.
Parameters: Returns: the name of the output file (the
genome
file).After merging, the files are deleted to save space.
-
pyGenClean.RelatedSamples.find_related_samples.
parseArgs
(argString=None)[source]¶ Parses the command line options and arguments.
Parameters: argString (list) – the options. Returns: A argparse.Namespace
object created by theargparse
module. It contains the values of the different options.Options Type Description --bfile
string The input file prefix (Plink binary file). --genome-only
bool Only create the genome file. --min-nb-snp
int The minimum number of markers needed to compute IBS values. --indep-pairwise
string Three numbers: window size, window shift and the r2 threshold. --maf
string Restrict to SNPs with MAF >= threshold. --ibs2-ratio
float The initial IBS2* ratio (the minimum value to show in the plot. --sge
bool Use SGE for parallelization. --sge-walltime
int The time limit (for clusters). --sge-nodes
int int Two INTs (number of nodes and number of processor per nodes). --line-per-file-for-sge
int The number of line per file for SGE task array. --out
string The prefix of the output files. Note
No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see
checkArgs()
).
-
pyGenClean.RelatedSamples.find_related_samples.
plot_related_data
(x, y, code, ylabel, fileName, options)[source]¶ Plot Z1 and Z2 in function of IBS2* ratio.
Parameters: - x (numpy.array of floats) – the x axis of the plot (
IBS2 ratio
). - y (numpy.array of floats) – the y axis of the plot (either
z1
orz2
). - code (numpy.array) – the code of the relatedness of each sample pair.
- ylabel (str) – the label of the y axis (either
z1
orz2
). - fileName (str) – the name of the output file.
- options (argparse.Namespace) – the options.
There are four different relation codes (represented by 4 different color in the plots:
Code Relation Color 1 Full-sbis #CC0000
2 Half-sibs or Grand-parent-Child or Uncle-Nephew #0099CC
3 Parent-Child #FF8800
4 Twins or Duplicated samples #9933CC
Sample pairs with unknown relation are plotted using
#669900
as color.- x (numpy.array of floats) – the x axis of the plot (
-
pyGenClean.RelatedSamples.find_related_samples.
runCommand
(command)[source]¶ Run a command.
Parameters: command (list) – the command to run. Tries to run a command. If it fails, raise a
ProgramError
. This function uses thesubprocess
module.Warning
The variable
command
should be a list of strings (no other type).
-
pyGenClean.RelatedSamples.find_related_samples.
runGenome
(bfile, options)[source]¶ Runs the genome command from plink.
Parameters: - bfile (str) – the input file prefix.
- options (argparse.Namespace) – the options.
Returns: the name of the
genome
file.Runs Plink with the
genome
option. If the user asks for SGE (options.sge
is True), a frequency file is first created by plink. Then, the input files are split inoptions.line_per_file_for_sge
and Plink is called (using thegenome
option) on the cluster using SGE (runGenomeSGE()
). After the analysis, Plink’s output files and logs are merged usingmergeGenomeLogFiles()
.
-
pyGenClean.RelatedSamples.find_related_samples.
runGenomeSGE
(bfile, freqFile, nbJob, outPrefix, options)[source]¶ Runs the genome command from plink, on SGE.
Parameters: - bfile (str) – the prefix of the input file.
- freqFile (str) – the name of the frequency file (from Plink).
- nbJob (int) – the number of jobs to launch.
- outPrefix (str) – the prefix of all the output files.
- options (argparse.Namespace) – the options.
Runs Plink with the
genome
options on the cluster (using SGE).
-
pyGenClean.RelatedSamples.find_related_samples.
safe_main
()[source]¶ A safe version of the main function (that catches ProgramError).
-
pyGenClean.RelatedSamples.find_related_samples.
selectSNPsAccordingToLD
(options)[source]¶ Compute LD using Plink.
Parameters: options (argparse.Namespace) – the options. Returns: the name of the output file (from Plink).
pyGenClean.RelatedSamples.merge_related_samples module¶
-
exception
pyGenClean.RelatedSamples.merge_related_samples.
ProgramError
(msg)[source]¶ Bases:
exceptions.Exception
An
Exception
raised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.RelatedSamples.merge_related_samples.
checkArgs
(args)[source]¶ Checks the arguments and options.
Parameters: args (argparse.Namespace) – a an object containing the options of the program. Returns: True
if everything was OK.If there is a problem with an option, an exception is raised using the
ProgramError
class, a message is printed to thesys.stderr
and the program exists with code 1.
-
pyGenClean.RelatedSamples.merge_related_samples.
main
(argString=None)[source]¶ The main function of the module.
Parameters: argString (list) – the options.
-
pyGenClean.RelatedSamples.merge_related_samples.
merge_related_samples
(file_name, out_prefix, no_status)[source]¶ Merge related samples.
Parameters: In the output file, there are a pair of samples per line. Hence, one can find related individuals by merging overlapping pairs.
-
pyGenClean.RelatedSamples.merge_related_samples.
parseArgs
(argString=None)[source]¶ Parses the command line options and arguments.
Parameters: argString (list) – the options. Returns: A argparse.Namespace
object created by theargparse
module. It contains the values of the different options.Options Type Description --ibs-related
string The input file containing related individuals according to IBS value. --no-status
bool The input file doesn’t have a status
column.--out
string The prefix of the output files. Note
No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see
checkArgs()
).