pyGenClean.RelatedSamples package¶

For more information about how to use this module, refer to the Related Samples Module.

Module contents¶

Submodules¶

pyGenClean.RelatedSamples.find_related_samples module¶

exception pyGenClean.RelatedSamples.find_related_samples.ProgramError(msg)[source]¶

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:	msg (str) – the message to print to the user before exiting.

pyGenClean.RelatedSamples.find_related_samples.checkArgs(args)[source]¶

Checks the arguments and options.

Parameters:	args (argparse.Namespace) – an object containing the options of the program.
Returns:	`True` if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.RelatedSamples.find_related_samples.checkNumberOfSNP(fileName, minimumNumber)[source]¶

Check there is enough SNPs in the file (with minimum).

Parameters:	fileName (str) – the name of the file. minimumNumber (int) – the minimum number of markers that needs to be in the file.
Returns:	`True` if there is enough markers in the file, `False` otherwise.

Reads the number of markers (number of lines) in a file.

pyGenClean.RelatedSamples.find_related_samples.extractRelatedIndividuals(fileName, outPrefix, ibs2_ratio_threshold)[source]¶

Extract related individuals according IBS2* ratio.

Parameters:	fileName (str) – the name of the input file. outPrefix (str) – the prefix of the output files. ibs2_ratio_threshold (float) – the ibs2 ratio threshold (tells if sample pair is related or not).
Returns:	a `numpy.recarray` data set containing (for each related sample pair) the `ibs2 ratio`, `Z1`, `Z2` and the type of relatedness.

Reads a genome file (provided by runGenome()) and extract related sample pairs according to IBS2 ratio.

A genome file contains at least the following information for each sample pair:

FID1: the family ID of the first sample in the pair.
IID1: the individual ID of the first sample in the pair.
FID2: the family ID of the second sample in the pair.
IID2: the individual ID of the second sample in the pair.
Z0: the probability that \(IBD = 0\).
Z1: the probability that \(IBD = 1\).
Z2: the probability that \(IBD = 2\).
HOMHOM: the number of \(IBS = 0\) SNP pairs used in PPC test.
HETHET: the number of \(IBS = 2\) het/het SNP pairs in PPC test.

The IBS2 ratio is computed using the following formula:

\[\textrm{IBS2 ratio} = \frac{\textrm{HETHET}} {\textrm{HOMHOM} + \textrm{HETHET}}\]

If the IBS2 ratio is higher than the threshold, the samples in the pair are related. The following values help in finding the relatedness of the sample pair.

Values	Relation	Code
\(0.17 \leq z_0 \leq 0.33\) and \(0.40 \leq z_1 \leq 0.60\)	Full-sibs	1
\(0.40 \leq z_0 \leq 0.60\) and \(0.40 \leq z_1 \leq 0.60\)	Half-sibs or Grand-parent-Child or Uncle-Nephew	2
\(z_0 \leq 0.05\) and \(z_1 \geq 0.95\) and \(z_2 \leq 0.05\)	Parent-Child	3
\(z_0 \leq 0.05\) and \(z_1 \leq 0.05\) and \(z_2 \geq 0.95\)	Twins or Duplicated samples	4

pyGenClean.RelatedSamples.find_related_samples.extractSNPs(snpsToExtract, options)[source]¶

Extract markers using Plink.

Parameters:	snpsToExtract (str) – the name of the file containing markers to extract. options (argparse.Namespace) – the options
Returns:	the prefix of the output files.

pyGenClean.RelatedSamples.find_related_samples.main(argString=None)[source]¶

The main function of this module.

Parameters:	argString (list) – the options.

Here are the steps for this function:

Prints the options.
Uses Plink to extract markers according to LD (selectSNPsAccordingToLD()).
Checks if there is enough markers after pruning (checkNumberOfSNP()). If not, then quits.
Extract markers according to LD (extractSNPs()).
Runs Plink with the genome option (runGenome()). Quits here if the user asker only for the genome file.
Finds related individuals and gets values for plotting (extractRelatedIndividuals()).
Plots Z1 in function of IBS2 ratio for related individuals (plot_related_data()).
Plots Z2 in function of IBS2 ratio for related individuals (plot_related_data()).

pyGenClean.RelatedSamples.find_related_samples.mergeGenomeLogFiles(outPrefix, nbSet)[source]¶

Merge genome and log files together.

Parameters:	outPrefix (str) – the prefix of the output files. nbSet (int) – The number of set of files to merge together.
Returns:	the name of the output file (the `genome` file).

After merging, the files are deleted to save space.

pyGenClean.RelatedSamples.find_related_samples.parseArgs(argString=None)[source]¶

Parses the command line options and arguments.

Parameters:	argString (list) – the options.
Returns:	A `argparse.Namespace` object created by the `argparse` module. It contains the values of the different options.

Options Type		Description
`--bfile`	string	The input file prefix (Plink binary file).
`--genome-only`	bool	Only create the genome file.
`--min-nb-snp`	int	The minimum number of markers needed to compute IBS values.
`--indep-pairwise`	string	Three numbers: window size, window shift and the r2 threshold.
`--maf`	string	Restrict to SNPs with MAF >= threshold.
`--ibs2-ratio`	float	The initial IBS2* ratio (the minimum value to show in the plot.
`--sge`	bool	Use SGE for parallelization.
`--sge-walltime`	int	The time limit (for clusters).
`--sge-nodes`	int int	Two INTs (number of nodes and number of processor per nodes).
`--line-per-file-for-sge`	int	The number of line per file for SGE task array.
`--out`	string	The prefix of the output files.

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.RelatedSamples.find_related_samples.plot_related_data(x, y, code, ylabel, fileName, options)[source]¶

Plot Z1 and Z2 in function of IBS2* ratio.

Parameters:

x (numpy.array of floats) – the x axis of the plot (IBS2 ratio).
y (numpy.array of floats) – the y axis of the plot (either z1 or z2).
code (numpy.array) – the code of the relatedness of each sample pair.
ylabel (str) – the label of the y axis (either z1 or z2).
fileName (str) – the name of the output file.
options (argparse.Namespace) – the options.

There are four different relation codes (represented by 4 different color in the plots:

Code	Relation	Color
1	Full-sbis	`#CC0000`
2	Half-sibs or Grand-parent-Child or Uncle-Nephew	`#0099CC`
3	Parent-Child	`#FF8800`
4	Twins or Duplicated samples	`#9933CC`

Sample pairs with unknown relation are plotted using #669900 as color.

pyGenClean.RelatedSamples.find_related_samples.runCommand(command)[source]¶

Run a command.

Parameters:	command (list) – the command to run.

Tries to run a command. If it fails, raise a ProgramError. This function uses the subprocess module.

Warning

The variable command should be a list of strings (no other type).

pyGenClean.RelatedSamples.find_related_samples.runGenome(bfile, options)[source]¶

Runs the genome command from plink.

Parameters:	bfile (str) – the input file prefix. options (argparse.Namespace) – the options.
Returns:	the name of the `genome` file.

Runs Plink with the genome option. If the user asks for SGE (options.sge is True), a frequency file is first created by plink. Then, the input files are split in options.line_per_file_for_sge and Plink is called (using the genome option) on the cluster using SGE (runGenomeSGE()). After the analysis, Plink’s output files and logs are merged using mergeGenomeLogFiles().

pyGenClean.RelatedSamples.find_related_samples.runGenomeSGE(bfile, freqFile, nbJob, outPrefix, options)[source]¶

Runs the genome command from plink, on SGE.

Parameters:	bfile (str) – the prefix of the input file. freqFile (str) – the name of the frequency file (from Plink). nbJob (int) – the number of jobs to launch. outPrefix (str) – the prefix of all the output files. options (argparse.Namespace) – the options.

Runs Plink with the genome options on the cluster (using SGE).

pyGenClean.RelatedSamples.find_related_samples.safe_main()[source]¶: A safe version of the main function (that catches ProgramError).

pyGenClean.RelatedSamples.find_related_samples.selectSNPsAccordingToLD(options)[source]¶

Compute LD using Plink.

Parameters:	options (argparse.Namespace) – the options.
Returns:	the name of the output file (from Plink).

pyGenClean.RelatedSamples.find_related_samples.splitFile(inputFileName, linePerFile, outPrefix)[source]¶

Split a file.

Parameters:	inputFileName (str) – the name of the input file. linePerFile (int) – the number of line per file (after splitting). outPrefix (str) – the prefix of the output files.
Returns:	the number of created temporary files.

Splits a file (inputFileName into multiple files containing at most linePerFile lines.

pyGenClean.RelatedSamples.merge_related_samples module¶

exception pyGenClean.RelatedSamples.merge_related_samples.ProgramError(msg)[source]¶

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:	msg (str) – the message to print to the user before exiting.

pyGenClean.RelatedSamples.merge_related_samples.checkArgs(args)[source]¶

Checks the arguments and options.

Parameters:	args (argparse.Namespace) – a an object containing the options of the program.
Returns:	`True` if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.RelatedSamples.merge_related_samples.main(argString=None)[source]¶

The main function of the module.

Parameters:	argString (list) – the options.

pyGenClean.RelatedSamples.merge_related_samples.merge_related_samples(file_name, out_prefix, no_status)[source]¶

Merge related samples.

Parameters:	file_name (str) – the name of the input file. out_prefix (str) – the prefix of the output files. no_status (boolean) – is there a status column in the file?

In the output file, there are a pair of samples per line. Hence, one can find related individuals by merging overlapping pairs.

pyGenClean.RelatedSamples.merge_related_samples.parseArgs(argString=None)[source]¶

Parses the command line options and arguments.

Parameters:	argString (list) – the options.
Returns:	A `argparse.Namespace` object created by the `argparse` module. It contains the values of the different options.

Options	Type	Description
`--ibs-related`	string	The input file containing related individuals according to IBS value.
`--no-status`	bool	The input file doesn’t have a `status` column.
`--out`	string	The prefix of the output files.

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.RelatedSamples.merge_related_samples.safe_main()[source]¶: A safe version of the main function (that catches ProgramError).

Table Of Contents

Previous topic

Next topic

This Page