pyGenClean.Ethnicity package¶
For more information about how to use this module, refer to the Ethnicity Module.
Module contents¶
Submodules¶
pyGenClean.Ethnicity.check_ethnicity module¶
-
exception
pyGenClean.Ethnicity.check_ethnicity.
ProgramError
(msg)[source]¶ Bases:
exceptions.Exception
An
Exception
raised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.Ethnicity.check_ethnicity.
allFileExists
(fileList)[source]¶ Check that all file exists.
Parameters: fileList (list) – the list of file to check. Check if all the files in
fileList
exists.
-
pyGenClean.Ethnicity.check_ethnicity.
checkArgs
(args)[source]¶ Checks the arguments and options.
Parameters: args (argparse.Namespace) – an object containing the options of the program. Returns: True
if everything was OK.If there is a problem with an option, an exception is raised using the
ProgramError
class, a message is printed to thesys.stderr
and the program exists with code 1.
-
pyGenClean.Ethnicity.check_ethnicity.
combinePlinkBinaryFiles
(prefixes, outPrefix)[source]¶ Combine Plink binary files.
Parameters: It uses Plink to merge a list of binary files (which is a list of prefixes (strings)), and create the final data set which as
outPrefix
as the prefix.
-
pyGenClean.Ethnicity.check_ethnicity.
computeFrequency
(prefix, outPrefix)[source]¶ Compute the frequency using Plink.
Parameters: Uses Plink to compute the frequency of all the markers in the
prefix
binary file.
-
pyGenClean.Ethnicity.check_ethnicity.
compute_eigenvalues
(in_prefix, out_prefix)[source]¶ Computes the Eigenvalues using smartpca from Eigensoft.
Parameters: Creates a “parameter file” used by smartpca and runs it.
-
pyGenClean.Ethnicity.check_ethnicity.
createMDSFile
(nb_components, inPrefix, outPrefix, genomeFileName)[source]¶ Creates a MDS file using Plink.
Parameters: Using Plink, computes the MDS values for each individual using the
inPrefix
,genomeFileName
and the number of components. The results are save using theoutPrefix
prefix.
-
pyGenClean.Ethnicity.check_ethnicity.
createPopulationFile
(inputFiles, labels, outputFileName)[source]¶ Creates a population file.
Parameters: The
inputFiles
is in reality a list oftfam
files composed of samples. For each of thosetfam
files, there is a label associated with it (representing the name of the population).The output file consists of one row per sample, with the following three columns: the family ID, the individual ID and the population of each sample.
-
pyGenClean.Ethnicity.check_ethnicity.
create_scree_plot
(in_filename, out_filename, plot_title)[source]¶ Creates a scree plot using smartpca results.
Parameters:
-
pyGenClean.Ethnicity.check_ethnicity.
excludeSNPs
(inPrefix, outPrefix, exclusionFileName)[source]¶ Exclude some SNPs using Plink.
Parameters: Using Plink, exclude a list of markers from
inPrefix
, and saves the results inoutPrefix
. The list of markers are inexclusionFileName
.
-
pyGenClean.Ethnicity.check_ethnicity.
extractSNPs
(snpToExtractFileNames, referencePrefixes, popNames, outPrefix, runSGE, options)[source]¶ Extract a list of SNPs using Plink.
Parameters: - snpToExtractFileNames (list) – the name of the files which contains the markers to extract from the original data set.
- referencePrefixes (list) – a list containing the three reference population prefixes (the original data sets).
- popNames (list) – a list containing the three reference population names.
- outPrefix (str) – the prefix of the output file.
- runSGE (boolean) – Whether using SGE or not.
- options (argparse.Namespace) – the options.
Using Plink, extract a set of markers from a list of prefixes.
-
pyGenClean.Ethnicity.check_ethnicity.
findFlippedSNPs
(frqFile1, frqFile2, outPrefix)[source]¶ Find flipped SNPs and flip them in the data.
Parameters: By reading two frequency files (
frqFile1
andfrqFile2
), it finds a list of markers that need to be flipped so that the first file becomes comparable with the second one. Also finds marker that need to be removed.A marker needs to be flipped in one of the two data set if the two markers are not comparable (same minor allele), but become comparable if we flip one of them.
A marker will be removed if it is all homozygous in at least one data set. It will also be removed if it’s impossible to determine the phase of the marker (e.g. if the two alleles are
A
andT
orC
andG
).
-
pyGenClean.Ethnicity.check_ethnicity.
findOverlappingSNPsWithReference
(prefix, referencePrefixes, referencePopulations, outPrefix)[source]¶ Find the overlapping SNPs in 4 different data sets.
Parameters: It starts by reading the
bim
file of the source data set (prefix.bim
). It finds all the markers (excluding the duplicated ones). Then it reads all of the reference populationbim
files (referencePrefixes.bim
) and find all the markers that were found in the source data set.It creates three output files:
outPrefix.ref_snp_to_extract
: the name of the markers that needs to be extracted from the three reference panels.outPrefix.source_snp_to_extract
: the name of the markers that needs to be extracted from the source panel.outPrefix.update_names
: a file (readable by Plink) that will help in changing the names of the selected markers in the reference panels, so that they become comparable with the source panel.
-
pyGenClean.Ethnicity.check_ethnicity.
find_the_outliers
(mds_file_name, population_file_name, ref_pop_name, multiplier, out_prefix)[source]¶ Finds the outliers of a given population.
Parameters: - mds_file_name (str) – the name of the
mds
file. - population_file_name (str) – the name of the population file.
- ref_pop_name (str) – the name of the reference population for which to find outliers from.
- multiplier (float) – the multiplier of the cluster standard deviation to modify the strictness of the outlier removal procedure.
- out_prefix (str) – the prefix of the output file.
Uses the
pyGenClean.Ethnicity.find_outliers
modules to find outliers. It requires themds
file created bycreateMDSFile()
and the population file created bycreatePopulationFile()
.- mds_file_name (str) – the name of the
-
pyGenClean.Ethnicity.check_ethnicity.
flipSNPs
(inPrefix, outPrefix, flipFileName)[source]¶ Flip SNPs using Plink.
Parameters: Using Plink, flip a set of markers in
inPrefix
, and saves the results inoutPrefix
. The list of markers to be flipped is inflipFileName
.
-
pyGenClean.Ethnicity.check_ethnicity.
main
(argString=None)[source]¶ The main function.
Parameters: argString (list) – the options. These are the steps of this module:
- Prints the options.
- Finds the overlapping markers between the three reference panels and
the source panel (
findOverlappingSNPsWithReference()
). - Extract the required markers from all the data sets
(
extractSNPs()
). - Renames the reference panel’s marker names to that they are the same as
the source panel (for all populations) (
renameSNPs()
). - Combines the three reference panels together
(
combinePlinkBinaryFiles()
). - Compute the frequency of all the markers from the reference and the
source panels (
computeFrequency()
). - Finds the markers to flip in the reference panel (when compared to the
source panel) (
findFlippedSNPs()
). - Excludes the unflippable markers from the reference and the source
panels (
excludeSNPs()
). - Flips the markers that need flipping in their reference panel
(
flipSNPs()
). - Combines the reference and the source panels
(
combinePlinkBinaryFiles()
). - Runs part of
pyGenClean.RelatedSamples.find_related_samples
on the combined data set (runRelatedness()
). - Creates the
mds
file from the combined data set and the result of previous step (createMDSFile()
). - Creates the population file (
createPopulationFile()
). - Plots the
mds
values (plotMDS()
). - Finds the outliers of a given reference population
(
find_the_outliers()
). - If required, computes the Eigenvalues using smartpca
(
compute_eigenvalues()
). - If required, creates a scree plot from smartpca resutls
(
create_scree_plot()
).
-
pyGenClean.Ethnicity.check_ethnicity.
parseArgs
(argString=None)[source]¶ Parses the command line options and arguments.
Parameters: argString (list) – the options. Returns: A argparse.Namespace
object created by theargparse
module. It contains the values of the different options.Options Type Description --bfile
string The input file prefix (Plink binary file). --skip-ref-pops
bool Perform the MDS computation, but skip the three reference panels. --ceu-bfile
string The input file prefix for the CEU population (Plink binary file). --yri-bfile
string The input file prefix for the YRI population (Plink binary file). --jpt-chb-bfile
string The input file prefix for the JPT-CHB population (Plink binary file). --min-nb-snp
int The minimum number of markers needed to compute IBS. --indep-pairwise
string Three numbers: window size, window shift and the r2 threshold. --maf
string Restrict to SNPs with MAF >= threshold. --sge
bool Use SGE for parallelization. --sge-walltime
int The time limit (for clusters). --sge-nodes
int int Two INTs (number of nodes and number of processor per nodes). --ibs-sge-walltime
int The time limit (for clusters) (for IBS) --ibs-sge-nodes
int int Two INTs (number of nodes and number of processor per nodes) (for IBS). --line-per-file-for-sge
int The number of line per file for SGE task array. --nb-components
int The number of component to compute. --outliers-of
string Finds the ouliers of this population. --multiplier
float To find the outliers, we look for more than x times the cluster standard deviation. --xaxis
string The component to use for the X axis. --yaxis
string The component to use for the Y axis. --format
string The output file format. --title
string The title of the MDS plot. --xlabel
string The label of the X axis. --ylabel
string The label of the Y axis. --create-scree-plot
bool Computes Eigenvalues and creates a scree plot. --scree-plot-title
string The main title of the scree plot --out
string The prefix of the output files. Note
No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see
checkArgs()
).
-
pyGenClean.Ethnicity.check_ethnicity.
plotMDS
(inputFileName, outPrefix, populationFileName, options)[source]¶ Plots the MDS value.
Parameters: - inputFileName (str) – the name of the
mds
file. - outPrefix (str) – the prefix of the output files.
- populationFileName (str) – the name of the population file.
- options (argparse.Namespace) – the options
Plots the
mds
value according to theinputFileName
file (mds
) and thepopulationFileName
(the population file).- inputFileName (str) – the name of the
-
pyGenClean.Ethnicity.check_ethnicity.
renameSNPs
(inPrefix, updateFileName, outPrefix)[source]¶ Updates the name of the SNPs using Plink.
Parameters: Using Plink, changes the name of the markers in
inPrefix
usingupdateFileName
. It saves the results inoutPrefix
.
-
pyGenClean.Ethnicity.check_ethnicity.
runCommand
(command)[source]¶ Run a command.
Parameters: command (list) – the command to run. Tries to run a command. If it fails, raise a
ProgramError
. This function uses thesubprocess
module.Warning
The variable
command
should be a list of strings (no other type).
-
pyGenClean.Ethnicity.check_ethnicity.
runRelatedness
(inputPrefix, outPrefix, options)[source]¶ Run the relatedness step of the data clean up.
Parameters: - inputPrefix (str) – the prefix of the input file.
- outPrefix (str) – the prefix of the output file.
- options (argparse.Namespace) – the options
Returns: the prefix of the new bfile.
Runs
pyGenClean.RelatedSamples.find_related_samples
using theinputPrefix
files andoptions
options, and saves the results using theoutPrefix
prefix.
pyGenClean.Ethnicity.find_outliers module¶
-
exception
pyGenClean.Ethnicity.find_outliers.
ProgramError
(msg)[source]¶ Bases:
exceptions.Exception
An
Exception
raised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.Ethnicity.find_outliers.
add_custom_options
(parser)[source]¶ Adds custom options to a parser.
Parameters: parser (argparse.ArgumentParser) – the parser to which to add options.
-
pyGenClean.Ethnicity.find_outliers.
checkArgs
(args)[source]¶ Checks the arguments and options.
Parameters: args (argparse.Namespace) – a argparse.Namespace
object containing the options of the program.Returns: True
if everything was OK.If there is a problem with an option, an exception is raised using the
ProgramError
class, a message is printed to thesys.stderr
and the program exists with code 1.
-
pyGenClean.Ethnicity.find_outliers.
find_outliers
(mds, centers, center_info, ref_pop, options)[source]¶ Finds the outliers for a given population.
Parameters: - mds (numpy.recarray) – the
mds
information about each samples. - centers (numpy.array) – the centers of the three reference population clusters.
- center_info (dict) – the label of the three reference population clusters.
- ref_pop (str) – the reference population for which we need the outliers from.
- options (argparse.Namespace) – the options
Returns: a
set
of outliers from theref_pop
population.Perform a
KMeans
classification using the three centers from the three reference population cluster.Samples are outliers of the required reference population (
ref_pop
) if:- the sample is part of another reference population cluster;
- the sample is an outlier of the desired reference population
(
ref_pop
).
A sample is an outlier of a given cluster \(C_j\) if the distance between this sample and the center of the cluster \(C_j\) (\(O_j\)) is bigger than a constant times the cluster’s standard deviation \(\sigma_j\).
\[\sigma_j = \sqrt{\frac{\sum{d(s_i,O_j)^2}}{||C_j|| - 1}}\]where \(||C_j||\) is the number of samples in the cluster \(C_j\), and \(d(s_i,O_j)\) is the distance between the sample \(s_i\) and the center \(O_j\) of the cluster \(C_j\).
\[d(s_i, O_j) = \sqrt{(x_{O_j} - x_{s_i})^2 + (y_{O_j} - y_{s_i})^2}\]Using a constant equals to one ensure we remove 100% of the outliers from the cluster. Using a constant of 1.6 or 1.9 ensures we remove 99% and 95% of outliers, respectively (an error rate of 1% and 5%, respectively).
- mds (numpy.recarray) – the
-
pyGenClean.Ethnicity.find_outliers.
find_ref_centers
(mds)[source]¶ Finds the center of the three reference clusters.
Parameters: mds (numpy.recarray) – the mds
information about each samples.Returns: a tuple with a numpy.array
containing the centers of the three reference population cluster as first element, and adict
containing the label of each of the three reference population clusters.First, we extract the
mds
values of each of the three reference populations. The, we compute the center of each of those clusters by computing the means.\[\textrm{Cluster}_\textrm{pop} = \left( \frac{\sum_{i=1}^n x_i}{n}, \frac{\sum_{i=1}^n y_i}{n} \right)\]
-
pyGenClean.Ethnicity.find_outliers.
main
(argString=None)[source]¶ The main function.
Parameters: argString (list of strings) – the options. These are the steps of the modules:
- Prints the options.
- Reads the population file (
read_population_file()
). - Reads the
mds
file (read_mds_file()
). - Computes the three reference population clusters’ centers
(
find_ref_centers()
). - Computes three clusters according to the reference population clusters’
centers, and finds the outliers of a given reference population
(
find_outliers()
). This steps also produce three different plots. - Writes outliers in a file (
prefix.outliers
).
-
pyGenClean.Ethnicity.find_outliers.
overwrite_tex
(tex_fn, nb_outliers, script_options)[source]¶ Overwrites the TeX summary file with new values.
Parameters:
-
pyGenClean.Ethnicity.find_outliers.
parseArgs
(argString=None)[source]¶ Parses the command line options and arguments.
Parameters: argString (list) – the options. Returns: A argparse.Namespace
object created by theargparse
module. It contains the values of the different options.Options Type Description --mds
string The MDS file from Plink. --population-file
string A population file from pyGenClean.Ethnicity.check_ethnicity
module.--format
string The output file format (png, ps, or pdf. --outliers-of
string Finds the outliers of this population. --multiplier
float To find the outliers, we look for more than \(x\) times the cluster standard deviation. --xaxis
string The component to use for the X axis. --yaxis
string The component to use for the Y axis. --format
string The output file format (png, ps, or pdf formats are available). --out
string The prefix of the output files. Note
No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see
checkArgs()
).
-
pyGenClean.Ethnicity.find_outliers.
read_mds_file
(file_name, c1, c2, pops)[source]¶ Reads a MDS file.
Parameters: Returns: a
numpy.recarray
(one sample per line) with the information about the family ID, the individual ID, the first component to extract, the second component to extract and the population.The
mds
file is the result of Plink (as produced by thepyGenClean.Ethnicity.check_ethnicity
module).
-
pyGenClean.Ethnicity.find_outliers.
read_population_file
(file_name)[source]¶ Reads the population file.
Parameters: file_name (str) – the name of the population file. Returns: a dict
containing the population for each of the samples.The population file should contain three columns:
- The family ID.
- The individual ID.
- The population of the file (one of
CEU
,YRI
,JPT-CHB
orSOURCE
).
The outliers are from the
SOURCE
population, when compared to one of the three reference population (CEU
,YRI
orJPT-CHB
).
pyGenClean.Ethnicity.plot_eigenvalues module¶
-
exception
pyGenClean.Ethnicity.plot_eigenvalues.
ProgramError
(msg)[source]¶ Bases:
exceptions.Exception
An
Exception
raised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.Ethnicity.plot_eigenvalues.
add_custom_options
(parser)[source]¶ Adds custom options to a parser.
Parameters: parser (argparse.ArgumentParser) – the parser to which the options will be added.
-
pyGenClean.Ethnicity.plot_eigenvalues.
check_args
(args)[source]¶ Checks the arguments and options.
Parameters: args ( argparse.Namespace
) – an object containing the options and arguments of the program.Returns: True
if everything was OK.If there is a problem with an option, an exception is raised using the
ProgramError
class, a message is printed to thesys.stderr
and the program exits with error code 1.
-
pyGenClean.Ethnicity.plot_eigenvalues.
create_scree_plot
(data, o_filename, options)[source]¶ Creates the scree plot.
Parameters: - data (numpy.ndarray) – the eigenvalues.
- o_filename (str) – the name of the output files.
- options (argparse.Namespace) – the options.
-
pyGenClean.Ethnicity.plot_eigenvalues.
main
(argString=None)[source]¶ The main function.
The purpose of this module is to plot Eigenvectors provided by the Eigensoft software.
Here are the steps of this module:
- Reads the Eigenvector (
read_eigenvalues()
). - Plots the Scree Plot (
create_scree_plot()
).
- Reads the Eigenvector (
-
pyGenClean.Ethnicity.plot_eigenvalues.
parse_args
(argString=None)[source]¶ Parses the command line options and arguments.
Returns: A argparse.Namespace
object created by theargparse
module. It contains the values of the different options.Options Type Description --evec
string The EVEC file from EIGENSOFT --scree-plot-title
string The main title of the scree plot --out
string The name of the output file Note
No option check is done here (except for the one automatically done by
argparse
). Those need to be done elsewhere (seecheckArgs()
).
-
pyGenClean.Ethnicity.plot_eigenvalues.
read_eigenvalues
(i_filename)[source]¶ Reads the eigenvalues from EIGENSOFT results.
Parameters: i_filename (str) – the name of the input file. Returns: a numpy.ndarray
array containing the eigenvalues.