pyGenClean.Ethnicity package¶
For more information about how to use this module, refer to the Ethnicity Module.
Module contents¶
Submodules¶
pyGenClean.Ethnicity.check_ethnicity module¶
-
exception
pyGenClean.Ethnicity.check_ethnicity.ProgramError(msg)[source]¶ Bases:
exceptions.ExceptionAn
Exceptionraised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.Ethnicity.check_ethnicity.allFileExists(fileList)[source]¶ Check that all file exists.
Parameters: fileList (list) – the list of file to check. Check if all the files in
fileListexists.
-
pyGenClean.Ethnicity.check_ethnicity.checkArgs(args)[source]¶ Checks the arguments and options.
Parameters: args (argparse.Namespace) – an object containing the options of the program. Returns: Trueif everything was OK.If there is a problem with an option, an exception is raised using the
ProgramErrorclass, a message is printed to thesys.stderrand the program exists with code 1.
-
pyGenClean.Ethnicity.check_ethnicity.combinePlinkBinaryFiles(prefixes, outPrefix)[source]¶ Combine Plink binary files.
Parameters: It uses Plink to merge a list of binary files (which is a list of prefixes (strings)), and create the final data set which as
outPrefixas the prefix.
-
pyGenClean.Ethnicity.check_ethnicity.computeFrequency(prefix, outPrefix)[source]¶ Compute the frequency using Plink.
Parameters: Uses Plink to compute the frequency of all the markers in the
prefixbinary file.
-
pyGenClean.Ethnicity.check_ethnicity.compute_eigenvalues(in_prefix, out_prefix)[source]¶ Computes the Eigenvalues using smartpca from Eigensoft.
Parameters: Creates a “parameter file” used by smartpca and runs it.
-
pyGenClean.Ethnicity.check_ethnicity.createMDSFile(nb_components, inPrefix, outPrefix, genomeFileName)[source]¶ Creates a MDS file using Plink.
Parameters: Using Plink, computes the MDS values for each individual using the
inPrefix,genomeFileNameand the number of components. The results are save using theoutPrefixprefix.
-
pyGenClean.Ethnicity.check_ethnicity.createPopulationFile(inputFiles, labels, outputFileName)[source]¶ Creates a population file.
Parameters: The
inputFilesis in reality a list oftfamfiles composed of samples. For each of thosetfamfiles, there is a label associated with it (representing the name of the population).The output file consists of one row per sample, with the following three columns: the family ID, the individual ID and the population of each sample.
-
pyGenClean.Ethnicity.check_ethnicity.create_scree_plot(in_filename, out_filename, plot_title)[source]¶ Creates a scree plot using smartpca results.
Parameters:
-
pyGenClean.Ethnicity.check_ethnicity.excludeSNPs(inPrefix, outPrefix, exclusionFileName)[source]¶ Exclude some SNPs using Plink.
Parameters: Using Plink, exclude a list of markers from
inPrefix, and saves the results inoutPrefix. The list of markers are inexclusionFileName.
-
pyGenClean.Ethnicity.check_ethnicity.extractSNPs(snpToExtractFileNames, referencePrefixes, popNames, outPrefix, runSGE, options)[source]¶ Extract a list of SNPs using Plink.
Parameters: - snpToExtractFileNames (list) – the name of the files which contains the markers to extract from the original data set.
- referencePrefixes (list) – a list containing the three reference population prefixes (the original data sets).
- popNames (list) – a list containing the three reference population names.
- outPrefix (str) – the prefix of the output file.
- runSGE (boolean) – Whether using SGE or not.
- options (argparse.Namespace) – the options.
Using Plink, extract a set of markers from a list of prefixes.
-
pyGenClean.Ethnicity.check_ethnicity.findFlippedSNPs(frqFile1, frqFile2, outPrefix)[source]¶ Find flipped SNPs and flip them in the data.
Parameters: By reading two frequency files (
frqFile1andfrqFile2), it finds a list of markers that need to be flipped so that the first file becomes comparable with the second one. Also finds marker that need to be removed.A marker needs to be flipped in one of the two data set if the two markers are not comparable (same minor allele), but become comparable if we flip one of them.
A marker will be removed if it is all homozygous in at least one data set. It will also be removed if it’s impossible to determine the phase of the marker (e.g. if the two alleles are
AandTorCandG).
-
pyGenClean.Ethnicity.check_ethnicity.findOverlappingSNPsWithReference(prefix, referencePrefixes, referencePopulations, outPrefix)[source]¶ Find the overlapping SNPs in 4 different data sets.
Parameters: It starts by reading the
bimfile of the source data set (prefix.bim). It finds all the markers (excluding the duplicated ones). Then it reads all of the reference populationbimfiles (referencePrefixes.bim) and find all the markers that were found in the source data set.It creates three output files:
outPrefix.ref_snp_to_extract: the name of the markers that needs to be extracted from the three reference panels.outPrefix.source_snp_to_extract: the name of the markers that needs to be extracted from the source panel.outPrefix.update_names: a file (readable by Plink) that will help in changing the names of the selected markers in the reference panels, so that they become comparable with the source panel.
-
pyGenClean.Ethnicity.check_ethnicity.find_the_outliers(mds_file_name, population_file_name, ref_pop_name, multiplier, out_prefix)[source]¶ Finds the outliers of a given population.
Parameters: - mds_file_name (str) – the name of the
mdsfile. - population_file_name (str) – the name of the population file.
- ref_pop_name (str) – the name of the reference population for which to find outliers from.
- multiplier (float) – the multiplier of the cluster standard deviation to modify the strictness of the outlier removal procedure.
- out_prefix (str) – the prefix of the output file.
Uses the
pyGenClean.Ethnicity.find_outliersmodules to find outliers. It requires themdsfile created bycreateMDSFile()and the population file created bycreatePopulationFile().- mds_file_name (str) – the name of the
-
pyGenClean.Ethnicity.check_ethnicity.flipSNPs(inPrefix, outPrefix, flipFileName)[source]¶ Flip SNPs using Plink.
Parameters: Using Plink, flip a set of markers in
inPrefix, and saves the results inoutPrefix. The list of markers to be flipped is inflipFileName.
-
pyGenClean.Ethnicity.check_ethnicity.main(argString=None)[source]¶ The main function.
Parameters: argString (list) – the options. These are the steps of this module:
- Prints the options.
- Finds the overlapping markers between the three reference panels and
the source panel (
findOverlappingSNPsWithReference()). - Extract the required markers from all the data sets
(
extractSNPs()). - Renames the reference panel’s marker names to that they are the same as
the source panel (for all populations) (
renameSNPs()). - Combines the three reference panels together
(
combinePlinkBinaryFiles()). - Compute the frequency of all the markers from the reference and the
source panels (
computeFrequency()). - Finds the markers to flip in the reference panel (when compared to the
source panel) (
findFlippedSNPs()). - Excludes the unflippable markers from the reference and the source
panels (
excludeSNPs()). - Flips the markers that need flipping in their reference panel
(
flipSNPs()). - Combines the reference and the source panels
(
combinePlinkBinaryFiles()). - Runs part of
pyGenClean.RelatedSamples.find_related_sampleson the combined data set (runRelatedness()). - Creates the
mdsfile from the combined data set and the result of previous step (createMDSFile()). - Creates the population file (
createPopulationFile()). - Plots the
mdsvalues (plotMDS()). - Finds the outliers of a given reference population
(
find_the_outliers()). - If required, computes the Eigenvalues using smartpca
(
compute_eigenvalues()). - If required, creates a scree plot from smartpca resutls
(
create_scree_plot()).
-
pyGenClean.Ethnicity.check_ethnicity.parseArgs(argString=None)[source]¶ Parses the command line options and arguments.
Parameters: argString (list) – the options. Returns: A argparse.Namespaceobject created by theargparsemodule. It contains the values of the different options.Options Type Description --bfilestring The input file prefix (Plink binary file). --skip-ref-popsbool Perform the MDS computation, but skip the three reference panels. --ceu-bfilestring The input file prefix for the CEU population (Plink binary file). --yri-bfilestring The input file prefix for the YRI population (Plink binary file). --jpt-chb-bfilestring The input file prefix for the JPT-CHB population (Plink binary file). --min-nb-snpint The minimum number of markers needed to compute IBS. --indep-pairwisestring Three numbers: window size, window shift and the r2 threshold. --mafstring Restrict to SNPs with MAF >= threshold. --sgebool Use SGE for parallelization. --sge-walltimeint The time limit (for clusters). --sge-nodesint int Two INTs (number of nodes and number of processor per nodes). --ibs-sge-walltimeint The time limit (for clusters) (for IBS) --ibs-sge-nodesint int Two INTs (number of nodes and number of processor per nodes) (for IBS). --line-per-file-for-sgeint The number of line per file for SGE task array. --nb-componentsint The number of component to compute. --outliers-ofstring Finds the ouliers of this population. --multiplierfloat To find the outliers, we look for more than x times the cluster standard deviation. --xaxisstring The component to use for the X axis. --yaxisstring The component to use for the Y axis. --formatstring The output file format. --titlestring The title of the MDS plot. --xlabelstring The label of the X axis. --ylabelstring The label of the Y axis. --create-scree-plotbool Computes Eigenvalues and creates a scree plot. --scree-plot-titlestring The main title of the scree plot --outstring The prefix of the output files. Note
No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see
checkArgs()).
-
pyGenClean.Ethnicity.check_ethnicity.plotMDS(inputFileName, outPrefix, populationFileName, options)[source]¶ Plots the MDS value.
Parameters: - inputFileName (str) – the name of the
mdsfile. - outPrefix (str) – the prefix of the output files.
- populationFileName (str) – the name of the population file.
- options (argparse.Namespace) – the options
Plots the
mdsvalue according to theinputFileNamefile (mds) and thepopulationFileName(the population file).- inputFileName (str) – the name of the
-
pyGenClean.Ethnicity.check_ethnicity.renameSNPs(inPrefix, updateFileName, outPrefix)[source]¶ Updates the name of the SNPs using Plink.
Parameters: Using Plink, changes the name of the markers in
inPrefixusingupdateFileName. It saves the results inoutPrefix.
-
pyGenClean.Ethnicity.check_ethnicity.runCommand(command)[source]¶ Run a command.
Parameters: command (list) – the command to run. Tries to run a command. If it fails, raise a
ProgramError. This function uses thesubprocessmodule.Warning
The variable
commandshould be a list of strings (no other type).
-
pyGenClean.Ethnicity.check_ethnicity.runRelatedness(inputPrefix, outPrefix, options)[source]¶ Run the relatedness step of the data clean up.
Parameters: - inputPrefix (str) – the prefix of the input file.
- outPrefix (str) – the prefix of the output file.
- options (argparse.Namespace) – the options
Returns: the prefix of the new bfile.
Runs
pyGenClean.RelatedSamples.find_related_samplesusing theinputPrefixfiles andoptionsoptions, and saves the results using theoutPrefixprefix.
pyGenClean.Ethnicity.find_outliers module¶
-
exception
pyGenClean.Ethnicity.find_outliers.ProgramError(msg)[source]¶ Bases:
exceptions.ExceptionAn
Exceptionraised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.Ethnicity.find_outliers.add_custom_options(parser)[source]¶ Adds custom options to a parser.
Parameters: parser (argparse.ArgumentParser) – the parser to which to add options.
-
pyGenClean.Ethnicity.find_outliers.checkArgs(args)[source]¶ Checks the arguments and options.
Parameters: args (argparse.Namespace) – a argparse.Namespaceobject containing the options of the program.Returns: Trueif everything was OK.If there is a problem with an option, an exception is raised using the
ProgramErrorclass, a message is printed to thesys.stderrand the program exists with code 1.
-
pyGenClean.Ethnicity.find_outliers.find_outliers(mds, centers, center_info, ref_pop, options)[source]¶ Finds the outliers for a given population.
Parameters: - mds (numpy.recarray) – the
mdsinformation about each samples. - centers (numpy.array) – the centers of the three reference population clusters.
- center_info (dict) – the label of the three reference population clusters.
- ref_pop (str) – the reference population for which we need the outliers from.
- options (argparse.Namespace) – the options
Returns: a
setof outliers from theref_poppopulation.Perform a
KMeansclassification using the three centers from the three reference population cluster.Samples are outliers of the required reference population (
ref_pop) if:- the sample is part of another reference population cluster;
- the sample is an outlier of the desired reference population
(
ref_pop).
A sample is an outlier of a given cluster \(C_j\) if the distance between this sample and the center of the cluster \(C_j\) (\(O_j\)) is bigger than a constant times the cluster’s standard deviation \(\sigma_j\).
\[\sigma_j = \sqrt{\frac{\sum{d(s_i,O_j)^2}}{||C_j|| - 1}}\]where \(||C_j||\) is the number of samples in the cluster \(C_j\), and \(d(s_i,O_j)\) is the distance between the sample \(s_i\) and the center \(O_j\) of the cluster \(C_j\).
\[d(s_i, O_j) = \sqrt{(x_{O_j} - x_{s_i})^2 + (y_{O_j} - y_{s_i})^2}\]Using a constant equals to one ensure we remove 100% of the outliers from the cluster. Using a constant of 1.6 or 1.9 ensures we remove 99% and 95% of outliers, respectively (an error rate of 1% and 5%, respectively).
- mds (numpy.recarray) – the
-
pyGenClean.Ethnicity.find_outliers.find_ref_centers(mds)[source]¶ Finds the center of the three reference clusters.
Parameters: mds (numpy.recarray) – the mdsinformation about each samples.Returns: a tuple with a numpy.arraycontaining the centers of the three reference population cluster as first element, and adictcontaining the label of each of the three reference population clusters.First, we extract the
mdsvalues of each of the three reference populations. The, we compute the center of each of those clusters by computing the means.\[\textrm{Cluster}_\textrm{pop} = \left( \frac{\sum_{i=1}^n x_i}{n}, \frac{\sum_{i=1}^n y_i}{n} \right)\]
-
pyGenClean.Ethnicity.find_outliers.main(argString=None)[source]¶ The main function.
Parameters: argString (list of strings) – the options. These are the steps of the modules:
- Prints the options.
- Reads the population file (
read_population_file()). - Reads the
mdsfile (read_mds_file()). - Computes the three reference population clusters’ centers
(
find_ref_centers()). - Computes three clusters according to the reference population clusters’
centers, and finds the outliers of a given reference population
(
find_outliers()). This steps also produce three different plots. - Writes outliers in a file (
prefix.outliers).
-
pyGenClean.Ethnicity.find_outliers.overwrite_tex(tex_fn, nb_outliers, script_options)[source]¶ Overwrites the TeX summary file with new values.
Parameters:
-
pyGenClean.Ethnicity.find_outliers.parseArgs(argString=None)[source]¶ Parses the command line options and arguments.
Parameters: argString (list) – the options. Returns: A argparse.Namespaceobject created by theargparsemodule. It contains the values of the different options.Options Type Description --mdsstring The MDS file from Plink. --population-filestring A population file from pyGenClean.Ethnicity.check_ethnicitymodule.--formatstring The output file format (png, ps, or pdf. --outliers-ofstring Finds the outliers of this population. --multiplierfloat To find the outliers, we look for more than \(x\) times the cluster standard deviation. --xaxisstring The component to use for the X axis. --yaxisstring The component to use for the Y axis. --formatstring The output file format (png, ps, or pdf formats are available). --outstring The prefix of the output files. Note
No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see
checkArgs()).
-
pyGenClean.Ethnicity.find_outliers.read_mds_file(file_name, c1, c2, pops)[source]¶ Reads a MDS file.
Parameters: Returns: a
numpy.recarray(one sample per line) with the information about the family ID, the individual ID, the first component to extract, the second component to extract and the population.The
mdsfile is the result of Plink (as produced by thepyGenClean.Ethnicity.check_ethnicitymodule).
-
pyGenClean.Ethnicity.find_outliers.read_population_file(file_name)[source]¶ Reads the population file.
Parameters: file_name (str) – the name of the population file. Returns: a dictcontaining the population for each of the samples.The population file should contain three columns:
- The family ID.
- The individual ID.
- The population of the file (one of
CEU,YRI,JPT-CHBorSOURCE).
The outliers are from the
SOURCEpopulation, when compared to one of the three reference population (CEU,YRIorJPT-CHB).
pyGenClean.Ethnicity.plot_eigenvalues module¶
-
exception
pyGenClean.Ethnicity.plot_eigenvalues.ProgramError(msg)[source]¶ Bases:
exceptions.ExceptionAn
Exceptionraised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
-
pyGenClean.Ethnicity.plot_eigenvalues.add_custom_options(parser)[source]¶ Adds custom options to a parser.
Parameters: parser (argparse.ArgumentParser) – the parser to which the options will be added.
-
pyGenClean.Ethnicity.plot_eigenvalues.check_args(args)[source]¶ Checks the arguments and options.
Parameters: args ( argparse.Namespace) – an object containing the options and arguments of the program.Returns: Trueif everything was OK.If there is a problem with an option, an exception is raised using the
ProgramErrorclass, a message is printed to thesys.stderrand the program exits with error code 1.
-
pyGenClean.Ethnicity.plot_eigenvalues.create_scree_plot(data, o_filename, options)[source]¶ Creates the scree plot.
Parameters: - data (numpy.ndarray) – the eigenvalues.
- o_filename (str) – the name of the output files.
- options (argparse.Namespace) – the options.
-
pyGenClean.Ethnicity.plot_eigenvalues.main(argString=None)[source]¶ The main function.
The purpose of this module is to plot Eigenvectors provided by the Eigensoft software.
Here are the steps of this module:
- Reads the Eigenvector (
read_eigenvalues()). - Plots the Scree Plot (
create_scree_plot()).
- Reads the Eigenvector (
-
pyGenClean.Ethnicity.plot_eigenvalues.parse_args(argString=None)[source]¶ Parses the command line options and arguments.
Returns: A argparse.Namespaceobject created by theargparsemodule. It contains the values of the different options.Options Type Description --evecstring The EVEC file from EIGENSOFT --scree-plot-titlestring The main title of the scree plot --outstring The name of the output file Note
No option check is done here (except for the one automatically done by
argparse). Those need to be done elsewhere (seecheckArgs()).
-
pyGenClean.Ethnicity.plot_eigenvalues.read_eigenvalues(i_filename)[source]¶ Reads the eigenvalues from EIGENSOFT results.
Parameters: i_filename (str) – the name of the input file. Returns: a numpy.ndarrayarray containing the eigenvalues.
