pyGenClean.Ethnicity package

For more information about how to use this module, refer to the Ethnicity Module.

Module contents

Submodules

pyGenClean.Ethnicity.check_ethnicity module

class pyGenClean.Ethnicity.check_ethnicity.Dummy[source]

Bases: object

exception pyGenClean.Ethnicity.check_ethnicity.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.
pyGenClean.Ethnicity.check_ethnicity.allFileExists(fileList)[source]

Check that all file exists.

Parameters:fileList (list) – the list of file to check.

Check if all the files in fileList exists.

pyGenClean.Ethnicity.check_ethnicity.checkArgs(args)[source]

Checks the arguments and options.

Parameters:args (argparse.Namespace) – an object containing the options of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.Ethnicity.check_ethnicity.combinePlinkBinaryFiles(prefixes, outPrefix)[source]

Combine Plink binary files.

Parameters:
  • prefixes (list) – a list of the prefix of the files that need to be combined.
  • outPrefix (str) – the prefix of the output file (the combined file).

It uses Plink to merge a list of binary files (which is a list of prefixes (strings)), and create the final data set which as outPrefix as the prefix.

pyGenClean.Ethnicity.check_ethnicity.computeFrequency(prefix, outPrefix)[source]

Compute the frequency using Plink.

Parameters:
  • prefix (str) – the prefix of the file binary file for which we need to compute frequencies.
  • outPrefix (str) – the prefix of the output files.

Uses Plink to compute the frequency of all the markers in the prefix binary file.

pyGenClean.Ethnicity.check_ethnicity.compute_eigenvalues(in_prefix, out_prefix)[source]

Computes the Eigenvalues using smartpca from Eigensoft.

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • out_prefix (str) – the prefix of the output files.

Creates a “parameter file” used by smartpca and runs it.

pyGenClean.Ethnicity.check_ethnicity.createMDSFile(nb_components, inPrefix, outPrefix, genomeFileName)[source]

Creates a MDS file using Plink.

Parameters:
  • nb_components (int) – the number of component.
  • inPrefix (str) – the prefix of the input file.
  • outPrefix (str) – the prefix of the output file.
  • genomeFileName (str) – the name of the genome file.

Using Plink, computes the MDS values for each individual using the inPrefix, genomeFileName and the number of components. The results are save using the outPrefix prefix.

pyGenClean.Ethnicity.check_ethnicity.createPopulationFile(inputFiles, labels, outputFileName)[source]

Creates a population file.

Parameters:
  • inputFiles (list) – the list of input files.
  • labels (list) – the list of labels (corresponding to the input files).
  • outputFileName (str) – the name of the output file.

The inputFiles is in reality a list of tfam files composed of samples. For each of those tfam files, there is a label associated with it (representing the name of the population).

The output file consists of one row per sample, with the following three columns: the family ID, the individual ID and the population of each sample.

pyGenClean.Ethnicity.check_ethnicity.create_scree_plot(in_filename, out_filename, plot_title)[source]

Creates a scree plot using smartpca results.

Parameters:
  • in_filename (str) – the name of the input file.
  • out_filename (str) – the name of the output file.
  • plot_title (str) – the title of the scree plot.
pyGenClean.Ethnicity.check_ethnicity.excludeSNPs(inPrefix, outPrefix, exclusionFileName)[source]

Exclude some SNPs using Plink.

Parameters:
  • inPrefix (str) – the prefix of the input file.
  • outPrefix (str) – the prefix of the output file.
  • exclusionFileName (str) – the name of the file containing the markers to be excluded.

Using Plink, exclude a list of markers from inPrefix, and saves the results in outPrefix. The list of markers are in exclusionFileName.

pyGenClean.Ethnicity.check_ethnicity.extractSNPs(snpToExtractFileNames, referencePrefixes, popNames, outPrefix, runSGE, options)[source]

Extract a list of SNPs using Plink.

Parameters:
  • snpToExtractFileNames (list) – the name of the files which contains the markers to extract from the original data set.
  • referencePrefixes (list) – a list containing the three reference population prefixes (the original data sets).
  • popNames (list) – a list containing the three reference population names.
  • outPrefix (str) – the prefix of the output file.
  • runSGE (boolean) – Whether using SGE or not.
  • options (argparse.Namespace) – the options.

Using Plink, extract a set of markers from a list of prefixes.

pyGenClean.Ethnicity.check_ethnicity.findFlippedSNPs(frqFile1, frqFile2, outPrefix)[source]

Find flipped SNPs and flip them in the data.

Parameters:
  • frqFile1 (str) – the name of the first frequency file.
  • frqFile2 (str) – the name of the second frequency file.
  • outPrefix (str) – the prefix of the output files.

By reading two frequency files (frqFile1 and frqFile2), it finds a list of markers that need to be flipped so that the first file becomes comparable with the second one. Also finds marker that need to be removed.

A marker needs to be flipped in one of the two data set if the two markers are not comparable (same minor allele), but become comparable if we flip one of them.

A marker will be removed if it is all homozygous in at least one data set. It will also be removed if it’s impossible to determine the phase of the marker (e.g. if the two alleles are A and T or C and G).

pyGenClean.Ethnicity.check_ethnicity.findOverlappingSNPsWithReference(prefix, referencePrefixes, referencePopulations, outPrefix)[source]

Find the overlapping SNPs in 4 different data sets.

Parameters:
  • prefix (str) – the prefix of all the files.
  • referencePrefixes (list) – the prefix of the reference population files.
  • referencePopulations (list) – the name of the reference population (same order as referencePrefixes)
  • outPrefix (str) – the prefix of the output files.

It starts by reading the bim file of the source data set (prefix.bim). It finds all the markers (excluding the duplicated ones). Then it reads all of the reference population bim files (referencePrefixes.bim) and find all the markers that were found in the source data set.

It creates three output files:

  • outPrefix.ref_snp_to_extract: the name of the markers that needs to be extracted from the three reference panels.
  • outPrefix.source_snp_to_extract: the name of the markers that needs to be extracted from the source panel.
  • outPrefix.update_names: a file (readable by Plink) that will help in changing the names of the selected markers in the reference panels, so that they become comparable with the source panel.
pyGenClean.Ethnicity.check_ethnicity.find_the_outliers(mds_file_name, population_file_name, ref_pop_name, multiplier, out_prefix)[source]

Finds the outliers of a given population.

Parameters:
  • mds_file_name (str) – the name of the mds file.
  • population_file_name (str) – the name of the population file.
  • ref_pop_name (str) – the name of the reference population for which to find outliers from.
  • multiplier (float) – the multiplier of the cluster standard deviation to modify the strictness of the outlier removal procedure.
  • out_prefix (str) – the prefix of the output file.

Uses the pyGenClean.Ethnicity.find_outliers modules to find outliers. It requires the mds file created by createMDSFile() and the population file created by createPopulationFile().

pyGenClean.Ethnicity.check_ethnicity.flipSNPs(inPrefix, outPrefix, flipFileName)[source]

Flip SNPs using Plink.

Parameters:
  • inPrefix (str) – the prefix of the input file.
  • outPrefix (str) – the prefix of the output file.
  • flipFileName (str) – the name of the file containing the markers to flip.

Using Plink, flip a set of markers in inPrefix, and saves the results in outPrefix. The list of markers to be flipped is in flipFileName.

pyGenClean.Ethnicity.check_ethnicity.main(argString=None)[source]

The main function.

Parameters:argString (list) – the options.

These are the steps of this module:

  1. Prints the options.
  2. Finds the overlapping markers between the three reference panels and the source panel (findOverlappingSNPsWithReference()).
  3. Extract the required markers from all the data sets (extractSNPs()).
  4. Renames the reference panel’s marker names to that they are the same as the source panel (for all populations) (renameSNPs()).
  5. Combines the three reference panels together (combinePlinkBinaryFiles()).
  6. Compute the frequency of all the markers from the reference and the source panels (computeFrequency()).
  7. Finds the markers to flip in the reference panel (when compared to the source panel) (findFlippedSNPs()).
  8. Excludes the unflippable markers from the reference and the source panels (excludeSNPs()).
  9. Flips the markers that need flipping in their reference panel (flipSNPs()).
  10. Combines the reference and the source panels (combinePlinkBinaryFiles()).
  11. Runs part of pyGenClean.RelatedSamples.find_related_samples on the combined data set (runRelatedness()).
  12. Creates the mds file from the combined data set and the result of previous step (createMDSFile()).
  13. Creates the population file (createPopulationFile()).
  14. Plots the mds values (plotMDS()).
  15. Finds the outliers of a given reference population (find_the_outliers()).
  16. If required, computes the Eigenvalues using smartpca (compute_eigenvalues()).
  17. If required, creates a scree plot from smartpca resutls (create_scree_plot()).
pyGenClean.Ethnicity.check_ethnicity.parseArgs(argString=None)[source]

Parses the command line options and arguments.

Parameters:argString (list) – the options.
Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--bfile string The input file prefix (Plink binary file).
--skip-ref-pops bool Perform the MDS computation, but skip the three reference panels.
--ceu-bfile string The input file prefix for the CEU population (Plink binary file).
--yri-bfile string The input file prefix for the YRI population (Plink binary file).
--jpt-chb-bfile string The input file prefix for the JPT-CHB population (Plink binary file).
--min-nb-snp int The minimum number of markers needed to compute IBS.
--indep-pairwise string Three numbers: window size, window shift and the r2 threshold.
--maf string Restrict to SNPs with MAF >= threshold.
--sge bool Use SGE for parallelization.
--sge-walltime int The time limit (for clusters).
--sge-nodes int int Two INTs (number of nodes and number of processor per nodes).
--ibs-sge-walltime int The time limit (for clusters) (for IBS)
--ibs-sge-nodes int int Two INTs (number of nodes and number of processor per nodes) (for IBS).
--line-per-file-for-sge int The number of line per file for SGE task array.
--nb-components int The number of component to compute.
--outliers-of string Finds the ouliers of this population.
--multiplier float To find the outliers, we look for more than x times the cluster standard deviation.
--xaxis string The component to use for the X axis.
--yaxis string The component to use for the Y axis.
--format string The output file format.
--title string The title of the MDS plot.
--xlabel string The label of the X axis.
--ylabel string The label of the Y axis.
--create-scree-plot bool Computes Eigenvalues and creates a scree plot.
--scree-plot-title string The main title of the scree plot
--out string The prefix of the output files.

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.Ethnicity.check_ethnicity.plotMDS(inputFileName, outPrefix, populationFileName, options)[source]

Plots the MDS value.

Parameters:
  • inputFileName (str) – the name of the mds file.
  • outPrefix (str) – the prefix of the output files.
  • populationFileName (str) – the name of the population file.
  • options (argparse.Namespace) – the options

Plots the mds value according to the inputFileName file (mds) and the populationFileName (the population file).

pyGenClean.Ethnicity.check_ethnicity.renameSNPs(inPrefix, updateFileName, outPrefix)[source]

Updates the name of the SNPs using Plink.

Parameters:
  • inPrefix (str) – the prefix of the input file.
  • updateFileName (str) – the name of the file containing the updated marker names.
  • outPrefix (str) – the prefix of the output file.

Using Plink, changes the name of the markers in inPrefix using updateFileName. It saves the results in outPrefix.

pyGenClean.Ethnicity.check_ethnicity.runCommand(command)[source]

Run a command.

Parameters:command (list) – the command to run.

Tries to run a command. If it fails, raise a ProgramError. This function uses the subprocess module.

Warning

The variable command should be a list of strings (no other type).

pyGenClean.Ethnicity.check_ethnicity.runRelatedness(inputPrefix, outPrefix, options)[source]

Run the relatedness step of the data clean up.

Parameters:
  • inputPrefix (str) – the prefix of the input file.
  • outPrefix (str) – the prefix of the output file.
  • options (argparse.Namespace) – the options
Returns:

the prefix of the new bfile.

Runs pyGenClean.RelatedSamples.find_related_samples using the inputPrefix files and options options, and saves the results using the outPrefix prefix.

pyGenClean.Ethnicity.check_ethnicity.safe_main()[source]

A safe version of the main function (that catches ProgramError).

pyGenClean.Ethnicity.find_outliers module

exception pyGenClean.Ethnicity.find_outliers.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.
pyGenClean.Ethnicity.find_outliers.add_custom_options(parser)[source]

Adds custom options to a parser.

Parameters:parser (argparse.ArgumentParser) – the parser to which to add options.
pyGenClean.Ethnicity.find_outliers.checkArgs(args)[source]

Checks the arguments and options.

Parameters:args (argparse.Namespace) – a argparse.Namespace object containing the options of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.Ethnicity.find_outliers.find_outliers(mds, centers, center_info, ref_pop, options)[source]

Finds the outliers for a given population.

Parameters:
  • mds (numpy.recarray) – the mds information about each samples.
  • centers (numpy.array) – the centers of the three reference population clusters.
  • center_info (dict) – the label of the three reference population clusters.
  • ref_pop (str) – the reference population for which we need the outliers from.
  • options (argparse.Namespace) – the options
Returns:

a set of outliers from the ref_pop population.

Perform a KMeans classification using the three centers from the three reference population cluster.

Samples are outliers of the required reference population (ref_pop) if:

  • the sample is part of another reference population cluster;
  • the sample is an outlier of the desired reference population (ref_pop).

A sample is an outlier of a given cluster \(C_j\) if the distance between this sample and the center of the cluster \(C_j\) (\(O_j\)) is bigger than a constant times the cluster’s standard deviation \(\sigma_j\).

\[\sigma_j = \sqrt{\frac{\sum{d(s_i,O_j)^2}}{||C_j|| - 1}}\]

where \(||C_j||\) is the number of samples in the cluster \(C_j\), and \(d(s_i,O_j)\) is the distance between the sample \(s_i\) and the center \(O_j\) of the cluster \(C_j\).

\[d(s_i, O_j) = \sqrt{(x_{O_j} - x_{s_i})^2 + (y_{O_j} - y_{s_i})^2}\]

Using a constant equals to one ensure we remove 100% of the outliers from the cluster. Using a constant of 1.6 or 1.9 ensures we remove 99% and 95% of outliers, respectively (an error rate of 1% and 5%, respectively).

pyGenClean.Ethnicity.find_outliers.find_ref_centers(mds)[source]

Finds the center of the three reference clusters.

Parameters:mds (numpy.recarray) – the mds information about each samples.
Returns:a tuple with a numpy.array containing the centers of the three reference population cluster as first element, and a dict containing the label of each of the three reference population clusters.

First, we extract the mds values of each of the three reference populations. The, we compute the center of each of those clusters by computing the means.

\[\textrm{Cluster}_\textrm{pop} = \left( \frac{\sum_{i=1}^n x_i}{n}, \frac{\sum_{i=1}^n y_i}{n} \right)\]
pyGenClean.Ethnicity.find_outliers.main(argString=None)[source]

The main function.

Parameters:argString (list of strings) – the options.

These are the steps of the modules:

  1. Prints the options.
  2. Reads the population file (read_population_file()).
  3. Reads the mds file (read_mds_file()).
  4. Computes the three reference population clusters’ centers (find_ref_centers()).
  5. Computes three clusters according to the reference population clusters’ centers, and finds the outliers of a given reference population (find_outliers()). This steps also produce three different plots.
  6. Writes outliers in a file (prefix.outliers).
pyGenClean.Ethnicity.find_outliers.overwrite_tex(tex_fn, nb_outliers, script_options)[source]

Overwrites the TeX summary file with new values.

Parameters:
  • tex_fn (str) – the name of the TeX summary file to overwrite.
  • nb_outliers (int) – the number of outliers.
  • script_options – the script options.
pyGenClean.Ethnicity.find_outliers.parseArgs(argString=None)[source]

Parses the command line options and arguments.

Parameters:argString (list) – the options.
Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--mds string The MDS file from Plink.
--population-file string A population file from pyGenClean.Ethnicity.check_ethnicity module.
--format string The output file format (png, ps, or pdf.
--outliers-of string Finds the outliers of this population.
--multiplier float To find the outliers, we look for more than \(x\) times the cluster standard deviation.
--xaxis string The component to use for the X axis.
--yaxis string The component to use for the Y axis.
--format string The output file format (png, ps, or pdf formats are available).
--out string The prefix of the output files.

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.Ethnicity.find_outliers.read_mds_file(file_name, c1, c2, pops)[source]

Reads a MDS file.

Parameters:
  • file_name (str) – the name of the mds file.
  • c1 (str) – the first component to read (x axis).
  • c2 (str) – the second component to read (y axis).
  • pops (dict) – the population of each sample.
Returns:

a numpy.recarray (one sample per line) with the information about the family ID, the individual ID, the first component to extract, the second component to extract and the population.

The mds file is the result of Plink (as produced by the pyGenClean.Ethnicity.check_ethnicity module).

pyGenClean.Ethnicity.find_outliers.read_population_file(file_name)[source]

Reads the population file.

Parameters:file_name (str) – the name of the population file.
Returns:a dict containing the population for each of the samples.

The population file should contain three columns:

  1. The family ID.
  2. The individual ID.
  3. The population of the file (one of CEU, YRI, JPT-CHB or SOURCE).

The outliers are from the SOURCE population, when compared to one of the three reference population (CEU, YRI or JPT-CHB).

pyGenClean.Ethnicity.find_outliers.safe_main()[source]

A safe version of the main function (that catches ProgramError).

pyGenClean.Ethnicity.plot_eigenvalues module

exception pyGenClean.Ethnicity.plot_eigenvalues.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.
pyGenClean.Ethnicity.plot_eigenvalues.add_custom_options(parser)[source]

Adds custom options to a parser.

Parameters:parser (argparse.ArgumentParser) – the parser to which the options will be added.
pyGenClean.Ethnicity.plot_eigenvalues.check_args(args)[source]

Checks the arguments and options.

Parameters:args (argparse.Namespace) – an object containing the options and arguments of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exits with error code 1.

pyGenClean.Ethnicity.plot_eigenvalues.create_scree_plot(data, o_filename, options)[source]

Creates the scree plot.

Parameters:
pyGenClean.Ethnicity.plot_eigenvalues.main(argString=None)[source]

The main function.

The purpose of this module is to plot Eigenvectors provided by the Eigensoft software.

Here are the steps of this module:

  1. Reads the Eigenvector (read_eigenvalues()).
  2. Plots the Scree Plot (create_scree_plot()).
pyGenClean.Ethnicity.plot_eigenvalues.parse_args(argString=None)[source]

Parses the command line options and arguments.

Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--evec string The EVEC file from EIGENSOFT
--scree-plot-title string The main title of the scree plot
--out string The name of the output file

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.Ethnicity.plot_eigenvalues.read_eigenvalues(i_filename)[source]

Reads the eigenvalues from EIGENSOFT results.

Parameters:i_filename (str) – the name of the input file.
Returns:a numpy.ndarray array containing the eigenvalues.
pyGenClean.Ethnicity.plot_eigenvalues.safe_main()[source]

A safe version of the main function (that catches ProgramError).