pyGenClean.PlinkUtils package

For more information about how to use this module, refer to the Plink Utils.

Module contents


Remove leading spaces and change spaces to tabs.

Param:line: a line from a Plink’s report file.
Type:line: str
Returns:an array containing each field from the input line.

Plink’s output files are usually created so that they are human readable. Hence, instead of separating fields using tabulation, it uses a certain amount of spaces to create columns. Using the re module, the fields are split.

>>> line = " CHR               SNP         BP   A1      A2"
>>> createRowFromPlinkSpacedOutput(line)
['CHR', 'SNP', 'BP', 'A1', 'A2']

Gets the Plink version from the binary.

Returns:the version of the Plink software
Return type:str

This function uses subprocess.Popen to gather the version of the Plink binary. Since executing the software to gather the version creates an output file, it is deleted.


This function only works as long as the version is returned as | PLINK! | NNN | (where, NNN is the version), since we use regular expresion to extract the version number from the standard output of the software.


pyGenClean.PlinkUtils.compare_bim module

exception pyGenClean.PlinkUtils.compare_bim.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.

Checks the arguments and options.

Parameters:args (argparse.Namespace) – an object containing the options of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.PlinkUtils.compare_bim.compareSNPs(before, after, outFileName)[source]

Compares two set of SNPs.

  • before (set) – the names of the markers in the before file.
  • after (set) – the names of the markers in the after file.
  • outFileName (str) – the name of the output file.

Finds the difference between two sets of markers, and write them in the outFileName file.


A ProgramError is raised if:

  1. There are more markers in the after set than in the before set.
  2. Some markers that are in the after set are not in the before set.

The main function of the module.

The purpose of this module is to find markers that were removed by Plink. When Plinks exclude some markers from binary files, there are no easy way to find the list of removed markers, except by comparing the two BIM files (before and after modification).

Here are the steps of this module:

  1. Reads the BIM file before the modification (readBIM()).
  2. Reads the BIM file after the modification (readBIM()).
  3. Compares the list of markers before and after modification, and write the removed markers into a file (compareSNPs()).


This module only finds marker that were removed (since adding markers to a BIM file usually includes a companion file to tell Plink which marker to add.


Parses the command line options and arguments.

Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--before string The name of the BIM file before modification.
--after string The name of the BIM file after modification.
--out string The prefix of the output files


No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).


Reads a BIM file.

Parameters:fileName (str) – the name of the BIM file to read.
Returns:the set of markers in the BIM file.

Reads a Plink BIM file and extract the name of the markers. There is one marker per line, and the name of the marker is in the second column. There is no header in the BIM file.


A safe version of the main function (that catches ProgramError).

pyGenClean.PlinkUtils.plot_MDS module

exception pyGenClean.PlinkUtils.plot_MDS.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.

Adds custom options to a parser.

Parameters:parser (argparse.parser) – the parser.

Checks the arguments and options.

Parameters:args (argparse.Namespace) – an object containing the options of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.PlinkUtils.plot_MDS.extractData(fileName, populations)[source]

Extract the C1 and C2 columns for plotting.

  • fileName (dict) – the name of the MDS file.
  • populations – the population of each sample in the MDS file.

the MDS data with information about the population of each sample. The first element of the returned tuple is a tuple. The last element of the returned tuple is the list of the populations (the order is the same as in the first element). The first element of the first tuple is the C1 data, and the last element is the C2 data.


If a sample in the MDS file is not in the population file, it is skip.


The main function of the module.

These are the steps:

  1. Reads the population file (readPopulations()).
  2. Extract the MDS data (extractData()).
  3. Plots the MDS data (plotMDS()).

Parses the command line options and arguments.

Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--file string The MBS file.
--population-file string A file containing population information.
--format string The output file format.
--title string The title of the MDS plot.
--xlabel string The label of the X axis.
--ylabel string The label of the Y axis.
--out string The prefix of the output files.


No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.PlinkUtils.plot_MDS.plotMDS(data, theOrders, theLabels, theColors, theSizes, theMarkers, options)[source]

Plot the MDS data.

  • data (list of numpy.array) – the data to plot (MDS values).
  • theOrders (list) – the order of the populations to plot.
  • theLabels (list) – the names of populations to plot.
  • theColors (list) – the colors of the populations to plot.
  • theSizes (list) – the sizes of the markers for each population to plot.
  • theMarkers (list) – the type of markers for each population to plot.
  • options (argparse.Namespace) – the options.

Reads a population file.

Parameters:inputFileName (str) – the name of the population file.
Returns:a dict of population for each of the samples.

A safe version of the main function (that catches ProgramError).

pyGenClean.PlinkUtils.plot_MDS_standalone module

exception pyGenClean.PlinkUtils.plot_MDS_standalone.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.

Checks the arguments and options.

Parameters:args (argparse.Namespace) – an object containing the options of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.PlinkUtils.plot_MDS_standalone.extractData(fileName, populations, population_order, xaxis, yaxis)[source]

Extract the C1 and C2 columns for plotting.

  • fileName (str) – the name of the MDS file.
  • populations (dict) – the population of each sample in the MDS file.
  • population_order (list) – the required population order.
  • xaxis (str) – the component to print as the X axis.
  • yaxis (str) – the component to print as the Y axis.

the MDS data with information about the population of each sample. The first element of the returned tuple is a tuple. The last element of the returned tuple is the list of the populations (the order is the same as in the first element). The first element of the first tuple is the C1 data, and the last element is the C2 data.


If a sample in the MDS file is not in the population file, it is skip.


The main function of the module.

These are the steps:

  1. Reads the population file (readPopulations()).
  2. Extracts the MDS values (extractData()).
  3. Plots the MDS values (plotMDS()).

Parses the command line options and arguments.

Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--file string The MBS file.
--population-file string A file containing population information.
--population-order string The order to print the different populations.
--population-colors string The population point color in the plot.
--population-sizes string The population point size in the plot.
--population-markers string The population point marker in the plot.
--population-alpha string The population alpha value in the plot.
--format string The output file format.
--title string The title of the MDS plot.
--xaxis string The component to print on the X axis.
--xlabel string The label of the X axis.
--yaxis string The component to print on the Y axis.
--ylabel string The label of the Y axis.
--legend-position string The position of the legend.
--legend-size int The size of the legend text.
--legend-ncol int The number of columns for the legend.
--legend-alpha float The alpha value of the legend.
--title-fontsize int The font size of the title.
--label-fontsize int The font size of the X and Y labels.
--axis-fontsize int The font size of the X and Y axis.
--adjust-left float Adjust the left margin.
--adjust-right float Adjust the right margin.
--adjust-top float Adjust the top margin.
--adjust-bottom float Adjust the bottom margin.
--out string The prefix of the output files.


No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.PlinkUtils.plot_MDS_standalone.plotMDS(data, theOrders, theLabels, theColors, theAlphas, theSizes, theMarkers, options)[source]

Plot the MDS data.

  • data (list of numpy.array) – the data to plot (MDS values).
  • theOrders (list) – the order of the populations to plot.
  • theLabels (list) – the names of the populations to plot.
  • theColors (list) – the colors of the populations to plot.
  • theAlphas (list) – the alpha value for the populations to plot.
  • theSizes (list) – the sizes of the markers for each population to plot.
  • theMarkers (list) – the type of marker for each population to plot.
  • options (argparse.Namespace) – the options.
pyGenClean.PlinkUtils.plot_MDS_standalone.readPopulations(inputFileName, requiredPopulation)[source]

Reads a population file.

  • inputFileName (str) – the name of the population file.
  • requiredPopulation (list) – the required population.

a dict containing the population of each samples.


A safe version of the main function (that catches ProgramError).

pyGenClean.PlinkUtils.subset_data module

exception pyGenClean.PlinkUtils.subset_data.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.

Checks the arguments and options.

Parameters:args (argparse.Namespace) – an object containing the options of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.


Only one operation for markers and one operation for samples can be done at a time. Hence, one of --exclude or --extract can be done for markers, and one of --remove or --keep can be done for samples.


The main function of the modile.

Parameters:argString (list) – the options.

Here are the steps:

  1. Prints the options.
  2. Subset the data (subset_data()).


The type of the output files are determined by the type of the input files (e.g. if the input files are binary files, so will be the output ones).


Parses the command line options and arguments.

Parameters:argString (list) – the parameters.
Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--ifile string The input file prefix.
--is-bfile bool The input file is a bfile
--is-tfile bool The input file is a tfile
--is-file bool The input file is a file
--exclude string A file containing SNPs to exclude from the data set.
--extract string A file containing SNPs to extract from the data set.
--remove string A file containing samples (FID and IID) to remove from the data set.
--keep string A file containing samples (FID and IID) to keep from the data set.
--out string The prefix of the output files.


No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).


Runs a command.

Parameters:command (list) – the command to run.

If there is a problem, a ProgramError is raised.


A safe version of the main function (that catches ProgramError).


Subset the data.

Parameters:options (argparse.Namespace) – the options.

Subset the data using either --exclude or --extract``for markers or ``--remove or keep for samples.