pyGenClean.SexCheck package

For more information about how to use this module, refer to the Sex Check Module.

Module contents

Submodules

pyGenClean.SexCheck.baf_lrr_plot module

exception pyGenClean.SexCheck.baf_lrr_plot.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.
pyGenClean.SexCheck.baf_lrr_plot.checkArgs(args)[source]

Checks the arguments and options.

Parameters:args (argparse.Namespace) – an object containing the options of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.SexCheck.baf_lrr_plot.check_file_names(samples, raw_dir, options)[source]

Check if all files are present.

Parameters:
  • samples (list of tuples) – a list of tuples with the family ID as first element (str) and sample ID as last element (str).
  • raw_dir (str) – the directory containing the raw files.
  • options (argparse.Namespace) – the options.
Returns:

a dict containing samples as key (a tuple with the family ID as first element and sample ID as last element) and the name of the raw file as element.

pyGenClean.SexCheck.baf_lrr_plot.encode_chromosome(chromosome)[source]

Encodes chromosomes.

Parameters:chromosome (str) – the chromosome to encode.
Returns:the encoded chromosome.

Encodes the sexual chromosomes, from 23 and 24 to X and Y, respectively.

Note

Only the sexual chromosomes are encoded.

>>> encode_chromosome("23")
'X'
>>> encode_chromosome("24")
'Y'
>>> encode_chromosome("This is not a chromosome")
'This is not a chromosome'
pyGenClean.SexCheck.baf_lrr_plot.main(argString=None)[source]

The main function of this module.

Parameters:argString (list) – the options.

These are the steps:

  1. Prints the options.
  2. Reads the problematic samples (read_problematic_samples()).
  3. Finds and checks the raw files for each of the problematic samples (check_file_names()).
  4. Plots the BAF and LRR (plot_baf_lrr()).
pyGenClean.SexCheck.baf_lrr_plot.parseArgs(argString=None)[source]

Parses the command line options and arguments.

Parameters:argString (list) – the options.
Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--problematic-samples string The list of sample with sex problems to plot
--use-full-ids bool Use full sample IDs (famID and indID).
--full-ids-delimiter string The delimiter between famID and indID.
--raw-dir string Directory containing information about every samples (BAF and LRR).
--format string The output file format (png, ps, pdf, or X11).
--out string The prefix of the output files.

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.SexCheck.baf_lrr_plot.plot_baf_lrr(file_names, options)[source]

Plot BAF and LRR for a list of files.

Parameters:
  • file_names (dict) – contains the name of the input file for each sample.
  • options (argparse.Namespace) – the options.

Plots the BAF (B Allele Frequency) and LRR (Log R Ratio) of each samples. Only the sexual chromosome are shown.

pyGenClean.SexCheck.baf_lrr_plot.read_problematic_samples(file_name)[source]

Reads a file with sample IDs.

Parameters:file_name (str) – the name of the file containing problematic samples after sex check.
Returns:a set of problematic samples (tuple containing the family ID as first element and the sample ID as last element).

Reads a file containing problematic samples after sex check. The file is provided by the module pyGenClean.SexCheck.sex_check. This file contains two columns, the first one being the family ID and the second one, the sample ID.

pyGenClean.SexCheck.baf_lrr_plot.safe_main()[source]

A safe version of the main function (that catches ProgramError).

pyGenClean.SexCheck.gender_plot module

exception pyGenClean.SexCheck.gender_plot.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.
pyGenClean.SexCheck.gender_plot.checkArgs(args)[source]

Checks the arguments and options.

Parameters:args (argparse.Namespace) – an object containing the options of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.SexCheck.gender_plot.encode_chr(chromosome)[source]

Encodes chromosomes.

Parameters:chromosome (str) – the chromosome to encode.
Returns:the encoded chromosome as int.

It changes X, Y, XY and MT to 23, 24, 25 and 26, respectively. It changes everything else as int.

If ValueError is raised, then ProgramError is also raised. If a chromosome as a value below 1 or above 26, a ProgramError is raised.

>>> [encode_chr(str(i)) for i in range(0, 11)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> [encode_chr(str(i)) for i in range(11, 21)]
[11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
>>> [encode_chr(str(i)) for i in range(21, 27)]
[21, 22, 23, 24, 25, 26]
>>> [encode_chr(i) for i in ["X", "Y", "XY", "MT"]]
[23, 24, 25, 26]
>>> encode_chr("27")
Traceback (most recent call last):
    ...
ProgramError: 27: invalid chromosome
>>> encode_chr("XX")
Traceback (most recent call last):
    ...
ProgramError: XX: invalid chromosome
pyGenClean.SexCheck.gender_plot.encode_gender(gender)[source]

Encodes the gender.

Parameters:gender (str) – the gender to encode.
Returns:the encoded gender.

It changes 1 and 2 to Male and Female respectively. It encodes everything else to Unknown.

>>> encode_gender("1")
'Male'
>>> encode_gender("2")
'Female'
>>> encode_gender("0")
'Unknown'
>>> encode_gender("This is not a gender code")
'Unknown'
pyGenClean.SexCheck.gender_plot.main(argString=None)[source]

The main function of the module.

Parameters:argString (list) – the options.

These are the steps:

  1. Prints the options.
  2. If there are summarized_intensities provided, reads the files (read_summarized_intensities()) and skips to step 7.
  3. Reads the bim file to get markers on the sexual chromosomes (read_bim()).
  4. Reads the fam file to get gender (read_fam()).
  5. Reads the file containing samples with sex problems (read_sex_problems()).
  6. Reads the intensities and summarizes them (read_intensities()).
  7. Plots the summarized intensities (plot_gender()).
pyGenClean.SexCheck.gender_plot.parseArgs(argString=None)[source]

Parses the command line options and arguments.

Parameters:argString (list) – the options.
Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--bfile string The plink binary file containing information about markers and individuals.
--intensities string A file containing alleles intensities for each of the markers located on the X and Y chromosome.
--summarized-intensities string The prefix of six files containing summarized chr23 and chr24 intensities.
--sex-problems string The file containing individuals with sex problems.
--format string The output file format (png, ps, pdf, or X11).
--xlabel string The label of the X axis.
--ylabel string The label of the Y axis.
--out string The prefix of the output files.

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.SexCheck.gender_plot.plot_gender(data, options)[source]

Plots the gender.

Parameters:

Plots the summarized intensities of the markers on the Y chromosomes in function of the markers on the X chromosomes, with problematic samples with different colors.

Also uses print_data_to_file() to save the data, so that it is faster to rerun the analysis.

pyGenClean.SexCheck.gender_plot.print_data_to_file(data, file_name)[source]

Prints data to file.

Parameters:
  • data (numpy.recarray) – the data to print.
  • file_name (str) – the name of the output file.
pyGenClean.SexCheck.gender_plot.read_bim(file_name)[source]

Reads the BIM file to gather marker names.

Parameters:file_name (str) – the name of the bim file.
Returns:a dict containing the chromosomal location of each marker on the sexual chromosomes.

It uses the encode_chr() to encode the chromosomes from X and Y to 23 and 24, respectively.

pyGenClean.SexCheck.gender_plot.read_fam(file_name)[source]

Reads the FAM file to gather sample names.

Parameters:file_name (str) – the fam file to read.
Returns:a dict containing the gender of each samples.

It uses the encode_gender() to encode the gender from 1``and ``2 to Male and Female, respectively.

pyGenClean.SexCheck.gender_plot.read_intensities(file_name, needed_markers_chr, needed_samples_gender, sex_problems)[source]

Reads the intensities from a file.

Parameters:
  • file_name (str) – the name of the input file.
  • needed_markers_chr (dict) – the markers that are needed.
  • needed_samples_gender (dict) – the gender of all the samples.
  • sex_problems (frozenset) – the sample with sex problem.
Returns:

a :py:class`numpy.recarray` containing the following columns (for each of the samples): sampleID, chr23, chr24, gender and status.

Reads the normalized intensities from a final report. The file must contain the following columns: SNP Name, Sample ID, X, Y and Chr. It then keeps only the required markers (those that are on sexual chromosomes (23 or 24), encoding NaN intensities to zero.

The final data set contains the following information for each sample:

  • sampleID: the sample ID.
  • chr23: the summarized intensities for chromosome 23.
  • chr24: the summarized intensities for chromosome 24.
  • gender: the gender of the sample (Male or Female).
  • status: the status of the sample (OK or Problem).

The summarized intensities for a chromosome (\(S_{chr}\)) is computed using this formula (where \(I_{chr}\) is the set of all marker intensities on chromosome \(chr\)):

\[S_{chr} = \frac{\sum{I_{chr}}}{||I_{chr}||}\]
pyGenClean.SexCheck.gender_plot.read_sex_problems(file_name)[source]

Reads the sex problem file.

Parameters:file_name (str) – the name of the file containing sex problems.
Returns:a frozenset containing samples with sex problem.

If there is no file_name (i.e. is None), then an empty frozenset is returned.

pyGenClean.SexCheck.gender_plot.read_summarized_intensities(prefix)[source]

Reads the summarized intensities from 6 files.

Parameters:prefix (str) – the prefix of the six files.
Returns:a :py:class`numpy.recarray` containing the following columns (for each of the samples): sampleID, chr23, chr24, gender and status.

Instead of reading a final report (like read_intensities()), this function reads six files previously created by this module to gather sample information. Here are the content of the six files:

  • prefix.ok_females.txt: information about females without sex problem.
  • prefix.ok_males.txt: information about males without sex problem.
  • prefix.ok_unknowns.txt: information about unknown gender without sex
    problem.
  • prefix.problematic_females.txt: information about females with sex
    problem.
  • prefix.problematic_males.txt: information about males with sex
    problem.
  • prefix.problematic_unknowns.txt: information about unknown gender
    with sex problem.

Each file contains the following columns: sampleID, chr23, chr24, gender and status.

The final data set contains the following information for each sample:

  • sampleID: the sample ID.
  • chr23: the summarized intensities for chromosome 23.
  • chr24: the summarized intensities for chromosome 24.
  • gender: the gender of the sample (Male or Female).
  • status: the status of the sample (OK or Problem).

The summarized intensities for a chromosome (\(S_{chr}\)) is computed using this formula (where \(I_{chr}\) is the set of all marker intensities on chromosome \(chr\)):

\[S_{chr} = \frac{\sum{I_{chr}}}{||I_{chr}||}\]
pyGenClean.SexCheck.gender_plot.safe_main()[source]

A safe version of the main function (that catches ProgramError).

pyGenClean.SexCheck.sex_check module

exception pyGenClean.SexCheck.sex_check.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.
pyGenClean.SexCheck.sex_check.checkArgs(args)[source]

Checks the arguments and options.

Parameters:args (argparse.Namespace) – an object containing the options of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exists with code 1.

pyGenClean.SexCheck.sex_check.checkBim(fileName, minNumber, chromosome)[source]

Checks the BIM file for chrN markers.

Parameters:
  • fileName (str) –
  • minNumber (int) –
  • chromosome (str) –
Returns:

True if there are at least minNumber markers on chromosome chromosome, False otherwise.

pyGenClean.SexCheck.sex_check.computeHeteroPercentage(fileName)[source]

Computes the heterozygosity percentage.

Parameters:fileName (str) – the name of the input file.

Reads the ped file created by Plink using the recodeA options (see createPedChr23UsingPlink()) and computes the heterozygosity percentage on the chromosome 23.

pyGenClean.SexCheck.sex_check.computeNoCall(fileName)[source]

Computes the number of no call.

Parameters:fileName (str) – the name of the file

Reads the ped file created by Plink using the recodeA options (see createPedChr24UsingPlink()) and computes the number and percentage of no calls on the chromosome 24.

pyGenClean.SexCheck.sex_check.createGenderPlot(bfile, intensities, problematic_samples, format, out_prefix)[source]

Creates the gender plot.

Parameters:
  • bfile (str) – the prefix of the input binary file.
  • intensities (str) – the file containing the intensities.
  • problematic_samples (str) – the file containing the problematic samples.
  • format (str) – the format of the output plot.
  • out_prefix (str) – the prefix of the output file.

Creates the gender plot of the samples using the pyGenClean.SexCheck.gender_plot module.

pyGenClean.SexCheck.sex_check.createLrrBafPlot(raw_dir, problematic_samples, format, dpi, out_prefix)[source]

Creates the LRR and BAF plot.

Parameters:
  • raw_dir (str) – the directory containing the intensities.
  • problematic_samples (str) – the file containing the problematic samples.
  • format (str) – the format of the plot.
  • dpi – the DPI of the resulting images.
  • out_prefix (str) – the prefix of the output file.

Creates the LRR (Log R Ratio) and BAF (B Allele Frequency) of the problematic samples using the pyGenClean.SexCheck.baf_lrr_plot module.

Run Plink to create a ped format.

Parameters:options (argparse.Namespace) – the options.

Uses Plink to create a ped file of markers on the chromosome 23. It uses the recodeA options to use additive coding. It also subsets the data to keep only samples with sex problems.

Run plink to create a ped format.

Parameters:options (argparse.Namespace) – the options.

Uses Plink to create a ped file of markers on the chromosome 24. It uses the recodeA options to use additive coding. It also subsets the data to keep only samples with sex problems.

pyGenClean.SexCheck.sex_check.main(argString=None)[source]

The main function of the module.

Parameters:argString (list) – the options.

These are the following steps:

  1. Prints the options.
  2. Checks if there are enough markers on the chromosome 23 (checkBim()). If not, quits here.
  3. Runs the sex check analysis using Plink (runPlinkSexCheck()).
  4. If there are no sex problems, then quits (readCheckSexFile()).
  5. Creates the recoded file for the chromosome 23 (createPedChr23UsingPlink()).
  6. Computes the heterozygosity percentage on the chromosome 23 (computeHeteroPercentage()).
  7. If there are enough markers on chromosome 24 (at least 1), creates the recoded file for this chromosome (createPedChr24UsingPlink()).
  8. Computes the number of no call on the chromosome 24 (computeNoCall()).
  9. If required, plots the gender plot (createGenderPlot()).
  10. If required, plots the BAF and LRR plot (createLrrBafPlot()).
pyGenClean.SexCheck.sex_check.parseArgs(argString=None)[source]

Parses the command line options and arguments.

Parameters:argString (list) – the options.
Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--bfile string The input file prefix (Plink binary).
--femaleF float The female F threshold.
--maleF float The male F threshold.
--nbChr23 int The minimum number of markers on chromosome 23 before computing Plink’s sex check.
--gender-plot bool Create the gender plot.
--sex-chr-intensities string A file containing alleles intensities for each of the markers located on the X and Y chromosome.
--gender-plot-format string The output file format for the gender plot.
--lrr-baf bool Create the LRR and BAF plot.
--lrr-baf-raw-dir string Directory containing information about every samples (BAF and LRR).
--lrr-baf-format string The output file format.
--lrr-baf-dpi int The pixel density of the figure(s) (DPI).
--out string The prefix of the output files.

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.SexCheck.sex_check.readCheckSexFile(fileName, allProblemsFileName, idsFileName, femaleF, maleF)[source]

Reads the Plink check-sex output file.

Parameters:
  • fileName (str) – the name of the input file.
  • allProblemsFileName (str) – the name of the output file that will contain all the problems.
  • idsFileName (str) – the name of the output file what will contain samples with sex problems.
  • femaleF (float) – the F threshold for females.
  • maleF (float) – the F threshold for males.
Returns:

True if there are sex problems, False otherwise.

Reads sex check file provided by runPlinkSexCheck() (Plink) and extract the samples that have sex problems.

pyGenClean.SexCheck.sex_check.runCommand(command)[source]

Run a command.

Parameters:command (list) – the command to run.

Tries to run a command. If it fails, raise a ProgramError. This function uses the subprocess module.

Warning

The variable command should be a list of strings (no other type).

pyGenClean.SexCheck.sex_check.runPlinkSexCheck(options)[source]

Runs Plink to perform a sex check analysis.

Parameters:options (argparse.Namespace) – the options.

Uses Plink to perform a sex check analysis.

pyGenClean.SexCheck.sex_check.safe_main()[source]

A safe version of the main function (that catches ProgramError).