pyGenClean package

Subpackages

Submodules

pyGenClean.pipeline_error module

exception pyGenClean.pipeline_error.ProgramError(msg)[source]

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:msg (str) – the message to print to the user before exiting.

pyGenClean.run_data_clean_up module

pyGenClean.run_data_clean_up.all_files_exist(file_list)[source]

Check if all files exist.

Parameters:file_list (list) – the names of files to check.
Returns:True if all files exist, False otherwise.
pyGenClean.run_data_clean_up.check_args(args)[source]

Checks the arguments and options.

Parameters:args (argparse.Namespace) – an object containing the options and arguments of the program.
Returns:True if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exits with error code 1.

pyGenClean.run_data_clean_up.check_input_files(prefix, the_type, required_type)[source]

Check that the file is of a certain file type.

Parameters:
  • prefix (str) – the prefix of the input files.
  • the_type (str) – the type of the input files (bfile, tfile or file).
  • required_type (str) – the required type of the input files (bfile, tfile or file).
Returns:

True if everything is OK.

Checks if the files are of the required type, according to their current type. The available types are bfile (binary), tfile (transposed) and file (normal).

pyGenClean.run_data_clean_up.count_markers_samples(prefix, file_type)[source]

Counts the number of markers and samples in plink file.

Parameters:
  • prefix (str) – the prefix of the files.
  • file_type (str) – the file type.
Returns:

the number of markers and samples (in a tuple).

pyGenClean.run_data_clean_up.main()[source]

The main function.

These are the steps performed for the data clean up:

  1. Prints the version number.
  2. Reads the configuration file (read_config_file()).
  3. Creates a new directory with data_clean_up as prefix and the date and time as suffix.
  4. Check the input file type (bfile, tfile or file).
  5. Creates an intermediate directory with the section as prefix and the script name as suffix (inside the previous directory).
  6. Runs the required script in order (according to the configuration file section).

Note

The main function is not responsible to check if the required files exist. This should be done in the run functions.

pyGenClean.run_data_clean_up.parse_args()[source]

Parses the command line options and arguments.

Returns:A argparse.Namespace object created by the argparse module. It contains the values of the different options.
Options Type Description
--bfile String The input binary file prefix from Plink.
--tfile String The input transposed file prefix from Plink.
--file String The input file prefix from Plink.
--conf String The parameter file for the data clean up.
--report-author String The current project number.
--report-number String The current project author.
--report-background String Text of file containing the background section of the report.

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.run_data_clean_up.read_config_file(filename)[source]

Reads the configuration file.

Parameters:filename (str) – the name of the file containing the configuration.
Returns:A tuple where the first element is a list of sections, and the second element is a map containing the configuration (options and values).

The structure of the configuration file is important. Here is an example of a configuration file:

[1] # Computes statistics on duplicated samples
script = duplicated_samples

[2] # Removes samples according to missingness
script = sample_missingness

[3] # Removes markers according to missingness
script = snp_missingness

[4] # Removes samples according to missingness (98%)
script = sample_missingness
mind = 0.02

[5] # Performs a sex check
script = sex_check

[6] # Flags markers with MAF=0
script = flag_maf_zero

[7] # Flags markers according to Hardy Weinberg
script = flag_hw

[8] # Subset the dataset (excludes markers and remove samples)
script = subset
exclude = .../filename
rempove = .../filename

Sections are in square brackets and must be integer. The section number represent the step at which the script will be run (i.e. from the smallest number to the biggest). The sections must be continuous.

Each section contains the script names (script variable) and options of the script (all other variables) (e.g. section 4 runs the sample_missingness script (run_sample_missingness()) with option mind sets to 0.02).

Here is a list of the available scripts:

pyGenClean.run_data_clean_up.run_check_ethnicity(in_prefix, in_type, out_prefix, base_dir, options)[source]

Runs step10 (check ethnicity).

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (bfile).

This function calls the pyGenClean.Ethnicity.check_ethnicity module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.Ethnicity.check_ethnicity module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_command(command)[source]

Run a command using subprocesses.

Parameters:command (list) – the command to run.

Tries to run a command. If it fails, raise a ProgramError.

Warning

The variable command should be a list of strings (no other type).

pyGenClean.run_data_clean_up.run_compare_gold_standard(in_prefix, in_type, out_prefix, base_dir, options)[source]

Compares with a gold standard data set (compare_gold_standard.

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (bfile).

This function calls the pyGenClean.Misc.compare_gold_standard module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.Misc.compare_gold_standard module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_contamination(in_prefix, in_type, out_prefix, base_dir, options)[source]

Runs the contamination check for samples.

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (bfile).

pyGenClean.run_data_clean_up.run_duplicated_samples(in_prefix, in_type, out_prefix, base_dir, options)[source]

Runs step1 (duplicated samples).

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (tfile).

This function calls the pyGenClean.DupSamples.duplicated_samples module. The required file type for this module is tfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

pyGenClean.run_data_clean_up.run_duplicated_snps(in_prefix, in_type, out_prefix, base_dir, options)[source]

Runs step2 (duplicated snps).

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (tfile).

This function calls the pyGenClean.DupSNPs.duplicated_snps module. The required file type for this module is tfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

This function creates a map file, needed for the pyGenClean.DupSNPs.duplicated_snps module.

Runs step9 (find related samples).

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (bfile).

This function calls the pyGenClean.RelatedSamples.find_related_samples module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.RelatedSamples.find_related_samples module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_flag_hw(in_prefix, in_type, out_prefix, base_dir, options)[source]

Runs step12 (flag HW).

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (bfile).

This function calls the pyGenClean.FlagHW.flag_hw module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.FlagHW.flag_hw module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_flag_maf_zero(in_prefix, in_type, out_prefix, base_dir, options)[source]

Runs step11 (flag MAF zero).

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (bfile).

This function calls the pyGenClean.FlagMAF.flag_maf_zero module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.FlagMAF.flag_maf_zero module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_noCall_hetero_snps(in_prefix, in_type, out_prefix, base_dir, options)[source]

Runs step 3 (clean no call and hetero).

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (tfile).

This function calls the pyGenClean.NoCallHetero.clean_noCall_hetero_snps module. The required file type for this module is tfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

pyGenClean.run_data_clean_up.run_plate_bias(in_prefix, in_type, out_prefix, base_dir, options)[source]

Runs step7 (plate bias).

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (bfile).

This function calls the pyGenClean.PlateBias.plate_bias module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.PlateBias.plate_bias module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_remove_heterozygous_haploid(in_prefix, in_type, out_prefix, base_dir, options)[source]

Runs step8 (remove heterozygous haploid).

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (bfile).

This function calls the pyGenClean.HeteroHap.remove_heterozygous_haploid module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

pyGenClean.run_data_clean_up.run_sample_missingness(in_prefix, in_type, out_prefix, base_dir, options)[source]

Runs step4 (clean mind).

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (bfile).

This function calls the pyGenClean.SampleMissingness.sample_missingness module. The required file type for this module is either a bfile or a tfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

pyGenClean.run_data_clean_up.run_sex_check(in_prefix, in_type, out_prefix, base_dir, options)[source]

Runs step6 (sexcheck).

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (bfile).

This function calls the pyGenClean.SexCheck.sex_check module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.SexCheck.sex_check module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_snp_missingness(in_prefix, in_type, out_prefix, base_dir, options)[source]

Run step5 (clean geno).

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (bfile).

This function calls the pyGenClean.MarkerMissingness.snp_missingness module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

pyGenClean.run_data_clean_up.run_subset_data(in_prefix, in_type, out_prefix, base_dir, options)[source]

Subsets the data.

Parameters:
  • in_prefix (str) – the prefix of the input files.
  • in_type (str) – the type of the input files.
  • out_prefix (str) – the output prefix.
  • base_dir (str) – the output directory.
  • options (list) – the options needed.
Returns:

a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (bfile).

This function calls the pyGenClean.pyGenClean.PlinkUtils.subset_data module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The output file type is the same as the input file type.

pyGenClean.run_data_clean_up.safe_main()[source]

A safe version of the main function (that catches ProgramError).

pyGenClean.version module

Module contents

pyGenClean.add_file_handler_to_root(log_fn)[source]

Adds a file handler to the root logging.

Parameters:log_fn (str) – the name of the log file.