pyGenClean package¶

Subpackages¶

pyGenClean.Contamination package
- Module contents
- Submodules
  - pyGenClean.Contamination.contamination module
pyGenClean.DupSNPs package
- Module contents
- Submodules
  - pyGenClean.DupSNPs.duplicated_snps module
pyGenClean.DupSamples package
- Module contents
- Submodules
  - pyGenClean.DupSamples.duplicated_samples module
pyGenClean.Ethnicity package
- Module contents
- Submodules
pyGenClean.FlagHW package
- Module contents
- Submodules
  - pyGenClean.FlagHW.flag_hw module
pyGenClean.FlagMAF package
- Module contents
- Submodules
  - pyGenClean.FlagMAF.flag_maf_zero module
pyGenClean.HeteroHap package
- Module contents
- Submodules
  - pyGenClean.HeteroHap.remove_heterozygous_haploid module
pyGenClean.LaTeX package
pyGenClean.MarkerMissingness package
- Module contents
- Submodules
  - pyGenClean.MarkerMissingness.snp_missingness module
pyGenClean.Misc package
pyGenClean.NoCallHetero package
- Module contents
- Submodules
  - pyGenClean.NoCallHetero.clean_noCall_hetero_snps module
  - pyGenClean.NoCallHetero.heterozygosity_plot module
pyGenClean.PlateBias package
- Module contents
- Submodules
  - pyGenClean.PlateBias.plate_bias module
pyGenClean.PlinkUtils package
- Module contents
- Submodules
pyGenClean.RelatedSamples package
- Module contents
- Submodules
  - pyGenClean.RelatedSamples.find_related_samples module
  - pyGenClean.RelatedSamples.merge_related_samples module
pyGenClean.SampleMissingness package
- Module contents
- Submodules
  - pyGenClean.SampleMissingness.sample_missingness module
pyGenClean.SexCheck package
- Module contents
- Submodules

Submodules¶

pyGenClean.pipeline_error module¶

exception pyGenClean.pipeline_error.ProgramError(msg)[source]¶

Bases: exceptions.Exception

An Exception raised in case of a problem.

Parameters:	msg (str) – the message to print to the user before exiting.

pyGenClean.run_data_clean_up module¶

pyGenClean.run_data_clean_up.all_files_exist(file_list)[source]¶

Check if all files exist.

Parameters:	file_list (list) – the names of files to check.
Returns:	`True` if all files exist, `False` otherwise.

pyGenClean.run_data_clean_up.check_args(args)[source]¶

Checks the arguments and options.

Parameters:	args (`argparse.Namespace`) – an object containing the options and arguments of the program.
Returns:	`True` if everything was OK.

If there is a problem with an option, an exception is raised using the ProgramError class, a message is printed to the sys.stderr and the program exits with error code 1.

pyGenClean.run_data_clean_up.check_input_files(prefix, the_type, required_type)[source]¶

Check that the file is of a certain file type.

Parameters:	prefix (str) – the prefix of the input files. the_type (str) – the type of the input files (bfile, tfile or file). required_type (str) – the required type of the input files (bfile, tfile or file).
Returns:	`True` if everything is OK.

Checks if the files are of the required type, according to their current type. The available types are bfile (binary), tfile (transposed) and file (normal).

pyGenClean.run_data_clean_up.count_markers_samples(prefix, file_type)[source]¶

Counts the number of markers and samples in plink file.

Parameters:	prefix (str) – the prefix of the files. file_type (str) – the file type.
Returns:	the number of markers and samples (in a tuple).

pyGenClean.run_data_clean_up.main()[source]¶

The main function.

These are the steps performed for the data clean up:

Prints the version number.
Reads the configuration file (read_config_file()).
Creates a new directory with data_clean_up as prefix and the date and time as suffix.
Check the input file type (bfile, tfile or file).
Creates an intermediate directory with the section as prefix and the script name as suffix (inside the previous directory).
Runs the required script in order (according to the configuration file section).

Note

The main function is not responsible to check if the required files exist. This should be done in the run functions.

pyGenClean.run_data_clean_up.parse_args()[source]¶

Parses the command line options and arguments.

Returns:	A `argparse.Namespace` object created by the `argparse` module. It contains the values of the different options.

Options	Type	Description
`--bfile`	String	The input binary file prefix from Plink.
`--tfile`	String	The input transposed file prefix from Plink.
`--file`	String	The input file prefix from Plink.
`--conf`	String	The parameter file for the data clean up.
`--report-author`	String	The current project number.
`--report-number`	String	The current project author.
`--report-background`	String	Text of file containing the background section of the report.

Note

No option check is done here (except for the one automatically done by argparse). Those need to be done elsewhere (see checkArgs()).

pyGenClean.run_data_clean_up.read_config_file(filename)[source]¶

Reads the configuration file.

Parameters:	filename (str) – the name of the file containing the configuration.
Returns:	A tuple where the first element is a list of sections, and the second element is a map containing the configuration (options and values).

The structure of the configuration file is important. Here is an example of a configuration file:

[1] # Computes statistics on duplicated samples
script = duplicated_samples

[2] # Removes samples according to missingness
script = sample_missingness

[3] # Removes markers according to missingness
script = snp_missingness

[4] # Removes samples according to missingness (98%)
script = sample_missingness
mind = 0.02

[5] # Performs a sex check
script = sex_check

[6] # Flags markers with MAF=0
script = flag_maf_zero

[7] # Flags markers according to Hardy Weinberg
script = flag_hw

[8] # Subset the dataset (excludes markers and remove samples)
script = subset
exclude = .../filename
rempove = .../filename

Sections are in square brackets and must be integer. The section number represent the step at which the script will be run (i.e. from the smallest number to the biggest). The sections must be continuous.

Each section contains the script names (script variable) and options of the script (all other variables) (e.g. section 4 runs the sample_missingness script (run_sample_missingness()) with option mind sets to 0.02).

Here is a list of the available scripts:

duplicated_samples (run_duplicated_samples())
duplicated_snps (run_duplicated_snps())
noCall_hetero_snps (run_noCall_hetero_snps())
sample_missingness (run_sample_missingness())
snp_missingness (run_snp_missingness())
sex_check (run_sex_check())
plate_bias (run_plate_bias())
contamination (run_contamination())
remove_heterozygous_haploid (run_remove_heterozygous_haploid())
find_related_samples (run_find_related_samples())
check_ethnicity (run_check_ethnicity())
flag_maf_zero (run_flag_maf_zero())
flag_hw (run_flag_hw())
subset (run_subset_data())
compare_gold_standard (run_compare_gold_standard())

pyGenClean.run_data_clean_up.run_check_ethnicity(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Runs step10 (check ethnicity).

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`bfile`).

This function calls the pyGenClean.Ethnicity.check_ethnicity module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.Ethnicity.check_ethnicity module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_command(command)[source]¶

Run a command using subprocesses.

Parameters:	command (list) – the command to run.

Tries to run a command. If it fails, raise a ProgramError.

Warning

The variable command should be a list of strings (no other type).

pyGenClean.run_data_clean_up.run_compare_gold_standard(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Compares with a gold standard data set (compare_gold_standard.

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`bfile`).

This function calls the pyGenClean.Misc.compare_gold_standard module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.Misc.compare_gold_standard module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_contamination(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Runs the contamination check for samples.

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`bfile`).

pyGenClean.run_data_clean_up.run_duplicated_samples(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Runs step1 (duplicated samples).

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`tfile`).

This function calls the pyGenClean.DupSamples.duplicated_samples module. The required file type for this module is tfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

pyGenClean.run_data_clean_up.run_duplicated_snps(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Runs step2 (duplicated snps).

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`tfile`).

This function calls the pyGenClean.DupSNPs.duplicated_snps module. The required file type for this module is tfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

This function creates a map file, needed for the pyGenClean.DupSNPs.duplicated_snps module.

pyGenClean.run_data_clean_up.run_find_related_samples(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Runs step9 (find related samples).

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`bfile`).

This function calls the pyGenClean.RelatedSamples.find_related_samples module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.RelatedSamples.find_related_samples module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_flag_hw(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Runs step12 (flag HW).

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`bfile`).

This function calls the pyGenClean.FlagHW.flag_hw module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.FlagHW.flag_hw module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_flag_maf_zero(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Runs step11 (flag MAF zero).

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`bfile`).

This function calls the pyGenClean.FlagMAF.flag_maf_zero module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.FlagMAF.flag_maf_zero module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_noCall_hetero_snps(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Runs step 3 (clean no call and hetero).

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`tfile`).

This function calls the pyGenClean.NoCallHetero.clean_noCall_hetero_snps module. The required file type for this module is tfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

pyGenClean.run_data_clean_up.run_plate_bias(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Runs step7 (plate bias).

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`bfile`).

This function calls the pyGenClean.PlateBias.plate_bias module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.PlateBias.plate_bias module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_remove_heterozygous_haploid(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Runs step8 (remove heterozygous haploid).

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`bfile`).

This function calls the pyGenClean.HeteroHap.remove_heterozygous_haploid module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

pyGenClean.run_data_clean_up.run_sample_missingness(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Runs step4 (clean mind).

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`bfile`).

This function calls the pyGenClean.SampleMissingness.sample_missingness module. The required file type for this module is either a bfile or a tfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

pyGenClean.run_data_clean_up.run_sex_check(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Runs step6 (sexcheck).

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`bfile`).

This function calls the pyGenClean.SexCheck.sex_check module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The pyGenClean.SexCheck.sex_check module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.

pyGenClean.run_data_clean_up.run_snp_missingness(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Run step5 (clean geno).

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`bfile`).

This function calls the pyGenClean.MarkerMissingness.snp_missingness module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

pyGenClean.run_data_clean_up.run_subset_data(in_prefix, in_type, out_prefix, base_dir, options)[source]¶

Subsets the data.

Parameters:	in_prefix (str) – the prefix of the input files. in_type (str) – the type of the input files. out_prefix (str) – the output prefix. base_dir (str) – the output directory. options (list) – the options needed.
Returns:	a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (`bfile`).

This function calls the pyGenClean.pyGenClean.PlinkUtils.subset_data module. The required file type for this module is bfile, hence the need to use the check_input_files() to check if the file input file type is the good one, or to create it if needed.

Note

The output file type is the same as the input file type.

pyGenClean.run_data_clean_up.safe_main()[source]¶: A safe version of the main function (that catches ProgramError).

pyGenClean.version module¶

Module contents¶

pyGenClean.add_file_handler_to_root(log_fn)[source]¶

Adds a file handler to the root logging.

Parameters:	log_fn (str) – the name of the log file.

Table Of Contents

Previous topic

Next topic

This Page

pyGenClean package¶

Subpackages¶

Submodules¶

pyGenClean.pipeline_error module¶

pyGenClean.run_data_clean_up module¶

pyGenClean.version module¶

Module contents¶