pyGenClean package¶
Subpackages¶
- pyGenClean.Contamination package
- pyGenClean.DupSNPs package
- pyGenClean.DupSamples package
- pyGenClean.Ethnicity package
- pyGenClean.FlagHW package
- pyGenClean.FlagMAF package
- pyGenClean.HeteroHap package
- pyGenClean.LaTeX package
- pyGenClean.MarkerMissingness package
- pyGenClean.Misc package
- pyGenClean.NoCallHetero package
- pyGenClean.PlateBias package
- pyGenClean.PlinkUtils package
- pyGenClean.RelatedSamples package
- pyGenClean.SampleMissingness package
- pyGenClean.SexCheck package
Submodules¶
pyGenClean.pipeline_error module¶
-
exception
pyGenClean.pipeline_error.ProgramError(msg)[source]¶ Bases:
exceptions.ExceptionAn
Exceptionraised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
pyGenClean.run_data_clean_up module¶
-
pyGenClean.run_data_clean_up.all_files_exist(file_list)[source]¶ Check if all files exist.
Parameters: file_list (list) – the names of files to check. Returns: Trueif all files exist,Falseotherwise.
-
pyGenClean.run_data_clean_up.check_args(args)[source]¶ Checks the arguments and options.
Parameters: args ( argparse.Namespace) – an object containing the options and arguments of the program.Returns: Trueif everything was OK.If there is a problem with an option, an exception is raised using the
ProgramErrorclass, a message is printed to thesys.stderrand the program exits with error code 1.
-
pyGenClean.run_data_clean_up.check_input_files(prefix, the_type, required_type)[source]¶ Check that the file is of a certain file type.
Parameters: Returns: Trueif everything is OK.Checks if the files are of the required type, according to their current type. The available types are
bfile(binary),tfile(transposed) andfile(normal).
-
pyGenClean.run_data_clean_up.count_markers_samples(prefix, file_type)[source]¶ Counts the number of markers and samples in plink file.
Parameters: Returns: the number of markers and samples (in a tuple).
-
pyGenClean.run_data_clean_up.main()[source]¶ The main function.
These are the steps performed for the data clean up:
- Prints the version number.
- Reads the configuration file (
read_config_file()). - Creates a new directory with
data_clean_upas prefix and the date and time as suffix. - Check the input file type (
bfile,tfileorfile). - Creates an intermediate directory with the section as prefix and the script name as suffix (inside the previous directory).
- Runs the required script in order (according to the configuration file section).
Note
The main function is not responsible to check if the required files exist. This should be done in the
runfunctions.
-
pyGenClean.run_data_clean_up.parse_args()[source]¶ Parses the command line options and arguments.
Returns: A argparse.Namespaceobject created by theargparsemodule. It contains the values of the different options.Options Type Description --bfileString The input binary file prefix from Plink. --tfileString The input transposed file prefix from Plink. --fileString The input file prefix from Plink. --confString The parameter file for the data clean up. --report-authorString The current project number. --report-numberString The current project author. --report-backgroundString Text of file containing the background section of the report. Note
No option check is done here (except for the one automatically done by
argparse). Those need to be done elsewhere (seecheckArgs()).
-
pyGenClean.run_data_clean_up.read_config_file(filename)[source]¶ Reads the configuration file.
Parameters: filename (str) – the name of the file containing the configuration. Returns: A tuple where the first element is a list of sections, and the second element is a map containing the configuration (options and values). The structure of the configuration file is important. Here is an example of a configuration file:
[1] # Computes statistics on duplicated samples script = duplicated_samples [2] # Removes samples according to missingness script = sample_missingness [3] # Removes markers according to missingness script = snp_missingness [4] # Removes samples according to missingness (98%) script = sample_missingness mind = 0.02 [5] # Performs a sex check script = sex_check [6] # Flags markers with MAF=0 script = flag_maf_zero [7] # Flags markers according to Hardy Weinberg script = flag_hw [8] # Subset the dataset (excludes markers and remove samples) script = subset exclude = .../filename rempove = .../filename
Sections are in square brackets and must be
integer. The section number represent the step at which the script will be run (i.e. from the smallest number to the biggest). The sections must be continuous.Each section contains the script names (
scriptvariable) and options of the script (all other variables) (e.g. section 4 runs thesample_missingnessscript (run_sample_missingness()) with optionmindsets to 0.02).Here is a list of the available scripts:
duplicated_samples(run_duplicated_samples())duplicated_snps(run_duplicated_snps())noCall_hetero_snps(run_noCall_hetero_snps())sample_missingness(run_sample_missingness())snp_missingness(run_snp_missingness())sex_check(run_sex_check())plate_bias(run_plate_bias())contamination(run_contamination())remove_heterozygous_haploid(run_remove_heterozygous_haploid())find_related_samples(run_find_related_samples())check_ethnicity(run_check_ethnicity())flag_maf_zero(run_flag_maf_zero())flag_hw(run_flag_hw())subset(run_subset_data())compare_gold_standard(run_compare_gold_standard())
-
pyGenClean.run_data_clean_up.run_check_ethnicity(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step10 (check ethnicity).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile).This function calls the
pyGenClean.Ethnicity.check_ethnicitymodule. The required file type for this module isbfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.Ethnicity.check_ethnicitymodule doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.run_command(command)[source]¶ Run a command using subprocesses.
Parameters: command (list) – the command to run. Tries to run a command. If it fails, raise a
ProgramError.Warning
The variable
commandshould be a list of strings (no other type).
-
pyGenClean.run_data_clean_up.run_compare_gold_standard(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Compares with a gold standard data set (compare_gold_standard.
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile).This function calls the
pyGenClean.Misc.compare_gold_standardmodule. The required file type for this module isbfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.Misc.compare_gold_standardmodule doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.run_contamination(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs the contamination check for samples.
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile).
-
pyGenClean.run_data_clean_up.run_duplicated_samples(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step1 (duplicated samples).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
tfile).This function calls the
pyGenClean.DupSamples.duplicated_samplesmodule. The required file type for this module istfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.
-
pyGenClean.run_data_clean_up.run_duplicated_snps(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step2 (duplicated snps).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
tfile).This function calls the
pyGenClean.DupSNPs.duplicated_snpsmodule. The required file type for this module istfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.Note
This function creates a
mapfile, needed for thepyGenClean.DupSNPs.duplicated_snpsmodule.
Runs step9 (find related samples).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile).This function calls the
pyGenClean.RelatedSamples.find_related_samplesmodule. The required file type for this module isbfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.RelatedSamples.find_related_samplesmodule doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.run_flag_hw(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step12 (flag HW).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile).This function calls the
pyGenClean.FlagHW.flag_hwmodule. The required file type for this module isbfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.FlagHW.flag_hwmodule doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.run_flag_maf_zero(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step11 (flag MAF zero).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile).This function calls the
pyGenClean.FlagMAF.flag_maf_zeromodule. The required file type for this module isbfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.FlagMAF.flag_maf_zeromodule doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.run_noCall_hetero_snps(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step 3 (clean no call and hetero).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
tfile).This function calls the
pyGenClean.NoCallHetero.clean_noCall_hetero_snpsmodule. The required file type for this module istfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.
-
pyGenClean.run_data_clean_up.run_plate_bias(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step7 (plate bias).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile).This function calls the
pyGenClean.PlateBias.plate_biasmodule. The required file type for this module isbfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.PlateBias.plate_biasmodule doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.run_remove_heterozygous_haploid(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step8 (remove heterozygous haploid).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile).This function calls the
pyGenClean.HeteroHap.remove_heterozygous_haploidmodule. The required file type for this module isbfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.
-
pyGenClean.run_data_clean_up.run_sample_missingness(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step4 (clean mind).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile).This function calls the
pyGenClean.SampleMissingness.sample_missingnessmodule. The required file type for this module is either abfileor atfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.
-
pyGenClean.run_data_clean_up.run_sex_check(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step6 (sexcheck).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile).This function calls the
pyGenClean.SexCheck.sex_checkmodule. The required file type for this module isbfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.SexCheck.sex_checkmodule doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.run_snp_missingness(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Run step5 (clean geno).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile).This function calls the
pyGenClean.MarkerMissingness.snp_missingnessmodule. The required file type for this module isbfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.
-
pyGenClean.run_data_clean_up.run_subset_data(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Subsets the data.
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile).This function calls the
pyGenClean.pyGenClean.PlinkUtils.subset_datamodule. The required file type for this module isbfile, hence the need to use thecheck_input_files()to check if the file input file type is the good one, or to create it if needed.Note
The output file type is the same as the input file type.
