pyGenClean package¶
Subpackages¶
- pyGenClean.Contamination package
- pyGenClean.DupSNPs package
- pyGenClean.DupSamples package
- pyGenClean.Ethnicity package
- pyGenClean.FlagHW package
- pyGenClean.FlagMAF package
- pyGenClean.HeteroHap package
- pyGenClean.LaTeX package
- pyGenClean.MarkerMissingness package
- pyGenClean.Misc package
- pyGenClean.NoCallHetero package
- pyGenClean.PlateBias package
- pyGenClean.PlinkUtils package
- pyGenClean.RelatedSamples package
- pyGenClean.SampleMissingness package
- pyGenClean.SexCheck package
Submodules¶
pyGenClean.pipeline_error module¶
-
exception
pyGenClean.pipeline_error.
ProgramError
(msg)[source]¶ Bases:
exceptions.Exception
An
Exception
raised in case of a problem.Parameters: msg (str) – the message to print to the user before exiting.
pyGenClean.run_data_clean_up module¶
-
pyGenClean.run_data_clean_up.
all_files_exist
(file_list)[source]¶ Check if all files exist.
Parameters: file_list (list) – the names of files to check. Returns: True
if all files exist,False
otherwise.
-
pyGenClean.run_data_clean_up.
check_args
(args)[source]¶ Checks the arguments and options.
Parameters: args ( argparse.Namespace
) – an object containing the options and arguments of the program.Returns: True
if everything was OK.If there is a problem with an option, an exception is raised using the
ProgramError
class, a message is printed to thesys.stderr
and the program exits with error code 1.
-
pyGenClean.run_data_clean_up.
check_input_files
(prefix, the_type, required_type)[source]¶ Check that the file is of a certain file type.
Parameters: Returns: True
if everything is OK.Checks if the files are of the required type, according to their current type. The available types are
bfile
(binary),tfile
(transposed) andfile
(normal).
-
pyGenClean.run_data_clean_up.
count_markers_samples
(prefix, file_type)[source]¶ Counts the number of markers and samples in plink file.
Parameters: Returns: the number of markers and samples (in a tuple).
-
pyGenClean.run_data_clean_up.
main
()[source]¶ The main function.
These are the steps performed for the data clean up:
- Prints the version number.
- Reads the configuration file (
read_config_file()
). - Creates a new directory with
data_clean_up
as prefix and the date and time as suffix. - Check the input file type (
bfile
,tfile
orfile
). - Creates an intermediate directory with the section as prefix and the script name as suffix (inside the previous directory).
- Runs the required script in order (according to the configuration file section).
Note
The main function is not responsible to check if the required files exist. This should be done in the
run
functions.
-
pyGenClean.run_data_clean_up.
parse_args
()[source]¶ Parses the command line options and arguments.
Returns: A argparse.Namespace
object created by theargparse
module. It contains the values of the different options.Options Type Description --bfile
String The input binary file prefix from Plink. --tfile
String The input transposed file prefix from Plink. --file
String The input file prefix from Plink. --conf
String The parameter file for the data clean up. --report-author
String The current project number. --report-number
String The current project author. --report-background
String Text of file containing the background section of the report. Note
No option check is done here (except for the one automatically done by
argparse
). Those need to be done elsewhere (seecheckArgs()
).
-
pyGenClean.run_data_clean_up.
read_config_file
(filename)[source]¶ Reads the configuration file.
Parameters: filename (str) – the name of the file containing the configuration. Returns: A tuple where the first element is a list of sections, and the second element is a map containing the configuration (options and values). The structure of the configuration file is important. Here is an example of a configuration file:
[1] # Computes statistics on duplicated samples script = duplicated_samples [2] # Removes samples according to missingness script = sample_missingness [3] # Removes markers according to missingness script = snp_missingness [4] # Removes samples according to missingness (98%) script = sample_missingness mind = 0.02 [5] # Performs a sex check script = sex_check [6] # Flags markers with MAF=0 script = flag_maf_zero [7] # Flags markers according to Hardy Weinberg script = flag_hw [8] # Subset the dataset (excludes markers and remove samples) script = subset exclude = .../filename rempove = .../filename
Sections are in square brackets and must be
integer
. The section number represent the step at which the script will be run (i.e. from the smallest number to the biggest). The sections must be continuous.Each section contains the script names (
script
variable) and options of the script (all other variables) (e.g. section 4 runs thesample_missingness
script (run_sample_missingness()
) with optionmind
sets to 0.02).Here is a list of the available scripts:
duplicated_samples
(run_duplicated_samples()
)duplicated_snps
(run_duplicated_snps()
)noCall_hetero_snps
(run_noCall_hetero_snps()
)sample_missingness
(run_sample_missingness()
)snp_missingness
(run_snp_missingness()
)sex_check
(run_sex_check()
)plate_bias
(run_plate_bias()
)contamination
(run_contamination()
)remove_heterozygous_haploid
(run_remove_heterozygous_haploid()
)find_related_samples
(run_find_related_samples()
)check_ethnicity
(run_check_ethnicity()
)flag_maf_zero
(run_flag_maf_zero()
)flag_hw
(run_flag_hw()
)subset
(run_subset_data()
)compare_gold_standard
(run_compare_gold_standard()
)
-
pyGenClean.run_data_clean_up.
run_check_ethnicity
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step10 (check ethnicity).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile
).This function calls the
pyGenClean.Ethnicity.check_ethnicity
module. The required file type for this module isbfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.Ethnicity.check_ethnicity
module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.
run_command
(command)[source]¶ Run a command using subprocesses.
Parameters: command (list) – the command to run. Tries to run a command. If it fails, raise a
ProgramError
.Warning
The variable
command
should be a list of strings (no other type).
-
pyGenClean.run_data_clean_up.
run_compare_gold_standard
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Compares with a gold standard data set (compare_gold_standard.
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile
).This function calls the
pyGenClean.Misc.compare_gold_standard
module. The required file type for this module isbfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.Misc.compare_gold_standard
module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.
run_contamination
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs the contamination check for samples.
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile
).
-
pyGenClean.run_data_clean_up.
run_duplicated_samples
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step1 (duplicated samples).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
tfile
).This function calls the
pyGenClean.DupSamples.duplicated_samples
module. The required file type for this module istfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.
-
pyGenClean.run_data_clean_up.
run_duplicated_snps
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step2 (duplicated snps).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
tfile
).This function calls the
pyGenClean.DupSNPs.duplicated_snps
module. The required file type for this module istfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.Note
This function creates a
map
file, needed for thepyGenClean.DupSNPs.duplicated_snps
module.
Runs step9 (find related samples).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile
).This function calls the
pyGenClean.RelatedSamples.find_related_samples
module. The required file type for this module isbfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.RelatedSamples.find_related_samples
module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.
run_flag_hw
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step12 (flag HW).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile
).This function calls the
pyGenClean.FlagHW.flag_hw
module. The required file type for this module isbfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.FlagHW.flag_hw
module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.
run_flag_maf_zero
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step11 (flag MAF zero).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile
).This function calls the
pyGenClean.FlagMAF.flag_maf_zero
module. The required file type for this module isbfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.FlagMAF.flag_maf_zero
module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.
run_noCall_hetero_snps
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step 3 (clean no call and hetero).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
tfile
).This function calls the
pyGenClean.NoCallHetero.clean_noCall_hetero_snps
module. The required file type for this module istfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.
-
pyGenClean.run_data_clean_up.
run_plate_bias
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step7 (plate bias).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile
).This function calls the
pyGenClean.PlateBias.plate_bias
module. The required file type for this module isbfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.PlateBias.plate_bias
module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.
run_remove_heterozygous_haploid
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step8 (remove heterozygous haploid).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile
).This function calls the
pyGenClean.HeteroHap.remove_heterozygous_haploid
module. The required file type for this module isbfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.
-
pyGenClean.run_data_clean_up.
run_sample_missingness
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step4 (clean mind).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile
).This function calls the
pyGenClean.SampleMissingness.sample_missingness
module. The required file type for this module is either abfile
or atfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.
-
pyGenClean.run_data_clean_up.
run_sex_check
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Runs step6 (sexcheck).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile
).This function calls the
pyGenClean.SexCheck.sex_check
module. The required file type for this module isbfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.Note
The
pyGenClean.SexCheck.sex_check
module doesn’t return usable output files. Hence, this function returns the input file prefix and its type.
-
pyGenClean.run_data_clean_up.
run_snp_missingness
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Run step5 (clean geno).
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile
).This function calls the
pyGenClean.MarkerMissingness.snp_missingness
module. The required file type for this module isbfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.
-
pyGenClean.run_data_clean_up.
run_subset_data
(in_prefix, in_type, out_prefix, base_dir, options)[source]¶ Subsets the data.
Parameters: Returns: a tuple containing the prefix of the output files (the input prefix for the next script) and the type of the output files (
bfile
).This function calls the
pyGenClean.pyGenClean.PlinkUtils.subset_data
module. The required file type for this module isbfile
, hence the need to use thecheck_input_files()
to check if the file input file type is the good one, or to create it if needed.Note
The output file type is the same as the input file type.