How to Run the PipelineΒΆ

Warning

Before using pyGenClean, be sure to activate the appropriate Python virtual environment (refer to the Linux virtualenv or conda installation section, or the Windows installation section for more information.

Modify the First Configuration File so that it suits your needs. After following the Preprocessing Steps described in the Proposed Protocol section, run the following command:

$ run_pyGenClean \
>     --conf configuration_example_1_of_2.ini \
>     --tfile /PATH/TO/ORIGINAL/DATASET_PREFIX

While the protocol is running, check the outputs according to the Proposed Protocol. If there are any problems, interrupt the analysis and make the required modifications. The completed steps can be skipped by commenting them out, while using the last output dataset as the input one for the steps that need to be done again.

Once everything was checked, locate the samples and the markers that need to be removed. For example, if the output directory from the first dataset is data_clean_up.YYYY-MM-DD_HH.MM.SS, the following command will help you:

$ output_dir=data_clean_up.YYYY-MM-DD_HH.MM.SS
$ cat $output_dir/7_sex_check/sexcheck.list_problem_sex_ids \
>     $output_dir/9_find_related_samples/ibs.discarded_related_individuals \
>     $output_dir/10_check_ethnicity/ethnicity.outliers \
>     > samples_to_remove.txt

Then, modify the first subset section in the Second Configuration File so that it reads:

1
2
3
4
[11]
script = subset
remove = samples_to_remove.txt
exclude = data_clean_up.YYYY-MM-DD_HH.MM.SS/8_plate_bias/plate_bias.significant_SNPs.txt

Once everything was checked, run the following command to finish the data clean up pipeline:

output_dir=data_clean_up.YYYY-MM-DD_HH.MM.SS
run_pyGenClean \
    --conf configuration_example_2_of_2.ini \
    --bfile $output_dir/6_sample_missingness/clean_mind

If you want to removed the markers that were flagged in the flag_maf_zero and flag_hw section, performed the following commands (using the newly created output directory data_clean_up.YYYY-MM-DD_HH.MM.SS):

$ output_dir=data_clean_up.YYYY-MM-DD_HH.MM.SS
$ cat $output_dir/13_flag_maf_zero/flag_maf_0.list \
>     $output_dir/14_flag_hw/flag_hw.snp_flag_threshold_1e-4 \
>     > markers_to_exclude.txt
$ pyGenClean_subset_data \
>     --ifile $output_dir/14_remove_heterozygous_haploid/without_hh_genotypes \
>     --is-bfile \
>     --exclude markers_to_exclude.txt \
>     --out final_dataset