How to Run the PipelineΒΆ
Warning
Before using pyGenClean
, be sure to activate the appropriate
Python virtual environment (refer to the Linux
virtualenv or
conda installation
section, or the
Windows installation
section for more information.
Modify the First Configuration File so that it suits your needs. After following the Preprocessing Steps described in the Proposed Protocol section, run the following command:
$ run_pyGenClean \
> --conf configuration_example_1_of_2.ini \
> --tfile /PATH/TO/ORIGINAL/DATASET_PREFIX
While the protocol is running, check the outputs according to the Proposed Protocol. If there are any problems, interrupt the analysis and make the required modifications. The completed steps can be skipped by commenting them out, while using the last output dataset as the input one for the steps that need to be done again.
Once everything was checked, locate the samples and the markers that need to be
removed. For example, if the output directory from the first dataset is
data_clean_up.YYYY-MM-DD_HH.MM.SS
, the following command will help you:
$ output_dir=data_clean_up.YYYY-MM-DD_HH.MM.SS
$ cat $output_dir/7_sex_check/sexcheck.list_problem_sex_ids \
> $output_dir/9_find_related_samples/ibs.discarded_related_individuals \
> $output_dir/10_check_ethnicity/ethnicity.outliers \
> > samples_to_remove.txt
Then, modify the first subset
section in the Second Configuration File so that
it reads:
1 2 3 4 | [11]
script = subset
remove = samples_to_remove.txt
exclude = data_clean_up.YYYY-MM-DD_HH.MM.SS/8_plate_bias/plate_bias.significant_SNPs.txt
|
Once everything was checked, run the following command to finish the data clean up pipeline:
output_dir=data_clean_up.YYYY-MM-DD_HH.MM.SS
run_pyGenClean \
--conf configuration_example_2_of_2.ini \
--bfile $output_dir/6_sample_missingness/clean_mind
If you want to removed the markers that were flagged in the flag_maf_zero
and flag_hw
section, performed the following commands (using the newly
created output directory data_clean_up.YYYY-MM-DD_HH.MM.SS
):
$ output_dir=data_clean_up.YYYY-MM-DD_HH.MM.SS
$ cat $output_dir/13_flag_maf_zero/flag_maf_0.list \
> $output_dir/14_flag_hw/flag_hw.snp_flag_threshold_1e-4 \
> > markers_to_exclude.txt
$ pyGenClean_subset_data \
> --ifile $output_dir/14_remove_heterozygous_haploid/without_hh_genotypes \
> --is-bfile \
> --exclude markers_to_exclude.txt \
> --out final_dataset