.. pyGenClean documentation master file, created by sphinx-quickstart on Wed Dec 12 15:08:13 2012. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. .. raw:: latex \listoffigures \listoftables Welcome to pyGenClean's documentation! ************************************** Introduction ============ Genetic association studies making use of high throughput genotyping arrays need to process large amounts of data in the order of millions of markers per experiment. The first step of any analysis with genotyping arrays is typically the conduct of a thorough data clean up and quality control to remove poor quality genotypes and generate metrics to inform and select individuals for downstream statistical analysis. :py:mod:`pyGenClean` is a bioinformatics tool to facilitate and standardize the genetic data clean up pipeline with genotyping array data. In conjunction with a source batch-queuing system, the tool minimizes data manipulation errors, it accelerates the completion of the data clean up process and it provides informative graphics and metrics to guide decision making for statistical analysis. :py:mod:`pyGenClean` is a command tool working on both Linux and Windows operating systems. Its usage is shown below: .. code-block:: console $ run_pyGenClean --help usage: run_pyGenClean [-h] [-v] [--bfile FILE] [--tfile FILE] [--file FILE] [--report-title TITLE] [--report-author AUTHOR] [--report-number NUMBER] [--report-background BACKGROUND] --conf FILE Runs the data clean up (pyGenClean version 1.8.3). optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit Input File: --bfile FILE The input file prefix (will find the plink binary files by appending the prefix to the .bim, .bed and .fam files, respectively). --tfile FILE The input file prefix (will find the plink transposed files by appending the prefix to the .tped and .tfam files, respectively). --file FILE The input file prefix (will find the plink files by appending the prefix to the .ped and .fam files). Report Options: --report-title TITLE The report title. [default: Genetic Data Clean Up] --report-author AUTHOR The current project number. [default: pyGenClean] --report-number NUMBER The current project author. [default: Simple Project] --report-background BACKGROUND Text of file containing the background section of the report. Configuration File: --conf FILE The parameter file for the data clean up. The tool consists of multiple standalone scripts that are linked together via a main script (``run_pyGenClean``) and a configuration file (the ``--conf`` option), the latter facilitating user customization. The :ref:`proposed_protocol_figure` shows the proposed data cleanup pipeline. Each box represents a customizable standalone script with a quick description of its function. Optional manual checks for go-no-go decisions are indicated. .. _proposed_protocol_figure: .. figure:: _static/images/Data_clean_up_schema.png :align: center :width: 100% :alt: Data clean up schema. Data clean up protocol schema Installation ============ :py:mod:`pyGenClean` is a Python package that works on both Linux and Windows operating systems. It requires a set of Python dependencies and PLINK. Complete installation procedures are available for both Linux (32 and 64 bits) and Windows in the following sections. .. toctree:: :maxdepth: 1 install_linux install_windows Input files =========== To use :py:mod:`pyGenClean`, two sets of files are required: a set of genotype files and a configuration file (using the `INI format `_). Genotype files -------------- The input files of the main program (``run_pyGenClean``) are either: * PLINK's pedfile format (use :py:mod:`pyGenClean`'s ``--file`` option) consists of two files with the following extensions: ``PED`` and ``MAP``. * PLINK's transposed pedfile format (use :py:mod:`pyGenClean`'s ``--tfile`` option) consists of two files with the following extensions: ``TPED`` and ``TFAM``. * PLINK's binary pedfile format (use :py:mod:`pyGenClean`'s ``--bfile`` option) consists of three files with the following extensions: ``BED``, ``BIM`` and ``FAM``. For more information about these file formats, have a look at PLINK's website, in the *Basic usage/data formats* section (`http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml `_). .. warning:: If the format used is the *transposed* one, the columns **must** be separated using **tabulations**, but alleles of each marker need to be separated by a single space. To create this exact transposed pedfile format, you need to use the following PLINK's options: * ``--recode`` to recode the file. * ``--transposed`` to create an output file in the transposed pedfile format. * ``--tab`` to use tabulations. Configuration file ------------------ To customized :py:mod:`pyGenClean`, a basic configuration file is required. It tells which script to use in a specific order. It also sets the different options and input files, so that the analysis is easy to replicate or modify. The configuration file uses the `INI format `_. It consists of sections, led by a ``[section]`` header (contiguous integers which gives the order of the pipeline) and followed by customization of this particular part of the pipeline. Lines preceded by a ``#`` are comments and are not read by :py:mod:`pyGenClean`. The following example first removes samples with a missing rate of 10% and more (section starting at line ``1``), then removes markers with a missing rate of 2% and more (section starting at line ``6``). Finally, it removes the samples with a missing rate of 2% and more (section starting at line ``3``). .. code-block:: lighttpd :linenos: [1] # Removes sample with a missing rate higher than 10%. script = sample_missingness mind = 0.1 [2] # Removes markers with a missing rate higher than 2%. script = snp_missingness geno = 0.02 [3] # Removes sample with a missing rate higher than 2%. script = sample_missingness mind = 0.02 For a more thorough example, complete configuration files are available for download at `http://www.statgen.org `_ and are explained in the :ref:`config_files` section. For a list of available modules and standalone script, refer to the :ref:`list_of_scripts`. .. toctree:: :maxdepth: 1 list_of_options Information about the protocol ============================== The following sections describe the proposed protocol and provides information about which file should be looked for quality control. Finally, configuration files (with all available parameters) are given. .. toctree:: :maxdepth: 2 protocol configuration_files how_to_run automatic_report The algorithm ============= All the functions used for this project are shown and explained in the following section: .. toctree:: :maxdepth: 1 module_content/modules contamination duplicated_samples duplicated_markers clean_noCall_hetero_snps sample_missingness marker_missingness sex_check plate_bias remove_heterozygous_haploid find_related_samples check_ethnicity flag_maf_zero flag_hw compare_gold_standard plink_utils Citing pyGenClean ================= If you use :py:mod:`pyGenClean` in any published work, please cite the published scientific paper describing the tool. Lemieux Perreault LP, Provost S, Legault MA, Barhdadi A, Dubé MP: **pyGenClean: efficient tool for genetic data clean up before association testing.** *Bioinformatics* 2013, 29(13):1704-1705 [DOI:`10.1093/bioinformatics/btt261 `_] Appendix ======== .. toctree:: :maxdepth: 1 result_table