.. _contamination_label:

Contamination Module
====================

The usage of the standalone module is shown below:

.. code-block:: console

    $ pyGenClean_check_contamination --help
    usage: pyGenClean_check_contamination [-h] [-v] --bfile FILE --raw-dir DIR
                                          [--colsample COL] [--colmarker COL]
                                          [--colbaf COL] [--colab1 COL]
                                          [--colab2 COL] [--sge]
                                          [--sge-walltime TIME]
                                          [--sge-nodes INT INT]
                                          [--sample-per-run-for-sge INT]
                                          [--out FILE]

    Check BAF and LogR ratio for data contamination.

    optional arguments:
      -h, --help            show this help message and exit
      -v, --version         show program's version number and exit

    Input File:
      --bfile FILE          The input file prefix (will find the plink binary
                            files by appending the prefix to the .bim, .bed and
                            .fam files, respectively).

    Raw Data:
      --raw-dir DIR         Directory containing the raw data (one file per
                            sample, where the name of the file (minus the
                            extension) is the sample identification number.
      --colsample COL       The sample column. [default: Sample Name]
      --colmarker COL       The marker column. [default: SNP Name]
      --colbaf COL          The B allele frequency column. [default: B Allele
                            Freq]
      --colab1 COL          The AB Allele 1 column. [default: Allele1 - AB]
      --colab2 COL          The AB Allele 2 column. [default: Allele2 - AB]

    SGE Options:
      --sge                 Use SGE for parallelization.
      --sge-walltime TIME   The walltime for the job to run on the cluster. Do not
                            use if you are not required to specify a walltime for
                            your jobs on your cluster (e.g. 'qsub
                            -lwalltime=1:0:0' on the cluster).
      --sge-nodes INT INT   The number of nodes and the number of processor per
                            nodes to use (e.g. 'qsub -lnodes=X:ppn=Y' on the
                            cluster, where X is the number of nodes and Y is the
                            number of processor to use. Do not use if you are not
                            required to specify the number of nodes for your jobs
                            on the cluster.
      --sample-per-run-for-sge INT
                            The number of sample to run for a single SGE job.
                            [default: 30]

    Output File:
      --out FILE            The prefix of the output files. [default:
                            contamination]


Input Files
-----------

This module uses PLINK's binary file format (``bed``, ``bim`` and ``fam``
files) for the source data set (the data of interest). It also uses intensities
file (one per sample to test) usually provided by the genotyping platform.


Procedure
---------

Here are the steps performed by the module:

1.  Selects only markers located on autosomes.
2.  Compute frequency for each autosomal markers (as required by *bafRegress*).
3.  Execute *bafRegress* on the dataset (in parallel if required).


Output Files
------------

The output files of each of the steps described above are as follow (note that
the output prefix shown is the one by default [*i.e.* ``contamination``]):

1.  ``contamination.to_extract``: the autosomal markers that will be used by
    *bafRegress*.
2. ``contamination.frq``: the frequency of each of the autosomal markers.
3. ``contamination.bafRegress``: the *bafRegress* results for each of the
    tested sample.


The Algorithm
-------------

For more information about the actual algorithms and source codes, refer to the
following page.

* :py:mod:`pyGenClean.Contamination.contamination`