.. _related_samples_label:

Related Samples Module
======================

The usage of the standalone module is shown below:

.. code-block:: console

    $ pyGenClean_find_related_samples --help
    usage: pyGenClean_find_related_samples [-h] [-v] --bfile FILE [--genome-only]
                                           [--min-nb-snp INT]
                                           [--indep-pairwise STR STR STR]
                                           [--maf FLOAT] [--ibs2-ratio FLOAT]
                                           [--sge] [--sge-walltime TIME]
                                           [--sge-nodes INT INT]
                                           [--line-per-file-for-sge INT]
                                           [--out FILE]

    Finds related samples according to IBS values.

    optional arguments:
      -h, --help            show this help message and exit
      -v, --version         show program's version number and exit

    Input File:
      --bfile FILE          The input file prefix (will find the plink binary
                            files by appending the prefix to the .bim, .bed and
                            .fam files, respectively.)

    Options:
      --genome-only         Only create the genome file
      --min-nb-snp INT      The minimum number of markers needed to compute IBS
                            values. [Default: 10000]
      --indep-pairwise STR STR STR
                            Three numbers: window size, window shift and the r2
                            threshold. [default: ['50', '5', '0.1']]
      --maf FLOAT           Restrict to SNPs with MAF >= threshold. [default:
                            0.05]
      --ibs2-ratio FLOAT    The initial IBS2* ratio (the minimum value to show in
                            the plot. [default: 0.8]
      --sge                 Use SGE for parallelization.
      --sge-walltime TIME   The walltime for the job to run on the cluster. Do not
                            use if you are not required to specify a walltime for
                            your jobs on your cluster (e.g. 'qsub
                            -lwalltime=1:0:0' on the cluster).
      --sge-nodes INT INT   The number of nodes and the number of processor per
                            nodes to use (e.g. 'qsub -lnodes=X:ppn=Y' on the
                            cluster, where X is the number of nodes and Y is the
                            number of processor to use. Do not use if you are not
                            required to specify the number of nodes for your jobs
                            on the cluster.
      --line-per-file-for-sge INT
                            The number of line per file for SGE task array.
                            [default: 100]

    Output File:
      --out FILE            The prefix of the output files. [default: ibs]


Input Files
-----------

This module uses PLINK's binary file format (``bed``, ``bim`` and ``fam`` files)
for the source data set (the data of interest).

Procedure
---------

Here are the steps performed by the module:

1.  Uses Plink to extract markers according to LD.
2.  Checks if there is enough markers after pruning.
3.  Extract markers according to LD.
4.  Runs Plink with the ``genome`` option to compute the IBS values.
5.  Finds related individuals and gets values for plotting.
6.  Plots ``Z1`` in function of ``IBS2 ratio`` for related individuals.
7.  Plots ``Z2`` in function of ``IBS2 ratio`` for related individuals.

Output Files
------------

The output files of each of the steps described above are as follow (note that
the output prefix shown is the one by default [*i.e.* ``ibs``]):

1.  One set of PLINK's result files:

    *   ``ibs.pruning_0.1``: the results of the pruning process of Plink. The
        value depends on the option of ``--indep-pairwise``. The markers that
        are kept are in the file ``ibs.pruning_0.1.prune.in``.

2.  No file created.
3.  One set of PLINK's binary files:

    *   ``ibs.pruned_data``: the data sets containing only the marker from the
        first step (the list is in ``ibs.pruning_0.1.prune.in``).

4.  One set of PLINK's result files (two if ``--sge`` is used):

    *   ``ibs.frequency``: PLINK's result files when computing the frequency of
        each of the pruned markers. This data set will exist only if the option
        ``--sge`` is used.
    *   ``ibs.genome``: PLINK's results including IBS values.

5.  One file provided by the :py:mod:`pyGenClean.RelatedSamples.find_related_samples` and
    three files provided by :py:mod:`pyGenClean.RelatedSamples.merge_related_samples`:

    * ``ibs.related_individuals``: a subset of the ``ibs.genome.genome`` file
        containing only samples that are considered to be related. Three columns
        are appended to the original ``ibs.genome.genome`` file: ``IBS2_ratio``
        (the value that is considered to find related individuals), ``status``
        (the type of relatedness [*e.g.* twins]) and ``code`` (a numerical code
        that represent the ``status``). This file is provided by the
        :py:mod:`pyGenClean.RelatedSamples.find_related_samples` module.
    *   ``ibs.merged_related_individuals``: a file aggregating related samples
        in groups, containing the following columns: ``index`` (the group
        number), ``FID1`` (the family ID of the first sample), ``IID1`` (the
        individual ID of the first sample), ``FID2`` (the family ID of the
        second sample), ``IID2`` (the individual ID of the second sample) and
        ``status`` (the type of relatedness between the two samples). This file
        is provided by the :py:mod:`merge_related_samples`.
    *   ``ibs.chosen_related_individuals``: the related individuals that were
        randomly chosen from each group to be kept in the final data set. This
        file is provided by the :py:mod:`merge_related_samples`.
    *   ``ibs.discarded_related_individual``: the related individuals that needs
        to be discarded, so that the final data set include only unrelated
        individuals. This file is provided by the
        :py:mod:`merge_related_samples`.

6.  One image file:

    *   ``ibs.related_individuals_z1.png``: a plot showing the :math:`Z_1` value
        in function of the :math:`IBS2^*_{ratio}` for all samples above a
        certain :math:`IBS2^*_{ratio}` (the default threshold is 0.8). See
        Figure :ref:`ibs_z1_figure`.

7.  One image file:

    *   ``ibs.related_individuals_z2.png``: a plot showing the :math:`Z_2` value
        in function of the :math:`IBS2^*_{ratio}` for all samples above a
        certain :math:`IBS2^*_{ratio}` (the default threshold is 0.8). See
        Figure :ref:`ibs_z2_figure`.

The Plots
---------

The first plot (:ref:`ibs_z1_figure` figure) that is created is :math:`Z_1` in
function of :math:`IBS2^*_{ratio}` (where each point represents a pair of
related individuals. The color code comes from the different value of
:math:`Z_0`, :math:`Z_1` and :math:`Z_2`, as described in the
:py:func:`pyGenClean.RelatedSamples.find_related_samples.extractRelatedIndividuals`
function. In this plot, there are four locations where related samples tend to
accumulate (first degree relatives (full sibs), second degree relatives
(half-sibs, grand-parent-child or uncle-nephew), parent-child and twins (or
duplicated samples). The unknown sample pairs represent possible undetected
related individuals.

.. _ibs_z1_figure:

.. figure:: _static/images/find_related_samples/ibs_related_individuals_z1.png
    :align: center
    :width: 50%
    :alt: Z1 in function of IBS2 ratio

    Z1 in function of IBS2 ratio

The second plot (:ref:`ibs_z2_figure` figure) that is created is :math:`Z_2` in
function of :math:`IBS2^*_{ratio}` (where each point represents a pair of
related individuals. It's just another representation of relatedness of sample
pairs, where the location of the "clusters" is different.

.. _ibs_z2_figure:

.. figure:: _static/images/find_related_samples/ibs_related_individuals_z2.png
    :align: center
    :width: 50%
    :alt: Z2 in function of IBS2 ratio

    Z2 in function of IBS2 ratio

Finding Outliers
----------------

A standalone script was created in order to regroup related samples in
different subset. The usage is as follow:

.. code-block:: console

    $ pyGenClean_merge_related_samples --help
    usage: pyGenClean_merge_related_samples [-h] [-v] --ibs-related FILE
                                            [--no-status] [--out FILE]

    Merges related samples according to IBS.

    optional arguments:
      -h, --help          show this help message and exit
      -v, --version       show program's version number and exit

    Input File:
      --ibs-related FILE  The input file containing related individuals according
                          to IBS value.

    Options:
      --no-status         The input file doesn't have a 'status' column.

    Output File:
      --out FILE          The prefix of the output files. [default: ibs_merged]


At the end of the analysis, two files are created. The file
``*.chosen_related_individuals"`` contains a list of randomly selected samples
according to their relatedness (to keep only on sample for a group of related
samples). The file ``*.discarded_related_individuals`` contains a list of
sample to exclude to only keep unrelated samples in a dataset.

The Algorithm
-------------

For more information about the actual algorithms and source codes, refer to the
following page.

* :py:mod:`pyGenClean.RelatedSamples.find_related_samples`
* :py:mod:`pyGenClean.RelatedSamples.merge_related_samples`