Related Samples Module¶
The usage of the standalone module is shown below:
$ pyGenClean_find_related_samples --help
usage: pyGenClean_find_related_samples [-h] [-v] --bfile FILE [--genome-only]
[--min-nb-snp INT]
[--indep-pairwise STR STR STR]
[--maf FLOAT] [--ibs2-ratio FLOAT]
[--sge] [--sge-walltime TIME]
[--sge-nodes INT INT]
[--line-per-file-for-sge INT]
[--out FILE]
Finds related samples according to IBS values.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Input File:
--bfile FILE The input file prefix (will find the plink binary
files by appending the prefix to the .bim, .bed and
.fam files, respectively.)
Options:
--genome-only Only create the genome file
--min-nb-snp INT The minimum number of markers needed to compute IBS
values. [Default: 10000]
--indep-pairwise STR STR STR
Three numbers: window size, window shift and the r2
threshold. [default: ['50', '5', '0.1']]
--maf FLOAT Restrict to SNPs with MAF >= threshold. [default:
0.05]
--ibs2-ratio FLOAT The initial IBS2* ratio (the minimum value to show in
the plot. [default: 0.8]
--sge Use SGE for parallelization.
--sge-walltime TIME The walltime for the job to run on the cluster. Do not
use if you are not required to specify a walltime for
your jobs on your cluster (e.g. 'qsub
-lwalltime=1:0:0' on the cluster).
--sge-nodes INT INT The number of nodes and the number of processor per
nodes to use (e.g. 'qsub -lnodes=X:ppn=Y' on the
cluster, where X is the number of nodes and Y is the
number of processor to use. Do not use if you are not
required to specify the number of nodes for your jobs
on the cluster.
--line-per-file-for-sge INT
The number of line per file for SGE task array.
[default: 100]
Output File:
--out FILE The prefix of the output files. [default: ibs]
Input Files¶
This module uses PLINK’s binary file format (bed
, bim
and fam
files)
for the source data set (the data of interest).
Procedure¶
Here are the steps performed by the module:
- Uses Plink to extract markers according to LD.
- Checks if there is enough markers after pruning.
- Extract markers according to LD.
- Runs Plink with the
genome
option to compute the IBS values. - Finds related individuals and gets values for plotting.
- Plots
Z1
in function ofIBS2 ratio
for related individuals. - Plots
Z2
in function ofIBS2 ratio
for related individuals.
Output Files¶
The output files of each of the steps described above are as follow (note that
the output prefix shown is the one by default [i.e. ibs
]):
- One set of PLINK’s result files:
ibs.pruning_0.1
: the results of the pruning process of Plink. The value depends on the option of--indep-pairwise
. The markers that are kept are in the fileibs.pruning_0.1.prune.in
.
- No file created.
- One set of PLINK’s binary files:
ibs.pruned_data
: the data sets containing only the marker from the first step (the list is inibs.pruning_0.1.prune.in
).
- One set of PLINK’s result files (two if
--sge
is used):ibs.frequency
: PLINK’s result files when computing the frequency of each of the pruned markers. This data set will exist only if the option--sge
is used.ibs.genome
: PLINK’s results including IBS values.
- One file provided by the
pyGenClean.RelatedSamples.find_related_samples
and three files provided bypyGenClean.RelatedSamples.merge_related_samples
:ibs.related_individuals
: a subset of theibs.genome.genome
file- containing only samples that are considered to be related. Three columns
are appended to the original
ibs.genome.genome
file:IBS2_ratio
(the value that is considered to find related individuals),status
(the type of relatedness [e.g. twins]) andcode
(a numerical code that represent thestatus
). This file is provided by thepyGenClean.RelatedSamples.find_related_samples
module.
ibs.merged_related_individuals
: a file aggregating related samples in groups, containing the following columns:index
(the group number),FID1
(the family ID of the first sample),IID1
(the individual ID of the first sample),FID2
(the family ID of the second sample),IID2
(the individual ID of the second sample) andstatus
(the type of relatedness between the two samples). This file is provided by themerge_related_samples
.ibs.chosen_related_individuals
: the related individuals that were randomly chosen from each group to be kept in the final data set. This file is provided by themerge_related_samples
.ibs.discarded_related_individual
: the related individuals that needs to be discarded, so that the final data set include only unrelated individuals. This file is provided by themerge_related_samples
.
- One image file:
ibs.related_individuals_z1.png
: a plot showing the \(Z_1\) value in function of the \(IBS2^*_{ratio}\) for all samples above a certain \(IBS2^*_{ratio}\) (the default threshold is 0.8). See Figure Z1 in function of IBS2 ratio.
- One image file:
ibs.related_individuals_z2.png
: a plot showing the \(Z_2\) value in function of the \(IBS2^*_{ratio}\) for all samples above a certain \(IBS2^*_{ratio}\) (the default threshold is 0.8). See Figure Z2 in function of IBS2 ratio.
The Plots¶
The first plot (Z1 in function of IBS2 ratio figure) that is created is \(Z_1\) in
function of \(IBS2^*_{ratio}\) (where each point represents a pair of
related individuals. The color code comes from the different value of
\(Z_0\), \(Z_1\) and \(Z_2\), as described in the
pyGenClean.RelatedSamples.find_related_samples.extractRelatedIndividuals()
function. In this plot, there are four locations where related samples tend to
accumulate (first degree relatives (full sibs), second degree relatives
(half-sibs, grand-parent-child or uncle-nephew), parent-child and twins (or
duplicated samples). The unknown sample pairs represent possible undetected
related individuals.
The second plot (Z2 in function of IBS2 ratio figure) that is created is \(Z_2\) in function of \(IBS2^*_{ratio}\) (where each point represents a pair of related individuals. It’s just another representation of relatedness of sample pairs, where the location of the “clusters” is different.
Finding Outliers¶
A standalone script was created in order to regroup related samples in different subset. The usage is as follow:
$ pyGenClean_merge_related_samples --help
usage: pyGenClean_merge_related_samples [-h] [-v] --ibs-related FILE
[--no-status] [--out FILE]
Merges related samples according to IBS.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Input File:
--ibs-related FILE The input file containing related individuals according
to IBS value.
Options:
--no-status The input file doesn't have a 'status' column.
Output File:
--out FILE The prefix of the output files. [default: ibs_merged]
At the end of the analysis, two files are created. The file
*.chosen_related_individuals"
contains a list of randomly selected samples
according to their relatedness (to keep only on sample for a group of related
samples). The file *.discarded_related_individuals
contains a list of
sample to exclude to only keep unrelated samples in a dataset.
The Algorithm¶
For more information about the actual algorithms and source codes, refer to the following page.