View on GitHub

ERDS-exome

A hybrid approach for copy number variants detection from whole-exome sequencing data

Turorial

This is a general tutorial for using ERDS-exome. You can also try ERDS-exome simply using test data. Please read the Quick Start Guide.

General command structure

ERDS-exome is a command-line python program (erds_exome.py). There are several sub-commands, each with a set of command line options. In general, the ERDS-exome pipeline consists of the following steps:

1. Calculate RPKM values for all samples.

  python erds_exome.py rpkm [...]

2. Merge RPKM values of all samples into a matrix.

  python erds_exome.py merge_rpkm [...]

3. Filter RPKM matrix by filter parameters.

  python erds_exome.py filter [...]

4. Normalization.

     In mode 1: Single-sample based normalization, normalize RPKM matrix by mappability, GC content and exon length.

  python erds_exome.py norm_rpkm [...]

     In mode 2: Pooled-sample based normalization, normalize RPKM matrix by singular value decomposition (SVD).

  python erds_exome.py svd [...]

5. CNV calling.

     In mode 1: CNV calling by single-sample mode.

     In mode 2: CNV calling by pooled-sample mode.

  python erds_exome.py discover [...]

6. Merge CNV calling results.

  python erds_exome.py merge [...]

0. Requirements and Setup

ERDS-exome requires Python v2.7 (Python 3 is not yet supported). And NumPy package should be installed. If you want to calculate RPKM values from BAM files, pysam is required. Pysam is also used to calculate GC content. In addition, bx-python is required to calculate mappability, and PyVCF is required to generate SNV heterozygosity information and tagSNP linkage disequilibrium (LD) information.

1. Create probe/target definition file

This file defines the chr:start-stop coordinates for the probes/targets in the exome capture. It should be tab-delimited with the following header and columns:
chr	start	stop
1	14642	14882
1	14943	15063
1	15751	15990
1	16599	16719
1	16834	17074
Where the first column represents the number of chromsome, and the second and third columns refer to the start index and end index of target respectively. ERDS-exome also support target file of bed format.

2. Create RPKM files

Using pysam package, ERDS-exome can calculate RPKM from aligned and indexed BAM files.
  python erds_exome.py rpkm \
  --input $bamlist_file \
  --maq $map_quality
  --target $target_file \
  --output $output_path
Where $target_file is the probes/target created in former step, $bamlist_file includes the bam files of all samples, and $map_quality is the minimum mapping quality score of reads. The format of $bamlist is like:
  NA06984	/data/rjtan/1000GP/exome/NA06984.mapped.ILLUMINA.bwa.CEU.exome.20120522.bam	CEU
  NA06985	/data/rjtan/1000GP/exome/NA06985.mapped.ILLUMINA.bwa.CEU.exome.20130415.bam	CEU
  NA06994	/data/rjtan/1000GP/exome/NA06994.mapped.ILLUMINA.bwa.CEU.exome.20120522.bam	CEU
  NA07000	/data/rjtan/1000GP/exome/NA07000.mapped.ILLUMINA.bwa.CEU.exome.20130415.bam	CEU
Where the first column refers to sample name, the second column refers to directory and name of bam file, and the third column refers to ethnic group of the corresponding sample.

3. Merge RPKM files

Merge the RPKM files from different samples into a matirx.
  python erds_exome.py merge_rpkm \
	--rpkm_dir $rpkm_files \
	--target $target_file
Where $rpkm_files includes all RPKM files. There should not be other files except RPKM files in $rpkm_files.

4. Filter RPKM matrix

Filter the RPKM matrix by mappability,
  python erds_exome.py filter \
  --rpkm_matrix $RC_matrix.raw \
  --ref_file $ref \
  --map_file $mapability_file \
  --filter_params $filter_params_file
Where $RC_matrix.raw is the targets by samples matrix generated in former step. $ref is human reference file (hg19) which can be downloaded here and $map_file refers to mappability file which can be downloaded here. $filter_params_file contains the filter parameters. Specifically, the format of filter parameter file is like:
  23	80	2	5000	0.001
The five parameters refer to minimum GC content, maximum GC content, minimum mappability value, maximum exon length and minimum RPKM value respectively.

5. Normalization

In mode 1: Normalize RPKM matrix by three steps, including mappability, GC content and exon length.
  python erds_exome.py norm_rpkm \
  --rpkm_matrix $RPKM_matrix.raw.filtered
In mode 2: Normalize RPKM matrix by SVD.
  python erds_exome.py svd \
  --rpkm_matrix $RPKM_matrix.raw.filtered

6. CNV calling

In mode 1: CNV calling by single sample mode. User can specify the HMM parameters file by '--params'. If '--hetsnp' is set to 'True', then the SNV heterozygosity information will be included. If '--tagsnp' is set to 'True', then the tagSNP linkage disequilibrium (LD) information will be included and a tagSNP file should be specified by '--tagsnp_file'. '--hetsnp' and '--tagsnp' are optional parameters. The default options of them are 'False'. If user specify a sample name by '--sample', ERDS-exome will only call CNV for this sample. Otherwise ERDS-exome will call CNV for all samples.
  python erds_exome.py discover \
  --params params.txt \
  --rpkm_matrix $RPKM_matrix.raw.filtered.normalized \
  --mode single \
  --sample $sample_name \
  --vcf $snv_vcf_file \
  --hetsnp True \
  --tagsnp True \
  --tagsnp_file $tagsnp_file \
  --output $sample_name.single.Het.Tag.cnv
In mode 2: CNV calling by pooled sample mode. All parameters are the same with CNV calling by single sample mode, except that '--mode' is set to 'pooled'.
  python erds_exome.py discover \
  --params params.txt \
  --rpkm_matrix $RPKM_matrix.raw.filtered.SVD \
  --mode pooled \
  --sample $sample_name \
  --vcf=$snv_vcf_file \
  --hetsnp True \
  --tagsnp True \
  --tagsnp_file=$tagsnp_file \
  --output $sample_name.pooled.Het.Tag.cnv

7. Merge CNV calling results

Merge CNV calling results from different methods.
  python erds_exome.py merge \
  --datafile_svd $sample_name.pooled.Het.Tag.cnv \
  --datafile_dis $sample_name.single.Het.Tag.cnv \
  --output $sample_name.merged.Het.Tag.cnv