Turorial
This is a general tutorial for using ERDS-exome. You can also try ERDS-exome simply using test data. Please read the Quick Start Guide.
General command structure
ERDS-exome is a command-line python program (erds_exome.py). There are several sub-commands, each with a set of command line options. In general, the ERDS-exome pipeline consists of the following steps:1. Calculate RPKM values for all samples.
python erds_exome.py rpkm [...]
2. Merge RPKM values of all samples into a matrix.
python erds_exome.py merge_rpkm [...]
3. Filter RPKM matrix by filter parameters.
python erds_exome.py filter [...]
4. Normalization.
In mode 1: Single-sample based normalization, normalize RPKM matrix by mappability, GC content and exon length.
python erds_exome.py norm_rpkm [...]
In mode 2: Pooled-sample based normalization, normalize RPKM matrix by singular value decomposition (SVD).
python erds_exome.py svd [...]
5. CNV calling.
In mode 1: CNV calling by single-sample mode.
In mode 2: CNV calling by pooled-sample mode.
python erds_exome.py discover [...]
6. Merge CNV calling results.
python erds_exome.py merge [...]
0. Requirements and Setup
ERDS-exome requires Python v2.7 (Python 3 is not yet supported). And NumPy package should be installed. If you want to calculate RPKM values from BAM files, pysam is required. Pysam is also used to calculate GC content. In addition, bx-python is required to calculate mappability, and PyVCF is required to generate SNV heterozygosity information and tagSNP linkage disequilibrium (LD) information.
1. Create probe/target definition file
This file defines the chr:start-stop coordinates for the probes/targets in the exome capture. It should be tab-delimited with the following header and columns:chr start stop 1 14642 14882 1 14943 15063 1 15751 15990 1 16599 16719 1 16834 17074Where the first column represents the number of chromsome, and the second and third columns refer to the start index and end index of target respectively. ERDS-exome also support target file of bed format.
2. Create RPKM files
Using pysam package, ERDS-exome can calculate RPKM from aligned and indexed BAM files.python erds_exome.py rpkm \ --input $bamlist_file \ --maq $map_quality --target $target_file \ --output $output_pathWhere $target_file is the probes/target created in former step, $bamlist_file includes the bam files of all samples, and $map_quality is the minimum mapping quality score of reads. The format of $bamlist is like:NA06984 /data/rjtan/1000GP/exome/NA06984.mapped.ILLUMINA.bwa.CEU.exome.20120522.bam CEU NA06985 /data/rjtan/1000GP/exome/NA06985.mapped.ILLUMINA.bwa.CEU.exome.20130415.bam CEU NA06994 /data/rjtan/1000GP/exome/NA06994.mapped.ILLUMINA.bwa.CEU.exome.20120522.bam CEU NA07000 /data/rjtan/1000GP/exome/NA07000.mapped.ILLUMINA.bwa.CEU.exome.20130415.bam CEUWhere the first column refers to sample name, the second column refers to directory and name of bam file, and the third column refers to ethnic group of the corresponding sample.
3. Merge RPKM files
Merge the RPKM files from different samples into a matirx.python erds_exome.py merge_rpkm \ --rpkm_dir $rpkm_files \ --target $target_fileWhere $rpkm_files includes all RPKM files. There should not be other files except RPKM files in $rpkm_files.
4. Filter RPKM matrix
Filter the RPKM matrix by mappability,python erds_exome.py filter \ --rpkm_matrix $RC_matrix.raw \ --ref_file $ref \ --map_file $mapability_file \ --filter_params $filter_params_fileWhere $RC_matrix.raw is the targets by samples matrix generated in former step. $ref is human reference file (hg19) which can be downloaded here and $map_file refers to mappability file which can be downloaded here. $filter_params_file contains the filter parameters. Specifically, the format of filter parameter file is like:23 80 2 5000 0.001The five parameters refer to minimum GC content, maximum GC content, minimum mappability value, maximum exon length and minimum RPKM value respectively.
5. Normalization
In mode 1: Normalize RPKM matrix by three steps, including mappability, GC content and exon length.python erds_exome.py norm_rpkm \ --rpkm_matrix $RPKM_matrix.raw.filteredIn mode 2: Normalize RPKM matrix by SVD.python erds_exome.py svd \ --rpkm_matrix $RPKM_matrix.raw.filtered
6. CNV calling
In mode 1: CNV calling by single sample mode. User can specify the HMM parameters file by '--params'. If '--hetsnp' is set to 'True', then the SNV heterozygosity information will be included. If '--tagsnp' is set to 'True', then the tagSNP linkage disequilibrium (LD) information will be included and a tagSNP file should be specified by '--tagsnp_file'. '--hetsnp' and '--tagsnp' are optional parameters. The default options of them are 'False'. If user specify a sample name by '--sample', ERDS-exome will only call CNV for this sample. Otherwise ERDS-exome will call CNV for all samples.python erds_exome.py discover \ --params params.txt \ --rpkm_matrix $RPKM_matrix.raw.filtered.normalized \ --mode single \ --sample $sample_name \ --vcf $snv_vcf_file \ --hetsnp True \ --tagsnp True \ --tagsnp_file $tagsnp_file \ --output $sample_name.single.Het.Tag.cnvIn mode 2: CNV calling by pooled sample mode. All parameters are the same with CNV calling by single sample mode, except that '--mode' is set to 'pooled'.python erds_exome.py discover \ --params params.txt \ --rpkm_matrix $RPKM_matrix.raw.filtered.SVD \ --mode pooled \ --sample $sample_name \ --vcf=$snv_vcf_file \ --hetsnp True \ --tagsnp True \ --tagsnp_file=$tagsnp_file \ --output $sample_name.pooled.Het.Tag.cnv
7. Merge CNV calling results
Merge CNV calling results from different methods.python erds_exome.py merge \ --datafile_svd $sample_name.pooled.Het.Tag.cnv \ --datafile_dis $sample_name.single.Het.Tag.cnv \ --output $sample_name.merged.Het.Tag.cnv