Vignette: Total copy-number segmentation (non-paired CBS)

Author: Henrik Bengtsson
Created on: 2011-11-05
Last updated: 2011-11-05

This document explains how to do total copy-number (CN) segmentation on a CN data set, how to export the segmentation results to tabular text files.

Preprocessing

We assume that locus-specific CN estimates have already been obtained via one of many CN preprocessing methods (e.g. CRMA v2) and that those estimates are available as a AromaUnitTotalCnBinarySet data set(*). Note that a AromaUnitTotalCnBinarySet can hold CN data of any microarray technology, not only Affymetrix, which means that what follows can be used to segment for instance also Illumina and Agilent CN data.

(*) Alternatively, a CnChipEffectSet (Affymetrix only) can also be used.

Setting up an already preprocessed data set

Assume that a CEL data set named 'HapMap270,6.0,CEU,testSet' has previously been processed by doCRMAv2() and afterward R was quit. To access the results, which was automatically stored on the file system, do:

dataSet <- "HapMap270,6.0,CEU,testSet"
tags <- "ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY";  # Tags added by CRMA v2
chipType <- "GenomeWideSNP_6"
ds <- AromaUnitTotalCnBinarySet$byName(dataSet, tags=tags, chipType=chipType)

so that print(ds) gives:

AromaUnitTotalCnBinarySet:
Name: HapMap270
Tags: 6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Full name: HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Number of files: 3
Names: NA06991, NA06993, NA07000 [3]
Path (to the first file): totalAndFracBData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6
Total file size: 21.53 MB
RAM: 0.00MB

AromaUnitTotalCnBinarySet:
Name: HapMap270
Tags: 6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Full name: HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Number of files: 3
Names: NA06991, NA06993, NA07000 [3]
Path (to the first file): totalAndFracBData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6
Total file size: 21.53 MB
RAM: 0.00MB

Segmenting

Here we will use the Circular Binary Segmentation (CBS) method for partitioning the CN signals in data set ds into segments of equal underlying CN levels. To setup a CBS model for our data set, we do:

sm <- CbsModel(ds)

so that print(sm) gives:

**CbsModel**:
Name: HapMap270
Tags: 6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Chip type (virtual): GenomeWideSNP_6
Path: cbsData/**HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY**/GenomeWideSNP_6
Number of chip types: **1**
Sample & reference file pairs:
Chip type #1 ('**GenomeWideSNP_6**') of 1:
Sample data set:
AromaUnitTotalCnBinarySet:
Name: HapMap270
Tags: 6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Full name: HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Number of files: **3**
Names: **NA06991, NA06993, NA07000** [3]
Path (to the first file): totalAndFracBData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6
Total file size: 21.53 MB
RAM: 0.00MB
Reference data set/file:
**<average across arrays>**
RAM: 0.00MB

From this we can see that:

Segmentation will be done using CBS ("CbsModel")
The full name of the data set will be 'HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY'.
There is only type of chips segmented ("GenomeWideSNP_6").
There are 3 samples that will be segmented.
Total copy numbers for these 3 samples are calculated using the pool of all samples (the 3 ones) as a reference ("<average across arrays>").

Note that above we have only setup the segmentation method. We still have to run the segmentation, which we do by:

fit(sm, verbose=-10)

This will segment each sample and each chromosome independently (and store the results to binary files). Note that this may take several minutes per sample. It is possible to segment a subset of the samples and chromosomes, by specifying arguments 'arrays' and 'chromosomes'.

To use another segmentation method, all that is need is to replace CbsModel(), e.g. GladModel() to use the GLAD segmentation (and calling) method.

Exporting segmentation results to tabular text files

To write the identified segments of all samples to a tab-delimited text file, do:

pathname <- writeRegions(sm, verbose=verbose)

The pathname specifies where the file is saved, e.g. print(pathname):

[1]
"cbsData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY,regions.xls"

The contents of this file can be loaded using read.table(), or more conveniently as:

library("R.filesets")
db <- TabularTextFile(pathname)
df <- readDataFrame(db)

such that print(df[1:5,1:6]) gives:

   sample chromosome     start      stop   mean count  
1 NA06991          1     61736 106013377 -0.002 65870  
2 NA06991          1 106019206 106022376 -1.675     1  
3 NA06991          1 106024056 149036525 -0.002 11462  
4 NA06991          1 149040066 149256692 -0.443   141  
5 NA06991          1 149259417 149436843 -0.144    36

The content of the segmentation table is specific to the segmentation method used. Note also that it is only some "segmentation" method that also call the CN state of the segments, e.g. GLAD (GladModel). The CBS method used here does not do that. Calling CN states is a hard problem, especially when there are heterogeneity in the cell population.