aroma.affymetrix 2.5.0
aroma.cn 1.0.0
What's new?
Author: Henrik Bengtsson
Created on: 2011-11-05
Last updated: 2011-11-05
This document explains how to do total copy-number (CN) segmentation on a CN data set, how to export the segmentation results to tabular text files.
We assume that locus-specific CN estimates have already been obtained via one of many CN preprocessing methods (e.g. CRMAv2) and that those estimates are available as a AromaUnitTotalCnBinarySet data set(*). Note that a AromaUnitTotalCnBinarySet can hold CN data of any microarray technology, not only Affymetrix, which means that what follows can be used to segment for instance also Illumina and Agilent CN data.
(*) Alternatively, a CnChipEffectSet (Affymetrix only) can also be used.
Assume that a CEL data set named 'HapMap270,6.0,CEU,testSet' has previously been processed by doCRMAv2() and afterward R was quit. To access the results, which was automatically stored on the file system, do:
dataSet <- "HapMap270,6.0,CEU,testSet"; tags <- "ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY"; # Tags added by CRMAv2 chipType <- "GenomeWideSNP_6"; ds <- AromaUnitTotalCnBinarySet$byName(dataSet, tags=tags, chipType=chipType);
so that print(ds) gives:
AromaUnitTotalCnBinarySet:
Name: HapMap270
Tags: 6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Full name: HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Number of files: 3
Names: NA06991, NA06993, NA07000 [3]
Path (to the first file): totalAndFracBData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6
Total file size: 21.53 MB
RAM: 0.00MB
Here we will use the Circular Binary Segmentation (CBS) method for partioning the CN signals in data set 'ds' into segments of equal underlying CN levels. To setup a CBS model for our data set, we do:
sm <- CbsModel(ds);
so that print(sm) gives:
CbsModel:
Name: HapMap270
Tags: 6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Chip type (virtual): GenomeWideSNP_6
Path: cbsData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6
Number of chip types: 1
Sample & reference file pairs:
Chip type #1 ('GenomeWideSNP_6') of 1:
Sample data set:
AromaUnitTotalCnBinarySet:
Name: HapMap270
Tags: 6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Full name: HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Number of files: 3
Names: NA06991, NA06993, NA07000 [3]
Path (to the first file): totalAndFracBData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6
Total file size: 21.53 MB
RAM: 0.00MB
Reference data set/file:
<average across arrays>
RAM: 0.00MB
From this we can see that:
fit(sm, verbose=-10);
This will segment each sample and each chromosome independently (and store the results to binary files). Note that this may take several minutes per sample. It is possible to segment a subset of the samples and chromosomes, by specifying arguments 'arrays' and 'chromosomes'.
To use another segmentation method, all that is need is to replace CbsModel(), e.g. GladModel() to use the GLAD segmentation (and calling) method.
To write the identified segments of all samples to a tab-delimited text file, do:
pathname <- writeRegions(sm, verbose=verbose);
The 'pathname' specifies where the file is saved, e.g. print(pathname):
[1] "cbsData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY,regions.xls"
The contents of this file can be loaded using read.table(), or more conveniently as:
library("R.filesets");
db <- TabularTextFile(pathname);
df <- readDataFrame(db);
such that print(df[1:5,1:6]) gives:
sample chromosome start stop mean count
1 NA06991 1 61736 106013377 -0.002 65870
2 NA06991 1 106019206 106022376 -1.675 1
3 NA06991 1 106024056 149036525 -0.002 11462
4 NA06991 1 149040066 149256692 -0.443 141
5 NA06991 1 149259417 149436843 -0.144 36
The content of the segmentation table is specific to the segmentation method used. Note also that it is only some "segmentation" method that also call the CN state of the segments, e.g. GLAD (GladModel). The CBS method used here does not do that. Calling CN states is a hard problem, especially when there are heterogenity in the cell population.