aroma.affymetrix 2.4.0
aroma.cn 1.0.0
What's new?
Author: Henrik Bengtsson
Created on: 2009-01-15
Last updated: 2009-01-15
This document shows how to estimate genotypes based on the CRLMM algorithm. The setup is such that it tries to replicate the estimates of the justCRLMMv2() of the oligo package. Support for CRLMM in aroma.affymetrix is currently for the 100K and the 500K chip types. The GWS5 or GWS6 chip types are yet not supported.
If this is your first analysis in aroma.affymetrix, please make sure to first read the first few pages in the online 'Users Guide'. This will explain the importance of following a well defined directory structure and file names. Understanding this is important and will save you a lot of time.
annotationData/
chipTypes/
Mapping250K_Nsp/
Mapping250K_Nsp.cdf
Mapping250K_Nsp,na26,HB20080915.ugp
rawData/
HapMap270,500K,CEU,testSet/
Mapping250K_Nsp/
NA06985.CEL NA06991.CEL NA06993.CEL
NA06994.CEL NA07000.CEL NA07019.CEL
There should be 6 CEL files in total. This data was downloaded from the HapMap website.
library("aroma.affymetrix");
log <- verbose <- Arguments$getVerbose(-8, timestamp=TRUE);
cdf <- AffymetrixCdfFile$byChipType("Mapping50K_Nsp");
print(cdf);
AffymetrixCdfFile:
Path: annotationData/chipTypes/Mapping250K_Nsp
Filename: Mapping250K_Nsp.cdf
Filesize: 185.45MB
Chip type: Mapping250K_Nsp
RAM: 11.01MB
File format: v4 (binary; XDA)
Number of cells: 6553600
Number of units: 262338
Cells per unit: 24.98
Number of QC units: 6
csR <- AffymetrixCelSet$byName("HapMap270,500K,CEU,testSet", cdf=cdf);
print(csR);
AffymetrixCelSet:
Name: HapMap270
Tags: 500K,CEU,testSet
Path: rawData/HapMap270,500K,CEU,testSet/Mapping250K_Nsp
Platform: Affymetrix
Chip type: Mapping250K_Nsp
Number of arrays: 6
Names: NA06985, NA06991, ..., NA07019
Time period: 2005-11-16 15:17:51 -- 2005-11-18 19:08:24
Total file size: 375.88MB
RAM: 0.01MB
ces <- justSNPRMA(csR, normalizeToHapmap=TRUE, returnESet=FALSE, verbose=log); print(ces);
SnpChipEffectSet:
Name: HapMap270
Tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-
Path: plmData/HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-/Mapping250K_Nsp
Platform: Affymetrix
Chip type: Mapping250K_Nsp,monocell
Number of arrays: 6
Names: NA06985, NA06991, ..., NA07019
Time period: 2009-01-10 17:34:45 -- 2009-01-10 17:34:46
Total file size: 57.34MB
RAM: 0.01MB
Parameters: (probeModel: chr "pm", mergeStrands: logi FALSE)
Setting up the model:
crlmm <- CrlmmModel(ces, tags="*,oligo"); print(crlmm);
CrlmmModel:
Data set: HapMap270
Chip type: Mapping250K_Nsp,monocell
Input tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-
Output tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
Parameters: (balance: num 1.5; minLLRforCalls: num [1:3] 5 1 5; recalibrate: logi FALSE;flavor: chr "v2").
Path: crlmmData/HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo/Mapping250K_Nsp
RAM: 0.00MB
Fitting the model:
units <- fit(crlmm, ram="oligo", verbose=log); str(units);
int [1:262264] 135666 261964 227992 3670 227993 159191 212754 24401 ...
Retrieving the genotype calls (without actually loading anything):
callSet <- getCallSet(crlmm); print(callSet);
AromaUnitGenotypeCallSet:
Name: HapMap270
Tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
Full name: HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
Number of files: 6
Names: NA06985, NA06991, ..., NA07019
Path (to the first file): crlmmData/HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo/Mapping250K_Nsp
Total file size: 3.00MB
RAM: 0.01MB
calls <- extractGenotypes(callSet, units=2001:2005); print(calls);
NA06985 NA06991 NA06993 NA06994 NA07000 NA07019
[1,] "AA" "AA" "AA" "AA" "AA" "AA"
[2,] "AA" "AB" "AB" "BB" "AB" "AB"
[3,] "AA" "AB" "AB" "AB" "AA" "AB"
[4,] "BB" "AB" "AB" "BB" "BB" "BB"
[5,] "AA" "AA" "AA" "AA" "AA" "AA"
confSet <- getConfidenceScoreSet(crlmm); print(confSet);
AromaUnitSignalBinarySet:
Name: HapMap270
Tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
Full name: HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
Number of files: 6
Names: NA06985, NA06991, ..., NA07019
Path (to the first file): crlmmData/HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo/Mapping250K_Nsp
Total file size: 6.01MB
RAM: 0.01MB
scores <- extractMatrix(confSet, units=2001:2005); print(scores);
NA06985 NA06991 NA06993 NA06994 NA07000 NA07019
[1,] 0.9998 0.9998 0.9998 0.9999 0.9999 0.9998
[2,] 0.9968 0.9932 0.9931 0.9998 0.9976 0.9974
[3,] 0.9997 0.9993 0.9967 0.9988 0.9998 0.9975
[4,] 0.9986 0.9961 0.9894 0.9994 0.9994 0.9989
[5,] 0.9998 0.9999 0.9998 0.9999 0.9998 0.9998
gi <- getGenomeInformation(cdf); units <- getUnitsOnChromosome(gi, chromosome=2, region=c(75,90)*1e6); pos <- getPositions(gi, units=units) / 1e6;
Get data for the last array:
array <- 6; ce <- getFile(ces, array); data <- extractTotalAndFreqB(ce, units=units); str(data);
num [1:1306, 1:2] 2312 2185 1994 3242 4156 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "total" "freqB"
Extract the (total) theta and the freqB columns:
theta <- data[,"total"]; freqB <- data[,"freqB"];
We calculate the copy-neutral reference as the robust average of all arrays:
ceR <- getAverageFile(ces); dataR <- extractTotalAndFreqB(ceR, units=units); thetaR <- dataR[,"total"];
Then we calculate the log2 copy-number ratios:
M <- log2(theta/thetaR);
Finally, we extract the corresponding genotype calls:
cf <- getFile(callSet, array); calls <- extractGenotypes(cf, units=units);
Next, we plot the (x,C) and (x,freqB) along the genome annotated with colors according to genotype:
col <- c(AA=1, AB=2, BB=3)[calls]; xlim <- range(pos, na.rm=TRUE); xlab <- "Physical position (Mb)"; Mlab <- expression(log[2](theta/theta[R])); Blab <- expression(beta == theta[B]/theta); subplots(2, ncol=1); par(mar=c(3,4,1,1)+0.1, pch="."); plot(pos, M, col=col, cex=3, xlim=xlim, ylim=c(-2,2), xlab=xlab, ylab=Mlab); stext(side=3, pos=0, getName(ce)); stext(side=3, pos=1, "Chr 2"); plot(pos, freqB, col=col, cex=3, xlim=xlim, ylim=c(0,1), xlab=xlab, ylab=Blab);

path <- file.path("oligoData", getFullName(csR), getChipType(csR, fullname=FALSE));
path <- Arguments$getWritablePathname(path);
if (!isDirectory(path)) {
mkdirs(getParent(path));
oligo:::justCRLMMv2(getPathnames(csR), tmpdir=path, recalibrate=FALSE, balance=1.5, verbose=TRUE);
}
calls0 <- readSummaries("calls", path);
dimnames(calls0) <- NULL;
units <- indexOf(cdf, pattern="^SNP"); unitNames <- getUnitNames(cdf, units=units); units <- units[order(unitNames)]; calls <- extractGenotypes(callSet, units=units, encoding="oligo"); dimnames(calls) <- NULL;
count <- 0;
for (cc in 1:ncol(calls)) {
idxs <- whichVector(calls[,cc] != calls0[,cc]);
count <- count + length(idxs);
cat(sprintf("%s: ", getNames(callSet)[cc]));
if (length(idxs) > 0) {
map <- c("AA", "AB", "BB");
cat(paste(map[calls[idxs,cc]], map[calls0[idxs,cc]], sep="!="), sep=",");
}
cat("\n");
}
We identify the following difference for each of the genotyped samples:
NA06985: AB!=AA, AB!=BB, AB!=AA, BB!=AB, BB!=AB
NA06991: AB!=AA
NA06993: AA!=AB, BB!=AB
NA06994: AB!=AA, BB!=AB
NA07000: AB!=BB, AA!=AB
NA07019: AA!=AB, AA!=AB, AA!=AB, AA!=AB
Note that these differences are only in one allele, that is, one implementation calls a SNP heterozygote whereas the other calls a homozygote.
cat(sprintf("Averages number of discrepances per array: %.1f\n", count/ncol(calls)));
Averages number of discrepances per array: 2.7
errorRate <- count/length(calls);
cat(sprintf("Concordance rate: %.5f%%\n", 100*(1-errorRate)));
Concordance rate: 99.99898%
Confidence scores according to oligo:
conf0 <- readSummaries("conf", path);
dimnames(conf) <- NULL;
Confidence scores according to aroma.affymetrix:
conf <- extractMatrix(confSet, units=units); dimnames(conf) <- NULL;
Differences between aroma.affymetrix and oligo:
delta <- conf - conf0;
avgDelta <- mean(abs(delta), na.rm=TRUE);
cat(sprintf("Averages difference: %.2g\n", avgDelta));
Averages difference: 7.1e-06
Pairwise plots:
subplots(ncol(conf));
par(mar=c(3,2,1,1)+0.1);
lim <- c(0,1);
for (cc in 1:ncol(conf)) {
plot(NA, xlim=lim, ylim=lim);
abline(a=0, b=1, col="#999999");
points(conf[,cc], conf0[,cc], pch=".", cex=3);
rho <- cor(conf[,cc], conf0[,cc]);
stext(side=3, pos=0, line=-1, sprintf("rho=%.4f", rho));
stext(side=3, pos=0, getNames(confSet)[cc]);
cat(sprintf("Array #%d: Correlation: %.4f\n", cc, rho));
}

> sessionInfo()
R version 2.8.1 Patched (2008-12-22 r47296)
i386-pc-mingw32
locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
attached base packages:
[1] splines tools stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] pd.mapping250k.nsp_0.4.1 oligoClasses_1.3.8
[4] AnnotationDbi_1.3.12 preprocessCore_1.3.4 RSQLite_0.7-0
[7] DBI_0.2-4 Biobase_2.1.7 aroma.affymetrix_1.0.0
[10] aroma.apd_0.1.4 R.huge_0.1.6 affxparser_1.15.1
[13] aroma.core_1.0.0 aroma.light_1.11.1 oligo_1.5.9
[16] digest_0.3.1 matrixStats_0.1.3 R.rsp_0.3.4
[19] R.cache_0.1.7 R.utils_1.1.3 R.oo_1.4.6
[22] R.methodsS3_1.0.3