aroma.affymetrix 1.7.0
aroma.cn 0.5.0
What's new?
Author: Elizabeth Purdom (pruned by Henrik Bengtsson)
Created on: 2008-12-19
Last updated: 2009-12-12
I am collecting here functions that Mark Robinson and I have developed to create CDFs from scratch. Ultimately the goal will be to create single R functions, but right now these are functions that make up a series of steps (not all of which are in perl).
An example of steps for Ensembl CDF for the exon array:
To filter, see example below
The main file flat2cdf.R contains flat2Cdf() for making the CDF, which is a function in R that takes a 'flat' file and converts it to a binary CDF file. The flat file is a tab-delimited file with each row corresponding to a probe to be included in the CDF. Right now the format of a 'flat' file is a bit rigid.
It must be:
Example:
Probe_ID X Y Probe_Sequence Group_ID Unit_ID
15 14 0 TCTCCAGTGAAGTGCACATTGCTCA 3029044 ENSG00000106144
17 16 0 TGATCGCCTGTCTGCAGATAGGGCA 2400195 ENSG00000090432
29 28 0 TGTAGCTACATGAGGTCTCAGCAGT 2690968 ENSG00000121577
34 33 0 TGGGATGAATGATCAGGAACTGCTG 3032901 ENSG00000130226
35 34 0 TCTGGAAATGCATCAGGGACATCTG 3087574 ENSG00000155975
51 50 0 TATGACAGGATAGCACCAGACGAGC 2808483 ENSG00000112992
58 57 0 TGAGTACGAATGTGCATCTTCAGTC 2783822 ENSG00000138738
59 58 0 TATGTGACTGGATTGCAAGTCTCTC 2403372 ENSG00000158161
62 61 0 CATGTAGATCTATGCCAGTTGAATC 2507416 ENSG00000048991
The R function takes the following arguments:
Example (within R):
flat2Cdf(file="hjay.r1.flat", chipType="hjay", tag="r1,TC");
Affymetrix generally has one file that gives the probe to probeset idenfication. Then there might be a separate file that relates probesets to genes. This information must be combined to get a single flat file for flat2Cdf(). The perl script convertProbesetCSV_differentInput.pl converts these files to an output 'flat' file. This may not work for all input (I use the file *.probe.tab from Affymetrix).
It takes 3 arguments:
Optional arguments concerning the first file:
-p column with the probeset (default 1) -g column with the genes (default 2)
Example (command line; one row):
convertProbesetCSV_differentInput.pl ensembl2affy.csv MoEx-1_0-st-v1.probe.tab MoEnsembl50.flat -p 1 -g 6
The annotateProbesets() function (annotateProbesets.R) is for taking Affymetrix's probesets ('psr') and mapping them to a different gene definition. This might be a bit exon array specific. It was designed for Human Ensembl annotation, though I used it for Mouse. It could be extended to be just probes, perhaps.
This is an R function with the following arguments:
Some strange assumptions about the files (should be fixed...):
Description of the overlapping strategy (It is run after probesets are assigned to genes based on the inputted exonBoundaries file.)
Example Usage:
path <- "HumanEnsembl50Build";
probesetName <- file.path(path, "HuEx-1_0-st-v2.na26.hg18.probeset.csv");
annotExonBoundaries <- file.path(path, "exonBoundaries_proteinCoding_20080819.txt");
colClasses <- rep("NULL", 39); # 39=number of columns in HuEx-1_0-st-v2.naXX.hg18.probeset.csv
colClasses[c(1:5,16)] <- NA; # the columns that I want to pull in
overlapType <- "probeset";
overlapCutoff <- 1.0;
pbsetHeaderNames <- c("probeset_id", "seqname", "strand", "start", "stop", "level");
exonHeaderNames <- c("gene_id", "exon_id", "chr", "strand", "start", "end");
annotateProbesets(probesetFile=probesetName,
exonBoundariesFile=annotExonBoundaries,
probesetFileParameters=list(sep=",", colClasses=colClasses, comment.char="#"),
probesetFileNames=pbsetHeaderNames,
exonBoundariesFileParameters=list(sep="\t", colClasses=NA, comment.char="#"),
exonBoundariesFileNames=exonHeaderNames,
overlapType=overlapType, overlapCutoff=overlapCutoff,
outdir=path, verbose=TRUE);
If you already have an existing flat file, you might want to create a new CDF with just a subset of the probes. selectProbes.pl takes a starting flat file and a file with the xy coordinates of the probes to keep and returns a new flat file.
The program takes 3 arguments:
Then if I want to filter out some probes based on my own analysis of my data, I can write these probes to a file
cells <- unlist(ensb50NOvHuCells, use.names=FALSE);
probeValues <- readRawData(secLgNHuOnHu_HM, indices=cells, fields=c("xy","intensities"));
probeValues$geneId <- rep(names(ensb50NOvHuCells), times=perGeneProbeCountHu);
probeValues$Indices <- cells;
write.table(probeValues[myProbes,c("x","y")], file="probeFile.txt", col.names=TRUE, row.names=FALSE, quote=FALSE, sep="\t");
Then in command line:
selectProbes.pl HuEnsembl50_nooverlap.flat probeFile.txt probeFile.flat
Then back to R:
flat2Cdf(file=probeFile.flat, chipType="HuEx-1_0-st-v2", tags="U-Ensembl50,G-Affy,customMH,joint,80PctLowMo,80PctHighHu,EP", col.class=c("integer", "character")[c(1,1,1,2,2,2)], splitn=5);
Then the CDF must be moved to the correct annotationData/ directory.
Affymetrix sometimes provides CLF and PGF files. These are not necessary fixed formats. CLF files (that I have) are tab-delimited, each line a probe with three columns corresponding to 1) Probe id 2) x position 3) y position. PGF files are not tab-delimited files, but rather nested and give the assignment of probes to probesets.
#level0=probeset_id\ttype\tprobeset_name
#level1=\tatom_id\texon_position"
#level2=\t\tprobe_id\ttype\tgc_count\tprobe_length\tinterrogation_position\tprobe_sequence"
For example:
1\t\tPSR150003628_x_st
\t1\t130
\t\t5892621\tpm:st\t16\t25\t13\tGTCCTCCTCTACGAGGCTCTCGTCC
\t2\t155
\t\t4734090\tpm:st\t17\t25\t13\tTCCGCGTCTCTCACGCCCTCGTCCT
\t3\t161
\t\t3931023\tpm:st\t15\t25\t13\tTCTCTCACGCCCTCGTCCTCTCTGA
The file combineProbeInfo.pl creates a single tab-delimited file with each row for a probe that looks like this after read into R. (This is the same format as the BGP).
## probeset_id type probeset_name atom_id exon_position probe_id type.1 gc_count probe_length
##1 1 NA PSR150003628_x_st 1 130 5892621 pm:st 16 25
##2 1 NA PSR150003628_x_st 2 155 4734090 pm:st 17 25
##3 1 NA PSR150003628_x_st 3 161 3931023 pm:st 15 25
##4 1 NA PSR150003628_x_st 4 70 3244040 pm:st 12 25
##5 1 NA PSR150003628_x_st 5 78 5124413 pm:st 13 25
## interrogation_position probe_sequence x y
##1 13 GTCCTCCTCTACGAGGCTCTCGTCC 2060 2301
##2 13 TCCGCGTCTCTCACGCCCTCGTCCT 649 1849
##3 13 TCTCTCACGCCCTCGTCCTCTCTGA 1422 1535
##4 13 CACCTTTTGTTAGTCCGGAACTCAG 519 1267
##5 13 GTTAGTCCGGAACTCAGAGGAATCG 1852 2001
The program takes 1 argument which is the prefix, then looks for the files <prefix>.pgf and <prefix>.clf. It creates the file <prefix>.probeflat with the per probe info and also <prefix>.psr which is a tab-delimited file with each row corresponding to a probeset in the CLF/PGF, and the columns giving the name and probeset id.
Currently there are not really checks that the PGF and the CLF contain the same information.
Note there are no further 'unit'/gene groupings here.addGeneId.pl can take this file plus another probeset->Gene mapping file and add the necessary column with that information.