How to: Create a CDF file from scratch (or from CLF and PGF files)
Author: Elizabeth Purdom (pruned by Henrik Bengtsson)
Created on: 2008-12-19
Last updated: 2014-01-22
I am collecting here functions that Mark Robinson and I have developed to create CDFs from scratch. Ultimately the goal will be to create single R functions, but right now these are functions that make up a series of steps (not all of which are in Perl).
An example of steps for Ensembl CDF for the exon array:
- Download the Ensembl Annotation with biomaRt and use it to map the Affymetrix probesets (exons) to the gene boundaries in Ensembl using annotateProbesets.R
- Make single file of all probes and their mapping with convertProbesetCSV_differentInput.pl (i.e. now map the corresponding probes)
- Create CDF with flat2Cdf.R
To filter, see example below.
Make CDF (Main File)
The main file
flat2Cdf.R
contains flat2Cdf()
for making the CDF, which is a function in R that
takes a 'flat' file and converts it to a binary CDF file. The flat file
is a tab-delimited file with each row corresponding to a probe to be
included in the CDF. Right now the format of a 'flat' file is a bit
rigid.
It must be:
- tab-delimited
- have a header row
- A column 'X' and 'Y' with the x,y coordinates of a file (there is an argument that allows to pass other names, but not really tested yet)
- A column identifying a probe to a group and a column identifying a probe to a unit (for exon arrays, group=exon, unit=gene; for gene arrays, unit=group=gene); these identifications will the group and unit names respectively.
- Other columns are allowed and will be ignored.
Example:
Probe_ID X Y Probe_Sequence Group_ID
Unit_ID
15 14 0 TCTCCAGTGAAGTGCACATTGCTCA 3029044
ENSG00000106144
17 16 0 TGATCGCCTGTCTGCAGATAGGGCA 2400195
ENSG00000090432
29 28 0 TGTAGCTACATGAGGTCTCAGCAGT 2690968
ENSG00000121577
34 33 0 TGGGATGAATGATCAGGAACTGCTG 3032901
ENSG00000130226
35 34 0 TCTGGAAATGCATCAGGGACATCTG 3087574
ENSG00000155975
51 50 0 TATGACAGGATAGCACCAGACGAGC 2808483
ENSG00000112992
58 57 0 TGAGTACGAATGTGCATCTTCAGTC 2783822
ENSG00000138738
59 58 0 TATGTGACTGGATTGCAAGTCTCTC 2403372
ENSG00000158161
62 61 0 CATGTAGATCTATGCCAGTTGAATC 2507416
ENSG00000048991
The R function takes the following arguments:
-
filename
: of flat file -
chipType
: the chip type of the CDF -
tags
: tags to add to the chip type -
ucol
: integer giving the column with the unit id -
gcol
: integer giving the column with the group id (can be same as unit id) -
col.class
: column classes of file (see help onread.table()
); -
splitn
: parameter that controls the number of initial chunks that are unwrapped (number of characters of unit names used to keep units together for initial chunks) -
rows
: number of rows on the array
cols
: number of columns on the array -
xynames
: names for the X,Y columns
Example (within R):
flat2Cdf(file="hjay.r1.flat", chipType="hjay", tag="r1,TC")
Convert Affymetrix annotation to Flat file (Perl)
Affymetrix generally has one file that gives the probe to probeset
identification. Then there might be a separate file that relates probesets
to genes. This information must be combined to get a single flat file
for flat2Cdf()
. The Perl script
convertProbesetCSV_differentInput.pl converts
these files to an output 'flat' file. This may not work for all input (I
use the file *.probe.tab from Affymetrix).
It takes 3 arguments:
- a file with the probeset assignments to genes (comma-delimited, one row per probeset)
- file with probe assignments to probesets (tab-delimited, one row per probe, e.g. *.probe.tab)
- outfile name (will be tab-delimited)
Optional arguments concerning the first file:
-p column with the probeset (default 1)
-g column with the genes (default 2)
Example (system command line; one row):
convertProbesetCSV_differentInput.pl ensembl2affy.csv MoEx-1_0-st-v1.probe.tab MoEnsembl50.flat -p 1 -g 6
Map probesets to outside gene definition
The annotateProbesets()
function (R
script annotateProbesets.R) is
for taking Affymetrix's probesets ('psr') and mapping them to a
different gene definition. This might be a bit exon array specific. It
was designed for Human Ensembl annotation, though I used it for Mouse.
It could be extended to be just probes, perhaps.
This is an R function with the following arguments:
-
probesetFile
: file name with the psr regions -
exonBoundariesFile
: file name of delimited file with the exon boundaries that you want to map the probesets to (new gene definitions) -
probesetFileParameters
,exonBoundariesFileParameters
: list of parameters to pass toread.table()
in reading the file; note thatheader=TRUE
andstringsAsFactors=FALSE
are hard-coded in and can not be changed by user (and should not be passed in this argument) -
probesetFileNames
: vector of the header (names) of the necessary columns to match the probeset to the annotation (see default assignments to see what these are). This allows the user to have a file with different headers, which will be detected and converted to the standard headers in the default. Note that the order of this argument is ESSENTIAL. Also, the sixth value (by defaultNA
) can be an additional column (like level); it is only used to table the matching and non-matching probesets in the verbose output, so is not essential. All columns will be returned in the final output (though in different order) -
exonFileNames
: similar to probesetFileNames, but see default to see the specifics for theexonBoundariesFile
. Note that if gene_id=exon_id, and the stop and start are the boundaries of the gene, it should just map to the gene and not coding region specific. -
overlapType
: either a character vector specifying overlap type or a user-defined function that returns the fraction of overlap of a set of genes (untested, see internal code to see operation of function). If character vector, should be either"none"
(no adjustment),"probeset"
(percentage of probeset overlap in the resulting merged gene),"coding"
(untested -- the percentage of coding overlap as fraction of coding length of resulting merged gene), or"genomic"
(untested-the percentage of genomic overlap as fraction of length of the resulting merged gene) -
overlapCutoff
: between 0 and 1; the required shared region required to merge genes together (as defined byoverlapType
function); if genes are not merged, the overlapping probesets are removed.
Some strange assumptions about the files (should be fixed...):
- 'probesets' file assumes Affymetrix format for chromosome data: chr1, chr2, etc. and +/- for strand
- assume numeric for chromosomes in exonBoundaries (1-24)...; and -1/1 for strand
- assumes that X,Y is encoded as 23,24 in exonBoundaries...very human specific, should be careful
Description of the overlapping strategy (It is run after probesets are
assigned to genes based on the inputted exonBoundariesFile
.)
- If
overlapType
is"none"
there is no further action done (and probesets can be assigned to multiple genes) - When there is overlap (defined as probesets assigned to multiple genes, and not based on the external exon boundaries), there is an initial scan that creates groups of genes that overlap (if A overlaps B and B overlaps C, but C does not overlap A, then A,B,C are still all included in an overlapping group).
- For each group, the % that are common to all of the genes in the
group is calculated based on the function given by
overlapType
. - If this % is greater or equal to
overlapCutoff
then the genes in the group (and all of their probesets, regardless of which gene they are from) are merged into one gene. - Otherwise all of the genes are kept distinct and any overlapping probesets are removed.
-
overlapCutoff=1
means that only genes with 100% overlap (as defined by the function inoverlapType
) are merged and otherwise probesets mapped to multiple genes are removed. -
overlapCutoff > 1
means there is absolutely no merging of genes, even if 100% equivalent (note: error is returned ifoverlapType
returns > 1) - The merged gene has a name that is the concatenation of the names of the genes in the set: "Gene1.Gene2.Gene3" etc.
Example Usage:
path <- "HumanEnsembl50Build"
probesetName <- file.path(path, "HuEx-1_0-st-v2.na26.hg18.probeset.csv")
annotExonBoundaries <- file.path(path, "exonBoundaries_proteinCoding_20080819.txt")
colClasses <- rep("NULL", 39); # 39=number of columns in HuEx-1_0-st-v2.naXX.hg18.probeset.csv
colClasses[c(1:5,16)] <- NA; # the columns that I want to pull in
overlapType <- "probeset"
overlapCutoff <- 1.0
pbsetHeaderNames <- c("probeset_id", "seqname", "strand", "start", "stop", "level")
exonHeaderNames <- c("gene_id", "exon_id", "chr", "strand", "start", "end")
annotateProbesets(probesetFile=probesetName,
exonBoundariesFile=annotExonBoundaries,
probesetFileParameters=list(sep=",", colClasses=colClasses, comment.char="#"),
probesetFileNames=pbsetHeaderNames,
exonBoundariesFileParameters=list(sep="\t", colClasses=NA, comment.char="#"),
exonBoundariesFileNames=exonHeaderNames,
overlapType=overlapType, overlapCutoff=overlapCutoff,
outdir=path, verbose=TRUE)
Filter probes (Perl)
If you already have an existing flat file, you might want to create a new CDF with just a subset of the probes. The Perl script selectProbes.pl takes a starting flat file and a file with the xy coordinates of the probes to keep and returns a new flat file.
The program takes 3 arguments:
- the file with the original probe to probeset and unit assignments (ie.flat file)
- file with new probes to keep (x,y in column 1 and 2 respectively)
- outfile name
Then if I want to filter out some probes based on my own analysis of my data, I can write these probes to a file
cells <- unlist(ensb50NOvHuCells, use.names=FALSE)
probeValues <- readRawData(secLgNHuOnHu_HM, indices=cells, fields=c("xy","intensities"))
probeValues$geneId <- rep(names(ensb50NOvHuCells), times=perGeneProbeCountHu)
probeValues$Indices <- cells
write.table(probeValues[myProbes,c("x","y")], file="probeFile.txt", col.names=TRUE, row.names=FALSE, quote=FALSE, sep="\t")
Then in command line:
selectProbes.pl HuEnsembl50_nooverlap.flat probeFile.txt probeFile.flat
Then back to R:
flat2Cdf(file=probeFile.flat, chipType="HuEx-1_0-st-v2", tags="U-Ensembl50,G-Affy,customMH,joint,80PctLowMo,80PctHighHu,EP", col.class=c("integer", "character")[c(1,1,1,2,2,2)], splitn=5)
Then the CDF must be moved to the correct annotationData/chipTypes/ subdirectory.
Combine Affymetrix CLF and PGF file formats
Affymetrix sometimes provides CLF and PGF files. These are not necessary fixed formats. CLF files (that I have) are tab-delimited, each line a probe with three columns corresponding to 1) Probe id 2) x position 3) y position. PGF files are not tab-delimited files, but rather nested and give the assignment of probes to probesets.
#level0=probeset_id\\ttype\\tprobeset_name
#level1=\\tatom_id\\texon_position"
#level2=\\t\\tprobe_id\\ttype\\tgc_count\\tprobe_length\\tinterrogation_position\\tprobe_sequence"
For example:
1\\t\\tPSR150003628_x_st
\\t1\\t130
\\t\\t5892621\\tpm:st\\t16\\t25\\t13\\tGTCCTCCTCTACGAGGCTCTCGTCC
\\t2\\t155
\\t\\t4734090\\tpm:st\\t17\\t25\\t13\\tTCCGCGTCTCTCACGCCCTCGTCCT
\\t3\\t161
\\t\\t3931023\\tpm:st\\t15\\t25\\t13\\tTCTCTCACGCCCTCGTCCTCTCTGA
The Perl script combineProbeInfo.pl creates a single tab-delimited file with each row for a probe that looks like this after read into R. (This is the same format as the BGP).
## probeset_id type probeset_name atom_id exon_position probe_id type.1 gc_count
## 1 1 NA PSR150003628_x_st 1 130 5892621 pm:st 16
## 2 1 NA PSR150003628_x_st 2 155 4734090 pm:st 17
## 3 1 NA PSR150003628_x_st 3 161 3931023 pm:st 15
## 4 1 NA PSR150003628_x_st 4 70 3244040 pm:st 12
## 5 1 NA PSR150003628_x_st 5 78 5124413 pm:st 13
## probe_length interrogation_position probe_sequence x y
## 1 25 13 GTCCTCCTCTACGAGGCTCTCGTCC 2060 2301
## 2 25 13 TCCGCGTCTCTCACGCCCTCGTCCT 649 1849
## 3 25 13 TCTCTCACGCCCTCGTCCTCTCTGA 1422 1535
## 4 25 13 CACCTTTTGTTAGTCCGGAACTCAG 519 1267
## 5 25 13 GTTAGTCCGGAACTCAGAGGAATCG 1852 2001
The program takes 1 argument which is the prefix, then looks for the files <prefix>.pgf and <prefix>.clf. It creates the file <prefix>.probeflat with the per probe info and also <prefix>.psr which is a tab-delimited file with each row corresponding to a probeset in the CLF/PGF, and the columns giving the name and probeset id.
Currently there are not really checks that the PGF and the CLF contain the same information.
Note there are no further 'unit'/gene groupings here. The Perl script addGeneId.pl can take this file plus another probeset->Gene mapping file and add the necessary column with that information.