Entering data into a multi-way array?

#1
Hello: I am a novice R user, but I have been working my way through the manuals / tutorials, ... I have R / Deducer up and running, and know the basics.

I want to analyze a microarray (gene expression) dataset.

I need to input the data into R as a multidimensional (multi-way) array, something on the order of

15,000 x 3 x 8 x 2 [genes x replicates x time points x treatments]

I've Google'd, etc. to find the solution, but I cannot find an answer. I am being led to the idea (I suspect) that I may need to use a RDBMS (like MySQL) for that purpose (section 4 in http://cran.r-project.org/doc/manuals/R-data.html)?

If so, that is my next challenge!

Any replies much appreciated!

Sincerely, Victoria :)
 

bryangoodrich

Probably A Mammal
#2
I would probably handle the object in a list. The list object is a very robust object in R because it is easy to perform operations on it and to extract information out of it. Matrices in R are designed for raw numeric data, and arrays I've never had a reason to use them. Dataframes are good when you have tabular data of different sorts, but it is really just a list in disguise (a list containing vectors of different modes of equal length). Depending on what you are trying to do, I'd probably do a list of time points. In each list element (time point) you can store your 3D gene x replicate x treatment object. This could be stored in a list, a dataframe, etc., depending on how you want to store it. Play around with lists to get a feel for them, and maybe some of the other data types. Compare as.list with list on different things (e.g., run each of those on the vector 1:10 to see the different result). This type of playing around will inform you of how you'll want to import your data into these data types.

As an example, consider this product.

Code:
x <- list()  # Create the object

for (i in seq(8)) x[[i]] <- rnorm(10)  # Add some data to it
print(x)

sapply(x, function(z) c(mean = mean(z), sd = sd(z)))  # Summarize some information about the list elements

lapply(x, function(z) data.frame(x = z, y = z^2))  # Manipulate each list element from a vector into a dataframe
print(x)
 
#3
Thank you for the suggestion [using a list(s)]; I am reading up on them. However, I am trying to replicate the script, appended below. My data is in OOCalc files. The script (below) synthesizes a dataset (it serves as a "tutorial"), but I will need to get my data from OOCalc into R for use in that script (which uses a multidimensional array).

I've worked my way through the script, and understand how most of it works (except the first bit - Step 1 - which is irrelevant to me, anyway).

P.S. I tried to upload this script as a text file (.txt) but I wasn't able to do so, using the Manage Attachments option. :-(

Code:
### Supplementary material with the paper
### Interpretation of ANOVA models for microarray data using PCA
### J.R. de Haan et al. Bioinformatics (2006)
### Please cite this paper when you use this code in a publication.
### Written by J.R. de Haan, December 18, 2006

### Step1: a synthetic dataset of 500 genes is generated with 5 classes
### 1 unresponsive genes (300 genes)
### 2 constant genes (50 genes)
### 3 profile 1 (50 genes)
### 4 profile 2 (50 genes)
### 5 profile 3 (50 genes)

#generate synthetic dataset with similar dimensions:
# 500 genes, 3 replicates, 10 timepoints, 4 treatments
X <- array(0, c(500, 3, 10, 4))
labs.synth <- c(rep(1, 300), rep(2, 50), rep(3, 50), rep(4, 50), rep(5, 50))
gnames <- cbind(labs.synth, labs.synth)
#print(dim(gnames))
gnames[1:300,2] <- "A"
gnames[301:350,2] <- "B"
gnames[351:400,2] <- "C"
gnames[401:450,2] <- "D"
gnames[451:500,2] <- "E"

### generate 300 "noise" genes with expressions slightly larger than
### the detection limit (class 1)
X[labs.synth==1,1,,] <- rnorm(length(X[labs.synth==1,1,,]), mean=50, sd=40)
X[labs.synth==1,2,,] <- X[labs.synth==1,1,,] +
  rnorm(length(X[labs.synth==1,1,,]), mean=0, sd=10)
X[labs.synth==1,3,,] <- X[labs.synth==1,1,,] +
  rnorm(length(X[labs.synth==1,1,,]), mean=0, sd=10)

# generate 50 stable genes at two levels (class 2)
X[301:325,1,,] <- rnorm(length(X[301:325,1,,]), mean=1500, sd=40)
X[301:325,2,,] <- X[301:325,1,,] + rnorm(length(X[301:325,1,,]), mean=0, sd=10)
X[301:325,3,,] <- X[301:325,1,,] + rnorm(length(X[301:325,1,,]), mean=0, sd=10)

X[326:350,1,,] <- rnorm(length(X[326:350,1,,]), mean=3000, sd=40)
X[326:350,2,,] <- X[326:350,1,,] + rnorm(length(X[326:350,1,,]), mean=0, sd=10)
X[326:350,3,,] <- X[326:350,1,,] + rnorm(length(X[326:350,1,,]), mean=0, sd=10)

# generate50 genes with profile 1 (class 3)
increase.range <- matrix(rep(1:50, 10), ncol=10, byrow=FALSE)
profA3 <- matrix(rep(c(10, 60, 110, 150, 150, 150, 150, 150, 150, 150) , 50),
                 ncol=10, byrow=TRUE) * increase.range
X[351:400,1,,1] <- profA3 + rnorm(length(profA3), mean=0, sd=40)
profB3 <- matrix(rep(c(10, 100, 220, 280, 280, 280, 280, 280, 280, 280), 50),
                 ncol=10, byrow=TRUE) * increase.range
X[351:400,1,1:10,2] <- profB3 + rnorm(length(profA3), mean=0, sd=40)
profC3 <- matrix(rep(c(10, 120, 300, 300, 280, 280, 280, 280, 280, 280), 50),
                 ncol=10, byrow=TRUE) * increase.range
X[351:400,1,1:10,3] <- profC3 + rnorm(length(profA3), mean=0, sd=40)
profD3 <- matrix(rep(c(100, 75, 50, 50, 50, 50, 50, 50, 75, 100), 50), ncol=10,
                 byrow=TRUE)
X[351:400,1,1:10,4] <- profD3 + rnorm(length(profA3), mean=0, sd=40)
#again replicates
X[351:400,2,,] <- X[351:400,1,,] + rnorm(length(X[351:400,2,,]), mean=0, sd=10)
X[351:400,3,,] <- X[351:400,1,,] + rnorm(length(X[351:400,3,,]), mean=0, sd=10)

# generate50 genes with profile 2 (class 4)
increase.range <- matrix(rep(1:50, 10), ncol=10, byrow=FALSE)
profA4 <- matrix(rep(c(10, 60, 110, 150, 125, 100, 75, 50, 50, 50) , 50),
                 ncol=10, byrow=TRUE) * increase.range
X[401:450,1,,1] <- profA4 + rnorm(length(profA4), mean=0, sd=40)
profB4 <- matrix(rep(c(10, 100, 220, 280, 200, 150, 100, 50, 50, 50), 50),
                 ncol=10, byrow=TRUE) * increase.range
X[401:450,1,1:10,2] <- profB4 + rnorm(length(profA4), mean=0, sd=40)
profC4 <- matrix(rep(c(10, 150, 300, 220, 150, 100, 50, 50, 50, 50), 50),
                 ncol=10, byrow=TRUE) * increase.range
X[401:450,1,1:10,3] <- profC4 + rnorm(length(profA4), mean=0, sd=40)
profD4 <- matrix(rep(c(150, 100, 50, 50, 75, 75, 75, 100, 100, 100), 50),
                 ncol=10, byrow=TRUE)
X[401:450,1,1:10,4] <- profD4 + rnorm(length(profA4), mean=0, sd=40)
#again replicates
X[401:450,2,,] <- X[401:450,1,,] + rnorm(length(X[401:450,2,,]), mean=0, sd=10)
X[401:450,3,,] <- X[401:450,1,,] + rnorm(length(X[401:450,3,,]), mean=0, sd=10)

# generate50 genes with profile 3 (class 5)
increase.range <- matrix(rep(1:25, 20), ncol=10, byrow=FALSE)
profA4 <- matrix(rep((200 - c(10, 60, 110, 150, 125, 100, 75, 50, 50, 50)), 50),
                 ncol=10, byrow=TRUE) * increase.range
X[451:500,1,,1] <- profA4 + rnorm(length(profA4), mean=0, sd=40)
profB4 <- matrix(rep((200 - c(10, 100, 180, 200, 200, 150, 100, 50, 50, 50)), 50),
                 ncol=10, byrow=TRUE) * increase.range
X[451:500,1,1:10,2] <- profB4 + rnorm(length(profA4), mean=0, sd=40)
profC4 <- matrix(rep((200 - c(10, 150, 200, 180, 150, 100, 50, 50, 50, 50)), 50),
                 ncol=10, byrow=TRUE) * increase.range
X[451:500,1,1:10,3] <- profC4 + rnorm(length(profA4), mean=0, sd=40)
profD4 <- matrix(rep((200 - c(150, 100, 50, 50, 75, 75, 75, 100, 100, 100)), 50),
                 ncol=10, byrow=TRUE)
X[451:500,1,1:10,4] <- profD4 + rnorm(length(profA4), mean=0, sd=40)
#again replicates
X[451:500,2,,] <- X[451:500,1,,] + rnorm(length(X[451:500,2,,]), mean=0, sd=10)
X[451:500,3,,] <- X[451:500,1,,] + rnorm(length(X[451:500,3,,]), mean=0, sd=10)


# Step 2: Now the effects for different factors in the ANOVA model
# can be calculated:

# subtraction of the general mean
x <- X - mean(X, na.rm=TRUE)
tpoints <- c(1, 3, 6, 12,c( 24*c(1, 2, 3, 5, 8, 12)))
nrgenes <- dim(x)[1]

# calculation of the three main effects
cat("calculating main effects\n")
timemeans <- apply(x, 3, mean, na.rm=TRUE)
treatmeans <- apply(x, 4, mean, na.rm=TRUE)
genemeans <- apply(x, 1, mean, na.rm=TRUE)

#par(mfrow=c(2, 3))


# calculation of the interaction effects
# interaction time-treatment
cat("calculating interaction time-treatment\n")
mean.ti.tr <- apply(x, c(3,4), mean, na.rm=TRUE)
#print(dim(mean.ti.tr))
time1.m <- matrix(rep(timemeans, 4), nrow=10, byrow=FALSE)
tr1.m <- matrix(rep(treatmeans, 10), nrow=10, byrow=TRUE)
int.ti.tr <- (mean.ti.tr - time1.m) - tr1.m

# interaction time-gene
cat("calculating interaction time-gene\n")
mean.ti.gene <- apply(x, c(1,3), mean, na.rm=TRUE)
#print(dim(mean.ti.gene))
time2.m <- matrix(rep(timemeans, dim(x)[1]), nrow=dim(x)[1], byrow=TRUE)
gene2.m <- matrix(rep(genemeans, 10), nrow=dim(x)[1], byrow=FALSE)
int.ti.gene <- (mean.ti.gene - time2.m) - gene2.m

# interaction gene-treatment
cat("calculating interaction gene-treatment\n")
mean.gene.tr <- apply(x, c(1,4), mean, na.rm=TRUE)
#print(dim(mean.gene.tr))
gene3.m <- matrix(rep(genemeans, 4), ncol=4, byrow=FALSE)
tr3.m <- matrix(rep(treatmeans, dim(x)[1]), ncol=4, byrow=TRUE)
int.gene.tr <- (mean.gene.tr - gene3.m) - tr3.m

# calculation of 3 factor interaction
cat("calculating 3 factor interaction\n")
mean.gene.time.tr <- apply(x, c(1, 3, 4), mean, na.rm=TRUE)
#print(dim(mean.gene.time.tr))

ar.ti.tr <- array(0, c(nrgenes, 10, 4))
for (i in 1:nrgenes){ar.ti.tr[i,,] <- int.ti.tr}

ar.ti.gene <- array(0, c(nrgenes, 10, 4))
for (i in 1:4){ar.ti.gene[,,i] <- int.ti.gene}

ar.gene.tr <- array(0, c(nrgenes, 10, 4))
for (i in 1:10){ar.gene.tr[,i,] <- int.gene.tr}

ar.ti <- array(0, c(nrgenes, 10, 4))
for (i in 1:4){ar.ti[,,i] <- time2.m}

ar.tr <- array(0, c(nrgenes, 10, 4))
for (i in 1:10){ar.tr[,i,] <- tr3.m}

ar.gene <- array(0, c(nrgenes, 10, 4))
for (i in 1:4){ar.gene[,,i] <- gene2.m}

int.gene.time.tr <- mean.gene.time.tr - ar.ti - ar.tr - ar.gene -
  ar.ti.tr - ar.ti.gene - ar.gene.tr
imat.gtt <- cbind(int.gene.time.tr[,,1],
                  int.gene.time.tr[,,2],
                  int.gene.time.tr[,,3],
                  int.gene.time.tr[,,4])

### calculation of error term
error.term1 <- abs(sweep(x, c(1, 3, 4), mean.gene.time.tr))
cell.error <- apply(error.term1, c(1, 3, 4), mean, na.rm=TRUE)
mncn.error <- scale(cbind(cell.error[,,1], cell.error[,,2],
                          cell.error[,,3], cell.error[,,4]), scale=FALSE)

### Step 3: The results of the model can now be inspected with
### different plots (PCA for the interactions)
#plot(timemeans, type="l")
#plot(treatmeans, type="l")
#plot(genemeans, type="l")

#source("biplot.R")
#jbiplot(1, 2, int.ti.gene, gnames[,2], as.character(tpoints), rep(2, 10))
#jbiplot(1, 2, int.gene.tr, gnames[,2], c("TR1", "TR2", "TR3", "UNT"), 2:5)
#jbiplot(1, 2, int.ti.tr, as.character(tpoints), c("TR1", "TR2", "TR3", "UNT"), rep(2, 4))
#jbiplot(1, 2, imat.gtt, gnames[,2], as.character(rep(tpoints, 4)), sort(rep(2:5, 10)))
Sep 30 Michael Weylandt You might want to describe how your data is kept now so we can help with the loading process, but in general, 4D arrays are loaded with the same...
Loading...
sample
Remove Ads
« Return to R help | 10 views
 
Last edited by a moderator:
#4
Solution: I figured this out on my own (below).

> gnames<-read.csv("/home/victoria/R/gnames.csv",header=FALSE,sep=",")

> gnames
V1 V2 V3 V4 V5 V6 V7
1 NM_005588 NM_004407 NM_006136 NM_004817 NM_006012 NM_001008693 NM_181435
[snip]
V497 V498 V499 V500
1 NM_152546 XM_375762 NM_020696 NM_138459

# Saved 500 genes (values only): 3 replicates of (8 time points for treatment 1) + 3 replicates of (8 time points for treatment 2) = 500 x 48 .csv file

> X<-matrix(scan("/home/victoria/R/X.csv",n=500*48),500,48,byrow=TRUE)

Read 24000 items

> dim(X)

[1] 500 48

> dim(X)<-c(500,3,8,2)

# Looks good: Checked all, below, against 'source' file "X.ods" : OK!

, , 1, 1 # = Time Point 1 replicates (columns) for Treatment 1
[,1] [,2] [,3]
[1,] -3.2093000000 -2.1463000000 0.347800000
[2,] 0.1689000000 -0.0945000000 -0.383400000
[3,] -0.0487000000 -0.8695000000 -1.990200000
[snip]

, , 2, 1 # = Time Point 2 replicates (columns) for Treatment 1
[,1] [,2] [,3]
[1,] -2.8531 -3.2044000000 -1.3212
[2,] 0.5820 0.1358000000 0.4183
[3,] 1.0075 0.3211000000 -1.7940
[snip]

, , 8, 1 # = Time Point 8 replicates (columns) for Treatment 1
[,1] [,2] [,3]
[1,] -2.3661000000 -1.8353000000 0.0682
[2,] 0.6607000000 0.0726000000 -0.1255
[3,] -0.4960000000 -1.8894000000 -0.9466
[snip]

, , 1, 2 # = Time Point 1 replicates (columns) for Treatment 2
[,1] [,2] [,3]
[1,] -2.0151000000 -1.4385 -0.9281
[2,] -0.0482000000 0.2382 -0.1319
[3,] -1.3375000000 -1.4805 -1.4416
[snip]

, , 2, 2 # = Time Point 2 replicates (columns) for Treatment 2
[,1] [,2] [,3]
[1,] -2.3212 -1.3964 -1.1155000000
[2,] 0.0264 0.2057 0.1566000000
[3,] -0.3569 -0.9750 -0.8666000000
[snip]

, , 7, 2 # = Time Point 7 replicates (columns) for Treatment 2
[,1] [,2] [,3]
[1,] -1.2615000000 -2.1288000000 -1.8844
[2,] 0.1048000000 -0.2306000000 -0.0634
[3,] -1.4571000000 0.7725000000 -1.0384
[snip]

, , 8, 2 # = Time Point 8 replicates (columns) for Treatment 2
[,1] [,2] [,3]
[1,] -0.0819 -1.8137000000 -1.4085
[2,] -0.1336 0.2744000000 1.2913
[3,] -1.1404 -0.7546000000 -1.2400
[snip]

Thanks anyway - Victoria :)
 

Dason

Ambassador to the humans
#5
Awesome. It's always nice to see somebody post a solution they came up with themselves. I work with gene expression data so I wish I could have helped you out but I didn't really understand what you were looking for and I think the first time I came across your problem I didn't have much time anyways. But thanks for posting the solution to help out anybody with a similar problem!
 
#6
Update:

1. Here is a better illustration. I assumed that the gene expression was the same for the three replicates for each gene, in the respective Time and Treatment groups, so that I could follow (verify) what was going on.

2. To better-view the following, copy and paste into a plain-text editor with a monospace font (e.g. gedit on Ubuntu with the Monospace font).
Or you could stick the code inside [noparse]
Code:
[/noparse] tags...


Sincerely, Victoria :)

------------------------------------

5 Genes x 3 Replicates x 2 Times x 2 Treatments
Code:
> x<-matrix(scan("/home/victoria/R/5.3.2.2.csv",n=5*12),5,12,byrow=TRUE)
Read 60 items

> x
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,]    1    1    1    2    2    2    3    3    3     4     4     4   # Gene1 avg. = row avg. = 2.5
[2,]    1    1    1    2    2    2    3    3    3     4     4     4   # Gene2 avg. = row avg. = 2.5
[3,]    1    1    1    2    2    2    3    3    3     4     4     4   # Gene3 avg. = row avg. = 2.5
[4,]    1    1    1    2    2    2    3    3    3     4     4     4   # Gene4 avg. = row avg. = 2.5
[5,]    1    1    1    2    2    2    3    3    3     4     4     4   # Gene5 avg. = row avg. = 2.5

#      |------- Treatment1 -------|  |-------- Treatment2 ---------| Tr1 avg. = (1+2)/2; Tr2 avg. = (3+4)/2
#      |-- Time1 --|  |-- Time2 --|  |-- Time1 --|     |-- Time2 --| Ti1 avg. = (1+3)/2; Ti2 avg. = (2+4)/2
#          (x3)           (x3)           (x3)              (x3)  

# This, above,is for illustration only; the actual data will have different values for eah of the above
# [ 5x3x2x2 = 60 data individual, independent data points ]

> dim(x)
[1]  5 12

> n<-dim(x)[1]
> n
[1] 5

> dim(x)<-c(5,3,2,2)
> dim(x)
[1] 5 3 2 2

> genemeans <- apply(x, 1, mean, na.rm=TRUE)
> genemeans
[1] 2.5 2.5 2.5 2.5 2.5  # = row averages (5 Genes)

> timemeans <- apply(x, 3, mean, na.rm=TRUE)
> timemeans
[1] 2 3  # = Time1 avg  Time2 avg  (1+1+1+3+3+3)/6; (2+2+2+4+4+4)/6

> treatmeans <- apply(x, 4, mean, na.rm=TRUE)
> treatmeans
[1] 1.5 3.5  # Treatment1 avg  Treatment2 avg  (1+1+1+2+2+2)/6; (3+3+3+4+4+4)/6
------------------------------------
 
Last edited by a moderator:
#7
One last comment: add sep="," to the read.csv assignments, to avoid "Error in scan(file ... : scan() expected 'a real', got" -type errors:

x<-matrix(scan("/home/victoria/R/5.3.2.2.csv",n=5*12,sep=","),5,12,byrow=TRUE)