Adding a new column in R data frame with values conditional on another column

#1
Suppose I have the data frame:

table<- data.frame(population=c(100, 300, 5000, 2000, 900, 2500), habitat=c(1,2,3,4,5,6))

Now I want to add a new column table$size with the values 1 if population< 500, 2 if 500<=population<1000, 3 if 1000<=population<2000, 4 if 2000<=population<3000, 5 if 3000<=population<=5000

I only know how to create a column with a binary TRUE/FALSE outcome conditional on the values in another column , e.g.

table$size <- (table$population<1000)

But I'm not sure to do it to get different numbers for different conditions. Can anyone provide help on this?
 

ledzep

Point Mass at Zero
#2
Here are two solutions. Take your pick.

Code:
## your data
table<- data.frame(population=c(100, 300, 5000, 2000, 900, 2500), 
		habitat=c(1,2,3,4,5,6))

## Solution 1 
table$size[table$population<500]<-1
table$size[table$population>=500 & table$population<1000]<-2
table$size[table$population>=1000 & table$population<2000]<-3
table$size[table$population>=2000 & table$population<3000]<-4
table$size[table$population>=3000 & table$population<=5000]<-5



## Solution 2 
table$size1<-ifelse(table$population<500,1,
		ifelse(table$population>=500 & table$population<1000,2,
		ifelse(table$population>=1000 & table$population<2000,3,
		ifelse(table$population>=2000 & table$population<3000,4,5
		))))

>table
  population habitat size size1
1        100       1    1     1
2        300       2    1     1
3       5000       3    5     5
4       2000       4    4     4
5        900       5    2     2
6       2500       6    4     4
 

Dason

Ambassador to the humans
#3
findInterval seems like a more appropriate function for this particular task I think

Code:
table<- data.frame(population=c(100, 300, 5000, 2000, 900, 2500), 
                   habitat=c(1,2,3,4,5,6))

table$size <- findInterval(table$population, c(0, 500, 1000, 2000, 3000, 5000), rightmost.closed = TRUE)
which gives

Code:
> table
  population habitat size
1        100       1    1
2        300       2    1
3       5000       3    5
4       2000       4    4
5        900       5    2
6       2500       6    4
 

Dason

Ambassador to the humans
#4
Aww man - not only do I find out that you crossposted at SO,

But you accepted quite possibly the worst answer for this problem...
 

trinker

ggplot2orBust
#7
econlearner,

We all start out new. I learned the same lesson myself (cross posting). Some people were very nasty about this unwritten or in some cases written rule and made me feel about 2 inches high (Dason was comical in his rebuke :)). Let me explain to you the general convention I've seen used so you don't make the mistakes I've made.

1) post your question on a site you find is most appropriate for your question.
2) life happens and sometimes people can't help you or you realize the question is better suited elsewhere
3) put a link in both places stating you've done this and why

The reasoning for this is so people don't waste time solving a question that's been solved elsewhere. It also keeps things together for future searchers with a similar problem.

Here's an example of where I've posted in 2 places and made it clear I've done so: http://stackoverflow.com/questions/9305471/zip-file-error-in-reading-in-an-https-url

You'll notice there's a link at both locations to the other and I've told everyone what I'm doing an why.

Hopefully, this is helpful.


========================
To Dason, didn't know about the findInterval. +1
 

ledzep

Point Mass at Zero
#8
Lovely one Dason and very elegant too.
I am going to be using the findInterval a lot in the future. I also like the fact they allow the option to include or not include the boundaries.
 

Dason

Ambassador to the humans
#9
I am going to be using the findInterval a lot in the future. I also like the fact they allow the option to include or not include the boundaries.
By default the left boundary is included and the right boundary is not included. What I did with the rightmost.closed=TRUE parameter was to tell it that the largest bin should have it's rightmost boundary closed. It wouldn't make sense to have all of the boundaries be closed because then what happens when something falls on a boundary? It needs to be able to decide if it should go with the lower bin or the higher bin.
 
#10
Here are two solutions. Take your pick.

Code:
## your data
table<- data.frame(population=c(100, 300, 5000, 2000, 900, 2500),
        habitat=c(1,2,3,4,5,6))

## Solution 1
table$size[table$population<500]<-1
table$size[table$population>=500 & table$population<1000]<-2
table$size[table$population>=1000 & table$population<2000]<-3
table$size[table$population>=2000 & table$population<3000]<-4
table$size[table$population>=3000 & table$population<=5000]<-5



## Solution 2
table$size1<-ifelse(table$population<500,1,
        ifelse(table$population>=500 & table$population<1000,2,
        ifelse(table$population>=1000 & table$population<2000,3,
        ifelse(table$population>=2000 & table$population<3000,4,5
        ))))

>table
  population habitat size size1
1        100       1    1     1
2        300       2    1     1
3       5000       3    5     5
4       2000       4    4     4
5        900       5    2     2
6       2500       6    4     4
I tried this code and it returned an error : "Unknown or uninitialised column:". I had to make the column first to fix this error. In my case, I created a column named "Hour" and assigning the expected values first. Then follow through Solution 1 and append these values accordingly to what's in the reference column.

Thank you for making my life easier with these codes!