big data

  1. S

    Logistic Regression vs survival analyses in large cohort studies

    Hello, I was wondering if anyone could give me some advice regarding the right statistical approach for a large cohort. I have a large prospective cohort of ~1.5 million lines of data. It contains people who were assessed as being eligible for an intervention who either went on to have the...
  2. rusty_virus

    Distance Measures, High Dimensions

    Hi all, I am learning how to handle high dimensional data. I am trying to cluster a matrix, that is about 2000×5000, with log-likelihood values in its cells. As a first step, to be able to visualize my data, I used Principal Component Analysis (PCA) and Principal Coordinate Analysis (PCoA). As...
  3. L

    How to duplicate cases in one date set order to match the other before merging

    Hi! I have two data sets which I need to merge. 1) demographic data (=one ID per row) 2) Up to 5 duplicates of each ID with different entries per row. Now I can't just delete the duplicates in order to merge the sets as each entry in (2) is important. Also,the data is quite larger...
  4. C

    Cluster Analysis on Big data

    I have a very large data set on booking price of the hotel packages which is containing approximately 10 columns and 100,000 rows. Majority of the variables are categorical here while there is only one continuous variable (i.e. price). Categorical variables are Cabin class, Board basis, In...
  5. C

    Interesting abstract question - Statisticians pls chk this

    I have a massive dataset (10s of millions of rows and 100s of dimensions). The dimensions are of all conceivable data types. How do I arrive at the sample that is: 1) Smallest 2) Most representative of the population with respect to all the dimensions If you can direct me to any...
  6. J

    Sampling Samples from a Big Data Set in R

    I have a large data set (23 million records, ~ 9 Gb) coming in R and am trying to figure out the best way to draw a sample from it. The plan I have right now is: 1) Break down the dataset into smaller pieces of around ~ 4 million records or 1.5 gb 2) Draw a random sample from each 3)...
  7. P

    I'm a newb to charts, graphs, and R --Ubuntu 14.04, Rstudio, MySQL, CSV files, Data

    Hi guys, I just wrote an introduction with a couple of reasons as to why I am here. Reason 1 is that I'm not sure where the divide is between big data and just a big MySQL database. We have easily over 1 million records right now in our database and I'm not sure if that is considered big data...
  8. P

    Data Scientist

    Hello everyone and thank you for welcoming me into this community! I started my career as a Data Scientist exactly one year ago today. I love my job and I love what I do. I mostly create parsers and natural language processors so my day typically consists of talking with my AI programs and...
  9. Y

    managing very large dataset.

    I am currently using a large dataset of around 30million observations (around 30GB) and Stata is getting very slow in implementing any simple command. it took for ex one hour to merge two parts of this dataset with the append command. any idea on why it is so slow? or on how to set stata so that...