Data Scientist

#1
Hello everyone and thank you for welcoming me into this community!

I started my career as a Data Scientist exactly one year ago today. I love my job and I love what I do. I mostly create parsers and natural language processors so my day typically consists of talking with my AI programs and gathering a ton of data on inmates in multiple states.

I'm here because now that I have this data I have a couple of questions. Our database is becoming so big now that I'm not sure whether you can consider it "Big Data". We easily have over a million case records. Is that considered big data? Should I start using Hadoop, Pig, or Hive now? It seems like printing out a single table of over 185,000 records takes a good 2 minutes now. Where's the divide?

My second reason that I am here is to learn how to start making charts and graphs. I guess I'm going to the R section for this information.

Anyways, thanks for having me and I look forward to hearing from the members here!
 

bryangoodrich

Probably A Mammal
#2
Sounds like fun work! Big data isn't defined by the data itself but by the processes you want to execute on it. It might be the data itself is complex, as is usually the case with text (unstructured) data. For instance, you can run SQL in a database to compute something very easily if it's numeric, but if you have to parse a string in your database, a similar task becomes dreadfully slow. This is true if the data isn't really "big" in the sense of "we process billions of websites an hour!" It may also be "big data" if your data is coming by quickly and you need real-time analytics. This gets at the 3 V's of big data: velocity, variety, and volume. As I like to think of it: speed, complexity, and size.

If 2 minutes is alright for you to wait, then it suits your computations. If you want it to be sub-second or sub-minute or if you intend on having your data grow, then you might want to consider massively parallel processing (MPP) databases like Hadoop or Cassandra. The data growth is the real issue here. In computer science, your computation has a growth factor (big O notation). That 185K records may be reasonable given your computation, but if you double that, how long does your computation take now? Is it 4 or 40 minutes? Maybe you need a new computation or maybe you need the parallel computing power of a Hadoop cluster.

You might also consider MongoDB or CouchDB if you're focused on unstructured data, but they definitely take more work to handle as the way the store data and the way you access that data are fundamentally different. There's no simple SQL on a document database (mostly).

As for visualizations, you can tap into your data with R or Python and make graphics or you can always go proprietary and get something like Tableau or Qlik that are much easier to build dashboards from. It just depends on what you want out of it, how they're going to be used, reproduced, or updated, and how much you're willing to pay.

Glad to have another data scientist around here! What language do you use for your NLP work?

Cheers