How do I fit 266 unique variables on the x-axis? What's the best graphic?

#1
I have a .csv file consisting of 1000 rows that I exported from MySQL. I have 266 unique variables in one row. Two questions:

1. How do I make a graph where you can see the frequency of each variable (character string)? (The problem is that it only shows two or three of the variables and I want to see them all.)

2. What would be the best graphic for this?

Also, if you have the time to get into more detail, here are a few things that you can look at if you want to help me understand what other cool stuff I can do with my data.

> dim(jackson)
[1] 1000 9

> labels(jackson)[[2]]
[1] "First_Name" "Middle_Name" "Last_Name"
[4] "SO_Num" "Arrest_Agency" "Charge"
[7] "Bail_Amount" "Lodging_Date" "Release_Date"

> nrow(unique(jackson[6]))
[1] 266

> summary(jackson)
First_Name Middle_Name Last_Name
WILLIAM: 42 JAMES : 43 TAYLOR : 24
MICHAEL: 32 ALLEN : 39 GARWOOD : 21
JAMES : 27 EDWARD : 33 THORNTON: 21
JOSHUA : 27 LYNN : 32 BREWER : 18
ERNEST : 24 LUTHER : 24 JOHNSON : 17
JASON : 23 RAY : 24 GADBERRY: 16
(Other):825 (Other):805 (Other) :883
SO_Num Arrest_Agency
Registered Sex Offender: 70 MFP :407
00130781 : 24 MFS :260
00109511 : 21 MFC :125
00111128 : 17 MFO : 87
00103890 : 16 CPP : 37
00113458 : 16 TAP : 26
(Other) :836 (Other): 58
Charge
0475.894 PCS/METH / UNL POSSESS METHAMPHETAMINE - 1 : 46
0163.427 SEX AB 1 / SEX ABUSE 1ST DEG : 34
0162.205 PCS/METH / FAIL TO APPEAR 1ST DEG - 1~PCS/METH: 30
0164.055 THEFT 1 / THEFT 1ST DEG - 1 : 26
0162.205 THEFT 1 / FAIL TO APPEAR 1ST DEG - 1~THEFT I : 23
0163.160 ASSAULT 4 / ASSAULT 4TH DEG - 1 : 22
(Other) :819
Bail_Amount Lodging_Date Release_Date
$0.00 :747 01/06/2015: 55 :906
$10,000.00: 61 01/14/2015: 40 01/16/2015: 14
$5,000.00 : 53 01/07/2015: 34 01/23/2015: 7
$20,000.00: 19 01/13/2015: 33 02/06/2015: 5
$25,000.00: 19 11/11/2014: 28 01/30/2015: 4
$50,000.00: 18 01/09/2015: 27 01/20/2015: 3
(Other) : 83 (Other) :783 (Other) : 61

Thank you so much for your input!
 

trinker

ggplot2orBust
#2
Can I ask what your goals are? 1000 unique groups on an axis seems to defeat the purpose of graphing the data in the first place, which is to reduce the information bits to comprehensible geometric shapes. If you want to lookup exact values for a group maybe a table or list of tables is better suited to your task.
 

bryangoodrich

Probably A Mammal
#5
The best I think you could hope for is something like this

http://support.sas.com/documentation/cdl/en/graphref/65389/HTML/default/images/gtilesampsrc1.png

Except your regions wouldn't even be comparable. Do you really care about the frequency of first names unrelated to last names? Are they really independent? (No. No they're not.) That is essentially the sort of frequency table plot you're asking for, however. You should probably think about the problem you're trying to solve first, then develop a strategy to approach it. Right now, we have no idea what you're truly after and what you're asking to do doesn't appear helpful to any problem we can imagine, not at least without a whole bunch of opaque constraints you haven't included in your initial post.

Therefore, without understanding the objective of your task, asking for the "best" way to visualize what you're requesting cannot be determined. The "best" way depends on the data and what objective it is serving. A blanket view of the frequency of all features seems more like you're just trying to profile the data for usability more than understand something about it. If that were the case, looking at the top 5 or 10 or 10% of values within each feature would seem more useful, but even such summarizations should be conditioned so that you look for frequency among categorical data versus descriptive statistics of actual numerical data. That's just speculation on my part, as we really do need to understand the problem you are trying to solve more than the approach you've decided to do.
 

noetsi

Fortran must die
#6
I can't imagine any scenario where you would learn anything from 266 IV. This calls out for data reduction techniques.
 
#7
The best I think you could hope for is something like this

http://support.sas.com/documentation/cdl/en/graphref/65389/HTML/default/images/gtilesampsrc1.png

Except your regions wouldn't even be comparable. Do you really care about the frequency of first names unrelated to last names?
No. What I want to see is the frequency of charges in the area. Then, after that, I want to see the amount of the set bail in association with the charges. I guess I don't need a chart or graph for that, my boss is happy to just see the numbers.

That is essentially the sort of frequency table plot you're asking for, however. You should probably think about the problem you're trying to solve first, then develop a strategy to approach it. Right now, we have no idea what you're truly after and what you're asking to do doesn't appear helpful to any problem we can imagine, not at least without a whole bunch of opaque constraints you haven't included in your initial post.
I've learned that the most important thing in Data Science is the question. My question is "how do I extrapolate the data into something interesting and interpretable for the people that I work with?" Like I said, I guess I can just show the numbers instead of the graphs. The reason I am here though is to learn how to do these things (and I truly appreciate all of your help guys).

looking at the top 5 or 10 or 10% of values within each feature would seem more useful, but even such summarizations should be conditioned so that you look for frequency among categorical data versus descriptive statistics of actual numerical data.
This is the answer I was looking for. I guess I don't need EVERY feature of the data on the graph. I guess that multiple graphs will do and the top frequency of charges will also do.

I can't imagine any scenario where you would learn anything from 266 IV. This calls out for data reduction techniques.
This answers my question. Thank you guys!
 

bryangoodrich

Probably A Mammal
#9
Feature selection is one of the hardest and most important parts of data mining. While you want to have a quick and easy overview of your data, you should really consider which features you want to include in your data mining model. If names aren't important, don't include them! It doesn't make sense. Looking at frequency is good to understand the data, but not its impact on the model because you need to relate it to other information. You also don't want to conflate your reporting with your analysis. What graphs and numbers you report to management is a wholly other issue from exploring the data.

I recommend following the CRISP-DM approach: http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

While it doesn't always align with every business process, reporting on what the model finds in the data, what results from deployment isn't really discussed. Part of the reason it that it is beyond the scope of data mining. You're entering the realm of business intelligence. However, if you want to be a data scientist, you need to work across the entire spectrum.

For instance, I have no problem whipping together a dashboard (Tableau is good for this) or a PivotTable in Excel to deliver to others. I may also use them to easily interpret the results of my analysis. It's not that I use those in determining my approach to the modeling. How you deliver results is significantly important, but it is another issue to deal with.

If you want to mine your data, looking at frequency doesn't get at how the data relates. This is where doing visual and numerical data exploration is handy. As the analyst, you will want to look through a lot of the plots between the features. If you want to use data mining techniques, you can do things like CART or clustering that will try to latch onto certain attributes among the features. If you have certain dependent variables, then you can do classification and regression to model how features relate to them. You can use PCA and other dimensionality reduction methods (feature selection) to create surrogates that can more directly capture the variation in the larger data (because 266 features is a lot to make sense of).

The fact is, 266 isn't even that much, because there might be higher-order relationships you'd never think of nor be able to make sense of. For instance, if you have 3 features, X1, X2, and X3, you can also look at their square, cube, or up to 6th power and all the possible interaction terms. Thus, 200 features can easily explode into thousands. Data mining methods can help take that complexity of information and reduce it or find a structure among the data that is good at predicting dependent variable values. Of course, doing it (not always hard) and validating it are two different things. It's easy to build a bad data mining model. If you want to build an explanatory model, then this may also be a bad way to go when statistical methods may be more appropriate for that. Thus, as you acknowledged, understanding the question you're after is the most important thing. The best advice I can give is be specific. A simple question can easily be unpacked into a dozen operational objects you can analyze, and you want to treat each as a separate problem (usually).
 

noetsi

Fortran must die
#10
Although I don't know your research at all exploratory factor analysis is commonly helpful when trying to reduce a large group of IV to a manageable set of factors to explore. It is not that difficult to learn, another advantage. :)