# Chatbox and formality

#### trinker

##### ggplot2orBust
This message only pertains to those with chatbox privileges. If you do not have those privileges yet keep posting and being a part of the community and soon you'll have access.

I'm working on a paper right now with the speech formality of educators (this actually has a great deal to do with he formality that Greta has been discussing). One of the elements of the language teachers use is that of formal language vs. contextual language. Here's a fascinating (IMHO) paper regarding this:

http://pespmc1.vub.ac.be/Papers/Formality.pdf

I have written a function to measure formality in speech using R. I ran it for the chat box and thought I'd share. You can run the code yourself just download the qdap and talkstats packages from my github repo (if you already downloaded qdap I'd update).

I warn you that the function takes a while to run the first time. Consider it's generating a part of speech per word. On my machine i7 quad core windows 7 machine it took about 10 minutes to run.

Here's the results (formality ranges from 0 to 100; neither extreme being possible). Also the second visual is with people less than 300 words used as the measure isn't recommended for under this number of words.

Results:
Code:
          person word.count formality
[COLOR="#708090"]1         bugman         15     86.67
2     SmoothJohn         15     73.33
3         ledzep         74     71.62
4   TheEcologist        195     67.95
5         spunky        278     63.31
6       bukharin         25     62.00[/COLOR]
[COLOR="blue"]7          quark        995     61.46
8  bryangoodrich      10957     60.76[/COLOR]
[COLOR="#708090"]9       duskstar         92     60.33[/COLOR]
[COLOR="blue"]10         vinux       2664     58.20
11         Lazar       2162     57.28[/COLOR]
[COLOR="#708090"]12        Dragan         84     57.14[/COLOR]
[COLOR="blue"]13          Jake       8249     57.10
14       trinker       8194     57.04
15    victorxstc       8765     56.81
16      SiBorg77        985     55.69
17        noetsi       1588     55.57
18    GretaGarbo       5872     55.29
19         Dason      13415     54.16[/COLOR]
Visuals:

Code:
Code:
# install.packages("devtools")
library(devtools)
install_github("qdap", "trinker")
install_github("talkstats", "trinker")

x <- ts_chatbox()
#the first one talks a long time as it's parsing parts of speech
(res <- formality(x$dialogue, x$person, plot = TRUE))
formality(res, x$person, plot = TRUE, min.wrd = 300) formality(res, x$date, plot = TRUE)
with(x, formality(res, list(person, date)))

#### Dason

Bwahah! I'm winning the informality race!

Any chance we could get some sort of standard error on those measurements

#### Jake

I guess it's not a serious question, but it did make me think that you could in principle get bootstrapped standard errors by resampling on peoples' samples of words, right?

#### trinker

##### ggplot2orBust
Not that I can devise as the formula works of the speech as a whole. no sd sorry. Formality isn't necessarily a good thing. I'm guessing that (in fact I'd bet) that in threads we're all way more formal as there's a greater chance to be misunderstood.

#### trinker

##### ggplot2orBust
Actually Jake that's a very interesting concept, thanks.

#### Dason

I thought about it for a little while but the problem with that approach is that there is a structure to our sentences that wouldn't be accounted for with a naive bootstrap. You would probably have to do some sort of block bootstrap to make it work.

#### trinker

##### ggplot2orBust
Code:
37  2011-08-16 15:46:00 trinker          Formal language is useful in that it is good for those with little contextual knowledgeable but terribly inefficient. Thus the goal is to get the student to have greater context and thus be less formal.
36  2011-08-16 15:51:00 trinker          What's interesting is that Dason is less formal than Greta who is female (females are less formal in spoken dialogue). That makes me question if bots are less formal still because the'r programming lacks sophistication. This could be an interesting way to detect bots.
35  2011-08-16 15:54:00 trinker          @Jake I was actually wanting to get SE for this and bottstrapping didn't occur to me. Nice idea.
34  2011-08-16 15:55:00 Jake             based on what you said about how long it took for a single run, i guess getting the SEs would take a while
33  2011-08-16 15:55:00 trinker          Would you resample with replacement each time but use the same n?
32  2011-08-16 15:55:00 Jake             yeah
31  2011-08-16 15:56:00 trinker          No Jake once you run it once it's easy. It saves the parts of speech in a lsit, that's what takes a while. After that you feed the first one to the next and it takes seconds.
30  2011-08-16 15:56:00 Jake             the key to this working is the fact that the formality algorithm just analyzes individual words, not sentence structure - that's true, right?
29  2011-08-16 15:57:00 GretaGarbo       What is the unit of investigation in this case, the word, sentence of message?
28  2011-08-16 15:57:00 trinker          correct jake, but the parts of speech algorithm needs the sentence struct. However if I smeel what you're cooking you just sample from the the parts of speech after they've been determined
27  2011-08-16 15:57:00 Dason            Like I mentioned in the thread I think you would need to do a block bootstrap - not a naive bookstrap where you just resample all of the words
26  2011-08-16 15:58:00 trinker          The formula is rather simple: [MATH]F = \\frac{\\sum{f_{i}} - \\sum{c_{i}}}{2N}+ 50[/MATH]
25  2011-08-16 15:58:00 Dason            But I would probably need to learn more about the actual measure used to be sure of that
24  2011-08-16 15:58:00 trinker          bootstrap I quasi understand now you lost me with block and naive
23  2011-08-16 15:59:00 Jake             block means basically you would resample at the level of sentences rather than words
22  2011-08-16 15:59:00 trinker          the tex doesn't come through but you add up all formal parts of speech ,inus all contextual parts plus 100 divided by 2
21  2011-08-16 15:59:00 trinker          why that way dason?
20  2011-08-16 15:59:00 trinker          It's doable but why?

#### trinker

##### ggplot2orBust
Code:
19  2011-08-16 16:00:00 GretaGarbo       Is it multilevel with: message, sentence, word?
18  2011-08-16 16:00:00 trinker          The parts of speech are actually saved in a list by turn of talk, not necessarily by sentence.
17  2011-08-16 16:00:00 Dason            Because people don't just throw out random words - there is structure that needs to be accounted for.
16  2011-08-16 16:01:00 Dason            Ok - people don't usually just throw out random words.
15  2011-08-16 16:01:00 trinker          Oh that makes sense
14  2011-08-16 16:01:00 Jake             for the simple analyses that only look at characteristics of the word (ignoring what sentence it came from) a simple bootstrap should be fine. but stuff that depends on sentence structure may need a more complicated boostrap like dason suggests
13  2011-08-16 16:01:00 trinker          Though you've been known to...
12  2011-08-16 16:01:00 Dason            that's why I added the "usually"
11  2011-08-16 16:02:00 Dason            How do you determine if something is formal or contextual?
10  2011-08-16 16:03:00 trinker          Very intersting. I'm working on the lit review now and the analysis is a few weeks off so I think I may add the SE to the analysis. It's on 3 subjects 3 pre and 3 post measures but not a large enough sample to run sound statistical analysis on, however the SEs I think add to the information conveyed.
9   2011-08-16 16:04:00 Jake             the pre and post design also adds some more complexity because now you also have to block by subject
8   2011-08-16 16:05:00 trinker          Dason it's rather simple, verbs, adverbs, pronouns and interjections are contextual, where as nouns, articles, prepostions and adjectives are formal and conjunctions are neither.
7   2011-08-16 16:05:00 trinker          I may start a thread on this.
6   2011-08-16 16:06:00 Jake             the more i think about it, you probably cant get away with a simple boostrap even for the very simple measures. it is probably the case that there are correlations across sentence in the composition of words. i.e., sentence with lots of verbs also have lots of nouns, that kind of thing
5   2011-08-16 16:07:00 Jake             so sentences probably always introduce dependence
4   2011-08-16 16:07:00 trinker          You may say well in this instance... blah blah blah. This is an overall measure so that's why we need 300+ words. It's pretty robust from everything I've read on it and pretty standard among linguists.

#### trinker

##### ggplot2orBust
Figured I'd provide the formula for quick reference:

$$F = \frac{1}{2N}\sum\limits_{i=1}^{n}{f_{i}} - \frac{1}{2N}\sum \limits_{i=1}^{n}{c_{i}}+ 50$$

$$f = \left \{formal \right \}= \left \{noun, adjective, preposition, article\right \}$$

$$c = \left \{contextual \right \}= \left \{pronoun, verb, adverb, interjection\right \}$$

#### Dason

Is $$n$$ the same as $$N$$? Is $$f_i$$ just an indicator that the $$i^{th}$$ word is in the set $$f$$?

#### trinker

##### ggplot2orBust
Dason no $$N$$ and $$n$$ are not the same. $$N$$ is total number of words and little $$n$$ is total formality ($$n_f$$) and total contextual ($$n_c$$) respectively.

#### GretaGarbo

##### Human
I was speculating a little bit about trinkers model yesterday while doing something else. I just wrote as comment in the chat box since it was, and still is, no more than a speculation on a temporary idea.

Let Let $$f_{st}$$ be 1/0 that is (formal/not formal) for each
individual with the s:th sentence and t:th word within sentence.

Trinker was referring to some authors who added 50 or so. I think that is unnecessary and just makes the situation more complicated. That is just a linear transformation and that can be done afterwards (after the estimation of proportion of formality). Besides, formality proportion and the other proportions will add up to 100%.

One can think of individuals as a third level but I ignore that and model one person and think of sentences and words as a multilevel model.

$$f_{st} = \mu+\alpha_s+\omega_{st}+\epsilon_{st}$$

Where $$\alpha_s$$ is a random variable for sentence number s and where $$\omega_{st}$$ is a random variable for word st. (I think of the word in a sentence like a time series so I use index t). Possibly the alpha:s can be considered to be independent. The omega:s have a dependence like maybe an autokorrelation. I think a first order autocorrelation would be to simple but maybe a second order model:

$$\omega_{st} =a_1\omega_{s(t-1)}+ a_2\omega_{s(t-2)}+error$$

I have not run a model exactly like this but there are standard models like generalized linear models (with binomial errors) and repeated measurements. Thus a multilevel model. Some sentences can be very short. “I know.” And that might be a complication for estimating the autocorrelation model. Models where estimates are shrunken towards the overall mean might be useful.

It was suggested to trinker to have standard error for the estimates. Such a model would give standard errors.

First I thought of a Kalman filter with gradually changing proportions of formality. In a way I like the idea that parameter are changing, drifting, over time. Like when a conversation starts more formal, moves over to something more contextual and ends a little bit formal.

Please correct this if it is of any use. Meanwhile, I agree with Socrates.

#### trinker

##### ggplot2orBust
Greta said:
First I thought of a Kalman filter with gradually changing proportions of formality. In a way I like the idea that parameter are changing, drifting, over time. Like when a conversation starts more formal, moves over to something more contextual and ends a little bit formal.
This occurs (according to the linguists I'm reading) because the beginning is spent building context. When the context is built the speaker may become less formal and thus more efficient.

Thanks for yuor thoughts Greta. I'd appreciate if people had challenges/confirmations/or new ideas.