Markov chain with covariates

Hi everyone,

My first post! :) I apologize for how long it will be, but please bear with me.
I am looking for help in a topic related to Markov chain modeling with covariates.
I have a decent knowledge of the underlying theory, but not exactly how to implement some aspects of it in practical terms.

First, a bit of exposition and background material that may be useful.

I want to build the transitions matrix for a model with agents transitioning between states a and b, where each element
in the transitions matrix is parameterized by a logistic link connecting individual features of agents to the transitions.
Hence, say, for 2 different states a and b, one must build a stochastic matrix Theta, with elements theta1, theta2, theta3, theta4,
where theta1 represents the weighted transitions between state a and itself, theta2 between a and b, and as usual theta1 + theta2 = 1, and so on.

Theta = [theta1 theta2
theta3 theta4]

According to the logistic link, we have that:

theta1 = 1/(1 + exp(Yab)),
theta2 = exp(Yab)/(1 + exp(Yab))

theta3 = 1/(1 + exp(Yba))
theta4 = exp(Yba)/(1 + exp(Yba))

where Yij = Beta_ij(0) + Beta_ij(Feature1) + Beta_ij(Feature2), i.e, it is a logistic regression expression
where the coefficients Beta are attached to individual features for the agents transitioning between states i and j.
The generalization for a higher number of states is easy enough.
From this you then derive, for each agent in your dataframe, the individual loglikelihood.
The objective is to obtain the global likelihood as the sum over all likelihoods of the agents and pass that function
along with initial parameters to an optimizer to find the Beta_ij and from there, the matrix Theta.

Lets say individual 1 transitions from state a to state b and then back to state a. Then, given this set
of consecutive states, and letting Pi_a be the proportion of individuals initially at state a, his log-likelihood will be

LL1 = log(Pi_a) + log(theta1) + log(theta4)
» log(Pi_a) + log(exp(Yab)/(1 + exp(Yab))) + log(exp(Yba)/(1 + exp(Yba)))

» log(Pi_a) + log(exp(Beta_ab(0) + Beta_ab(Feature1) + Beta_ab(Feature2))/(1 + exp(Beta_ab(0) + Beta_ab(Feature1) + Beta_ab(Feature2)))) +
log(exp(Beta_ba(0) + Beta_ba(Feature1) + Beta_ba(Feature2))/(1 + exp(Yba = Beta_ba(0) + Beta_ba(Feature1) + Beta_ba(Feature2))))

For your n agents, the log-likelihood will be
LL = Sum(LLi)
Ok, enough background material. on to more concrete things and some procedural doubts.

Say I have a data frame in panel data format in R, from which I derive the both the transitions and the set of features agents possess at the time they transition:

ID Feature1 Feature2 Transition
120421006 10000 1 ab
120421006 12000 0 ba
120421006 10000 1 ab
123884392 3000 1 ab
123884392 2000 0 ba
908747738 1000 1 ab

My questions:

1) How would you automate a function in R that returns the log-likelihood of each agent given the features, and then sums them over into one global function, where the unknowns are the Betas?
2) The function you feed into the MLE optimizer returns the estimates of the Betas. How do you build a matrix from the Betas if you need the Ys for the thetas? I suppose an approach is to take the means,
but how would you do it? and if there are dummy variables in the features, like feature2 above?
3) What software would you recommend for the MLE estimation if there are hundreds of states, thousands os possible transitions, and millions of agents? I tried a fully computational approach(namely, the msm package in R) but it chokes
up with this amount of data. Plus, it may be better to develop the code to have more control over it.

Thanks in advance!
A word of advice: you can try making this shorter for more people to read and respond. Currently this looks like a director's cut of the Lord of the Rings... Having that said, if msm cannot handle your data, you will have to do programming of your own, with some parallel computing perhaps. Alternatively, you may sample only a subset of observations at each step and grow your model like random forests are grown.
Last edited:
Ah ah, thanks :) I do realize its a bit long, but I thought it was important to provide context...I'll make a revised shorter version.
As regards your last suggestion, that seems interesting. Its possible to develop a series of models for many samples, of course, but how would you aggregate the whole thing? If you develop markov models for 50 samples, say, how do you come up with an overall model that uses all the data?
Well it´s panel data of about 6 million individuals observed over 20-25 time periods (it is unbalanced), so we´re talking dataframes with around 120 million rows
And the problem is in the optimization step...When it passes arguments to the msm.optim function, the entrance data is about 15 MB. But somehow it explodes in the function iteration and that´s when I get the dreaded RAM limitation message "Cannot Allocate Vector of size X GB"..
This paper by Breiman gives you an idea of how to sample just a subset of features and observations at each step and merge the result into the existing model. You would just use the idea for the algorithm, not the functional form, of course.

The paper is somewhat technical. You may also refer to one of the good data mining books, for more colorful and user-friendly exposition.
Last edited:
Thank you for the reference. I carefully went through it (I already knew a bit about regression and classification trees) and the RF is a natural extension. But only for RT and CTs, it seems. If you want to put together several sampled Markov chains, how would you do it? I was reading a bit on bagging, and it seems like the natural idea, to sample several Markov chains and compute their predictive power by a chi-square vs the test data. But then what do you take as an aggregate?
That is why I mentioned: "You would just use the idea for the algorithm, not the functional form." Just apply the algorithm to various defining characteristics of the Markov chain.