Why am I writing on math late on a Saturday night after a long time away from my blog? Because, I admit, I have nothing else to talk about…
I probably could [and should] write about my experience as a prospective first-time father, or I could write about how my class in Decision Making Under Uncertainty has been a much better experience for me personally and professionally this semester. I might even write about how the Wade-O Radio Show has been off the air for at least 4 weeks now, and that Wade needs to fix his mixer ASAP. But I figured I’d take a few minutes to write about a topic that almost no one I know cares about even though I’ve been thinking about it almost exclusively for the past month or so: learning Bayesian networks from data.
Bayesian networks are remarkable graphical models for organizing joint probability distributions according to the conditional independence relationships extant in a dataset. Another way of saying this is that people usually process information according to hypotheses linking the objects they observe. People don’t connect objects intellectually unless the hypotheses connect. One of my PhD advisors at Carnegie Mellon, Mitchell Small, introduced me to the Bayesian Network as a way to combine information from health effects studies to support risk assessment, but as a postdoctoral fellow, I started to look at Bayesian networks as a potential data mining tool. However you look at them, they are elegant computational models with a compelling axiomatic basis for philosophical reasoning, to boot. For me, they helped me understand and visualize Bayes’ rule as a graduate student, and now I’m hoping to use them more as a data mining technique to model drinking water distribution system reliability. For these applications, I am thinking that learning the networks from my datasets will be indispensable.
OK, so learning Bayesian networks hasn’t been the exclusive focus of my thoughts in preparing this research. Most of the past month has been a more thorough reading of the first few parts of Judea Pearl’s Probabilistic Reasoning in Intelligent Systems, but I have found a really cool paper by an Italian geneticist whose integrated several of the most popular algorithms into an R package (bnlearn) for learning both the structure and parameters of a Bayesian network. I originally came across his article last March or so when working with some JHU colleagues on using Bayesian Networks to predict missing data in a public hurricane loss model database, but we didn’t learn our network from data, and made some simplifying assumptions that did not require the sophisticated set of techniques in the linked paper. Having read Marco Scutari’s paper several more times in the past week, I’m very impressed at the resource that he’s constructed. It also helps a lot that it is in my favorite programming environment. There are many tools that learn either structure or parameters of Bayesian networks, but doing both at the same time has generally been left alone. While Scutari’s package doesn’t do both at the same time, a researcher can come close, especially when using the bootstrapping or cross-validation utilities included in bnlearn.
Because of Scutari and bnlearn, I am excited to move further with the modeling I’m doing. As an environmental engineer who wants to use computer science, not necessarily create it, I’m very pleased he’s made this tool available.