In a data-deluged world, novel science depends on putting Machine Learning into practice.
That’s a lesson being taught by Joshua Bloom, and one that he also practices in his work as professor of astronomy at UC Berkeley. “We absolutely have to see as our imperative putting machine learning or machine intelligence or cognitive computing into practice,” Bloom, who also is founder of Machine Learning platform vendor wise.io, told an audience gathered at this summer’s Cognitive Computing Forum. It’s important to do that “not because it’s cool and fun, [though] it is all of that, but because…it is a tool, a means to an end.”
For scientists, the means involves automating the data-driven discovery and inference stack so that they no longer need be bogged down by the tedious knowledge work that sits at the bottom of the stack – data handling, scheduling, observing, reducting, finding, discovery, classification and follow-up. Instead, they can go right to the top of the stack, exploiting that backend work to push science farther ahead, faster.
The days of the old-school intimate relationship between scientists and data exemplified in the movie Contact by its lead actress Jodie Foster, in the role of a radio astronomer personally listening to extraterrestrial radio transmissions, are over, he said by way of illustration. “We have to recognize as scientists in this new era that the traditional roles we’ve played are going to wind up changing,” he said. For too many scientific projects, there’s simply too much incoming digital data – and too great a need to observe it in real-time in order to follow it up appropriately with the right facilities that can improve the science – for that not to be the case.
As an example, he discussed the 8-meter Large Synoptic Survey Telescope that’s being built in Chili over the next six years, and which should start receiving photons from the universe around 2020. Bloom referred to it as “the 800-pound gorilla in the room for us when we start thinking about how much data we have coming down the pike.”
That “how much” translates to about 3,000 3-gigapixel images per night, and 20 terabytes of data per night being collected. If you think about astronomers as celestial cinematographers taking pictures repeatedly of the same part of the sky over and over again and looking for changes, he said, “we’re updating about a billion variable sources every three days,” and our understanding of what’s occurring can be improved by quickly getting at that insight. For example, the Large Synoptic Survey Telescope project will be observing about 1 million supernovae per year; in its first two weeks of operation alone it will observe more supernovae than mankind has observed in its entire history.
To get a full picture of an event, like the explosion of a massive star, which evolves as a function of time, it’s important to “get to a state of inference which requires all this input data that hasn’t happened yet,” he says. Unless and until the scientific community can do discovery in real-time, getting to where it needs to be is going to be a really hard thing, he noted.
Automating Data-Driven Discovery and Inference
Building smart robotic telescopes that can do intelligent and autonomous data collection – including talking with other telescopes in a heterogeneous network in order to be informed of a sky event’s occurrence and make a decision about whether to take data about it, without any humans in the loop – is part of the work Bloom began a few years back to support data-driven discovery in astronomy and automating the inference stack.
Things can get sticky when moving up the inference stack to discovery, though, because the data – the images taken by the telescopes – tends to be dirty and noisy, making it difficult to find new and real astrophysical objects. “We wanted to discover transients and variable stars in the sky without any people actually having to look at data,” Bloom said. Being able to use machines to do even such simple inferencing – to discover whether something in an image is real or bogus – can lead to great things, he said: It’s fast; it’s transparent as to why you got the answers you got; it’s deterministic so that you can go back and do the science on it without requiring humans to make potentially conflicting statements about the same data; and it’s versionable.
To discover the one real object in a survey of the sky out of about 1,000 bogus ones – a necessity before one can even ask if something real is a supernovae or a variable star in our own galaxy – required applying domain knowledge to image data. In creating a Machine Learning discovery engine for astronomical images, Bloom used a training set that included a 42-dimensional data of postage-size astronomical images, 78,000 detections, and context features on each candidate. A discovery was that non-linear algorithms such as random forest models (which Bloom said can do quite well without a lot of training data) wound up doing better to meet the metric of minimizing the misdetection rate at a 1 percent false positive rate. No surprise, of course, he added, that “when you add more training data, the classifier winds up getting better and better.”
Now, Bloom says, he and his team are exploring using deep learning approaches to do the same sort of classification – no need to apply any domain knowledge, but rather just work with the raw pixel data itself. “We’ll have to stay tuned there,” he said.
Right-Timing Discoveries
The important thing is that “now we can find objects in the sky with no humans in the loop,” using a robust, real-time Machine Learning framework. Among the discoveries enabled was that of the nearest supernova in three decades – one that was very easy to see by amateur astronomers with reasonable telescopes a couple of days after the explosion. A few weeks later it would be bright enough to see with just binoculars.
The difference was, that using this Machine Learning framework, it was found eleven hours after the explosion.” That matters because, “had we been doing some sort of interesting Machine Learning work and sort of worked on this in a batch way and on the side found out about this a year later, what was interesting would have been completely useless scientifically,” he said. But, because it was found so quickly, it was possible to essentially alert all the world’s large telescopes and commandeer large satellites for the next couple of days after the explosion discovery.
“If we hadn’t applied all of that data and all of those instruments to bear on this one part of the sky, we wouldn’t have been able to make some of the seminal inferences that we were able to about the nature of these exploding stars,” he said, noting that a number of papers were born out of the findings. “Doing Machine Learning for Machine Learning’s sake is fun. But we are really doing it because we are able to do novel science and we wouldn’t have been able to do it by other means.”
As Bloom explains it, there are “vast gems of insight and efficiency hidden in data, from the terrestrial to the celestial.” But in every field, from science to business, Machine Learning has to be able to be put into practice – and that means a number of things. That includes that benchmark datasets be informed by real-world rather than lab questions; that models be not only accurate, but also built to be production capable; and that models are constantly monitored to evaluate how well they are doing against expectations. (More details on this can be found in the video of his presentation, which you can view here.)
As for science, he sees great frontiers ahead not only in applying Machine Learning to astronomy, but also whether some of these techniques could be used to find new physics that humans haven’t even envisioned. “If chance favors the prepared mind,” he posits, “an important question for data driven-scientific inquiry is, ‘How do we prepare the artificial mind for novel inference?’”