Daniela Witten, a statistics and biostatistics professor at the UW, delivered a lecture March 1 about the unique and modern research problems that stem from big data. Witten drew examples from her own work to illustrate the complicated nature of big data statistics.
“Even if your data is so big that right now it’s expensive to store, wait five minutes,” Witten said, starting off the lecture with some laughter.
With technological advancements that have improved storage capacity and computational ability, data processing is becoming easier and cheaper at a very quick pace. Decoding the human genome, an accomplishment that cost $100 million in 2001, can now be done with as little as $1,000, as Witten demonstrated in the lecture.
What this means for researchers is that with overall costs dropping rapidly, and with more data at their disposal, they can ask newer and more challenging scientific questions.
Neuroscience, for example, has benefited greatly from these technological improvements. A notable example is the advancements in calcium imaging at the Allen Brain Observatory, a form of brain imaging that tracks activity using calcium levels. Witten worked with the researchers to analyze footage of individual neurons in a mouse’s brain at a higher level of detail than previously possible.
Witten showed the audience a video of the mouse’s brain as individual neurons flashed and flickered in real time against a black background. This research provided an enormous amount of data that is potentially useful for neuroscientists.
But there is an underlying problem that arises from this amount of data: it needs to be appropriately analyzed.
Unlike certain laws of physics that remain more or less true at large scales, such as gravity or entropy, classic statistical analysis is no longer applicable when encountering these massive amounts of data.
“All the things that you learn how to do, like in statistics 101, don’t apply anymore,” Witten said.
According to Witten, classical statistics assumes that there are more observations than there are features or measurements within a data set.
“If you’re going to predict how tall a kid is using how tall their parent is, how tall your dad is the feature,” Witten explained.
With massive amounts of data, however, the number of variables being measured is much larger than the observations. Witten refers to this state of data as being a high-dimensional setting, and within this high-dimensional setting, the old rules do not apply.
“Big data has its own unique challenges,” Belinda Li, an event attendee and senior majoring in computer science, said. Her interest in these challenges and the invitation from a friend led her to the crowded room.
Room 210 of Kane Hall was full for Witten’s lecture, which was organized by the UW’s department of mathematics. Once the lecture started, almost all of the room’s 240 seats were taken, and several latecomers hoping to attend sat or stood off to the sides of the room.
Computation and storage are not the most important issues involving big data, according to Witten. The real challenge is statistical analysis. Statisticians who work with big data, like Witten, face a new and rapidly-growing field that requires new statistical techniques.
“You can spend millions and millions of dollars collecting data, and then you have this data that’s sitting there, but you didn’t collect the data for the data’s sake,” Witten said. “You collected it to actually glean scientific insights from it, and … getting those insights is not a trivial thing, a thing you do as an afterthought with point-and-click software. It’s something that requires a lot of deep and scientific thought, statistical thought.”
Reach contributing writer Oscar Rodriguez at firstname.lastname@example.org. Twitter: @Oscar_Rdrz
Like what you’re reading? Support high-quality student journalism by donating here.