It’s good to be back blogging! I’ve been buried for the past month finishing a book I’m co-authoring on 3D printing (called Fabricated — it’s due out in February because it takes the publisher 3 months(!) to format an ebook). Also in production, an SBIR commercialization plan for a tech startup (data analytics software) in Boston. I can’t say enough good things about the SBIR program and the process of applying for Round I and Round II funds. But more about that later.
For the Commercialization plan, I’ve been doing research on Big Data. Big data is the new natural resource. Here’s some interesting facts:
- The amount of new data created in the past twelve months alone would fill up 57 billion Apple iPads (according to IDC)
- Each year, the amount of data generated, worldwide, will increase 40%
- The human mind isn’t equipped to handle more than about seven pieces of information at once (this was calculated by researcher George Miller in a 1956 article in Psychological Review)
By now most, nearly everyone has heard of “Big Data.” Yet, there’s no real formal definition, no benchmark. Instead, Big Data is a concept, a situation where the size of digital information is beyond the capacity of today’s software tools to readily capture, store and analyze. What constitutes Big Data means different things to different people and different industries, obviously.
A useful way to think about a data situation is to consider the 3 V’s: volume, velocity and variety. In terms of structure, datasets are categorized into two broad categories: structured (databases, tablular data) and unstructured (unformed datasets from sensors or digital media).
Market research firm, TDWI (“Big Data Analytics,” 4th Quarter, 2011) surveyed 325 professionals in several industries. Ironically, most of the people in organizations who “own” the company’s data are the people in the IT department that manage the technology, not the people setting strategy. Only 21% of the respondents said that their individual department was the primary owner of their data.
There are several reasons for this. Current data analytics tools aren’t easy for non-quants to use. Today’s tools are geared to make sense of structured data pulled from a database. Frequently, available data tools are custom-built and their use and configuration (creating reports) is overseen by people who manage the database.
There is a vast unaddressed market for analytical tools that can make sense of unstructured data. Sensor data, massive digital files, GPS data will continue to grow in volume, velocity and variety. In fact, I predict that the value and volume, probably velocity also of unstructured data will soon surpass that of structured data. Yet, aside from sophisticated facial recognition software or primitive tools to manage digital media, massive reams of unstructured data remain buried out of our analytical reach.
Where is all this data — structured and unstructured — coming from? Well, everywhere. Mobile phones, internet clicks. An estimated 60% of the world’s population has a cell phone that’s constantly streaming data back to its network provider.
Each time you pay for something at the grocery store, you just contributed to the world’s store of Big Data. Cars and machines have tiny sensors built into them which collect a steady stream of raw unstructured data (unstructured data means it’s not captured in a nice tidy table or spreadsheet — it’s simply reams of numbers or text). The medical profession racks up massive data files in the form of medical images or real-time video feeds generated during a surgery.
In daily life, most of us experience Big Data analysis in action when we’re buying something. Retailers are among the top industries eagerly collecting and trying to gain insight from customer data. Grocery stores lure customers with discount store cards, but I personally don’t like making it easy for my grocery store to track my purchases. That’s why I don’t have a store savings card.
In the past, I would innocently sign up for a store card and then a few weeks later would start getting coupons in the mail for some product that apparently fit in with my buying patterns. Retailers call this the “Next Best Offer.” Speaking of untapped markets, there are good opportunities for an analytical tool that would help retailers improve the precision of their NBOs by analyzing point-of-sale (POS) data and making good suggestions (including setting the right price) on the spot.
The faster and more precisely a retailer can offer a customer targeted NBOs,” the more goods that retailer will sell (at least in theory). Sometimes retailers’ efforts can be amusing. For example, I like old movies — the singing and dancing kind from the 1930s and 1940s. I can’t prove this, but after we bought a Roku and I started happily watching old musicals on streaming Netflix, I started getting flyers in the mail selling me a particular retirement homes or Golden Vacations. I’m still a few years away from that phase of life. But… I suppose that guessing a person’s age based on the movies they like is not a completely unreasonable way to target a particular market.
I quit Facebook for the same reason I don’t get store cards or give Google my cell phone number (although it asks me on a nearly daily basis). In the long run, though, I suspect my efforts to limit my presence on the Big Data Grid are futile. Eventually I may just give in and let “Them” collect all the data they want. A friend of mine likes to give random middle initials to companies to see which one sold her out when the flyers, robot-calls and emails start coming.
Retailers are active users of data analytics. But in terms of volume and velocity, the financial services industry handles the most data per employee. Wall Street firms on average, wrestle with the (nearly real-time) data byproducts of half a trillion stock trades a month. However, if you calculate industry data load by total data (not by employee), according to estimates from McKinsey,(1) discrete manufacturing is number one. Government is second and communications/media third.
Manufacturing companies do a lot of quantitative, data-intensive R&D. Managing a supply chain and inventory kicks up a lot of data. The one advantage that manufacturing firms have is that much of the manufacturing process has been automated for several decades now. However, the industry’s existing software was built for a previous era, when data flowed more slowly and was mostly simple numbers or letters.
The next most data-intensive industry is our government. The government collects tax data, tracks passports, residential tax assessments, marriage certificates and so on. On the one hand, having our government collect and make sense of all this data could be a good thing (if all this data is used to create intelligent policy). On the other hand, it could be a threat to civil liberties if the government abused its power to closely monitor and track innocent citizens.
After the manufacturing and government industries are the communications and media industries which deal with enormous pools of digital media data. These data stores are unstructured and massive. Digital information in this format can’t be captured in tidy rows and columns. It doesn’t correlate to alphanumeric codes or characters. Instead, media data oversees the generation of sound waves and visual information.
One my first jobs was in a company’s Media Archive. In those creaky old days, most of the media we were given was on CDs or DVDs (and a few on VHS — yikes). Our job in the Media Archive was to upload those massive files into an image storage database and then log their identifying information into a sort of library catalog. Even back then our data storage systems were staggering under the weight of those CDs and DVDs and we were logging in just a few each week.
My co-workers and I would click “upload” and then sit frozen, praying for the system not to crash. Periodically you would hear somebody howl in frustration. This meant a computer had buckled in mid-upload and its hapless user would have to start the whole process over. I know computing power and the quality of compression algorithms has improved, but I can’t image how organizations are managing to keep track of all the digital media they’re creating.
The world needs analytical tools that can make sense of unstructured data. Another open market is for analytical tools that can provide meaningful insight into data in real time.
Traditional spreadsheets can’t do real time analysis. Nor can custom-built reports that connect to a legacy enterprise database. Old-school static legacy analytics tools can only provide so much insight at a time. They can’t iterate results fast enough in situations where data flows in like a firehose.
The promise for real-time analysis lies in the new generation of analytical tools, software built on research in artificial intelligence and machine learning. Most low-cost data tools today offer numerical models using a “best fit” approach using linear regression. We’re so used to the limitation of today’s tools that it seems like fantasy to imagine a world where it’s possible to quickly and intelligently react to in-flowing data.
Sure, it’s possible to load data quickly into an Oracle or SAP database. But then what? True, if you have the computing power and bandwidth and mathematical/programming ability (which most of us don’t), you could probably create a 3D model of a hypothetical situation in a digital environment. However, even a sophisticated simulation tool can only go so far.
If people could extract insight from data very quickly, this would change the way we live and work. Medical data could be instantly analyzed could save lives or to tailor a Personalized Medicine regimen for people. People working in dangerous conditions could monitor their environment and react the instant a threat appears. Medical devices would react in response to changes in a person’s vital signs. Manufacturers could quickly fabricate custom products in response to real-time data streams. Engineers could quickly calculate the best way to distribute power resources when a massive snowstorm knocks out the power in DC, or a drought in the mid-west causes people there to crank up their air conditioning.
Someday, there will be powerful low-cost tools that can quickly crunch through all types of datasets. These tools will provide useful insight and will be easy for regular people to use. Increasingly, the raw data is available. We’re just yet able to quickly make sense of it.
(1) (“Big data: The next frontier for innovation, competition, and productivity,” McKinsey Report, June, 2011).
image credit: siliconrepublic.com
Melba Kurman writes and speaks about innovative tech transfer from university research labs to the commercial marketplace. Melba is the president of Triple Helix Innovation, a consulting firm dedicated to improving innovation partnerships between companies and universities.