The finishing touch for Phase 1 of the Data Analytics lifecycle is the creation of an Analytic Plan. In the same way that requirements drive all phases of a software project, the analytic plan lays the foundation for all of the work in an analytics project.
I’ve mentioned in previous posts that this part is not easy. Analytic Plans are new to me. Before starting I need to give credit where credit is due. David Deitrich has been a driving force behind our Data Science and Big Data Analytics curriculum, and a regular contributor to this series of articles.
There are four initial components of an Analytic Plan:
1. Framing of the Business Problem
In my case I am trying to accelerate innovation within my corporation (EMC). Three problems faced by the corporation are (a) the tracking of knowledge growth throughout our global employee base, (b) ensuring that this knowledge is effectively transferred within the corporation, and (c) that this knowledge is most effectively converted into corporate assets. Executing on these three elements more effectively should accelerate innovation, which is the lifeblood of our company.
2. Initial Hypothesis
In my last post I described eight different initial hypotheses theorizing how analytics can assist in solving the business problem. These eight hypotheses were boiled down to one high-level hypothesis statement:
An increase in geographic knowledge transfer improves the speed of idea delivery.
This hypothesis paves the way for what data we will need and what type of analytic methods we will likely use.
The data that the project will rely on fall into two categories.
- The first category represents five years’ worth of idea submissions into EMC’s Innovation Showcase process. The Showcase process is a formal, organic innovation process whereby employee ideas from around the globe are submitted, vetted, judged, and incubated. The data is a mix of structured (idea counts, submission dates, inventor names) and unstructured (the ideas themselves) content.
- The second category encompasses minutes and notes representing innovation and research activity from around the world. This data is also a mix of structured and unstructured. The structured data, once again, includes items such as dates, names, and geographic location. The unstructured documents contain the “who, what, when, and where” information that represents rich data about knowledge growth and transfer within the company. This type of information, however, is often stored in business silos that have little to no visibility across disparate research teams.
The first repository (the idea submissions) is centralized. The second data set (centralized research and innovation minutes/notes) will be gathered from throughout the corporation and contain 6 months worth of global data.
4. Model Planning – Analytic Technique
Model Planning represents the conversion of the business problem into a data definition and a potential analytic approach. In other words the rubber is beginning to hit the road in terms of creating algorithms. A model contains the initial ideas on how to frame the business problem as an analytic challenge that can be solved quantitatively. There is a strong link between the hypotheses and the analytic techniques that will eventually be chosen. Described below are a few algorithms and approaches that make sense given the hypotheses. They do not represent a complete list but give the reader a sense for this activity within the analytic plan.
Keep in mind that model selection is an “art form”. Some people are better at it than others. It requires iteration and overlap with phase 2 (Data Prep). Multiple types of models are applicable to the same business problem. Selection of methods can vary depending on the experience of the Data Scientist’s comfort zone. In other cases model selection is more strongly dictated by the problem set.
- Use Map/Reduce for extracting knowledge from unstructured documents. At the highest level, Map/Reduce imposes a structure on unstructured information by transforming the content into a series of key/value pairs. Map/Reduce can also be used to establish relationships between innovators/researchers discussing the knowledge.
- Natural language processing (NLP) can extract “features” from documents, such as strategic research themes, and can store them into vectors.
- After vectorization, several other techniques would be appropriate:
- Clustering (e.g. k-means clustering) can find “clouds” within the data (e.g. create ‘k’ types of themes from a set of documents).
- Classification can be used to place documents into different categories (e.g. university visits, idea submission, internal design meeting).
- Regression analysis can focus on the relationship between an outcome and its input variables. What happens when an independent variable changes? It can help in predicting outcomes. This could suggest where to apply resources for a given set of ideas.
- Graph theory (e.g. Social Network Analysis) will be an important way to establish relationships between employees who are submitting ideas and/or collaborating on research.
At this point I have generated some hypotheses, described potential data sets, and chosen some potential models for proving or dis-proving the hypotheses. During this process I have been sharing my thoughts in bits and pieces with my peers, and I feel confident that I have enough data to draft a high-level analytic plan and submit it for formal review. I’ve attached a template slide below.
The last two rows in the Analytic Plan overview (Results & Key Findings, Business Impact) are a reminder to me that I am working toward Step 5 of the Analytic Lifecycle: Communicate the Results. As the business user I participate most heavily in the beginning and the end of the Lifecycle.
I’ve spent a lot of time on this first step. Any analytic project lead should do the same. With the Analytic Plan as the foundation, it’s time to move on to Step 2: Data Prep.
image credit: stevetodd.com
Steve Todd is Director at EMC Innovation Network, and a high-tech inventor and book author “Innovate With Global Influence“. An EMC Intrapreneur with over 180 patent applications and billions in product revenue, he writes about innovation on his personal blog, the Information Playground. Twitter: @SteveTodd