At this point in the series of articles on the Data Analytic Lifecycle, raw data has been identified and imported into a Data Analytics sandbox. The data (a mix of structured and unstructured data) is depicted below. Contained within the sandbox now lies a large amount of data representing research and innovation activities occurring globally throughout my corporation (EMC).
At this point in the lifecycle it is recommended that the data scientists and engineers begin to “get used to the data”. This means that they can use any number of tools to inspect the format, structure, and quality of the data. They have already immersed themselves into the analytic plan generated in Phase 1, and they are likely looking for data sources to validate the hypotheses in the analytic plan.
This process will almost certainly identify “gaps” in the data that will prevent the data scientists from being able to prove or disprove the hypotheses.
At my corporation we have created a volunteer team of data scientists that ensure that our approach scales globally. One of these data scientists is Vladimir Suvorov of Saint Petersburg, Russia. Vladimir described the tools and approach he used to explore the data:
R-studio provides the easiest way for initially examining the data placed within a sandbox. R-studio provides a simple connection string and SQL query on one side and a powerful statistics and graphing package on the other. Besides that, R is open-source software, so it can present additional value for the community. I recommend R for fast prototyping and quick summaries of the data.
Vladimir created a chart that focuses solely on university activities. This chart uses color coding techniques to categorize the types of university activities that have occurred in previous months. What is actually happening here is broadly considered “data exploration”. In this phase, people like Vladimir explore the data, assess data quality, examine basic information about the data itself (relationships in the data, trends, etc.), and they try to understand what kind of data they actually have. This lets them form ideas of how they can test in future phases, and what kinds of insights they may be able to drive towards.
Vladimir’s simple visual chart shows the pace of visits to universities (yellow), as well the emergence of a dark blue color, which represents meetings of the Research Advisory Team (RAT) for the purpose of discussing university research funding for 2012. These meetings continue throughout the first quarter of the current year (2012-03).
The relatively small amount of employee lectures at universities (red) and professor visits to EMC (aqua) is a potential indicator that programs could be put in place to accelerate these types of exchanges.
This chart falls under the category of “descriptive statistics”.
It’s an important activity in Phase 2 which, as just explained, can already lead to actionable conclusions.
During the Data Science and Big Data course, I learned that bar charts like this, while helpful for the data exploration phase, may not be the best choice in future phases. They can actually be improved via a few simple tips. Stacked bar charts are good for showing a few aggregated data points and exploring the data points. When looking at data over time (as above), it is generally better to display line charts, to make it easy to see trends (people generally have a more difficult time examining trends in stacked bar charts). Likewise, best practice is to use very soft colors and strong, emphasis colors as a way of highlighting key points; red is also a signal for things like “danger” or problem areas, especially in certain world countries. In summary, a graphic like this gives us good initial information at this stage in the process. For more polished visualizations (used later in the lifecycle), Data Scientists need to address these kinds of aesthetic considerations.
During the data exploration exercise the data scientists and engineers begin to notice that (a) certain data needs conditioning and/or normalization, and (b) there are data sets that cannot be found anywhere which are critical to proving analytic hypotheses.
These activities will be further profiled in the next post. Once again, thanks to David Dietrich, who not only taught the course I attended, but continues to oversee this series of posts.
image credit: silverdane.com
Steve Todd is Director at EMC Innovation Network, and a high-tech inventor and book author “Innovate With Global Influence“. An EMC Intrapreneur with over 180 patent applications and billions in product revenue, he writes about innovation on his personal blog, the Information Playground. Twitter: @SteveTodd