Unstructured Data Analytics and Reporting
Traditional analytics has been focused primarily on structured data. Big Data, however, is primarily unstructured, so we now have a couple of combinations available. We can perform quantitative analysis on structured data as before. We can extract structure out of unstructured data and perform quantitative analysis on the extract quantifications. Last, but not least, there is a fair amount of non-quantitative analysis now available for unstructured data. This section explores a couple of techniques rapidly becoming popular with the vast amount of unstructured data and looks at how these techniques are becoming mainstream with their powerful capabilities for organizing, categorizing, and analyzing Big Data.
Search and Count
Google and Yahoo rapidly became household terms because of their ability to search the web for specific topics. A typical search engine offers the ability to search documents using a set of search terms and may find a large number of candidate documents. It prioritizes the results based on preset criteria that can be influenced by how we choose the documents. If I have a lot of unstructured data, I can count words to find the most commonly used words. Wordle™ (www.wordle.net) provides word clouds for the unstructured data provided to it. For example, Figure 4.1 shows a word cloud for the text used in this book. The font size represents the number of times a word was used in the text. This data can be laid out against other known dimensions. For example, this summer we were working on unstructured data analytics for a CSP in India. We received a large quantity of unstructured text. Our first exercise was to use the Text Analytics capabilities in Cognos® Consumer Insight (CCI) to study key words being used as plotted against time. Figure 4.2 shows the results of this word count plotted against time.
Context-Sensitive and Domain-Specific Searches
Anyone with telecommunications knowledge can easily understand what “3g” and “4g” in Figure 4.2 refer to. Context-sensitive search engines can differentiate between “gold medal” (Olympics) and “gold bullion” (commodity trading). Also, some of the search engines are fine tuned for industry or corporate terms. Figure 4.1: Wordle™ Word Cloud 36 • Big Data Analytics Vivisimo offers the capability to specialize a search engine for a specific purpose, thereby fine tuning it for corporate terms when used inside a corporate intranet.
Categories and Ontology
Often, we like to classify unstructured data into categories. This gives us an understanding of the relative distribution across a known classification scheme. Let me use an example from online purchasing. I use Slice (www.slice.com) to keep track of my online purchases. Slice scans my email for any online purchases and extracts relevant information so I can track shipments, order numbers, purchase dates, and so on. Slice also lets me “slice and dice” the orders. That is, it analyzes my purchases against a set of categories to report the number of items and money spent in each category. Figure 4.3 shows Slice’s category analysis: Travel & Entertainment, Music, Electronics & Accessories, and so on. Slice must be doing rigorous unstructured analytics to understand what is considered “Movies & TV” and how that is different from “Music.” The classic product categories originated from the Yellow Pages. We remember the classic Yellow Pages books that we received so often and are nowadays getting incorporated into online Yellow Pages and other shopping and ordering tools. However, categories are typically tree structured, where each node is a sub-class of the node above and can be further sub-classified into further specialized nodes. For example, a scooter is a sub-class of two-wheeler, while an electric scooter is a sub-class of scooter. A node can be a sub-class of Figure 4.2: Word count graphical display plotted against time Chapter 4: Architecture Components • 37 more than one entity. A sub-class shares the attributes of its super-class. Therefore, both scooters and electric scooters should have two wheels. While the classic product catalogs were static and were managed by administrators without organized feedback, the unstructured analytics provides the ability to make a dynamic hierarchy, which can be adjusted based on usage and search criteria. A more general representation of conceptual entities is found in ontology, which is an abstract view of the world for some purpose.21 Ontology defines the terms used to describe and represent an area of knowledge. Ontologies are used by people, databases, and applications that need to share domain information (a domain is a specific subject area or area of knowledge, like medicine, tool manufacturing, real estate, automobile repair, financial management, and so on) and may include classifications, relationships, and properties.22 With formal ontology, we can create a “Semantic Web,” which can provide structural extracts to machines, thereby providing them with ability to extract, analyze, and manipulate the data.