| | November 20189CIOReviewAlthough the formal Big Data movement may have passed, leveraging large and diverse data sets continues to be fundamental to business intelligencecompletely incorrect results that would have been undetected without their recommended visual verification. One can therefore argue that the practice of machine learning in terms of strategy, verification, and communication depends fully on visualization. With the well-documented shortage of data scientists, some are turning to automating machine learning. The tuning and optimization required for machine learning algorithms are certainly achievable, particularly for well-defined problems or re-analyses given new data. However, the correct interpretation for a given algorithm still requires an understanding of the data. We have found that creating fully numeric tools for a specific analysis are prone to incorrect interpretations, even with user training. A better alternative is providing visualizations with guiding explanations to help users attain the correct interpretation.Although the formal Big Data movement may have passed, leveraging large and diverse data sets continues to be fundamental to business intelligence. Both the business and science fields recognize that an isolated event has value, but can be increased with additional context obtained by integrating additional data sets and applying machine learning techniques. In aggregating data, the number of features (i.e. dimensionality) increases substantially. However, our standard bar and pie charts can only handle 1-dimensional data, allowing an examination of each feature separately. Scatter plots usually give two dimensions so we can examine all pairs of features possibly with color or other glyphs, but cannot easily see the more complex and less obvious patterns. The more advanced parallel coordinates visualization can handle about ten dimensions, but we frequently have data with 100s or 1000s of features. Machine learning methods such as principal component analysis (PCA) can map the high dimensional space into two dimensions fora scatter plot. Similar new methods have emerged recently such as t-SNE that balance the local and global similarities displayed. These approaches empower the user to understand not only the final analysis, but also provide a window into the nuances of the raw data.The full data science pipeline is a multi-step process that evaluates hypotheses on high dimensional data, and generally conveys the results through visualization. The machine learning components of the pipeline can find the outliers, characterize the importance of each feature, find the key patterns and provide the answers. Still, different analysis pipelines for the same problem may produce results that vary as a function of time, use-case, target population, etc. A further visualization challenge is how to present these combinatorial sets of results to the user with the constraint of Miller's Law that states that human capacity for processing information is limited to 7±2 objects at a time. Current research in Human Computer Interaction (HCI) continues to tackle this problem by exploring how users can navigate and effectively comprehend multiple sets of information. Not surprisingly, machine learning research is also being used to recommend sequences of views and identify interesting regions to flag for the user.In this data science era, visualizations guide data scientists to select appropriate analysis strategies and help communicate results. Machine learning facilities the multidimensional data visualizations and identifies the salient patterns to highlight. The interdisciplinary visualization field is working to combine the information theoretic and human cognitive models together into a unified framework. This will facilitate better comprehension of raw data, provide insight into the machine learning techniques and convey the correct interpretations of results.
<
Page 8 |
Page 10 >