| | December 20179CIOReviewA data lake is essentially a library, ideally staffed with experienced employees who can handle and find data in multiple mediacatalog and data dictionary, and for non-tabular data, by creating tags. Tags are structures created where OCR, text scanning, and/or a collection of assigned attributes/keywords may be used to organize files or objects. For instance, the structure of a blog entry and a magazine subscription has many structural and content similarities, even though a sea of differences exists between them. Librarians and Data Architects can use tags to group different kinds of data together, based on both internal structure and content. Tags can be used for data management as well, with tags for security rating and type, data lifecycle, update frequency, source system (lineage), subject area, owner/steward/SME, downstream uses (impact), and business purpose. Finding the Right Library Team for a Data LakeData is everywhere, but knowing everything about the data is crucial. Create a dedicated staff that is interested in finding out where data comes from, which data is better for which use, and is detail-oriented so they can create good documentation from what they find out. Library Science is the study of organizing data to make it easily searchable and more accessible in an orderly fashion, which is exactly what a Data Lake needs.Every company has some undocumented data and processes, usually manual data hammers someone created to get around a systemic or data access issue. Create documentation that reflects the current understanding, and then try to automate any manual processes by fixing the systemic or access issue that caused it in the first place. If that is not possible, we find another automated way to serve that need by using the data lake. Apart from this, it's mostly just aiming to be a good caretaker of the data and users, much like librarians are good caretakers of the library contents and patrons. A library does more than contain paper documents; in reality, a library has much more including patron education and entertainment programs, and digital media content and creation facilities. A good data management team provides education and training on best practices, just like a library would.Challenges of a Data Architect and Advice to Overcome ThemIn some organizations, building a business case for funding centralized Data Management is difficult due to the lack of a direct ROI. The justification should be for an enablement investment, similar to investing in Enterprise Security, or even drinking clean water. The main expenses involved in this business case are for increased staff, cloud expenses, and funds to acquire tools to standardize and automate our data management processes.The biggest challenge could be the lack of an enterprise-level overview; seeing how the multiple chunks of the organization fit into a complete whole. In some organizations, departments may operate within their own silos with their blinders on. The data goes across the silos, where each level may have their individual copy of the data, resembling fractured mirrors. Lack of integration in both data and documentation can lead to a situation where employees are unaware of what data really exists within the organization and merely use whatever information they can get their hands on. Unavailable documentation is replaced by tribal knowledge, which is only as good as who one knows and what or who they know, and research takes more time. Good Data Architects look for the big picture, and how things fit together, rather than point solutions.Another challenge is getting all the employees to realize the importance of data to an organization's success. Data needs to be accurate, consistent, relevant, and easily available. Employees also need to learn to share data, rather than create their own copies. This prevents treading into a hoarding culture, where the databases are hugely bloated as each level has their own version of data, leading to an incredible expense and waste. My advice to fellow Data Architects would be to prevent the bad data from entering into the systems. Cutting it off at the source is essentially the first and foremost job. Of course, this is difficult for organizations that undocumented or unsupported interfaces. In these cases, the best solution would be to find (through data profiling) and fix the erroneous data in the databases where the data enters the organization, before passing to downstream systems and creating potential downstream disasters. Susan Earley
<
Page 8 |
Page 10 >