The LAKE experience
AbstractThis presentation is a case study of the Art Institute of Chicago's DAMS project. LAKE, the AIC DAMS, is entirely based on modern Web standards and open-source software built in collaboration with several cultural heritage institutions. Its first beta release was launched in September 2016 and a full release is planned for early 2017. In this session we will describe: 1) the scenario pre-dating LAKE; 2) the thought process and design phase that led to choosing the technology we are using; 3) the implementation steps and challenges; and 4) the current status and plans for expansion and long-term sustainability.
Keywords: DAMS, case study, open source, linked data, digital preservation, collaboration
This paper is a case study of LAKE, the Digital Asset Management System developed at the Art Institute of Chicago (AIC). Started in 2014, it will be fully in production by April 2017.
When evaluating a DAMS for the voluminous and complex knowledge base of the museum’s collections, the AIC Information Services team undertook a less common path to resolve a host of interconnected, individual issues that cropped up during the initial research phase. Instead of simply resolving the isolated issue of image and media storage, AIC decided to engineer a system that could be further expanded to cover (and connect together) incrementally broader areas. Such an approach revealed this challenge to be in part unique to the AIC and in part common to other memory institutions.
After three years of research, planning, design, implementation and dialogue with end users and developer communities, LAKE has become a reality, accessed daily by diverse museum departments and the sole information source for all the public-facing digital projects. In addition to this achievement, the discoveries from this three-year “journey into unknown waters” have resulted in invaluable knowledge capital for AIC staff and the institution more broadly. This knowledge allows staff to better comprehend the broader context of complex needs related to maintaining the information flow of a large museum with a deep history.
The authors will describe the technical design and implementation of LAKE in broad strokes, including the path leading to its current status. They will also share their views on how such experience can be collaboratively adopted by similar institutions with the goal of promoting, and hopefully initiating, a community-driven DAMS, both as a concrete product and as a set of shared standards that would facilitate cross-institutional information sharing.
The Art Institute of Chicago data repository is home to over 200,000 digitized collection items. The AIC’s Collection Management Systems, called CITI, is a custom-built application that the museum has maintained for over 25 years. From a simple cataloging application, it has grown into a complete management system for the whole life cycle of collection items and related events, agents, and operations.
CITI data are extensively used by many departments. Curatorial staff and the registrar are the most active producers while other departments are mainly consumers of these data: Imaging, which links collection objects to the imaging services they provide; Digital Experience (DE), which oversees public-facing digital projects; Finance and Strategy, for which collection information (and the quality thereof) drives top-level decisions for the museum.
Other areas of the museum are depending on collection information to various degrees but are at the moment not as seamlessly integrated with the main Collection Management System (CMS), or hardly at all. Among these are Conservation, Publishing, and the Library. Cross-referencing collection information for these departments often means copying or syncing data from one system (frequently CITI) into an isolated system.
CITI has some built-in image and interpretive media manager modules. Due to the growing volume of digitized resources and the complexity of their creation and distribution process, it was decided that the museum would create a system specifically dedicated to the management, preservation, and access of collection images and documents. This was to better define the role of CITI as a CMS and avoid conflating too many “features” in it, and at the same time to avoid custom coding generic functionality that would be readily available in other products.
The AIC also has a custom-built image order management application called Phoenix, launched in 2014. This is a central tool for the Imaging department to receive, dispatch, and track photography orders from internal departments. Phoenix is closely integrated with its neighboring systems, however once an order is completed, Imaging staff must go through several time-consuming steps which should ideally be automated. Also, original images and resized derivatives for publishing and sharing with other AIC departments are kept in two different places. The organization of images in the file server is left to archiving practices which have been changing for many years.
LAKE (Linked Asset and Knowledge Ecosystem) has been conceived from the outset as a central repository for images and other media related to collection-related operations. What this actually means in the specific has evolved deeply during the research and design phase, which spanned over a year period.
The requirement gathering and concept design process for LAKE have been described in a previous paper (Cossu & Wilcox, 2016). What is expanding here are the very different approaches to the issue between the beginning and the end of the design phase. This phase, and the type of software involved in LAKE, deeply changed the AIC staff’s initial assumptions about how to build this system. Rather than having complete control over the product by developing a system from scratch, at the price of increased developer resources, or relying on a “turnkey solution” which might have a more predictable cost but quite likely less flexibility, we identified a solution that married these two approaches: community-supported software.
The implementation of LAKE has been a challenge. These challenges have ranged from unexpected and time-consuming difficulty determining which of the many images to ingest into the repository from the Imaging Department’s file server, (as a result of the department’s workflow practices having changed over the years), to software bugs in the component applications. But perhaps the single greatest complicating factor can be traced to the project’s scale, which sometimes has the unfortunate effect of amplifying the challenge even if any given issue might present itself as a seemingly small, isolated occurrence.
Scale and complexity
To briefly describe this scale, the Art Institute of Chicago has nearly 300,000 image records in CITI. One image record generally represents a single view of an object. Those 300,000 image records readily translate to over 600,000 image files on disk, but the total number is actually much greater. Per image record, there may be a digital negative, which may be a DNG file for example; a preservation master, which is a largely unmodified TIFF; and what we call a production master, the primary derivative that has been color corrected, perhaps cropped, and yet remains an uncompressed TIFF. The production master is nominally the source of all future derivatives, whether that derivative will be shown on the Web or published in a print exhibition catalog. Then we have innumerable variations of production masters that have been modified to address specific needs, such as a printing requirement. Recently, the number of images has burgeoned as we work to create more 3D views of three-dimensional objects. This is frequently the case for sculpture. Hundreds of individual images—all TIFFs—of a three-dimensional object may be required to generate one 3D view. Finally, there are the many duplicate or stray images that create unnecessary noise. The total number of images is probably incalculable even if it ever made good sense to try and calculate the total number. And all of that is simply about the number of images, not about their size and the amount of storage they require. That’s a similarly large, ever expanding figure also.
The preceding paragraph is not to elicit sympathy or to establish bragging rights, but to underscore the scale of the LAKE Project and how that scale can, and has, compounded its complexity. The complexity ranges from the intellectual exercise that is modeling various relationships between interconnected assets to the pragmatic issues of what computer ports need to be opened to ensure two systems can communicate. There are diagrams that illustrate the relationships between entities within a system and there are diagrams that describe how multiple systems interact with each other. There are tables that attempt to capture the names of servers, their function, and which communication ports need to be accessible or accessed. We have a table that records information about other tables.
At the heart of our implementation is Fedora, a flexible, open source, Java-based digital repository software solution that has been used by libraries and archives to manage digital assets for more than a decade. It found wide favor in those communities because not only did it perform the basic function of a DAMS but also it provided a means for an institution to implement digital preservation strategies that were, and remain, in keeping with best digital preservation practices. This is an important distinction from most commercial repository offerings that are largely, if not exclusively, focused on the management of digital assets for only the life-cycle of the specific DAMS product.
Two of the core tenets of the Fedora project are flexibility and scalability. “Flexibility” not only refers to the various media the software can handle, but also how its design and implementation enable adopters to select their own components, such as their preferred search index. “Scalability” refers not only to plans to permit the Fedora software itself to scale horizontally, but also to the intentional choice not to embed components and functionality beyond what is essential to the software itself. In this way, Fedora and a search index, for example, can reside on separate machines, each with its own dedicated memory and computing power. (Fedora and other components can also very nicely reside together on the same machine, and is certainly a simpler solution assuming a project’s needs do not require greater resources.) While these decisions by the Fedora project are intended to empower Fedora’s adopters, they unfortunately shift the burden of integrating these components, and any complexity that may arise from said integration, to the implementers who must not only install additional components but also ensure those add-ons can connect with Fedora correctly and reliably.
Messaging and asynchronous communication
Fedora manages the communication about repository events—content is added; content is modified; content is deleted—by publishing a notice to a built-in messaging service. This means that messaging between Fedora and, for example, a search index is not a simple process of transmitting content between point A (Fedora) and point B (a search index) at the time content is created or modified in Fedora (fig. 1). Instead, the process of indexing Fedora metadata in a search index is more akin to a data flow that invokes Point A, then Point B, then Point A again, possibly Point B again but perhaps Point C, and finally Point D. To describe that in more concrete terms, content is created or modified in Fedora and Fedora deposits a message in a queue of messages (fig. 2). Another piece of software listens for new messages. When it encounters a new message (line 1), it parses it and then acts accordingly. In the case of indexing the metadata about a Fedora resource in a search index, the metadata must first be retrieved from Fedora (line 2), massaged into whatever format is acceptable to the search index (line 3), and then transmitted to the search index (line 4). It should be pointed out that this is an asynchronous flow, meaning that when content is created in Fedora it may take time (in our case usually only a few seconds) to propagate to other systems. This design can protect against resource hogging operations, but is not without its potential pitfalls.
The above describes one component (the search index) in a single “environment” (production). We of course have additional components integrated. A triplestore and LAKEshore itself—the front-end application—are two very important components not represented in the above graphics. We also have multiple environments, principally test and development regions. It is therefore safe to add a couple of components to figure 2 and then multiply the resulting environment by two. The result is 12 servers deployed and connected in our operation (though in reality it is only nine or ten as we have some servers performing two functions). This description is not designed to scare but it is intended to warn. Indeed, this complexity is entirely manageable, but it behooves the adopter to appreciate what lay ahead.
Managing complexity: documentation and replicability
At AIC, we’ve managed it in part by documenting much of everything, such as which machines are running at which IP addresses with which ports open and the software installed. We’ve written scripts to test various aspects of the system, from those that test whether the messaging routes that ensure Fedora metadata are represented in our Solr index are working properly, to scripts designed to crawl the entire repository to perform a post-migration validity check. We’ve started looking at applications that collect and analyze the various application logs to assist us with identifying issues that require our attention. We’ve created Docker containers—atomic virtual instances capable of hosting one or more applications—that can be run on stock workstations to simulate one such complex environment as described above. Although somewhat of an elaborate solution, starting four Docker containers on a local laptop permits us to recreate the complexity of a four-server environment but otherwise isolate development work in a way that does not threaten one of the well-tuned environments discussed above.
An important element of the these Docker containers is how they are created. An abstraction within an abstraction, the Docker containers are actually “created” by deploying Chef—a configuration management tool—to a container and then invoking a Chef script (called a “recipe”) that in turn provides instructions about how a server, for example, should be configured. This has proven to be a critical component. Not only can we replicate our server environment via disposable Docker containers, but we also hope to use the Chef recipes in the long-term to configure our actual development, test, and production servers. Presently, a change to the configuration of one of these machines is manual, and such a change must be manually implemented in the remaining environments. In the future, our aim is to use a single Chef recipe, for example, to configure the server that runs Fedora. That single Chef recipe would then be used to ensure the Fedora machine is identically configured in all environments, thereby reducing the potential of human error and removing the possibility that a variable in one environment is masking a problem in the other. Finally—and this has already proven crucial—the fact that the configuration is embedded in scripts means the configuration is documented. The recipe records which directories were created, where, and what their permissions are. And, because they are computer code, they are ideally checked in to a code repository, which means you have a record of every configuration change you’ve made.
Short-term discomfort, long-term gain
One question that rises above others after weighing the above is whether the value of this complexity—in many ways unavoidable in our case simply because of the scale—is greater than the cost in terms of resources. By giving each component its own machine, we certainly have some area for growth. This is important because long-term plans include much more than images; long-term goals include digitizing (as much as conveniently possible) the paperwork associated with our collection operations to finally bring them all under one searchable cover. Moreover, the separation of roles means that we can load images to Fedora on one machine, while another machine generates derivatives, and a third machine indexes the associated metadata. Related to the concept of distributing the workload, Fedora’s implementation of messaging enables a concept called “event driven application workflows” (“Setup Camel Message Interactions,” 2016). This means, for example, a message from Fedora pertaining to a new image can be treated differently than a message about a new metadata resource. We’ve only really started to benefit from this concept, but it may prove very promising in the long run because it establishes a way to discriminate workflows based on any number of factors, sending some content thither while messages about another resource type are dropped altogether.
Fedora is a repository, nothing more. There is no discovery layer to Fedora, meaning there is no way to find content once it enters the repository unless you know, precisely, its identifier. This means that those additional components are “optional” only in the most theoretical sense. But from a practical standpoint they are absolutely essential. Taming them, while maintaining sanity, is about recognizing each component’s role, drawing clear boundaries, and then managing the configuration of these interconnected system components, in just the same way you’d manage the relationships between collection Objects, Agents, Exhibitions, and Transactions.
The act of implementing LAKE and managing it proved to be a much larger goal than initially designed. This is in part due to the discovery of possibilities offered by the technology we adopted, much of it very new to the AIC at that time, and the consequent growth of stakeholders’ expectations. What was initially dubbed “Phase 1” has been divided into more manageable milestones as the project progressed. This was mostly due to factors that created delays beyond our most pessimistic expectations, such as the development timeline of upstream projects and the complexity of extracting and filtering contents from our legacy systems.
Once set up, the first institution-wide release of LAKE will pave the way to further integration and migration milestones. The main challenge so far has been software and architecture design; the next main challenge will be adoption management. That is not to say that parts of the architecture will not be adjusted, improved, redesigned, or outright replaced; however, given the modular structure of LAKE, future change is very manageable.
The main concern going forward will be prioritizing features and migration projects according to institutional priorities. Every improvement needs to be done with an eye to possible expansion.
Each localized project may pose its own specific challenges also. Some departments have very large volumes of paper documentation that needs to be digitized: that may be a relatively modest challenge in terms of content modeling, but a significant investment in CPU power, storage, digitization equipment, personnel training, and raw processing time. Other departments may have very complex needs in terms of linking and aggregating resources, in order to make it easy to find them. That involves a greater effort in devising a flexible ontology that doesn’t paint us into a corner with short-sighted decisions, but also does not produce an over-engineered solution that is unwieldy for developers to maintain and for users to understand.
A separate challenge from legacy data migration is the introduction of new content and new users. In order to maintain a high data quality within such a complex system, users need to be both well trained and able to navigate through the user interface and established workflows seamlessly. The greater part of custom work in LAKE, leaving aside the migration scripts, is the UX design and GUI implementation. This effort will likely grow even further as the number and diversity of users increases. The tight relationship with Digital Experience, who is leading the dialogue with stakeholders, has been very successful so far in informing the tech team’s workflow and interface design.
Another main concern for LAKE is its long-term sustainability as a project. LAKE was designed as a system which should serve its purpose for many years to come, and whose entire contents could easily be migrated to another system when its core design and concept becomes obsolete.
LAKE beyond LAKE
Dealing with community-driven software
Community-supported software has its own advantages and challenges, which sometimes lie between custom in-house software and third-party solutions (especially with regard to cost and flexibility) and sometimes are specific to this approach.
The most obvious advantage of community-supported software is the community itself, and it is important that the community has enough participants, expertise, and commitment to guarantee a continuous advancement and support of the project. Fedora (http://fedorarepository.org) and Hydra (http://projecthydra.org), two projects behind the LAKE core components, are dependent on each other and are installed widely; Hydra is built on top of the Fedora repository, and Fedora needs Hydra (or another front end) to be usable. The two developer communities are intertwined so that several participants often contribute to both projects.
Collaboration is a critical component of community-driven software. One example of collaborative problem solving is represented by an issue that was initially raised within the development of LAKEshore (https://github.com/aic-collections/aicdams-lakeshore/). LAKEshore is the user-facing side of LAKE, built by AIC on top of Hydra, which is itself dependent on a hefty stack of Ruby gems. AIC developers discovered that the conceptual model inherited by the upstream project was limited in the way relationships between files and other entities were handled. In order to address this properly, code changes were needed several levels deeper than LAKEshore. If AIC were to proceed on its own, it would have to fork several dependent projects and maintain its one-off implementations. Instead, after some discussion with key contributors, we learned that other institutions shared this concern and decided to address the issue collaboratively and carry out the necessary changes as part of the main development of the interested components. A working group was created (https://wiki.duraspace.org/display/hydra/FileSets+Working+Group) and interested parties have been involved in the design and implementation of the desired changes. The process is taking far longer than had AIC proceeded alone, but will result in significantly less maintenance in the long-run.
A significant challenge of community-driven software is that, even with an extremely helpful community, one may not have the immediate response in terms of fulfilling feature requests or even bug fixes. This aspect, which is a trade-off for having other people helping design and write code, needs to be understood and managed properly. The above described example of collaboration would have not worked if AIC had a tight timeline to implement the feature in question or did not have a temporary workaround for it. In fact, on a separate occasion, a quite different approach was taken.
AIC needed a simple image ingest API for LAKEshore. Since this API is integrated into our custom interface, it was possible to build it directly in LAKEshore, without any direct involvement of other community members. This choice was also driven by timing constraints. Once deemed complete within the specific context of the AIC needs, the API specs (https://github.com/aic-collections/aicdams-lakeshore/wiki/REST-API:-CRUD) and code were shared and discussed with the community. Although the API is specific to AIC and although AIC currently lacks the bandwidth to generalize the code for broader applications, this project raised the interest of other institutions who have expressed a desire to create a more generic, portable API component.
Although taking advantage of open-source solutions can be a strategic choice when considering budgets, it is important to recognize the need to contribute back to a project not only because it is a form of community contribution, but also because it may be in your interest depending on your reliance on a given open source project. Fedora is a good case in point for AIC—it is critical to the LAKE system design. The obvious contribution to a product is to lend programming skills, but Fedora is written in the Java programming language, which is not a strong skill in our department. Instead, we focused on testing Fedora. These were general tests to capture performance metrics, but critically, some tests were designed to test Fedora releases to ensure proper functionality of the software. It was an area that was lacking in the community so we made it our contribution.
A DAMS for museums
Given the design and purpose of LAKE, it seems natural that this project may eventually become of interest to museums. This would be a very welcome evolution of the project. LAKE has been open-source from the outset, both in the components developed by AIC, and the ones developed by third parties, so it is well poised for this step. Previous examples (http://www.collectiveaccess.org/) (http://omeka.org/) have demonstrated the advantage of cross-institutional projects, especially in the humanities field where information is made to be shared and gains value by being shared.
LAKE, as the first museum-focused DAMS based on Hydra and Fedora, has already helped shape the mission of these two projects and the communities around them. These communities have been very receptive to the museum-specific issues raised by AIC and have acknowledged the need to expand the usage of Fedora and Hydra beyond the library and research data fields. Library communities learning about museum models has strengthened Hydra in much the same way museums stand to acquire additional knowledge and strategies from the academic library field while building bridges within the so-called LAM (Libraries, Archives, Museums) community. The advantages are multiple, the most obvious being working together toward shared goals.
In theory it is possible for any institution with appropriate technical expertise and server resources to run LAKE, but currently there are a few caveats to bear in mind:
- Each of the LAKE components is built using a specific technology and has its own learning curve. Configuring each component requires a certain degree of knowledge about the system and its setup. In order to troubleshoot issues that arise in a running system, more in-depth knowledge may be needed.
- Those components work together in an asynchronous fashion. Each of them needs to be installed separately. For something more than a proof of concept or sandbox, most of these systems are best installed on a separate server so that each has its own devoted resources.
- Even if LAKE were successfully set up to manage digital assets at a given institution, it would not integrate any collection data. LAKE, as is installed at AIC, gets non-asset resources (Works, Agents, Exhibitions, Shipments etc.) synchronized through Combine, a custom ETL framework that is not shared with the community.
Bearing in mind these three warnings, LAKE constitutes the starting point of a possible community-wide adoption effort. All of the above are resolvable given the appropriate time and resources:
- AIC uses Docker and Chef to deploy sandbox versions of the various LAKE components. This can save a potential adopter, someone new to the system, the daunting task of figuring out what to download, how it works and how it fits together with the rest.
- The same technology can be used to create an “almost-ready-to-run” container that abstracts all the complex setup details in one configuration file, carefully crafted to only expose the relevant information. Beyond a sandbox or evaluation environment, more specific knowledge of the individual technologies involved is still required. However, it is also assumed that any DAMS of comparable size and scope would somehow need at least one in-house maintainer (or a contract with a specialized service provider).
- The external data integration point is arguably the one demanding the largest investment of resources, and the one most unlikely to become universal. The diversity (or rather, inconsistency) of data repositories (both collection data and legacy file and metadata stores) among museums is much more severe than among libraries, which have been adhering to metadata standards for a long time. In the AIC case, each specific departmental digital and non-digital asset repository requires its own migration logic. One strategy that might serve a number of museums is to consider co-developing data extraction and migration tools that target common museum products, such as TMS or KE EMu.
The convenience of an open-source, community-maintained project lies primarily in the better guarantee of long-term sustainability. For as much effort AIC spent on LAKE, this is still a minuscule amount of work compared to the joint effort of the some 39 institutions and 85 individual contributors making up the Hydra community, amounting to over 3,000 commits and two million lines of code (Hydra Project, 2016); or to the Fedora codebase, maintained by 15 individuals from 11 institutions (Fedora Project, 2016). In a similar way, if enough interest and concrete collaboration efforts were gathered around LAKE, this could become a common platform specifically for museums to manage and share digital assets related to digital collections.
We are now able to at least project a balance between the time, effort, research and risk that developing LAKE demanded, and tangible results that have been delivered since its first beta launch in September 2016. As the system becomes part of more staff members’ daily workflow, and as the build-up phase transitions into a “stable” phase in the second half of 2017, this balance will become more defined.
What we can conclude at this early point is that the main value of LAKE for the AIC is in how it connects users, resources and tools within a system that is made up of many discrete components, while presenting itself to its users as a single flow touching almost all their areas of operation.
Some institutions may not have comparably complex needs to justify a similar effort. However, a readily available version of LAKE, which would require a well-defined and contained effort to become operational, would probably become convenient even to museums with smaller budgets or in-house expertise. To this end, although LAKE contains elements that are specific to AIC, a group of enterprising institutions with shared goals could readily modify the current LAKE codebase to fit their collective needs. If that were to happen, the AIC would enthusiastically join and support such collective effort.
Cossu, S. & D. Wilcox. (2017). “A little sweat goes a long way, or: Building a community-driven digital asset management system for museums.” MW2016: Museums and the Web 2016. Published February 8, 2016. Consulted January 29, 2017. Available http://mw2016.museumsandtheweb.com/paper/a-little-sweat-goes-a-long-way-or-building-a-community-driven-digital-asset-management-system-for-museums/
Hydra Project. (2016). Github. Analytics for August 2015–August 2016 time span. Available https://github.com/projecthydra/
Fedora 4.x Documentation. (2016). “Setup Camel Message Interactions.” Duraspace Wiki. Last updated October 27, 2016. Consulted January 31, 2017. Available https://wiki.duraspace.org/display/FEDORA4x/Setup+Camel+Message+Integrations
Cossu, Stefano and Kevin Ford. "The LAKE experience." MW17: MW 2017. Published February 1, 2017. Consulted .