Art Tracks: using Linked Open Data for object provenance in museums
David Newbury, Carnegie Museum of Art, USA
AbstractOver the past three years, the Carnegie Museum of Art has been working on Art Tracks, an ambitious project to create a standard for representing the provenance of works of art that works for both computers and humans. We will present a way to represent provenance as structured data, modeled using both the AAM recommended provenance standard and the CIDOC-CRM. This standard will allow provenance to easily work as a human-readable text, a JSON API, and a Linked Data graph. Once a work’s provenance has been converted into structured data, we have the new ability to use this digitized information to develop insight into not only the works themselves, but the history of collecting. As this history is not institution-specific, we can use this standardized data to connect works across institutions, to tell stories that span history and geography, to enrich the discussion of the nature of collecting, and to provide a powerful example of the utility of Linked Open Data in the museum sector. We have worked to aid in the adoption of this standard by building a suite of freely released, well documented, open-source software tools that support and build on this new standard. These tools are built with a design philosophy of creating software that is usable across multiple museums, regardless of their existing technology choices. In this presentation, we will demonstrate how you can start implementing this standard in your museum, and what new capabilities this allows for art history, visitor engagement, and storytelling with the objects in your collection.
Keywords: provenance, linked data, collections
The Art Tracks project, under development since 2014 at the Carnegie Museum of Art (CMOA), is a set of standards and software tools designed to parse textual provenance records and convert them into structured information. Art Tracks was designed around the observation that the text of a provenance record has innate structure, and thus sufficiently clever software could convert that text into structured data. Structured data would enable data visualization, comparative research, bulk analysis, and other important but non-textual uses of provenance. So, over the course of the project we wrote a custom parser based on a set of regular expressions capable of extracting meaning and data from provenance texts and converting them into JSON, a form of structured data. We also developed a user interface, known as Elysa, that enabled curatorial and registrarial staff members to edit and modify the provenance records. A full discussion of these techniques and systems was presented two years ago at Museums and the Web (Berg-Fulton, Newbury, & Snyder, 2014).
As an initial prototype, the project was successful. We were able to demonstrate automatic extraction and visualization of provenance data, which we demonstrated in a public gallery installation at CMOA. Following this, we spent several months demoing the project at conferences, universities, and other institutions. Throughout this dissemination process, many institutions, particularly mid-size museums, expressed interest in using it to help author and maintain their provenance records.
Unfortunately, while the software was sufficient for our prototyping needs and was fully open-source, it was not designed for easy industry-wide adoption. The complexity of implementing the project was such that only a single other institution was able to get the software up and running, even as a prototype, and no other institution adopted it. As we continued to internally expand the project beyond our initial test data, we also discovered that there were many events recorded within the provenance records that did not match the structure that we had defined. For instance, the distinctions between loans, gallery shows, and auctions were insufficiently defined. We also discovered that it was almost impossible to determine if a single person or location was mentioned across two different records, because the names referred to in the provenance texts were irregular. Finally, while we had always designed our provenance data around the CIDOC Conceptual Reference Model (ICOM/CIDOC, 2017), we had not exhaustively documented the model we were using, nor had we automated the production of RDF from the structured data.
Phase II of Art Tracks, funded by a National Endowment for the Humanities grant, is designed to take the existing prototype and expand it to enable reuse beyond our institution. It is also designed to take the lessons we learned and resolve some of the inconsistencies and poor decisions we made throughout our prototyping process. We have approached this process in four ways: through the development of a community of partners to advise and collaborate with us; through the development of concrete models and standards for the expression of provenance; through the development and documentation of open-source software tools; and through the practical application of these tools to facilitate a concrete research project. This paper will focus on the second of these processes. It will discuss what we have learned as we, in collaboration with partners at the Yale Center for British Art, the Freer|Sackler Galleries, and the Getty Research Institute, have developed a structure for representing textual provenance information as Linked Data.
Note that we distinguish here and throughout the paper between Linked Data, Linked Open Data, and Open Data. We define Linked Data as information that is represented in graph form as RDF, often referencing external authorities and vocabularies, but not necessarily available to be referenced externally itself. Open Data is information that has been publicly provided to external parties for their use in a structured form. Linked Open Data is the intersection of these two categories: publicly available RDF data.
Provenance as a formal grammar
As we developed Art Tracks, it become obvious that our existing parsing strategy was inadequate for the needs of the project. The regular expression-based system we initially developed was extremely powerful, but as we uncovered more complexities and added more exceptions and special cases, it was clear that we would not be able to support this technique indefinitely. The complexities enabled by the sophistication of regular expressions meant it was no longer simple to discover how a specific provenance text would be parsed, and when it didn’t work it was no longer clear where the failure was taking place. We attempted to mitigate this problem with an extensive automated test suite, but even that was insufficient.
To address this problem, we have replaced the regular expressions with an explicit Parsing Expression Grammar, or PEG, for provenance. A PEG is an explicit, unambiguous form of context-free grammar that defines a specific method of parsing a string of text, and are commonly used when implementing programming languages (Ford, 2004). Using a single PEG instead of a collection of regular expressions also has the additional benefit of allowing us to leverage open-source software that can process these grammars. This means that re-implementing the software in additional programming languages may be significantly easier, since rather than requiring a re-write of the entire code base, the grammar, which is the core of the project, can be trivially ported from one language to another.
It also opens up the possibility for other, equally valid grammars to be constructed. One of the major criticisms of the Art Tracks project has been that many institutions have very specific styles for their provenance, and previously, using our tools required that you format provenance texts into a specific style defined by CMOA. By treating this grammar as a formal mapping between the linguistic rules defined by the institution and the data structure defined by Art Tracks, any other grammar that expresses the same concepts should be able to be processed. This would also enable the possibility of automatic translation between provenance styles, which might be a benefit to the field.
The grammar is one-way; to convert from the structured data back into text requires a generator, the inverse of a parser. Thankfully, a host of tools exist for generating strings from data structures and templates, so we do not need to explicitly reuse the same grammar to regenerate our provenance texts. While this theoretically presents the possibility of inconsistencies between the generation and the parsing, since we have the original text we can verify that the output of the template-based system produces an identical text. When tested against a large collection of possible texts (say, all the provenances contained within a museum collection) we can be reasonably confident that such a system is sufficiently accurate.
Provenance as Linked Data
One we have used a grammar to generate structured data, the next step is to map that structured data to the CIDOC-CRM. The actual mapping process, which is the work of converting data from one form to another, is not intrinsically difficult and can be easily automated; however, before the data can be mapped it must be modeled.
Modeling is the process of describing how a datum expressed within one form of structured data is correctly represented within a different form while following the set of rules defined within a specific ontology. Since there are often multiple correct ways to implement the rules of any ontology, particularly one as complex as the CIDOC-CRM, we document a profile, or a specific understanding of the ontology that describes a crosswalk between a known set of input values and a known set of output structures. This profile ensures that the transformations performed during the mapping process can be understood by others and helps maintain consistency across multiple implementations.
Linked Data is a relatively new concept in museums, so many of the tools that would enable institutions to integrate it into their existing systems are still missing. Because of this, it can be difficult for museums to model provenance information to the full specificity possible. To guide institutions with various levels of structured data and software implementations, we have defined four profiles for provenance in the CIDOC-CRM. Each profile defines a specific level of complexity for the digital encoding of provenance in Linked Data based on the amount of processing and the complexity of the input data. These levels have been designed to function as supersets, so a record expressed at a given level can also be treated as a record at each simpler level. Because of this, each level of profile provides a more complete graph representation of knowledge, enables additional search capabilities, and provides a more nuanced understanding of the provided provenance text.
This technique was heavily influenced by TEI’s profiles (Hawkins, 2011), which have a similar level system, as well as by a comment from Kate Blanch, Database Administrator at The Walters Art Museum. She referred to this sort of structure as “a roadmap for the future of an institution’s data.” By knowing how your data could be represented, you can prioritize projects that will get you closer to a known model, even if you’re not prepared to implement it fully at the current time. We will describe at a high level how each functions here; full specifications, RDF models, and additional details of these levels are available at http://www.museumprovenance.org.
When we model a provenance text we are modeling two separate but related structures. One is the written text that makes up the traditional provenance record, which we refer to as the provenance text. The other is the actual events and parties that the provenance text describes, which we refer to as provenance events. We maintain and model both of these structures in Linked Data because they both contain nuances that are difficult to capture in the other’s form.
Level One describes a provenance text in which the events described within the document are not modeled in any fashion, nor are the people, places, or any other entity. This level is intended as a model for provenance within an institution where the provenance is represented as a single text field and where no additional work is done to understand or interpret the provenance. The RDF model at this level merely models the existence of a provenance text, the existence of an event, and relates both to the work described.
At this level, a full text search is possible, but it is not possible to facet the information in any way.
Level Two models both the provenance text, the core event, and the people, places, and URLs present within the text. This level is designed to accommodate software that is capable of processing the provenance text but doesn’t have special knowledge of provenance, such as Named Entity Recognition or string matching against authority files.
At this level, it becomes possible to connect artwork to the people and places that are mentioned in the provenance. It also enables search for provenances using alternate names (when linked to authority file), and to use URLs to link to digitized primary sources. Individual transactions and dates are not modeled at this level.
Level Three models the individual transactions as specific provenance events, but considers there only to be a single event representing the transaction between periods of ownership. These transactions follow each other in time, but are not causally related. To implement this level, we assume the existence software capable of parsing individual acquisitions or transfer events within provenance. All software developed as part of Art Tracks is designed to work at this level.
At Level Three, it becomes possible to analyze and search for specific provenance transactions, and relate those transaction to people and locations. It also becomes possible to search for transactions that occurred within specific space-time volumes, such as “Continental Europe between 1932 and 1946.”
At Level Four, we explicitly model the sub-events that make up each individual transaction. These events are causally related: each one is motivated by and dependent on the event that directly precedes it. For example, rather than modeling a loan as a single event, you would model the transfer of custody to the trucking company, followed by a transfer to the art handlers at the new location, followed by a transfer to the exhibiting institution. This level of provenance is not currently recorded in traditional provenance texts, but it is the next logical step for integrating provenance into a larger model of the history of movement, ownership, and custody of art.
We do not fully define Level Four, and none of the Art Tracks tools are capable of modeling or mapping these events. It may seem unusual to include a level that we cannot implement or fully define, however, we understand the potential of its use. Although none of the institutions that we currently work with have data at this level of granularity, we want to make sure that the model we develop does not preclude the future existence of this data; therefor we have designed the model to accommodate it.
Location, ownership, and custody
Throughout this process it became obvious that the traditional recording of provenance was not a perfect fit with the CIDOC-CRM. The CIDOC-CRM has enormous flexibility when it comes to mapping cultural heritage information to Linked Data ontologies, but there are conflicts between the traditional understanding of the role of provenance within museums and the event-based model described within the conceptual reference model. In particular, within museums, provenance often conflates ownership, custody, and location of a work, but these three time/space vectors are independent within the CIDOC-CRM.
We have assumed that provenance does not include the location of the work. This is counterintuitive—provenance seems very much to be about the movement of works in space and time. But upon a closer reading, and through many discussions with provenance experts and researchers, we determined that what provenance texts describe are locations associated with people or locations associated with events. An object may be present at the event, but the event is what connects people, places, objects, and dates, not the object itself. Instead, provenance texts only refer to ownership and custody. In our prototype, we understood that we had conflated these two which we knew might be problematic, but we did not have enough experience to understand where those problems might be. In our 2014 paper, we said:
We have not yet modeled an ontological hierarchy of [acquisition] terms, but it appears that such a hierarchy exists and could be modeled with sufficient domain knowledge. Additionally, this hierarchy could be used in constructing the CIDOC-CRM modeling; some of these methods indicate ownership and custody changes, some merely of custody.
As we further explored this in Phase II, we discovered that many of the exceptions that had caused trouble in our modeling were due to an insufficient understanding of this hierarchy. Additionally, it became clear that the differences between exhibition history, acquisition, destruction, and transfer were more a matter of museum record-keeping than one of ontological significance. The semantic difference between an auction where the work was not sold and an exhibition at a gallery was one of intent, not one of category. Both involved transfers of custody, but not ownership, and both were followed by a transfer back to the previous owner. As another example, when a work is created it is simultaneously owned and in the custody of someone, usually the artist responsible for creating the work. To exclude that from provenance is to lose valuable information needed to fully understand where the work has been over time.
As we began to think through the differences between ownership and custody, it became clear that defining a hierarchy of acquisition methods would be essential in both understanding how works changed hands, and also in determining when a transaction involved both ownership and custody, and when it involved only one or the other. Additionally, the creation of such a hierarchy would allow for significantly more faceted searches—auctions, private sales, forced sales, and exchanges are related—by placing all of these these various terms under the category “exchange of value,” we could search across them without losing the nuance each contains.
To facilitate our work, we have constructed a small thesaurus of 50 different terms used to distinguish between methods of acquisition. These terms are broken into six categories of structured hierarchy: Transfers, Originations, Disappearances, Divisions of Custody, Rejoinings of Custody, and Party Transformations. These represent the terms that we have identified through an analysis of CMOA’s provenance, as well as those of YCBA and Freer|Sackler, but may not be a complete list; we continue to add additional terms as they are discovered. For a full list and description of these terms, see http://www.museumprovenance.org/reference/acquisition_methods, where the list is available as HTML, PDF, and SKOS.
Entities and authority
An additional problem Art Tracks faced was determining a mechanism for identifying the specific entities mentioned in provenance. People, organizations, places, and named events (such as as sales or exhibitions) are essential data for a provenance record, but they are traditionally referred to only by name or appellation. Additionally, much of the data that makes up a transaction record exists to unambiguously identify specific people. These include birth and death dates, signifiers of human relationships such as marriage or kinship, titles and honorifics, and geographical locations. These are not essential to recording the history of the transfer of art; they are present to provide identification and context for the parties that owned the artwork and to disambiguate between people who share names and might otherwise be confused.
Within a given context, disambiguation only requires that enough information be provided to avoid overlap with other known individuals. However, there are often multiple sets of information that could uniquely disambiguate a specific individual, and it can be difficult to determine if two unique appellations refer to the same person or not. For example, both “Dr. Jones, UCLA” and “Indiana Jones” are sufficiently unique appellations to disambiguate between Mr. Jones and any other antiquities collector currently known; however, we cannot easily determine if they both represent the same person or are instead different people.
Because this disambiguating information is both imperfectly known but relevant across many provenance records, it can be difficult to ensure that any newly acquired knowledge is appropriately updated in all relevant records. For example, when a given collector passes away, their death date should be updated in every provenance record where it occurs, but given the large number of possible appellations for that person, it can be difficult to identify which records those might be.
This is not merely an identity problem. If we were building a traditional structured data system, this problem would appear trivial—we could treat the people and places as lookups to other tables. However, the choice of which appellation to provide in a provenance record is not only a disambiguation or identity problem. Additional implicit data occurs in the choice of which name is used in provenance texts, and simply identifying the individual does not capture the nuance that is available in that text. The specific spelling and structure of the name used for a person is often drawn from the the wording of the primary source used to confirm that record, and the name used is also often the correct name at the moment of the transaction, but may not be the preferred name of the individual. Specific titles may be included when they are relevant or important to the provenance record, but may be excluded when irrelevant. To reduce this human-readable information to a single primary key is to lose this nuance and expressiveness. However, to ignore the fact that these nuanced strings are references to real entities that can appear across many provenance records would lose much of the benefit of structured data. People and places both represent real entities; connecting these entities with external authority records helps avoid multiple conflicting sources of truth.
Finally, there are technical issues with the use of names. As we move towards using a PEG to parse provenance, it becomes difficult to determine which datum specific words or phrases within appellations refer to. For example, “His Excellency, James Jones, France,” vs. “James Jones, Paris, France.” Both consist of three phrases separated by commas. Neither would be unusual within a provenance record, but a computer has no way to determine if the middle clause is part of the name or part of the location. Making that determination relies on human understanding of the semantic meaning of the text. Existing software tools such as machine learning or natural language parsing attempt to resolve these problems, but the statistical models are not 100% accurate, and when they fail they do so in ways that are extremely difficult to correct.
Art Tracks initially attempted to resolve these problems by creating strict standards for how various names were expressed. Inevitably, exceptions would occur that would result in unexpected parsings, and to resolve these conflicts required constant patching and new code to work around each new exception. The reasons for these failures were difficult for non-technical staff to understand: they appeared to be correctly entering new information, yet the system would reject it. This was frustrating and became a significant barrier to adoption.
A better solution to the problems of identity management and entity disambiguation is the use of authority files through Linked Data. We have always intended to use authority files within Art Tracks, but Phase I relied on string matching to cross-reference parties to museum authority files. This proved problematic due to the large number of alternate names that needed to be entered into the authority files to allow for the various nuances and titles present in provenance texts. String matching also presents problems when reconciling between multiple authority files and does not provide a solution to people who have identical names but different identities.
We explored string matching as a solution because it required the smallest number of changes to the appearance of provenance texts. However, the additional work needed to manage external authority files and the inevitable ambiguities that the process created turned out to be more of a barrier than anticipated. In this version, we have developed a new technique that we believe provides a better solution to this problem through the addition of a new section to our written provenance, which we refer to as an “Authority” section.
Traditional provenance consists of two semantically different sections. One is the paragraph of semi-structured text that contains a list of transfers ordered from first owner until the present day. The other is a collection of footnotes which provide unstructured content and commentary. In order for provenance to remain legible for traditional scholars, we have attempted not to fundamentally change the written representation of either of these sections. However, we decided that it might be possible to add additional sections to the text. To avoid confusion and prevent this change from being overly disruptive, we gave ourselves several constraints. First, these additional sections should be human-readable without specialized software and should be able to be created and maintained by hand in a standard text editor. We also felt that they should provide additional context not only to the computer, but to a human reader, which ruled out explicitly including machine-readable data. We took inspiration from archival and bibliographic practice, and added a list of names and URLs as a new section below the two traditional sections of the provenance text. For example:
John Johns, the artist, 1960; purchased by Jane Doe [1920-1990], Boise, ID, 1965; bequest to Jill Doe, Esq., 1990.
Jill was both Jane Doe’s great-niece and a partner in her legal practice.
Authorities: John Johns: see http://viaf.org/viaf/8189962 Jane Doe: see http://vocab.getty.edu/ulan/500217526 Boise, ID: see http://vocab.getty.edu/tgn/1234556 Jill Doe, Esq.: no record found.
The new Authority section references Linked Open Data URIs for named individuals, locations, and events, which resolves many of the problems described above. Best practices for Linked Data also asserts that these URIs should be dereferenceable using HTML, which means that a human researcher can gain immediate context for the people and places represented by the IDs by copying and pasting them into a browser. This also resolves the problem of allowing the user to maintain nuance in their choice of appellation for individuals while also providing an unambiguous ID for that individual. “Jane Doe” can be explicitly linked to the same URI as “Jane Pendleton,” a maiden name, maintaining context while disambiguating identity. Additionally, titles and honorifics can be included as part of the appellation, avoiding the difficulty of determining which entity ambiguous clauses refer to. Lastly, having this section significantly eases parsing of the provenance statements, because entities referenced in the semi-structured text can be replaced with tokens as a pre-processing step, avoiding the inevitable conflicts between appellations and parsable tokens within the grammar.
There are several constraints that this system creates. The first is minor: any appellation within the text can only be used to refer to a single entity. This is not a problem, since this should be true even in a traditional provenance. Secondly, this means that both sections of provenance that contain appellations must be kept in sync. This is an additional burden on the researcher, but because they are both within the same field, it seems less difficult than the previous requirement that the authority record and the provenance text be kept in sync. It also allows for the use of authority records that the provenance record does not have the ability to modify, such as ULAN or VIAF. Also, an interface such as Elysa should be able to maintain these connections automatically, identify records that are missing authority listings, and suggest LOD IDs, all of which should relieve the burden on the researcher. The final constraint is that because the system uses these as part of the parsing process, it is essential to list entities that do not currently have LOD URIs for them using “no record found” instead of a URI.
These techniques, along with other, less essential techniques for managing familial relationships or the representation of agents are fully described within the Art Tracks documentation. Together, they provide a consistent, documented technique for generating Linked Data from traditional provenance texts. It is a model that can be used to transition museums from traditional CMS representations of provenance into Linked Data, and as such does not fully capture the potential Linked Data could provide. However, throughout this project, we have been guided by three questions: (1) what is quantifiable? (2) what is unique to provenance? and (3) what is useful for art historians? We have described here a model that captures three years of direct work and decades of accumulated expertise in answering those three questions. We recognize that much of the work described might feel like ceremony for ceremony’s sake; much of the potential of Linked Data remains theoretical, and an enormous amount of effort is being expended to structure information and work around practices that have been in place and sufficient for hundreds of years. However, Art Tracks, and specifically this model for provenance in Linked Data, represents an initial important step forward for museums. Museums are moving away from being fortresses that protect and catalog treasures, and moving towards being willing and essential participants in a community of knowledge. As our community moves toward viewing collections as essential primary sources for an event-based model of art history, our hope is that our work models how to successfully transition existing knowledge and expertise into Linked Data. It is our goal to enable future research, improve understanding of our own collections, and further the mission of the museum in the 21st century.
Initial funding for Art Tracks was supported by the Institute of Museum and Library Services. Phase II is supported by a grant from the National Endowment for the Humanities, a grant from the Samuel H. Kress Foundation, and a grant from the Paul Mellon Centre for Studies in British Art. This project would not be possible without the collaboration of the entire Carnegie Museum of Art. In particular, we would like to thank Tracey Berg-Fulton, Costas Karakatsanis, Neil Kulas, Louise Lippincott, Akemi May, Emily Mirales, Katie Reilly, and Travis Snyder. In addition, it would not have been possible without the help of colleagues across the field, in particular, the expertise of Emmanuelle Delmas-Glass at the Yale Center for British Art, Jeffrey Smith at the Freer and Sackler Galleries, and Rob Sanderson and Ruth Cuadra at the Getty Research Institute.
Berg-Fulton, T., D. Newbury, & T. Snyder. (2015). “Art tracks: visualizing the stories and lifespan of an artwork.” Museums and the Web 2015: Proceedings. Chicago: The Palmer House, 2015. Consulted January 31, 2017. Available http://mw2015.museumsandtheweb.com/paper/art-tracks-visualizing-the-stories-and-lifespan-of-an-artwork/
Ford, B. (2004). “Parsing Expression Grammars: A Recognition-Based Syntactic Foundation.” Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages. Venice, Italy. Consulted January 30, 2017. Available https://pdos.csail.mit.edu/~baford/packrat/popl04/peg-popl04.pdf
ICOM/CIDOC Documentation Standards Group. (2017). “Definition of the CIDOC Conceptual Reference Model.” Published January 25, 2017. Consulted January 30, 2017. Available http://www.cidoc-crm.org/Version/version-6.2.2
Hawkins, K., M. Dalmau, & S. Bauman. (2011). “Best Practices for TEI in Libraries.” Consulted October 15, 2016. http://www.tei-c.org/SIG/Libraries/teiinlibraries/main-driver.html
. "Art Tracks: using Linked Open Data for object provenance in museums." MW17: MW 2017. Published February 1, 2017. Consulted .