### The Design of Historical Data Projects *The Comédie Française Registers Project and the Laboratoire Paris XVIII* - [http://slides.jamiefolsom.com/dh17/](http://slides.jamiefolsom.com/dh17/) Note:
### Welcome - Presenters: - Dr. Jeff Ravel of MIT - Dr. Pascal Bastien of UQAM - Jamie Folsom, Performant Software Solutions - Andy Stuhl, Performant Software Solutions - Goals: - To share these two projects with you - To have your thoughts, questions and feedback on them - To identify some best practices in the design of historical data projects Note: Good morning and welcome to The Design of Historical Data Projects. My name is Jamie Folsom; I am VP for development with Performant Software Solutions. We are software company based in Charlottesville Virginia and Boston Massachusetts, specializing in software engineering for the digital humanities. I have the pleasure today of being joined by two distinguished scholars and digital humanists: Dr. Jeff Ravel, professor and head of the history department at MIT, and Dr. Pascal Bastien, professor and director of the center for the history of sociabilities at the University of Québec at Montréal. As a starting point for a broader discussion of the design of data-centric projects in the study of history, we would like to share with you two such projects: one led by Dr. Ravel, one led by Dr. Bastien. These two projects will form the foundation of the workshop, and our goal in this workshop is build on that foundation with you, to have a conversation about the design of these projects. Ideally, we would produce the definitive set of guidelines or best practices for anyone setting out to create such a project. That may not be within scope for the next three hours’ time, to borrow a term from the programming business. But we are optimistic that this conversation may at a minimum provide us all with a clearer view of the challenges, opportunities and considerations inherent in these collaborative, multidisciplinary efforts, and beyond that, that we may come away with some new ideas. So while we will take these two projects as our case studies, and we will talk about the software these, and other projects may build and use, and the data they produce, and the scholarship they enable, our objective is to have a discussion about how to organize these collaborations.
### Schedule - Introduction: *10 minutes* - Presentation of CFRP: *50 minutes* - Hands-on with CFRP data: *40 minutes* - Break: *10 minutes* - Presentation of LP18: *50 minutes* - Discussion: *40 minutes* - Conclusion: *10 minutes* Note:
### Overview - The Comédie Française Registers Project (CFRP or RCF) - Theater ticket sales ledgers - 1680-1793 - Le Laboratoire Paris XVIII (LP18) - Social and political life in Paris - Various sources - Similarities and Differences Note: The Comédie Française Registers Project, (CFRP or RCF), led by Dr. Jeff Ravel at MIT and others, is a digital archive containing 113 years worth of tickets sales ledgers from the Comédie Française theater in Paris, recording ticket sales data for every performance by the company for over century. CFRP promises to shed light on an important period in the cultural production of France, using web based software to capture, analyze and visualize those data. Laboratoire Paris XVIII (LP18), led by Dr. Pascal Bastien at UQAM, aims to create a space for researchers to work together to understand the life of the City of Paris in the 18th century. That workspace will accommodate the ingestion of document facsimiles, and the extraction of text and data from those documents, and offer an exploratory environment in which to manage and analyze those data, and to create spatial and temporal visualizations. In some ways these two projects are very similar, in some ways quite different.
### Overview The two projects share some common features... - 18th Century Paris - Documents » data » tools and visualizations - Collaborative, multidisciplinary teams Note: #### Both projects: - Enable research into cultural and social phenomena in 18th Century Paris. - Make use of documents, by digitizing them, extracting data from them, and offering tools for the management, search and visualization of those data. - Enlist teams which are collaborative, distributed, and multidisciplinary
### Overview ...but there are some key differences between them: | CFRP | LP18 | | -------------------- | ---------------------- | | mature | in planning | | single archive | many sources | | structured documents | unstructured documents | Note: - CFRP is a mature project, over 10 years in the making; LP18 is in its earliest phases. - CFRP is focused on a single archive, LP18 on a broad range of documentary sources. - CFRP documents are very structured; LP18 aims to accommodate unstructured sources. We see an opportunity, in juxtaposing them, to create a productive context in which to get at some of the common obstacles, considerations, concerns, and possibilities shared by such projects more generally.
### Questions - **Engagement** - How can a historical data project engage progressively with different elements of its audience to create tools that serve the needs of all its users? - **Collaboration** - How can a project team build trust, and deepen collaboration, especially over time, at distance, and across disciplinary boundaries? - **Publication** - When and how should a project make its results public, and to what degree should feedback from outside the team shape the direction of the project? Note: The two project presentations will provide the context for the final portion of our agenda, which we have set aside for discussion. To provide structure, we offer three focal questions.
### CFRP Technical Overview - [Digitization](http://hyperstudio.mit.edu/cfrp/flip_books/R73/index.html#page/244/mode/2up) - [Data entry](http://app.cfregisters.org/registers/42624/edit) - [Data management](http://app.cfregisters.org/admin/registers?utf8=%E2%9C%93&q%5Bdate_gteq%5D=1717-08-08&q%5Bdate_lteq%5D=1717-08-08&q%5Bseason_eq%5D=1717-1718&commit=Filter&order=id_desc) - [Data access](http://api2.cfregisters.org/registers?date=eq.1717-08-08) - [Tools](http://app.cfregisters.org/registers?filter[season][]=1717-1718&filter[weekday][]=Dimanche&utf8=%25E2%259C%2593) Note: There have been five main phases of the technical work on this project. These phases are not strictly sequential, but iterative and overlapping, and have evolved in response to researchers’ priorities. - Digitization - The Comédie Française managed the creation of digital facsimiles of the registers, which are preserved in volumes at the theater archives. - Data entry - Our involvement in the project began with the development of an interface to allow people to see the documents, and to capture the data they contain. This is a completely custom web application, crafted to match the variations across time, venue, ticket categories, pricing, etc. - Data management - The large amount of data, produced by a large amount of work by a team of researchers over years, required a meticulous and repetitive verification process, which we supported with another interface, for the bulk management of those data. - In addition to the data stored in the documents, there are authoritative datasets regarding the plays and authors, which have been introduced to the database over time, most recently to include references to information and images stored by the Bibliothèque Nationale de France. - Data access - Each audience has a different set of interests, and consequently, different preferences as regards access to the data, from raw data in one of several formats, to aggregated cross sections of data, to finished visualizations and tools. - Tools - Tools for search, visualization and discovery were built throughout the project, but development has taken off in the last year or two, as conferences were held, and developers and researchers have had more opportunities to collaborate more frequently, and more intensively.
### Hands on with CFRP data *Goal: get a feel for the data and tools the project has produced* We'll introduce each activity, spend a few minutes on it, then move to the next one. - CFRP Tools: easy to use - CSV files: flexible and rapid - JSON API: powerful and complex Note: Here are three activities which I hope will provide you with some insight into the CFRP data. They’re designed so you can do some or all of each in about 10 minutes, but you should feel free to spend more or less time on any of them. The goal here is to allow you to get a sense of the shape and size of the data the project has produced, and a feel for some data tools, ours and others’. The first activity will give you a feel for the tools the project built. The second will have you using our data with tools developed by others; the third gives you an idea of how you might build your own tools, using our data. This is a menu, so feel free to do what interests you. These are listed in order of technical complexity.
### Activity 1: CFRP Tools *Easy to use* Use CFRP tools with CFRP data - **Faceted Browser** - [http://cfregisters.org/en/the-data/faceted-browser](http://cfregisters.org/en/the-data/faceted-browser) - **Discovery Tool** - [http://cfregisters.org/en/the-data/basic-tool](http://cfregisters.org/en/the-data/basic-tool) - **Questions** - On handout Note: *Faceted Browser* - In 1717, which day or days of the week had the most two-comedy performances? - The most two-Molière performances? - The most revenue? *Discovery Tool:* - Which play was the most performed in the 113 year period? - In which year was that play performed the most times? - In which season did the theater earn the most revenue? *Either/Both tools* - Which genre was most lucrative? - When were cheap seats (parterre) most/least popular? - What other patterns do you notice?
### Activity 2: CSV files *Flexible and rapid* Manipulate CFRP CSV data, using a third party tool; details on handout. - **Tool** - [http://hdlab.stanford.edu/palladio-app/](http://hdlab.stanford.edu/palladio-app/) - **Projects** - Authors: [http://slides.jamiefolsom.com/data/authors/project.json](http://slides.jamiefolsom.com/data/authors/project.json) - Receipts: [http://slides.jamiefolsom.com/data/receipts/project.json](http://slides.jamiefolsom.com/data/receipts/project.json) - Challenge: add to the second project, using data from CSV files - [http://slides.jamiefolsom.com/data/csv/](http://slides.jamiefolsom.com/data/csv/) Note: One at a time.
### Activity 3: JSON API *Powerful, developer-focused* A tour of the CFRP JSON data API; try it out! - **Tool** - [https://www.getpostman.com/](https://www.getpostman.com/) - **API** - [http://api2.cfregisters.org/](http://api2.cfregisters.org/) - **Documentation** - [https://postgrest.com/en/v4.1/api.html](https://postgrest.com/en/v4.1/api.html) Note: Andy?
### Break ####10 minutes Note:
### LP18 Technical Overview ![toolset](/img/lp18-diagram.png) Note: The archived documents in question are assumed to include both printed and manuscript documents, in a variety of formats and with a variety of contents. Some can be transcribed with the aid of automated optical character recognition; others, cannot. Some have a regular, formulaic layout, others do not. - Import and storage of digitized documents. In some cases, they’re already available. - Extraction and storage of text, either by optical character recognition, or by hand, or both. Again, in some cases, digital text is already available. - Presentation of each image with tools appropriate to the capture of data to the “source type” in question. - Capture of data and metadata from these documents, in a database, in such a way as to preserve links to the corresponding images and/or texts, and to compile reusable indices of named entities, with data value lists available for application to subsequent data entry tasks. - Creation of metadata collections or “layers”, either static or dynamic, either: - Manually, during transcription or expert annotation; - Or Automatically, using full text search, facet browsers or NER tools; - Or Both - Data visualization, principally on an historical map in cases in which data contain place names or other spatial markers, as: - Points representing important locations, with related metadata - Paths between places - Shapes, representing several data points grouped at a level of visual granularity appropriate to the resolution of the data. - Search and filtration of data layers, across spatial, temporal and thematic layers, and superimposition of those layers. - The project also aims to: - Produce and consume data from other sources, using linked open data standards - Make all data available to researchers and developers, using standard data access modalities - Create an open source platform, to allow others to create similar labs for other cities. - Provide a mechanism to allow owners of existing datasets to collaborate on this project, while preserving their ownership rights to their data.
### Discussion (Starter) Questions - **Engagement** - How can a historical data project engage progressively with different elements of its audience to create tools that serve the needs of all its users? - **Collaboration** - How can a project team build trust, and deepen collaboration, especially over time, at distance, and across disciplinary boundaries? - **Publication** - When and how should a project make its results public, and to what degree should feedback from outside the team shape the direction of the project? Note:
### Conclusion - Summary Note: My colleague Andy Stuhl, with whom I worked in Hyperstudio at MIT, has agreed to summarize the workshop’s findings, and to present his summary to us now.
### Thank you! - Jamie Folsom - Email: [jamie@performantsoftware.com](jamie@performantsoftware.com) - Web site: [http://performantsoftware.com](http://performantsoftware.com) - Twitter: [http://twitter.com/jamiefolsom](http://twitter.com/jamiefolsom) - Github: [http://github.com/jamiefolsom](http://github.com/jamiefolsom) - Slides: [http://slides.jamiefolsom.com/dh17/](http://slides.jamiefolsom.com/dh17/) Note: Thank you!