Normality: For or Against?

I’m a historian who is currently designing and/or building four databases.  As I work through the complexities of each project, I’m struck by two thoughts.

First: I’m overworked.

Second: I like the way relational algebra makes me think.

Good database design involves breaking a data set into the smallest viable components and then linking those components back together to facilitate complex analysis.  This process, known as normalization, helps keep the data set free of duplicates and protects the data from being unintentionally deleted or unevenly updated.

As I research merchants in the eighteenth century and how they connected people and empires with far-flung locations and transfered goods and ideas across oceans, I find it helpful to break those multivalent connections into discrete units.  Who wrote to whom?  Who worked for whom?  Who became a diplomat or consul for the United States?  Who recommended him for that position?  And so on.  Each question has become a relationship in my design for the Early American Foreign Service Database (EAFSD), and by linking all this (and more) information together, the EAFSD will track how the U.S. Foreign Service developed over fifty years.  But there is a catch.

When the database is done, I plan on publishing it online so that other researchers can have access to its data.  However, I cannot deny that the EAFSD was designed to answer questions specific to my dissertation.  Other researchers looking at information gathered from the papers of diplomats, consuls, and merchants will (hopefully) want to ask other questions which my database may or not be able to answer.  For example, I only focus on merchants who had a clear connection to the U.S. government (i.e., received positions in the Foreign Service), which means that a large segment of the merchant community will not appear in the database.

Along with the completed database I plan on releasing the source code (both for the database itself and the web application that permits the data migrations and the basic query structure) under an open source license, hopefully making it easier for other scholars to create their own relational databases to track social networks and institutional development.  Once those databases are published similar issues will arise.

When a scholar decides to use a relational database in her research, she is making a decision about methodology — not theory.  A relational database does not dictate what scholars will find in a given data set, but rather shapes their search in ways that need to remain in the forefront of all our minds, even if the methodological discussions get relegated to footnotes or appendices.  If an astronomer has to state the specifications of the telescope along with the data received, a digital humanist should be clear about the choices she made (and why) in designing a database to facilitate her analysis and the analytical limits of the final design.

I became a historian because I see the world as a complex and contingent place that doesn’t respond well to being forced into a constraining model.   While having the EAFSD is a necessary condition of my dissertation it is not a sufficient one.

There are real world ambiguities and unpredictable turns in my subject matter which should not be modeled in a relational data structure.  High on this list are the many mistakes made by early American diplomats: John Adams picking a fight with the French Foreign Minister in the middle of the Revolutionary War (subject of my Master’s Thesis), James Monroe being recalled by a furious George Washington after denouncing (accurate) rumors regarding a new treaty with Great Britain, Thomas Jefferson breaking the Law of Nations to help Lafayette write the Rights of Man and Citizen, the list goes on and on.  On the other hand, while the database also fails to capture the sheer brilliance of Benjamin Franklin it does hint at John Quincy Adams’ compulsive attention to detail.  None of these stories or personalities map into the database, but they are all crucial to understanding how the newly United States interacted with the larger Atlantic World.

Designing the EAFSD has sharpened my historical analysis but narrative prose blurs the edges back into the delightfully abnormal lives of the people I seek to understand.

In brief: I am an Early American historian, a database designer, and a photographer. I'm also sleep-deprived, but that probably isn't related . . . Current Digital Humanities Librarian at Brown University, former Presidential Fellow in the Graduate School of Arts and Sciences, a former Digital Humanities Fellow in the University of Virginia Library's Digital…

8 Comments

  1. Here’s an interesting post with some comments-section discussion about CouchDB.

    http://mooseyard.com/Jens/2009/02/what-will-web-30-be/

  2. Have you seen CouchDB?

    http://couchdb.apache.org/

  3. Just want to put in another voice for a talk on this, it is a subject that people doing work in literary studies databases will be interested in as well.

  4. Many thanks for all the great comments. To respond more specifically:

    @Bob
    I would love to hear your presentation and echo Bethany’s hope that you will deliver it at the Scholars’ Lab in the near future.

    I did consider several different data structures for each of my four projects, and in one case (a prosopography database) the project involves a document management system and the use of MarkLogic server space in conjunction with an RDBMS. As for the other three, none of them rely on large amounts of free form text, and in all cases, the crucial information fits nicely into a normalized data structure. What the Early American Foreign Service Database can’t do is retell past events in a new narrative form. Even if it could, I would rather do that in the text of my dissertation.

    @David
    The above being said, you are absolutely right — the greater availability of proven, capable tools and interfaces (at low or no cost) for relational databases made my decision much simpler; though I cannot deny that my background in relational database architecture also directed my technology choices.

    Which brings me back to my larger question: how should we, as scholars, approach using data structures in a humanities project? Whether we choose a relational, an XML, or an object oriented database, that choice will impact our results. I’m quite happy with my normalized data structure and how it contributes to my analysis, but I need to be upfront about the choices I made and why I made them. If I omit that level of transparency, I’m clouding the vision of my future users and readers.

  5. Hey, Bob! I’ve got your “right context,” right here! We’d love to bring you into the Scholars’ Lab to give a talk like this, which would be of great interest to a lot of our staff and collaborators. Some of the literary scholars with whom we work, for example, are engaged in discussions about the value of a parallel-segmented TEI approach vs. a database approach to expressing variants and relations among variants in traditional scholarly editions. Am contacting you offline…

  6. Hi David,

    Very understandable. When I first heard about Ruby on Rails, I thought it was all about building websites, but when I tried a tutorial (http://www.snee.com/bobdc.blog/2006/04/joining_the_ruby_on_rails_chor.html) I found that what it really helped with was quick development of distributed relational database applications for people with little background in it–especially the input forms. XForms sounded great when it first got some buzz, then implementations didn’t get very far for a while. In the last few months it seems to be picking up, so it’s something to keep on eye on more going forward, but your point about easy, reliable distributed input is well-taken.

    I have a presentation with a somewhat grand title of “Automated Databases from the 18th to the 21st Centuries
    ” in which the ultimate message is what XQuery can offer over relational databases for data that doesn’t fit well into tables, and I’d be happy to give it on campus when right context comes along.

    Bob

  7. Hi Bob,

    The problem we have consistently had with XML in doing more data-centric (vs. just marking up essays or texts) humanities work has been in relationship to providing reliable and easy to use editing interfaces for distributed input and revisions run collaboratively across multiple scholarly centers of activity. XForms so far has not yielded a robust project at UVa as far as I know, and making a custom built solution seems to take a lot longer than just launching a straight forward relational database. Given how radically underfunded digital humanities projects usually are, such considerations can make the difference between providing a realistic editorial interface or not. That has been a persistent sticking point around here for years and has proven very frustrating – I could cite a long list of examples just from the local context.

    Any thoughts on that?

    David

  8. Did you consider any formats besides a relational database? For example, an XML-based format to query using the W3C XQuery standard?

    Relational databases have many advantages. The normalization process provides a deterministic way to impose an efficient structure on a collection of data, but for a lot of people, an RDBMS is the hammer that makes all data collections look like nails. An XML-based format is not only more flexible (in both initial schema design and eventual schema modification) for humanities-oriented data; it’s easier to share on the internet.

    If your data fits well into normalized tables, then an RDBMS is the way to go. I would never recommend that a company store their inventory or employee database in anything but an RDBMS. I was just curious if you had considered anything else for the data you’re compiling.

    Bob

Archives