Breathing new life into old data: how to retro-digitize a dictionary

breis.focloir.ie

A new digital home for granddad and grandma: breis.focloir.ie

I have recently worked on a project where we retro-digitized two Irish dictionaries and published them on the web, so I thought it would be a good idea to summarize my experience here. Hopefully somebody somewhere will find it useful.

In the slang of people who care about such things, retro-digitization is the process of taking a work that had previously been published on paper (often a long time ago, way before computers made their way into publishing) and converting it into a digital, computer-readable format. A bit like retro-fitting a house or pimping up an old car. This involves not only scanning and OCRing the pages, but also structuring and indexing the content so it can be searched and interrogated in ways that would have been impossible on paper. This is the bit that matters most if what you are retro-digitizing is a dictionary.

The dictionaries we retro-digitized are Foclóir Gaeilge-Béarla [Irish-English Dictionary] from 1977 (editor Niall Ó Dónaill), and English-Irish Dictionary from 1959 (editor Tomás de Bhaldraithe). Both are sizeable volumes which, despite their age, enjoy the respect, even adoration, of Irish speakers everywhere, are still widely used and widely available in bookshops. People have been saying for ages how nice it would be if we had electronic versions of these. And now we do, available freely to everybody on a website. Here’s how we got there.

Step 1: Getting your hands on the text

We were lucky to already have access to machine-readable versions of the two dictionaries, so no scanning or OCRing was necessary. In the case of English-Irish Dictionary (EID from now on), the text had already been scanned, OCRed, proof-read and even marked-up into a fairly detailed XML structure before it landed on my desk. The trouble was that much of that structure had been lost again in the process of proof-reading the text: people were using Microsoft Word which only preserved some of the structure, basically just bold and italics and not much more. Still, better than starting from paper.

FGB in typesetting format

Pic 1: FGB in typesetting format

Foclóir Gaeilge-Béarla (FGB from now on) was a different matter. I had access to a typesetting file from which the dictionary had originally been printed. I have no idea where that file came from but I assume it was the output of some early DTP software. It contained the entire text of the dictionary interlaced with formatting mark-up which was quite unlike anything I had seen before (see pic 1). But luckily it was not hard to make sense of and, again, this was a lot better than starting from paper. In fact, it was better than starting from OCRed text because you could take for granted that there were no spelling errors – none that wouldn’t be in the printed version too, at least.

Step 2: Structuring the entries

Modern born-digital dictionaries often have a complex and explicit structure, usually encoded in XML, where each entry is subdivided into senses, usage examples, various kinds of sub-entries, grammatical labels and so on. You can forget about all that when dealing with a paper-born dictionary though: the entries are just paragraphs of text, sometimes with surface formatting like bold and italics, but very little explicit structure beyond that. As a retro-digitizer, you need to be able to infer just enough structure from this to produce something useful, something that beats paper.

Pic 2: FGB converted into semi-structured data

Pic 2: FGB converted into semi-structured data

In FGB, I was able to use the formatting codes from the typesetting file to infer where one sense ended and another began, where one usage example ended and another began, what was a grammatical label and what was a translation, and so on. Pic 2 shows an example entry in its original formatting (you can see it’s basically just text with bold and italics), with the inferred structure indicated by colour-coding and superscripts. The accuracy was so high that we decided not to do any manual proof-reading, at least not for now.

Pic 3:  EID in markdown notation

Pic 3: EID in markdown notation

EID was a tougher beast to tame. Even though I was able to infer a lot of structure too, the accuracy was far from acceptable and we decided we needed to do a fair bit of manual proof-reading and correction. The question was, how? The lexicographers on our team, highly able as they are, couldn’t be expected to edit raw XML. There was no time to look for a suitable editorial GUI as the deadline was looming large. So, in a sudden fit of inspiration, I devised a plain-text notation that was explicit enough structurally while still being readable by humans. Pic 3 shows an example. There are various notational conventions: square brackets enclose text in the target language (Irish), double square brackets in the source language (English), an ampersand @ says the rest of the line is a usage example, and so on. I think I may have accidentally invented a markdown format for dictionaries. With a plain-text editor and a team of several people, we were able to proof-read the entire dictionary in a few weeks and bring the structural mark-up to an acceptable level of accuracy.

Our strategy in both dictionaries was to mark up the text without altering it. We wanted to keep the text exactly as it appeared in the printed version. The other option would be to reinterpret and restructure the entries to bring them closer to modern lexicographic standards, but we did not dare to do that.

Step 3: Wrapping it up in a pretty website

Now your job merges with that of somebody working on a born-digital dictionary: you have a bunch of XML files and you need to build a pretty user interface where people can search and browse. Because you are now on a computer and not on paper any more, you can do things that would have been impossible on paper. Our website has things like reverse search (a search from the target language to the source language) and searches that match things deep inside the entries rather than just the headwords. The two dictionaries are very rich in examples and contain literally hundreds of thousands of example sentences. So we have a feature that lists example sentences for a given word or combination of words, even if they are under a completely unrelated headword. All of these searches are morphology-aware and spell-checked of course, as one would expect of any online dictionary worth its name.

Our website was launched at the Oireachtas na Gaeilge festival this autumn (2013) and has been received with much enthusiasm – over 10,000 pageviews on the first day!

Summary

When retro-digitizing a dictionary, the most basic question is, should you do it at all? Old dictionaries are usually slightly out of date in the sense that they no longer fully describe the language as it is spoken nowadays. On the other hand, they contain a lot of potentially useful information which is still valid and which you can make available pretty quickly, probably a lot quicker than writing a new dictionary from scratch. Also, old dictionaries are often well-known “brands” and this will rub off onto your digital product, guaranteeing you a good reception from the public – provided you do a good job of the digitization.

Provided, yes. Digitizing old data can be a nightmare. The data had been written entirely by humans, so you must expect annoying little inconsistencies everywhere. There will be grammatical labels that nobody knows what they mean, there will be gaps in homonym numbering, there will be dead cross-references. All this will conspire to limit what you can do in terms of structure, mark-up and search. So it’s a game of finding a good balance between what’s desirable and what’s possible. I hope the balance we have found for FGB and EID is good.

Advertisements

13 thoughts on “Breathing new life into old data: how to retro-digitize a dictionary

  1. Cé a scríobh é?
    Feicim ‘I recently worked on…’ ach ní fheicim ainm an údair in aon áit.
    Féir plé ar aon nós.

  2. A, feicim anois.
    Is tusa Michal Boleslav Měchura.
    Uúaú.
    Buíochas agus meas!

  3. Rinne sibh gaisce. Tá pobal na Gaeilge faoi chomaoin agaibh.

  4. Maith thú as an cúl scéil agus na moltaí a nochtadh!

  5. I’ve always wished I could have all my paper dictionaries in a digitalised format. Very informative post, thank you!

    _______________
    Karolina Karczmarek-Giel
    Office Assistant
    wantwords.co.uk

  6. A Mhicail a Chara,
     
    Your other site pota focal site is also a great resource and I use it every day for the last few years. it really is a fantastic piece of work, it has been an essential tool and an accelerator in my learning of Irish. Thank you very much.  I can see you have a “bua” in the field of computer technology and the Irish and you are a visionary in our world on how technology can assist our language. nár laga dia do lámh.
    I have a question that either you can answer or might consider as a project in the future.
    My daughter recently showed me how an Italian book can be read on kindle. She demonstrated that when you do not know a word you can hold your finger on that word and an English translation will appear on the screen.
    This would be a great facility for Irish student into the future. but I am not sure if anything is available right now?
    On a more general level if I read a PDF on my Samsung TAB and hold down my finger on a Irish word, a menu appears on the top of the screen and one of the choices is dictionary , by pressing on the dictionary icon you get a choice of Google or Wikipedia, by choosing Google the Google search engine opens and puts that word in the search box and shows you the web results, this happens automatically. You then have to look through the results and find a site and navigate through that site for the translation, it’s all very clumsy but the operation of sending the word to Google is good and a bit like how the automation in pota focal with search words work; sending words to other sites at the bottom of the pota focal page, for example sending the search word to an foclóir beag. This integration is fantastic and a great time saver.
    With regard to the PDF on my Samsung. An automated redirect with the queried word would be powerful if it was directed to http://breis.focloir.ie/en/fgb/ with the highlighted word already searched, instead of going to Google search as it does now. 
     
    I use the online O’ Dónaill and de Bhaldraithe every day, I also love the pronunciation section with gach canúintí. I hope this becomes available for every single word. One thing, the similar word suggestion are very small and its hard to chooses one from the other with my finger on a tablet, it would be good if they were bigger and spaced further away from each other.

    Gach buíochas duit

    • Hi Artúr, I know you had sent me an e-mail on the same topic some time ago and I feel guilty for not having answered it yet. I will, though. You’re giving us a lot of food for thought there but I’m heaving under a heavy workload at the moment. I’ll get back to you later!

  7. Delighted to see this and loooooove your blog! 🙂

  8. Thanks for this very personal account. I’m considering taking on a dictionary digitization process right now for a class; it helps to hear from someone who’s been there. I know the project’s already done for you, but I wanted to mention: University of Maryland has a project to make tools to speed up the pace of dictionaries. It might make life easier for those following in your footsteps.

    See http://www.casl.umd.edu/sites/default/files/download_4.pdf and http://elex2011.trojina.si/Vsebine/proceedings/eLex2011-40.pdf

  9. Teanglann.ie has become an indispensable resource to me since I started learning Irish a few months ago. Not just the digitized Ó’Dónaill and de Bhaldraithe, but the grammar database! The whole thing works together well and is never frustrating, always helpful. Go raibh mile maith agat!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s