This is an abbreviated transcript of a talk I gave at a British-Irish Council conference on language technology in indigenous, minority and lesser-used languages in Dublin earlier this month (November 2015) under the title ‘Do minority languages need the same language technology as majority languages?’ I wanted to bust the myth that machine translation is necessary for the revival of minority languages. What I had to say didn’t go down well with some in the audience, especially people who work in machine translation (unsurprisingly). So beware, there is controversy ahead!
I have devoted a large chunk of my career to learning Irish, working with Irish and making a living out of Irish. So I thought it would be fair to put together a list of reasons why I think the language is worth it. Mine are proper linguistic reasons though – none of that starry-eyed sentimental nonsense about the language being ‘beautiful’ or ‘romantic’! So, put your language geek hats on, here we go!
(Many of the features mentioned here are actually common to all Celtic languages, including Scottish Gaelic and Welsh, but let’s not be splitting hairs now.) Continue reading
Open data is a principle that dictates that data held internally in organizations should be made available to outsiders. This applies mainly to governments. Governments possess large amounts of data: data about the geographies of their countries, anonymized statistics about their populations, about the economy, data about transport infrastructure, about traffic, about weather. There is a growing understanding in developed countries everywhere that governments should make these data sets available, in machine-readable formats, for free reuse by anyone anywhere, without copyright or royalties. The idea is that society will benefit in two ways. Way number one, opening up government data will encourage transparency in government: good governments have nothing to hide. Way number two, all that data will provide fodder for innovation and entrepreneurship, people will be able to build applications on top of the data, start businesses and create jobs, or if not, at least build useful apps that make people’s lives easier.
That is the theory and politicians everywhere are falling over themselves proclaiming how much they believe in it. But not everywhere are words being converted into actions. Sadly, the country where I live, Republic of Ireland, is not a leader in this field. Very little government data is available for unrestricted reuse or in formats that lend themselves to easy reuse. I will demonstrate this with a concrete example from personal experience. Continue reading
In the slang of people who care about such things, retro-digitization is the process of taking a work that had previously been published on paper (often a long time ago, way before computers made their way into publishing) and converting it into a digital, computer-readable format. A bit like retro-fitting a house or pimping up an old car. This involves not only scanning and OCRing the pages, but also structuring and indexing the content so it can be searched and interrogated in ways that would have been impossible on paper. This is the bit that matters most if what you are retro-digitizing is a dictionary.
The dictionaries we retro-digitized are Foclóir Gaeilge-Béarla [Irish-English Dictionary] from 1977 (editor Niall Ó Dónaill), and English-Irish Dictionary from 1959 (editor Tomás de Bhaldraithe). Both are sizeable volumes which, despite their age, enjoy the respect, even adoration, of Irish speakers everywhere, are still widely used and widely available in bookshops. People have been saying for ages how nice it would be if we had electronic versions of these. And now we do, available freely to everybody on a website. Here’s how we got there. Continue reading
I have been going to lexicography conferences for many years now, including the Euralex congresses and the eLex series. One popular opinion that always emerges in talks and conversations at such events is that the Web is – supposedly – killing the dictionary. Now that I’m about to attend yet another instalment of the eLex conference (taking place in Tallinn, Estonia this year – great, I hear the Baltic Sea is lovely in October!) I thought it would be a good idea to dissect this opinion a little. Let’s dissect away, then. Continue reading
Oh, the things I do for fun at weekends! For example last weekend, I attended the Linguistics of the Gaelic Languages conference in University College Dublin (19 – 20 April 2013). This was a small but focused event, with 20 to 30 people attending to discuss latest research on Irish, Scottish Gaelic and Manx. Here is my report. Continue reading
This was an occasion for information professionals to meet and discuss, well, data. You might think that that sounds too vague. Surely, what can anybody have to say about data in general except that it is the stuff that computers eat? A lot, actually. In the last couple of years, something has changed about the way we understand data: what it is, how we produce it, how much of it we produce, and how we use it. I will summarize this under two broad headings: big data and open data. Continue reading