This was an occasion for information professionals to meet and discuss, well, data. You might think that that sounds too vague. Surely, what can anybody have to say about data in general except that it is the stuff that computers eat? A lot, actually. In the last couple of years, something has changed about the way we understand data: what it is, how we produce it, how much of it we produce, and how we use it. I will summarize this under two broad headings: big data and open data.
Big data is a buzzword that refers to the fact that organizations are accumulating more and more raw data and are not quite sure how to use it to their advantage. Think about it: every company, every government and every institution these days has huge deposits of data: financial data, data about sales and customer transactions, demographic data, meteorological data. Almost every organization today sits on a super-sized store of potentially useful data, only to exploit it properly.
In a way, the phrase big data simply refers to data when you have a lot of it, and big data is not qualitatively different from “small” data. It is different quantitatively, though. Big data comes in such large volumes that even storing it and moving it around becomes a challenge. Big data comes in a variety of formats, not all of which are neatly structured, which makes processing and analysis difficult: things like normalized databases and referential integrity are optional in big data, sometimes all you have a semi-structured or completely unstructured “blob”. Big data people often talk about how we are shifting from classical relational databases to a new paradigm with loosely structured “no SQL” databases and heavy-duty parallel processing. Last but not least, big data usually arrives quickly and needs to be processed quickly, in real time if possible (“real time” is computer speak for “as fast or faster than a human can think”). These defining characteristics of big data are often summarized as “the three Vs”: volume, variety, velocity.
At the start of EDF, Marko Grobelnik from the Jožef Stefan Institute in Ljubljana gave an introductory tutorial on big data. What I’ve written in the previous two paragraphs is basically a summary of that tutorial. There were presentations later in the conference on how various businesses deal with big data, and as you might expect, database software vendors were in attendance. For me, the most impressive was Ingo Brenckmann from SAP who presented SAP’s in-memory database package SAP HANA. In-memory databases are what the name implies: instead of storing the database on disk and pulling pieces of it into memory as and when needed, the whole database is kept in memory (and backed up to disk every now and then). This speeds up data retrieval operations incredibly, allowing you to interrogate extremely large datasets in real time. In-memory computing can be an expensive affair, but SAP gives developers a chance to try HANA on a cloud server for free.
Open dataMost of the big data stuff is very corporate and actually kind of boring. Things become interesting, though, when we turn our attention to the data that publicly funded institutions accumulate: governments, state bodies, universities, international institutions and so on. There is a growing understanding that, because the data these institutions own has been paid for by the tax payer, then it should be openly available: hence our second buzzword today, open data. “Open” in this context means open-source and free from copyright: public institutions should basically give their data away for free, so everybody and anybody can re-use it, mash it up with data from other sources and build applications on top of it. People often talk up open data as a means to achieving two goals: firstly, it fosters a culture of openness and transparency in government and secondly, it provides fodder for creative people to do new interesting things, to start up new businesses, to power the much-coveted information economy that EU heads so love talking about.
Open data means more than just sticking a couple of PDF files and Word documents on your website. Open data means making your data available in structured machine-readable formats, “raw” data that computers can work with easily, typically in XML. More and more countries are beginning to understand this and many have set up national data portals where people can go to find out what’s available. Several were presented or mentioned at EDF; here’s a random selection: Norway, France, Austria, UK, US, EU. An unofficial portal that brings European open data together is publicdata.eu run by the Open Knowledge Foundation, a non-profit organization that promotes openness in data, information and knowledge in general.
It is ironic that a conference like EDF should be held in Ireland because, sadly, Ireland is not exactly up to speed when it comes to open data. The sector is in its infancy here. There is an unofficial portal run by the research institute DERI but there is no official government data portal. There is no legal obligation on public institutions to make machine-readable data available, as far as I am aware. There is not much compulsion from the EU level, either: the ages-old Public Sector Information Directive only encourages, but does not force, governments to open their data (Ireland’s puny response can be seen here). Ireland is not a signatory to the Open Government Partnership, either.
However, things are not all bleak. The pioneer for open data in Ireland is Fingal county council which has its own data portal and actively encourages app developers to reuse their data. There was a presentation at EDF by Dominic Byrne, Fingal’s open data man (officially: Assistant Head of Information Technology) that showcased some of the wonderful applications people have built by re-using and mashing up open data. Things may start changing on the legal front soon, too. I hear that the EU directive is being modernised and it may well be the case that Irish institutions will have to start opening their data soon, whether they want to or not. I have also heard rumours that the Irish government is planning to join the Open Government Partnership but I cannot vouch for that, it’s just rumours.
EDF is an annual event, held each spring in whichever EU member state currently holds the presidency: that explains its seemingly illogical location in Dublin this year. But it was a good thing it came to Dublin, as it gave me a chance to soak up some new ideas and to reflect on things. I walked away with much more in my head than what I’ve summarized here and I may return to some of the topics in a later blog post. Also, I’m seriously considering travelling to next year’s instalment, which I hear will be in Athens. That way, my future European Data Forum trip report may actually involve some travelling!