Unravelling the Data for Development Muddle
Hackers, Inter-government Organisations (IGOs), governments and NGOs are pinning very high development hopes today on “data for development” but its precise definition and how the different types of data relate to each other is often misunderstood. So it’s important to differentiate between some of the most common types used today.
Open Data - According to the very useful Open Data Handbook, “Open data is data that can be freely used, reused and redistributed by anyone - subject only, at most, to the requirement to attribute and share alike” so transparency advocates are very supportive of open data”. In essence this means that the data is easily availabile and accessible to anyone, anywhere and that they can all reuse and redistribute it. (You can also check the Open Knowledge Foundation’s Open Definition for further clarification.)
Big Data - Big data can come from social networks, sensors, satellite imagery, mobile phones, GPS, cars, financial markets and many more things. In essence, “big data” is data that exceeds the processing capacity of conventional database systems. It is fast (velocity), varied (variety), very valuable and really big! (volume). It does not fit the traditional database architecture so it can only be analysed by using new technology. (See what O’Reilly has to say about this, as well as the recent McKinsey Report.)
Linked Data - Thanks to the Web, we can link not only related documents but related data. The term “linked data” refers to the practice of publishing and connecting structured data on the Semantic Web. The Linked Data website describes how this is supported by URIs (a generic means to identify entities or concepts in the world), HTTP (a simple yet universal mechanism for retrieving resources, or descriptions of resources), and RDF (a generic graph-based data model with which to structure and link data that describes things in the world). This type of technology can also help to link big and open data sets.
Raw Data - As the name suggests, this refers to primary data which comes straight from its initial source. Manipulation and sorting of this data is easy, and is the Holy Grail for developers as well. Have a look at what the architect of the original worldwide Web, Tim Berners-Lee, says is the need for Raw Data Now on this TED video.
Machine-readable data - Data that is machine-readable can be understood by a computer. It can come from stored files or a device connected to a computer.
Metadata - This is basically data about data which is machine readable. For example, it provides information about a certain item's content (e.g, size, color, when a document was created, length of the document, date of origin, etc.).
Real-Time Data - This refers to data that is provided as it is generated. There is no delay. It can help us to make informed decisions as activities unfold and it is often used by an API (application program interface, a set of protocols and tools for building software applications). For example, the E-Bread Index project shows how monitoring online prices can provide real-time insights on the pricing of bread in 6 Latin American countries.
Crowdsourced data - This refers to data that is gathered by crowdsourcing asking “the crowd” i.e., the online public to give feedback and answer questions on specific issues. The idea is that the wisdom of the crowd may help uncover answers when other methods may not be so fruitful. Not only NGOs, but also corporations have started using this data to help them improve products and services. Such data can also be gathered in real-time. However, reliability and validity issues must often be addressed with its use.
Geodata - This is data about the geographic location of an object. For example this can be a mobile phone. Value also lies in it being generated in real-time. It shows where you are at any given moment. This data can also be mapped. When used with Geographic Information Systems (GIS), it can help make maps, plot addresses and identify routes.
Data Visualisations - This can be used to help the public better understand any one of these types of data. It proves especially valuable for helping to understand very large datasets as more data is made available than the human mind can process. Seeing things visually also provides tremendous insights and any type of data can be visualised.
And a brief summary of how they are linked
Open data can be part of big data but big data is not always open data. Big data and open data are both raw data and machine readable. Both open data and big data can be transformed into data visualisations. It can also be crowdsourced data, and geodata. Linked data technology can help to connect open data and big data sets.
Big issues for data
While there are several types of data, all data in the “data for development debate” faces these challenges:
validity and reliability privacy and security intellectual property and liability
It is therefore not surprising that it is becoming difficult to understand what NGOs should focus their efforts on. As data processes and services develop it has become a learning by doing experiment. Some have been successful but others have not. Nevertheless, information has always been essential for development. Today the rate at which it is being generated and made available to many, only ensures this will continue to remain true.