The Design of Open Data Systems: Analyzed Through Case Studies in Nairobi and Lahore

Senior Thesis presented to the Department of Computer Science & the Woodrow Wilson School of Public Policy & International Affairs, Princeton University

Abstract

Open data initiatives such as those involving government data sharing, crowdsourced crisis management and the monitoring of public service delivery are often unable to maintain relevance as solutions to the social problems they aim to solve. Drawing on fieldwork in Nairobi and Lahore, this study presents a framework for understanding the underlying network of open data systems. It argues that sustainable open data systems are defined by strong links between data providers, software owners, and action agents. These links can be strengthened through the development of technology and policy that aims to augment existing human infrastructure. This is illustrated through the development of toolkits to aid Voter Registration procedures in Kenya and Pakistan. The design methodology used in this development can be extended to other cases and locations to achieve sustainability in open data systems.

Analytical Framework

The fundamental argument of this thesis is that the study of technology must incorporate a study of its social context. I argue that all technology assumes roles for its users, just as users assume roles for technology. And it is only when the assumptions of both stabilize that a technological system becomes a true instrument of knowledge. The open data movement is relevant globally, but is of particular importance to developing countries where the need for government accountability and social movement is more imminent and where the spread of mobile technology is most profound. Traditional design approaches have often ignored the needs of the developing world. This thesis therefore frames the rest of its study along the lines of Postcolonial Computing.

First, I conduct an Engagement with the stakeholders of open data systems, interviewing software developers, government officials and activists in Kenya and Pakistan.

Second, I propose a generalized Articulation of the human infrastucture of open data systems.

Third, I provide a Translation of the needs of this system into technological toolkits, design to aid the life of the open data ecosystem.

The goal of this study is to provide a framework through which to see open data systems, and from which to extrapolate principles for the design of technology and policy in systems of the public domain.

Engagement

My fieldwork in Nairobi and Lahore consisted of 18 interviews with Open Data leaders in Nairobi & Lahore. The key projects in my study were the Kenya Open Data Initiative (the Kenyan Government's work to share government data), Data.gov.pk (an Open Data Initiative by the Government of the Punjab, Pakistan), Ushahidi (a crowdsourced crisis reporting tool) and some of its most successful instances, and the Citizen Feedback Model (a project in the Punjab Government to monitor public service delivery by directly calling affected citizens).

Articulation

I present a common methodology to frame open data systems as consisting of three types of actors: data providers, software owners, and action agents. I argue that the consistent failure of many open data systems to stand the test of time can be traced back to the collapse of links between the actors identified above. And as a result it is imperative that technological intervention address the need to make these links stronger by reducing the costs of interaction and enabling the creation and strengthening of these relationships. The diagram below shows flows of data as solid lines, and other influences between actors as dotted lines.

Translation

Voter registration was a process that stood out to me both in Kenya and in Pakistan, and I used this as a case study to develop toolkits that aim to strengthen the relationships of software owners to data providers and action agents. Together these toolkits (Chutney & Grapes), could be deployed in a few hours and used to convert paper data from Election Commissions into a searchable SMS interface available to everyone with a cell phone.

Chutney:

An SMS query system that allows citizens to find out where their Voter Registration Centers are by texting in where they are located. Chutney aims to strengthen the links between software owners and action agents by providing a simple, extensible toolkit to create SMS query systems across various domains. Chutney makes use of an existing dataset of Voter Registration Centers in Kenya.

Chutney was designed to minimize conversation lengths, always provide a helpful response, prevent format trap, and to be customizable & re-deployable.

At the core of Chutney is a novel approach to responding to natural language queries. Instead of creating complex language models, this algorithm searches for tokens in the query, that match any keyword in the database. Upon finding a match each keyword is attached to the column of the database it matches in, and then these column values used to lookup another column against a specific record in the database.

Words that don't match are ignored, implying that a query can be formulated in any linguistic form, and Chutney will still find the answer.

If the number of columns identified from the database is not enough to uniquely identify the record from which a response is formulated, Chutney responds by asking for clarification.

Shown below:
Top Left: a sample 'ideal' query and associated response
Top Right: a set of queries that would have returned the same results
Bottom Left: a query that uniquely identifies a result with only one column
Bottom Right: a query that needs clarification and is answered over two SMS messages

Grapes:

A web-based tool to digitize data sets from scanned images to reusable format.

In my interviews with leaders of the Kenya Open Data initiative I discovered the massive cost of digitizing paper-based records. I also saw the lack of any digital dataset of Voter Registration Centers in Pakistan and witnessed the ad-hoc organization of volunteers on the ground to collect this data in paper form from government offices and share it on social networks as scanned images.

Grapes was intended to help with this process via crowdsourcing. The system is designed as a web-app to which scanned images can be uploaded, and known fields in the documents identified. Volunteers are then recruited to read from the images and transcribe into appropriate fields. The system administrator can define rules that check for accuracy (using number of responses that are similar), before each task is completed.

This system was specifically designed to provide flexibiity of data structure, allowing the bulk adding of tasks and exports to a reusable database, and to be customizable & re-deployable. The collaborative nature of the system also encourages public ownership of data that is in the public interest, ensuring accuracy and upkeep.

Shown below:
Top: the workflow for Grapes
Bottom: a screenshot of Grapes in use on a phone, set up to identify predefined fields from a scanned sheet of a document containing Lahore's voter registration centers.

Presentation

This work was presented to the Princeton University Department of Computer Science. The slides of this presentation are laid out below.

Complete Paper

The full document can be accessed here.