The Design of Open Data Systems: Analyzed Through Case Studies in Nairobi and Lahore

Thesis presented to the Department of Computer Science & the Woodrow Wilson School of Public Policy & International Affairs, Princeton University

Open data initiatives such as those involving government data sharing, crowdsourced crisis management and the monitoring of public service delivery are often unable to maintain relevance as solutions to the social problems they aim to solve. Drawing on fieldwork in Nairobi and Lahore, this study presents a framework for understanding the underlying network of open data systems. It argues that sustainable open data systems are defined by strong links between data providers, software owners, and action agents. These links can be strengthened through the development of technology and policy that aims to augment existing human infrastructure. This is illustrated through the development of toolkits to aid Voter Registration procedures in Kenya and Pakistan. The design methodology used in this development can be extended to other cases and locations to achieve sustainability in open data systems.

Analytical Framework

The fundamental argument of this thesis is that the study of technology must incorporate a study of its social context. I argue that all technology assumes roles for its users, just as users assume roles for technology. And it is only when the assumptions of both stabilize that a technological system becomes a true instrument of knowledge. The open data movement is relevant globally, but is of particular importance to developing countries where the need for government accountability and social movement is more imminent and where the spread of mobile technology is most profound. Traditional design approaches have often ignored the needs of the developing world. This thesis therefore frames the rest of its study along the lines of Postcolonial Computing.

Engagement with stakeholders

My fieldwork in Nairobi and Lahore consisted of 18 interviews with software developers, government officials and activists in Nairobi & Lahore. The key projects in my study were the Kenya Open Data Initiative (the Kenyan Government's work to share government data), Data.gov.pk (an Open Data Initiative by the Government of the Punjab, Pakistan), Ushahidi (a crowdsourced crisis reporting tool) and some of its most successful instances, and the Citizen Feedback Model (a project in the Punjab Government to monitor public service delivery by directly calling affected citizens).

Articulation of the Human Infrastructure of Open Data Systems

I present a common methodology to frame open data systems as consisting of three types of actors: data providers, software owners, and action agents. I argue that the consistent failure of many open data systems to stand the test of time can be traced back to the collapse of links between these actors. Technology should reduce costs of interaction, strengthening relationships and sustaining open data systems. The diagram below shows flows of data as solid lines, and other influences between actors as dotted lines.

Translation of Needs to Technology

My case study was developing simple toolkits for voter registration, a process relevant in Kenya & Pakistan.

An SMS query system that allows citizens to find out where their Voter Registration Centers are by texting in where they are located. This system strengthens the links between software owners and action agents by providing a simple, extensible toolkit to create SMS query systems across various domains. Chutney makes use of an existing dataset of Voter Registration Centers in Kenya.

Design goals were to minimize conversation lengths, always provide a helpful response, prevent format trap, and to be customizable & re-deployable.

A simple linguistic approach makes the system easy to install. Queries look for keywords. If unique, results are delivered automatically. Otherwise simple disambiguations are requested. No complex language models are required.

The system is powered by a novel approach to responding to natural language queries. Instead of creating complex language models, this algorithm searches for tokens in the query, that match any keyword in the database. Upon finding a match each keyword is attached to the column of the database it matches in, and then these column values used to lookup another column against a specific record in the database.

This was supplemented with a crowdsourcing tool to digitize voter lists. Crowdsourcing reduces costs and takes advantage of ad-hoc volunteer organization which was collecting and scanning images of these lists during Pakistani elections.