Keyboards for Urdu conventionally have the same number of keys as a normal QWERTY keyboard. This is non-sensical because Urdu has more letters than English. Further, the standardized layout promulgated by the Pakistani government maps the Urdu alphabet onto QWERTY phonetically.
This builds into the technology the assumption that Urdu computing requires a familiarity with English computing. This is only one in a series of compromises made in the history of computing, typewriting, and printing in Urdu, other languages in the Arabic script, and non-Latin languages broadly.
The result is an inescapable belief in Pakistan today that English is the language of computing, implying that Urdu will never be sufficient for computing. Of course the reality is that computing is not sufficient for Urdu. And this project attempts to challenge not just technological designs, but social conceptions of localized technology with an outreach initiative to test and provide language models for the keyboard. Promoting not just better language technology but rethought ownership of technology that reflects and promotes a renewed cultural, social and historical awareness in our practice.
Urdu has about a 100 million native speakers. However, due to its status as official language across Pakistan and India, and mutual intelligibility with Hindi, it effectively addresses a much larger population.
Across modern Pakistan however, Urdu is slowly disappearing from everyday use – in road signs, in government communication, in schools, and in software.
Inevitably, it is believed that Urdu must give way to English as the language of modernity, science, technology, and progress.
Urdu did not become the language of the past by coincidence. The erosion of cultures across the worlds from visons of modernity, especially those driven by technological progress, is a pattern that has repeated many times over the past few centuries. A historical analysis reveals this much – it took three centuries for printing in Europe to reach the Arabic script, decades of experimentation about Chinese keyboards still landed us in a situation where millions of Chinese speakers type via the Latin script.
The typeface that dominates Urdu text production today, was built in the early 1980s. In its first years it could not type new words, only an initial dictionary the ligatures of which had been built into the font. I don't the death of technical vocabulary after this moment as a coincidence.
Points of intervention have to be identified by understanding the technology through which language is represented. Urdu software is substandard in many areas (identified in red).
A keyboard underpins digital text production. So any effort to address the under-representation of Urdu and other regional languages on the internet, and in software as a whole, starts with the keyboard.
It is imperative that we do this now, as millions more in the region come online through the availability of smartphones.
A keyboard also affords the creation of underlying technological infrastructure that is beneficial to other technological components: such as dictionaries and language models.
Most engineers, when questioned on the topic of new Urdu keyboards, exclaim that it's a solved problem. However Urdu speakers distanced from the act of building software express strong discontent with their tools. But they do not ask for better technology, instead blaming themselves for their trouble. This is a situation we teach our students to identify as user testing 101 for "this is a bad design".
The result of consistent keyboard failures is that convoluted human systems develop to produce functional Urdu text. Each layer of interaction making it harder to represent true expression and correct language, making the resultant output inaccurate and ineffective. Urdu becomes useless as a functional tool.
Current keyboards are variations of a phonetic mapping of Urdu letters onto QWERTY.
Keyboards are traditionally layed out using a combination of simulation and empirical testing. To my knowledge, no empirical testing with humans has ever been conducted for keyboards that use the Arabic script.
Given that the most obvious problem with Urdu keyboards is legibility, an alphabetical layout becomes a great first test case. From tests with Latin layouts, we know that layouts within a ~20% range of performance do not warrant a change. And so that becomes a premilinary bar of judging the long term usage cost of a keyboard. Of course a complete understanding of how a keyboard performs includes both initial legibility, and performance over time.
The alphabetical layout is not a new idea, not even in the realm of Urdu. Android offers a version under the 'Urdu (India)' language setting (which is an interesting story on its own). This layout represents characters in a more accurate nastaliq calligraphic form. Typography in the Arabic script has its own deep history of technological misrepresentation, and presents new challenges for modern developers and designers.
More curious however, is that Google's layout skips a letter ostensibly considered part of every major formulation of the Urdu alphabet. Which begs the question, who owns the Urdu alphabet? Even small polls reveal the varying understandings of how Urdu is conceived, all of which is lost in the technological sphere. I show this drawing not because it is particularly high quality research, but because it showcases also how software development for modern Pakistan must proceed – via cross-border engagement not just of native audiences, but of diaspora, which is often deeply invested and more motivated to find cultural representation.
There is need for a body to protect and rationalize the technological representation of the Urdu language. And this project attempts to provide that space and initiative.
Arabic script is cursive. As a result, letters take on many shapes as they connect forwards or backwards. Depending on specific calligraphic form, there are countless shapes a single letter can take. These are however, simplified into four forms.
As children, Urdu speakers are taught to abstract between the isolated forms of each letter, and their various forms inside of words. This is the elementary توڑ جوڑ (disassembly - assembly) exercise, seen in a textbook below.
Keyboards tend to label keys with the isolated forms of each letter. As a result users have to abstract unnecessarily back to isolated forms while they type words. This design gets rid of that extra mental step.
While lingua-franca, Urdu is mother tongue for a small minority of Pakistan's population. Real representation for this population means finding solutions not just for Urdu but for other languages as well, many of which are also written at times in Arabic script. Otherwise the tyranny of English is replaced with the tyranny of Urdu.
The Arabic script is made of up of 21 basic letter shapes. These are then annotated using a variety of symbols to indicate different consonants in different languages.
This characteristic of the script can be used to condense the keyboard from 39 letters in Urdu, and over 50 in many other languages to 21. Software can then disambiguate between individual consonants as it auto-completes and auto-corrects. This also allows for larger hit targets and the potential to scale to existing desktop keyboards that may not have the number of keys needed to effectively scale to languages with large alphabets.
Diacritics, compounding with the ۶ letter, and spell correction that accounts for similar sounding letters are other features than can be baked into the keyboard.
Beyond typing, this project is about protecting cultural heritage – generating discussion and providing collective ownership of a language's digital representation. The project was opened to public beta in late April 2018.
In addition to the public beta, the goal is to publish findings as a formal research study. The final beta will be launched as we finalize approvals for the study.
The project's website will also host developer resources that are being developed with the keyboard.
As this study is extended to software beyond the keyboard, it will concentrate on guidelines that can be used by developers and designers to produce high quality software in Urdu and other languages. Best-practices in this are under-developed. An example is how to typeset appropriately – the transition required from Latin script is non-trivial. Even minor changes in code, such as what order fonts are referenced in CSS, can cause huge readability changes. And a whole idiom of common interface language will need to be developed in local tongues.
These guidelines are meant to help developers and designers through the minefield of Urdu software development. Even a simple task such as buying a domain – presents an array of challenges when internationalized, such as unintuitive transformations to Punycode.
Public engagement is essential to address the biases so easily encoded in modern AI. All major corpuses for the Urdu language are built in one city. More representative software requires that all of its components, especially those that are traditionally opaque such as artificial intelligence, must be constructed with care and social understanding.
The project's website: https://matnsaz.net
My presentation of this work at the Harvard GSD below: