Skip to main content

The localization of AI for India

By Kamal Das

Limitations of Foreign AI for building India-centric Solutions

In 2019, the central government’s standing counsel, representing the Ministry of Woman and Child Development, noted that the number of children matched using facial recognition software (FRS) was less than one per cent ! The counsel rued that the FRS was sometimes unable to identify the correct gender of the child as well.

FRS is one of the most common use cases of AI in India, with applications across know your customer (KYC) checks, attendance systems, employment screening, security and law enforcement. In their study of Indian faces in 2021, researchers Gaurav Jain and Smriti Parsheera noted that FRS may misclassify up to 14.68 per cent(or one in seven) of females as males . They also note FRS give erroneous age predictions of over ten years in up to 42.2 per cent (or three in seven) for Indian faces.

Lack of Localization: A key Reasons for High Inaccuracies

There are many reasons for the higher inaccuracy of AI models inthe Indian context. AI models need data to be trained and require huge amounts of data to be able to understand and derive the pattern. India centric data for AI is not as easily available. The world’s largest image database, ImageNet , has only 2% images from India, while the country accounts for almost 18% of the world’s population . Similarly, while six Indian languages are part of the top 20 global languages by population, Microsoft India noted that none of these languages is on top of the digital content list .

Efforts to create and integrate Indian databases are often delayed. In 2020, India’s National Crimes Records Bureau issued the revised tender for National Automated Facial Recognition System, which aims to integrate various databases such as Crime and Criminal Tracking Network & Systems for a single large criminal database. This has the potential to be one of the largest facial recognition systems in the world. However, this tender has been extended by over a dozen times !

There has been an effort to improve demographic equality with datasets of diverse races and groups such as White, Black, South Asian and the like. However, India is clubbed as part of South Asia in most datasets. Over 20% of the global population with immense diversity in language and skin tones are often classified as one monolith. However, not all people from racial categories are the same. “The Indian/South Asian category presents an excellent example of the pitfalls of racial categories,” highlights theresearch from Northeastern University . AI should embrace and be able to identify and embrace the diversity of Indians from states such as Gujarat toArunachal Pradesh and Kashmir to Kerala.

According to a study by Deloitte and NASSCOM, India is currently home to more than 1,300 Global Capability Centres employing about 1.3 million people. While much of global AI is being developed out of India, the focus on India centric AI has not been a key focus earlier. The need to have diversity and country-specific AI is a recent development.

Efforts to Localize AI for India

The localizationof AI relies on developing large sets of local and region-specific user experiencegenerated data to customize the AI to understand the local context. As per the 2011 Census, we often forget that only 11 per cent of Indians understand English as a first, second or third language . Current estimates suggest less than 20 per cent of Indians are confident in English. Over 90 percent prefer content in their mother tongue or other regional Indian languages. In the AI community, there is a realization that we do not have enough internet material that we could use to train India-centric AI.

The initial efforts in India were by Indian MNC to tap the growing Indian market. They started to incorporate local languages, accents and spoken styles. Nowadays, many voice assistants can interpret and respond to queries in regional languages. In 2018, Google Assistant introduced support for Hindi. In 2019, it expanded support to eight more Indian languages. Microsoft’s Windows now works with all 22 Indian languages. However, glitches in the translation abilities exist. Indian regional language comprehension needs ongoing research.

Academic institutions like IIT Madras are helping to localize AI. Faculty from the premier institute has founded AI4Bhārat, a non-profit, open-source community collaborating to build AI solutions to solve India’s problems. They are helping to build digital content in Indian languages that will help improve AI comprehension of Indian languages.

The Indian government is also focusing on improving and increasing access to India centric datasets. Data sharing of citizen data with the government, even amongst various ministries, is slow and burdensome. These efforts to increase data sharing and access have gained pace over the past years. The National Data Governance Framework and Policy was reintroduced and is available for consultation through 11 June 2022. The policy aims to make available datasets in an anonymized format to enable non-personal citizen data available with the government to be shared with the public for improving the governance mechanism and India-centric research.

Next Steps: Focus on Quality as well as Quantity!

Andrew Ng, Adjunct Professor at Stanford University, notes that better data than better models will lead to the next wave of improvements in AI solutions. For India to participate in this wave, we must focus on the quality of data and the quantity of data. Efforts are ongoing to enhance the quantity of India centric local data available. We should also strive to ensure data quality is improved. Data captured must be audited, and the grassroots must include accurate data to ensure appropriate policy decisions. Hoping current efforts to localize data and promote data sharing will help India make rapid strides in AI!