African languages account for nearly a third of all languages worldwide. Yet, of the more than 2,000 languages spoken across the continent, only 49 are available on translation platforms like Google Translate. Worse still, a staggering 88% of African languages are “severely underrepresented” or “completely ignored” in computational linguistics ( Joshi et al ., 2020).
Artificial Intelligence (AI) offers a chance to protect underrepresented languages, but guidance and safeguards are essential. Without them, large language models (LLMs) risk reinforcing institutional languages and accelerating the decline of others. The consequences are dire: 40% of languages worldwide are at risk of extinction, hundreds of which are spoken in Africa. ( UNESCO, 2022 ).
The African Languages Lab (All Lab) is a youth-led collaboration committed to preserving African languages by documenting, digitizing, translating, and empowering them through advanced AI and natural language processing (NLP) systems. Together with partners like Smartling, we are making substantial strides to address the digital divide in African languages. Here’s how.
The need for linguistic documentation in Africa
Linguistic diversity is one of the African continent’s greatest assets, but it also presents monumental challenges. Many, especially smaller communities, speak unique languages that are not well documented. These “low-resource” languages lack the necessary data sets for computational use, making machine translation (MT), speech processing, automated transcription, and other NLP applications difficult, if not impossible.
The challenge is widespread: less than 5% of African languages have significant digital resources. ( Association for Computational Linguistics, 2019) It is clear that we need to better document these languages, but the process is not an easy task. African Language Laboratory Statistics
The challenge of documenting resource-poor African languages ( Issaka et al ., 2024)
Data scarcity: Historically, most African cultures have placed a strong emphasis on oral traditions. As a result, many exist primarily in oral forms, and written documentation is often sparse or nonexistent. Without written language, assembling corpus data—a collection of written and spoken language needed to train machine learning models—becomes complicated.
Government policies and limited research funding:
Most African governments have prioritized official languages such as English and French, often remnants of colonial rule, while providing benin whatsapp number data 5 million little institutional support to document, preserve, and develop indigenous languages. Insufficient academic funding due to low interest also constrains research and development of indigenous language technologies.
Early childhood education:
Some African countries aim to preserve indigenous languages in education, but efforts are often insufficient. For example, in Ghana, track conversions using organic visitors a policy requires instruction in a child’s first language from kindergarten through grade 3, before transitioning to English. However, it restricts instruction to 11 government-sponsored languages, resulting in even fewer resources, attention, and speakers for the remaining languages. Even with these policies, educators often use English as the primary medium of instruction due to limited resources and training.
Lack of standardized orthographies:
Collecting data for many under-resourced African languages, such as Hausa and Fulani, is highly challenging due to their wide geographic job data distribution and significant dialectal variations. Therefore, creating unified digital resources for these languages requires careful and important coordination and standardization.
Data collection barriers:
In some regions, active conflict or marginalization of certain language groups adversely affects data collection and language development initiatives. In addition, many speakers of low-resource languages live in rural or remote communities with limited access to the Internet and digital technologies, making linguistic data collection even more difficult.
Innovating for linguistic equity
At the African Languages Lab, we are using AI and NLP systems to digitize, translate, and preserve African languages to create positive outcomes for people across the continent. Our four-pillar approach currently supports 40 languages, from the spoken Bantu to the lesser-known Khoisan, representing diverse cultures, regions, and language families across the continent.
How African Languages Lab supports under-resourced languages
Data collection, extraction, cleaning, and storage: We collect linguistic data from a variety of sources, organize and standardize it by removing inconsistencies, and store it securely for use in AI models.
Research and model development: We conduct research to create AI models that improve the understanding and application of African languages.
Community Engagement and Crowdsourcing: We collaborate with institutions, communities, and native speakers to collect and translate data, ensuring authentic representation and long-term sustainability through our innovative AI-powered technologies.
Technology Deployment: In partnership with industry leaders and academic institutions, we use AI and NLP systems to translate our data into usable linguistic outputs that power platforms like our All Voices app and a multilingual chatbot integrated into the Base mobile app.
Countries that integrate local languages into education and digital content tend to have higher literacy rates and greater cultural retention.
The technology that makes our work possible
Executing our four pillars requires the right technology and collaborative partners. As such, we have formed a strategic partnership with Smartling, a leader in translation and localization technology. This partnership allows us to leverage Smartling’s cutting-edge tools for language translation, management, and contextual accuracy, transforming the way low-resource languages are documented and shared digitally.
See how technology is driving our progress in digitizing and translating African languages.
Compiling existing data: corpus aggregation
For many African languages, there is a lack of centralized linguistic data. We collect and standardize data from multiple sources, leveraging Python scripts to clean, standardize, and convert the data into a common format with the goal of creating a centralized corpus for widespread use. Consolidating and refining linguistic data ensures consistency and accessibility, empowering communities to create educational resources.
Making progress — and looking to the future
At the African Languages Lab, we have made substantial progress in addressing the digital divide in African languages through data collection, aggregation, standardization, crowdsourcing, and model development and deployment. We are proud of our growing and robust language data corpus, which is about half a terabyte in size, advanced translation tools, and successful expansion of access to language resources.