Arabic is spoken by more than 440 million people worldwide and is the fourth most-common language used on the Internet today. Yet the Arabic language is seriously underrepresented online.
Digital content in Arabic accounts for only 1 to 3 percent of all content online, according to a paper, “Digital Arabic Content,” produced by the International Telecommunication Union for a summit in 2012. A recent study by the W3Techs survey firm found that Arabic was the language of fewer than 1 percent of websites it surveyed.
Kareem Darwish, a senior scientist at the Arabic Language Technologies Group at the Qatar Computing Research Institute, in Doha, is part of a team working on tools that use artificial intelligence to change that.
The problem is twofold, Darwish says.
“A limited number of people have the intellectual capacity, the time and financial means to invest in providing high-quality content on a voluntary basis,” he said. “On the other hand, the lack of technological tools that account for the specific characteristics of the Arabic language makes it difficult to retrieve the content when it exists.”
Developing better tools for automatic processing of the Arabic language is not an easy task.
In Arabic, one “root,” or combination of several consonant sounds in a certain order, can generate numerous words having different meanings. Also, the shape of the same letter differs depending on its position within the word. Moreover, symbols placed above or below the letters, called diacritics, change the pronunciation, grammatical formulation, and even the meaning of the words sometimes. This confuses search systems and produces poor search results.
Another challenge is that Arabic letters do not have upper- and lower-case forms, which makes identifying proper names difficult.
An Open-Source Toolkit
Hamdy Soliman Mubarak, a senior software engineer at the Qatar Computing Research Institute, says that the absence of common
“Open-source mentality is absent in the Arab world, especially among companies,”Hamdy Soliman Mubarak
a senior software engineer at the Qatar Computing Research Institute
research collaboration in this field forces researchers to always start from scratch, which delays the development of more-accurate processing tools.
“Open-source mentality is absent in the Arab world, especially among companies,” he said.
Defying this trend, the Qatar institute recently released “Farasa,” an open-source text-processing toolkit for Arabic text.
Using artificial intelligence, Darwish and his colleagues managed to improve the accuracy and speed of word segmentation—that is, dividing words into meaningful units, which is significant to improve output quality in “natural language processing” tasks such as machine translation and information retrieval.
“We achieved a breakthrough when we allowed artificial intelligence to analyze all elements in the text and not to be restricted by specific elements,” Darwish said. “This improved accuracy from 87 percent to 95 percent.”
Today, Farasa is able to process one billion words in less than five hours, which makes it faster than other segmentation tools.
In addition to language education and diacritization of Arabic texts, Farasa is used by media organizations such as Al Jazeera Network to help editors locate and classify proper names in a text.
Other tools developed by the Qatar Computing Research Institute include an automatic transcription and translation system for online multimedia content, a multi-language e-reader and a platform for microblog search and filtering.
Enhancing Arabic Content Online
In 2011, the institute collaborated with the Wikimedia Foundation to generate 10,000 articles in Arabic. It also signed an agreement with the Mayo Clinic to translate some of the website’s medical articles into Arabic.
Other initiatives to boost Arabic content were launched in different Arab countries over the past few years, but Mubarak says that the increase in volume doesn’t guarantee the quality of the content.
“There are one million articles in Arabic on Wikipedia, compared to almost seven million in English,” he said. “But the number of pages is not an accurate indicator because sometimes the Arabic page is just one or two lines.”
Mubarak attributes the poor quality of most Arabic online content to the lack of institutional and financial support for content developers and to violations of intellectual property rights.
“Failure to observe intellectual property rights in Arab countries makes authors reluctant to publish their production,” Mubarak said.
Mahmoud Abdel Raziq Jumaa, a poet and journalist who founded a popular Facebook page about Arabic grammar, sees the problem differently.
“Arabic suffers from severe neglect from Arabic speakers,” said Jumaa, whose page has nearly 600,000 followers. “Poor Arabic digital content is a result of poor education systems which reduce the Arabic language to abstract rules that students study just to pass their exams.”
The latest “Media Use in the Middle East” survey released by Northwestern University in Qatar found that 79 percent of Internet
“Poor Arabic digital content is a result of poor education systems which reduce the Arabic language to abstract rules that students study just to pass their exams.”Mahmoud Abdel Raziq Jumaa
a poet and journalist
users, in Arab countries covered by the survey, use the Internet in Arabic. (See a related article, “Study of Arab Media Use: Facebook Down. Podcasts Up. But Don’t Criticize the Government.”)
The survey results indicate a demand for Arabic digital content, but it seems this is an opportunity that content providers have yet to capitalize on.
“Having more high-quality digital Arabic content ensures access of Arabic speakers to this content and allows them to express themselves and spread their cultural production in their own language,” Jumaa said.
Currently the Arabic Language Technologies Group at the Qatar research institute has shifted its focus to colloquial Arabic, the everyday language spoken by Arabs. There are many local varieties of colloquial Arabic, and these dialects are often not comprehensible to speakers of other dialects. (See a related article, “Why the Split Between Classical and Everyday Arabic Endures.”)
According to Darwish, most Arabic online content is in these local dialects, which poses a whole new set of challenges.
There is no standard way of writing words colloquially and spelling mistakes are more common in this form of the Arabic language.
Also to train artificial intelligence to understand the relations between words, you need a huge amount of encoded data. Currently, there is a shortage of such resources.
“We are developing tools to understand these dialects and know the origin of its words,” Darwish said. “Our dream is to be able to take a text in any local dialect and turn it into Standard Arabic.”
See additional related articles from Al-Fanar Media: