Skip to content

The Serious Flaw of ‘Multilingual’ AI Content Moderation

Featured Sponsor

Store Link Sample Product
UK Artful Impressions Premiere Etsy Store


Three parts of Bosnian text. Thirteen parts in Kurdish. Fifty-five parts in Swahili. Eleven thousand parts in English.

This is part of the data recipe for Facebook’s new big language model, which the company says is capable of detecting and controlling harmful content in more than 100 languages. Bumble uses similar technology to detect rude and spam messages in at least 15 languages. Google uses it for everything from translation to filtering newspaper comment sections. They all have comparable recipes and the same dominant ingredient: data in English.

For years, social media companies have focused their automatic content detection and removal efforts more on content in English than on the world’s 7,000 other languages. Facebook is almost gone 70 percent of the misinformation about Covid in Italian and Spanish unchecked, compared to just 29 percent for similar misinformation in English. Leaked documents reveal that arabica-Language posts are regularly wrongly flagged as hate speech. Poor moderation of local language content has contributed to human rights abuses, including genocide in myanmarethnic violence in ethiopiaand electoral disinformation in Brazil. On a large scale, decisions to host, demote, or remove content directly affect the fundamental rights of individuals, particularly those of marginalized people who have few avenues to organize or express themselves freely.

The problem is partly one of political will, but it is also a technical challenge. Building systems that can detect spam, hate speech, and other unwanted content in all the world’s languages ​​is hard enough. What makes it more difficult is the fact that many languages ​​are “low resource”, meaning they have little digitized text data available to train automated systems. Some of these low-resource languages ​​have limited speakers and internet users, but others, like Hindi and Indonesian, are spoken by hundreds of millions of people, multiplying the damage created by errant systems. Even if companies were willing to invest in creating individual algorithms for each type of harmful content in all languages, they may not have enough data to make those systems work effectively.

A new technology called “multilingual large language models” has fundamentally changed the way social media companies approach content moderation. Multilingual language models, as we describe in a new role—are similar to GPT-4 and other extended language models (LLMs), except that they learn more general language rules through training on texts in dozens or hundreds of different languages. They are specifically designed to make connections between languages, allowing them to extrapolate from those languages ​​for which they have a lot of training data, such as English, to better handle those for which they have less training data, such as Bosnian.

These models have been shown to be capable of performing simple semantic and syntactic tasks across a wide range of languages, such as parsing grammar and parsing sentiment, but how capable they are at the much more language- and context-specific task of moderation is unclear. of content, particularly in languages ​​in which they are barely trained. And besides the occasional self-indulgence Blog mailsocial media companies have revealed little about how well their systems work in the real world.

Why could it be multilingual Will models be less able to identify harmful content than social media companies suggest?

One reason is the quality of the data they train with, particularly in low-resource languages. In the large text data sets that are often used to train multilingual models, the languages ​​that are least represented are also those that most frequently contain text that is offensive, pornographic, mistranslated automatically or just plain gibberish. Developers sometimes try to make up for poor data by filling the gap with machine-translated text, but again, this means the model will still have a hard time understanding language the way people actually speak it. For example, if a language model has only been trained on automatically translated text from English to Cebuanoa language spoken by 20 million people in the Philippines, the model may not have seen the term “kuan,” slang used by native speakers, but has no comparable term in other languages.


—————————————————-

Source link

We’re happy to share our sponsored content because that’s how we monetize our site!

Article Link
UK Artful Impressions Premiere Etsy Store
Sponsored Content View
ASUS Vivobook Review View
Ted Lasso’s MacBook Guide View
Alpilean Energy Boost View
Japanese Weight Loss View
MacBook Air i3 vs i5 View
Liberty Shield View
🔥📰 For more news and articles, click here to see our full list. 🌟✨

👍🎉 Don’t forget to follow and like our Facebook page for more updates and amazing content: Decorris List on Facebook 🌟💯

📸✨ Follow us on Instagram for more news and updates: @decorrislist 🚀🌐

🎨✨ Follow UK Artful Impressions on Instagram for more digital creative designs: @ukartfulimpressions 🚀🌐

🎨✨ Follow our Premier Etsy Store, UK Artful Impressions, for more digital templates and updates: UK Artful Impressions 🚀🌐