How to use BERT in Arabic and other languages ​​· Chris McCormick (2023)

So far, our tutorials have focused almost exclusively on NLP applications in English. While the general algorithms and ideas extend to all languages, the large body of resources supporting English-language NLP does not extend to all languages. For example, BERT and BERT-like models are an incredibly powerful tool, but model releases are almost always in English, perhaps followed by Chinese, Russian, or Western European language variants.

For this reason, we will look at an interesting category of BERT-like models, referred to asMultilingual models, which help extend the power of large BERT-like models to languages ​​beyond English.

by Chris McCormick and Nick Ryan

  • Contents
  • S1. Multilingual models
    • 1.1. Multilingual model approach
    • 1.2. Cross-language transfer
    • 1.3. Why multilingual?
    • 1.4. Languages ​​by resource
    • 1.5. Use machine translation
    • 1.6. XLM-R-Vocabulary
  • S2. Compare approaches
    • 2.1. Natural Language Inference
    • 2.2. Overview of MNLI and XNLI
    • 2.3. Monolingual approach
    • 2.4. Multilingual approach
    • 2.5. summary of results
    • 2.6. Which approach to use?
  • Example notebooks

1.1. Multilingual model approach

Multilingual models take a rather bizarre approach to addressing multiple languages...

Rather than treating each language individually, is a multilingual modelpre-trainedon text from a mixture of languages!

In this post and the accompanying notebooks, we play with a specific multilingual model calledXLM-Rfrom Facebook (short for “Cross-Lingual Language Model – Roberta”).

While the original BERT was pre-trained on English Wikipedia and BooksCorpus (a collection of self-published books), XLM-R was pre-trained on Wikipedia and Common Crawl data100 different languages! Not 100 different models trained on 100 different languages, but aeinzelBERT type model pre-trained with all this text together.

How to use BERT in Arabic and other languages ​​· Chris McCormick (1)

There's really nothing here that tries to consciously differentiate between languages. For example in XLM-R:

  • There is a single, common vocabulary (with 250,000 tokens) covering all 100 languages.
  • No special marker is added to the input text to indicate what language it is.
  • It wasn't trained on "parallel data" (the same sentence in multiple languages).
  • We didn't change the training goal to encourage learning to translate.

And yet, instead of predicting nonsense or having even the slightest understanding of one of the many input languages, XLM-R performs surprisingly well, even when compared to models trained with a single language!

1.2. Cross-language transfer

If your application is in another language (we'll use Arabic as an example from now on), you can use XLM-R just like BERT. You can adapt XLM-R to your Arabic training text and then use it to make predictions in Arabic.

However, with XLM-R you can use another technique that is even more surprising...

Let's say you're trying to build a model to automatically identify nasty (or "toxic") user comments in Arabic. There's already a great dataset for that called "Wikipedia Toxic Comments" with about 225,000 tagged comments - only it's all in English!

What are your options? Gathering a similarly large data set in Arabic would have to be costly. Applying machine translation in any way could be interesting, but it has its limitations (I'll talk more about translation in a later section).

XLM-R offers another way called "Cross-Lingual Transfer". You can tune XLM-R in the English language Wikipedia Toxic Comments dataset.and then apply it to Arabic comments!

How to use BERT in Arabic and other languages ​​· Chris McCormick (2)

XLM-R is able to apply its task-specific knowledge learned in English to Arabic, although we never showed it Arabic examples! It is the concept of transfer learning applied from one language to another – i.e. “cross-lingual transfer”.

In the notebooks accompanying this post, we see that XLM-R only trains to ~400,000EnglishSamples actually yieldsbetterResults as the fine-tuning of a "monolingual" Arabic model on a (much smaller) Arabic dataset.

This impressive feat is referred to asZero-Shot LearningorCross-language transfer.

1.3. Why multilingual?

Multilingual models and cross-language transfer are cool tricks, but wouldn't it be better if Facebook just trained and released a separate model for each of these different languages?

That would probably produce the most accurate models, yes - if only as much text in each language was available online as English!

A model that has been pre-trained with text from a single language is invokedmonolingual, while calling those familiar with texts from multiple languagesmultilingual.

(Video) Multilingual BERT - Part 1 - Intro and Concepts

The following bar chart shows how much text data the authors of XLM-R were able to collect for pre-training for a small selection of languages.

How to use BERT in Arabic and other languages ​​· Chris McCormick (3)

Note that the scale is logarithmic, so there is approximately 10 times more English data than Arabic or Turkish, and 1,000 times more English data than Yiddish.

1.4. Languages ​​by resource

Different languages ​​have different amounts of training data available to build large, BERT-like models. These are referred to ashoch,Middle, Andresource poorLanguages. Resource-rich languages ​​such as English, Chinese, and Russian have many freely available texts online that can be used as training data. As a result, NLP researchers have largely focused on developing large language models and benchmarks in those languages.

I have adapted the bar graph above from Figure 1 of the XLM-RPapier. Here's their full bar chart showing the amount of data they collected for 88 of the 100 languages.

How to use BERT in Arabic and other languages ​​· Chris McCormick (4)

The languages ​​are marked with two-letter ISO codes - you can look them up in the tableHere.

Here are the first ten codes in the bar chart (note that after German, there are another ~10 languages ​​with a similar amount of data).

Code Language
In English
Ru Russian
ID Indonesian
vi Vietnamese
Fa Persian / Farsi
United Kingdom Ukrainian
sv Swedish
th Thai
And Japanese
von Deutsch

Note that this ranking of "amount of data" is not the same as ranking of "how many".useris available on the internet in every language. Cashthis tableon Wikipedia. Chinese (codezh) is number 21 in the bar chart, but has by far the most users (after English).

Likewise the amount ofeffort and attentiongiven by NLP researchers to different languages ​​does not follow the ranking in the bar chart - otherwise Chinese and French would be in the top 5.

There is a current project calledOSKARwhich provides large amounts of text for pre-training BERT-like models in different languages ​​- definitely worth checking out if you're looking for unlabeled text to use for pre-training in your language!

1.5. Use machine translation

It is also possible to include "machine translation" (machine learning models that translate text automatically) to try to help with this limited language resource problem. Here are two common approaches.

Approach #1 – Translate everything

You could rely entirely on English templates and translate any Arabic text in your application into English.

How to use BERT in Arabic and other languages ​​· Chris McCormick (5)

This approach has the same problems as the monolingual model approach. The best translation tools use machine learning and have the same limitations in terms of available training data. In other words, the translation tools for medium and low resource languages ​​are not good enough to be a simple solution to our problem - currently a multilingual BERT model like XLM-R is probably the better way.

Approach #2 – Augment training data

When there is already a large amount of labelsEnglishtext for your assignment, then you could translate this labeled text into Arabic and use it to expand your available onesArabictraining data.

How to use BERT in Arabic and other languages ​​· Chris McCormick (6)

If your language has a decent monolingual model and a large English dataset for your task, then this is a great technique. We applied this technique to Arabic in one of our companion notebooks, and it outperformed XLM-R (at least in our initial results - we didn't run a rigorous benchmark).

1.6. XLM-R-Vocabulary

As you can imagine, XLM-R has a very different vocabulary than the original BERT to accommodate 100 different languages.

(Video) BERT Document Classification Tutorial with Code

XLM-R has a vocabulary of 250,000 tokens, compared to BERT's 30,000 tokens.

I published a notebookHerewhere I browsed the XLM-R vocabulary to get a feel for what it contained and to collect various statistics.

Here are some highlights:

  • It contains an "alphabet" of 13,828 characters.
  • It consists of 62% whole words and 38% partial words.
  • To count English words, I tried looking up all whole words in WordNet (a kind of comprehensive English dictionary) and found ~11,400 English words, which is only 5% of XLM-R's vocabulary.

2.1. Natural Language Inference

The most commonly used task for evaluating multilingual models is calledNatural Language Inferencing (NLI). The reason for this is that there is an excellent multilingual benchmark data set calledXNLI.

We'll look at XNLI in the next section, but here's an explanation of the basic NLI task in case you're unfamiliar.

In NLI we are given two propositions: (1) a "premise" and (2) a "hypothesis" and asked to determine whether:

  • 2 follows logically from 1 (this means "Obligation“)
  • 2 contradicts 1 ("contradiction")
  • 2 has no effect on 1 ("neutral")

Here are some examples:

premise label hypothesis
The man inspects his uniform. contradiction The man sleeps.
An older and a younger man smile. Neutral Two men smile and laugh at the cats.
A football game in which several men play. Claim Some men play sports.

As I understand it, NLI is primarily abenchmarking taskrather than a practical application - it requires the model to develop some sophisticated skills, so we use it to evaluate and compare models like BERT.

2.2. Overview of MNLI and XNLI

Multilingual models are benchmarked on NLI using a combination of two datasets called "MNLI" and "XNLI".

MNLI will provide us with a large number of theseEnglishTraining examples to fine-tune XLM-Roberta to the general task of NLI.

XNLI will provide us with a small number of NLI test examplesin different languages. We take our XLM Roberta model (which we will only optimize on the English MNLI examples) and apply it to theArabicTest cases from XNLI.

About MNLI

TheMulti-genre natural language inference(MultiNLI or MNLI) Corpus was released in 2018 and is a collection of more than 400,000EnglishPairs of sentences annotated with text sequence information.

In MNLI, "multi" refers to multi-genre, not multilingual. Confusing I know! It's called "Multi-Genre" because it's intended to be the successor to the Stanford NLI Corpus (SNLI), which consists entirely of somewhat simple sentences pulled from captions. MNLI increases the difficulty of the task by adding multiple and more difficult text genres such as transcribed conversations, government documents, travel guides, etc.

This corpus contains 392,000 training samples, 20,000 "development samples" (test samples to use during the development of your model), and 20,000 "test samples" (the final test set for which benchmark results are reported).

Here are a few randomly selected training examples

Premise: If I had told you my ideas, the very first time you saw Mr Alfred Inglethorp, that shrewd gentleman would have 'smelt hate in your eloquence'! Hypothesis: In the event that I had revealed my ideas to you, Mr. Alfred, you would have been completely unaware of your knowledge of my ideas. Label: 2 (Contradiction) ----------------Premise: Like federal agencies, the organizations we study must protect the integrity, confidentiality, and availability of the information resources on which they rely. - Premise: Well? There was no change in expression on the dark, melancholy face. Hypothesis: He just looked at me and said: Well, what is that? Label: 0 (Detailment) ----------------

About XNLI

“XNLI” stands for Cross-lingual Natural Language Inference Corpus. The paper (Here) was submitted for the first timearXivin September 2018.

This dataset consists of a smaller subset of samples from the MLNI dataset, translated by humans into 14 different languages ​​(15 languages ​​total if you include English):

(Video) Smart Batching Tutorial - Speed Up BERT Training!

Index Code Language
0 ar Arabic
1 bg Bulgarian
2 von Deutsch
3 Is Greek
4 In English
5 es Spanish
6 fr French
7 Hi Hindi
8 Ru Russian
9 black Swahili
10 th Thai
11 tr Turkish
12 of Urdu
13 vi Vietnamese
14 zh Chinese

XNLI does not provide training data for these different languages, so it is intended as a benchmark for the cross-language approach we will be taking here.

For each language there are 5,000 test set-sentence pairs and 2,500 development set-sentence pairs.

Sam Bowmanat NYU stood behind both the MNLI and XNLI records. XNLI was created in collaboration with Facebook.

Here are a few random examples from the Arabic test set.

Premise: Even in social play, opportunities for action and coordination of different roles can help children to understand the similarities and differences between people in desires, beliefs, and feelings Hypothesis: Children cannot learn anything Label: 2 (contradiction)---- -- - ---------Premise: Why, as I have said here, to his lordship, who, like you, thought Miss Bishop's presence on the ship would keep us safe, not for the sake of his mother, held this filthy Slaver Silence about what is due to him Hypothesis: I haven't spoken to His Lordship for a long time Label: 2 (contradiction) ----------------Premise: I threw a Coca -Cola- Ad Over There Hypothesis: Run a soft drink ad Label: 1 (neutral) ----------------

2.3. Monolingual approach

We have created two notebooks for this post - one for applying a monolingual model and one for applying a multilingual model (XLM-R).

For the monolingual approach, I used a community-submitted model,asafaya/bert-base-arabisch, out ofHere. The documentation for this model shows that it was pre-trained with a large amount of Arabic text and that it had a high number of downloads in the last 30 days (meaning it's a popular choice).

I refined this model using two different approaches.

Approach #1 - Use a small, labeled data set

We can use XNLI's small validation set (2,500 human-translated Arabic examples) as our training set. That's a pretty small training set, especially compared to the ~400,000 examples in the English MNLI! I imagine this approach is the most similar to what you might expect when attempting to collect a labeled dataset yourself.

This approach resulted in an accuracy of 61.0% on the Arabic XNLI test set. This is the lowest score of the different approaches we tried (there is a table of results in a later section).

Approach #2 – Using machine translated examples

The XNLI authors also provided machine-translated copies of the English MNLI large dataset for each of the 14 non-English languages.

This will give us plenty of training data, but presumably the quality of the data will be lower since the examples were translated by an imperfect machine learning model, not a human.

This approach gave us an accuracy of 73.3% on the Arabic XNLI test set.

2.4. Multilingual approach

For the multilingual approach, I matched XLM-R against the full English MNLI training set.

Using the huggingface/transformers library, using XLM-R is almost identical to using BERT, you just use different class names.

To use the monolingual approach, you can load the model and tokenizer like this:

out of Transformer import BertTokenizerout of Transformer import BertForSequenceClassification# Load the tokenizer.tokenizer = BertTokenizer.by_pre-trained("asafaya/bert-base-arabisch")# Load the model.Model = BertForSequenceClassification.by_pre-trained("asafaya/bert-base-arabisch", num_labels = 3)
(Video) Multilingual BERT - Part 3 - Multi vs. Monolingual on XNLI

For XLM-R this becomes:

out of Transformer import XLMRobertaTokenizerout of Transformer import XLMRobertaForSequenceClassification# Load the tokenizer.xlmr_tokenizer = XLMRobertaTokenizer.by_pre-trained("xlm-roberta-base" )# Load the model.xlmr_model = XLMRobertaForSequenceClassification.by_pre-trained("xlm-roberta-base", num_labels=3)

learning rate

The rest of the code is identical. However, we encountered a key difference in parameter selection... We found that XLM-R requires a lower learning rate than BERT - we used 5e-6. When we tried 2e-5 (the smallest of the recommended learning rates for BERT), XLM-R training failed completely (the model's performance never improved over random guessing). Note that 5e-6 is a quarter of 2e-5.

Cross-language results

Using this cross-language transfer approach, we achieved 71.6% accuracy in the Arabic XNLI test set. Compare that to the monolingual model, refined using Arabic examples, which scored only 61.0%!

The authors of XML-RoBERTa reported a score of 73.8% for Arabic in their reportPapierin Table 1:

How to use BERT in Arabic and other languages ​​· Chris McCormick (7)

The model in the bottom row of the table is larger - it corresponds to the scale of BERT-large. In our example, we used the smaller "Base" size.

Our lower accuracy may have to do with the choice of parameters such as stack size, learning rate, and overfitting.

2.5. summary of results

Again, my intention with these notebooks was to provide working sample code; not to conduct strict benchmarks. To really compare approaches, more hyperparameter tuning should be done and the results should be averaged over multiple runs.

But here are the results we got with minimal tuning!

How to use BERT in Arabic and other languages ​​· Chris McCormick (8)

For rows 2-4 of the table, you can further improve these results by tuning the Arabic XNLI validation examples. (I quickly tried this with XLM-R and confirmed that the score went up to 74.2%!)

2.6. Which approach to use?

I was able to get good results with itarabic-bert-base, and knowing that it uses less memory (due to a smaller vocabulary), I would go for the monolingual model in this case.

However, this is only possible because a team has pre-trained and published a good monolingual model for Arabic!

I originally thought of using Indonesian as my sample language for this project, but

  1. Indonesian is not one of the 15 XNLI languages.
  2. The best Indonesian model I found,cahya/bert-base-indonesian-522M, was pre-trained with a relatively modest amount of text (~0.5GB), so I'm more skeptical about its performance.

For Indonesian, I'd still want to try both approaches, but I suspect XLM-R would come out on top.

The two notebooks referenced in this post (one implements the multilingual experiments and the other implements the monolingual experiments) can be purchased from my websiteHere. I also provide a walkthrough of these notebooks on YouTubeHere.


1. BERT Research - Ep. 1 - Key Concepts & Sources
2. Mixing BERT with Categorical and Numerical Features
3. BERT Research - Ep. 3 - Fine Tuning - p.1
4. Should you switch from BERT to ALBERT?
5. BERT Research - Ep. 4 - Inner Workings I
6. Trivial BERsuiT - How much trivia does BERT know?
Top Articles
Latest Posts
Article information

Author: Dong Thiel

Last Updated: 05/12/2023

Views: 6613

Rating: 4.9 / 5 (59 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Dong Thiel

Birthday: 2001-07-14

Address: 2865 Kasha Unions, West Corrinne, AK 05708-1071

Phone: +3512198379449

Job: Design Planner

Hobby: Graffiti, Foreign language learning, Gambling, Metalworking, Rowing, Sculling, Sewing

Introduction: My name is Dong Thiel, I am a brainy, happy, tasty, lively, splendid, talented, cooperative person who loves writing and wants to share my knowledge and understanding with you.