Paper Improving Rule Based Stemmers to Solve Some Special Cases of Arabic Language

Analysis of Arabic language has become a necessity because of its big evolution; we propose in this paper a rule based extraction method of Arabic text to solve some weaknesses founded on previous research works. Our approach is divided on preprocessing phase, on which we proceed to the tokenization of the text, and formatting it by removing any punctuation, diacritics and non-letter characters. Treatment phase based on the elimination of several sets of affixes (diacritics, prefixes, and suffixes), and on the application of several patterns. A check phase that verifies if the root extracted is correct, by searching the result in root dictionaries.


Introduction and Related Works
Arabic is a Semitic language spoken by more than 400 million people as a native language and ranked at the seventh position of Internet users in 2010. However, the task of performing the retrieve of information of Arabic language is very problematic, because of many aspects, such as: polysemy, irregular and inflected derived forms, various spelling of certain words, various writing of certain combination character, short vowels (diacritics) and long vowels, and the spectacular availability of affixes in the Arabic words [1,2]. Different methods and approaches have been introduced to retrieve Arabic information [2,3,4,5,6].
To study Arabic morphology effectively, we divide words in Arabic into three self-contained categories as follows: ✓ ‫ْم‬ ‫ِس‬ ‫:ا‬ It includes nouns, pronouns, adjectives, adverbs, etc ✓ ‫ْل‬ ‫ِع‬ ‫:ف‬ Verbs ✓ ‫ف‬ ‫َرْ‬ ‫:ح‬ Particles, articles, and conjunctions Particles are completely unpredictable; they don't fall into the templatic system (i.e. they have no patterns) nor do they undergo any morphophonemic changes. They are what they are and must be memorized. The up side is that there are relatively few of them in the languagewithin one hundred.
Nouns and verbs do fall into the templatic system and have very systematic morphophonemic rules that govern them. This includes the study of how verbs are conjugated, how they move from pattern to pattern to enhance their meanings, how the participles and other nouns are derived, how nouns pluralize, etc.
Each declinable noun and each verb is made up of a certain set of base letters, called its root: ‫.جذر‬ (Nouns that are always indeclinable (such as pronouns) usually don't follow this system).
Verbs can have either 3 base letters, or 4. Nouns can have 3, 4 or 5. Now these base letters can be augmented with extra letters, and they can be dropped or changed due to morphophonemic rules as well.
Particles -The third part of speech in Arabic mentioned above is the particle. The meaning of a particle is often understood in the context of the sentence and words before and after the particle. The sign of the particle is that it does not accept the signs of nouns or verbs.
Analyzing Arabic text was treated by many researchers, all of them tried to extract an exact root or stem from a word, there is two ways to treat a text; morphological analyze, which consist to find roots, and there are statistical stemmers that group word variants using clustering techniques.
The first approach of morphological analyze is manually constructed dictionaries based on roots, Kharashi and Evans worked with small text collections, for which they manually built dictionaries of roots for each word to be indexed [7]. Tim Buckwalter developed a set of lexicons of Arabic stems, prefixes, and suffixes, with truth tables indicating legal combination [1].
Nehar [5] and Taghva [8] introduces new stemming techniques that do not rely on any dictionary, the first one is based on the use of transducers. Nehar [5] proposed also a heavy stemmer that does not use any dictionary of roots. Khoja and R. Garside [2] developed a dictionary-based stemmer, and Larkey [5] developed a Java program based on their own Arabic stemmer that will develop and evolve to take in count some nouns and verbs categories described in the previous paragraph. Taghva [8] proposed IRSI Arabic Stemmer Algorithm, which does not use a Root Dictionary. ISRI stemmer per-forms better than the other approaches on the shorter title queries. For the long texts and narrative queries, stemming made a difference: the Khoja, ISRI, and Light stemmers were significantly better than not stemming. Ghwanmeh [9] presents an Arabic root-based algorithm based on patterns. This stemmer is restricted to native Arabic words that consist of four or more Arabic alphabets.
All algorithms mentioned before have some weaknesses. In this paper, we will prove that the best way of stemming is the one that have a strong preprocessing phase, and it is based on both "patterns check" and "root list". This paper is an extension of work originally presented in conference [10].
In fact, we present the weakness in Heavy and Light Stemming Algorithms and we try to propose some new solutions for each point treated, then we will compare results of our new stemmer with other ones.
In section II, we present the different areas for improvement in Arabic text classification, in section III we present our approach, and in section IV we present some tested examples and, compare our solution with others.

Improvement Areas in Arabic Classification
Light stemming algorithms removes suffixes, and prefixes from words, producing a form of word called "stem" [11], there was many versions of the light stemming algorithms and the last one is light10 [12]. This algorithm after removing punctuation and non-letters, diacritics, Hamza from letter ‫,"أ"‬ he replaces final letter ‫"ة"‬ with letter ‫"ه"‬ and then replace final letter ‫"ى"‬ with letter ‫."ي"‬ After that, the algorithm search in irregular word list to find out if the word exists on this table or not. Then the algorithm removes the letter ‫"و"‬ from the beginning of the words if the length of the word is more than three characters, because it considers that this letter is usually a conjunction.
This step generates several errors on stem extraction, I give below some examples: As we can see, when removing the letter ‫"و"‬ from the beginning of those words, we change the meaning of the word, for the first word ‫يل"‬ ِ ‫ب‬ َ ‫"و‬ it means calamitous; disastrous, and when we remove the letter ‫"و"‬ the word means torch.
In Khoja's Approach, and TC system proposed by M.Hadni, A.Lachkar and S. Alaoui Ouatik in [13], and also in Mohammed N. Al-Kabi who proposed evolution of Khoja's algorithm [14], we find this same issue, so in our algorithm we will take care of this point and we propose to check if the word doesn't exist in the list of words that begins by ‫,'و'‬ and then remove diacritic ‫'و'‬ (primarily weak vowels), this list is constituted manually and must be maintained regarding the evolution of Arabic language.
The second point we have improved is removing the letter ‫,"أ"‬ in Light Approach, Khoja's Approach and M.Hadni's one, this letter is deleted because it is considered as a prefix. The issue is when this letter is a part of word as for the word ‫"أباح"‬ which means "permit", and when we delete it the word means "confide", to solve that, we built a list of words that starts with letter ‫."أ"‬ The third point we involve in this paper is the stemming of five nouns ( ‫أب,‬ ‫أخ,‬ ‫و‬ ُ ‫ذ‬ ‫فو,‬ ‫حمو,‬ ), those nouns are excluded from the other single nouns per the syntactic case. They have other marks to indicate them syntactic cases that the other doesn't have. The single noun always depends on rules to indicate its syntactic cases but the five nouns are contradicting those rules. The five nouns aren't depending on al Harakat (vowelization on system) rather than the letters. They have preconditions to be different from other single nouns: • It has to be adjunct to another noun in other words there must be a noun after it that is genitive noun.
• The noun after must not be ‫)ي(‬ that indicates the speaker [15].
Therefore, for our algorithm, we handle the fives nouns separately.
It is true the orthography in Arabic is less ambiguous and more phonetic with the use of diacritics. For example, a word can be written using the same characters and be pronounced differently. The main purpose of diacritics including vowel marks, known as Harakat a phonetic is to provide, ‫"حركات"‬ aid to show the correct pronunciation. Arabic vowel marks include Fatha ‫فتحة‬ " ", Kasra ‫,"كسرة"‬ Damma ‫,"ضمة"‬ Sukun ‫,"سكون"‬ Shadda ‫"شدة"‬ and Tanwin ‫."تنوين"‬ The pronunciation of these vowel marks are represented in Table 2 Table 2. Arabic Diacritics However, in Modem Standard Arabic (MSA), vowel marks are not usually included in printed and electronic text, and the understanding and correct pronunciation of the word is determined within its context by the reader, so we decide not to remove (if it exists), the vowels as a step on preprocessing phase.

Important Steps
The method we propose is based on preprocessing step, treatment and check steps, here is a description of each one:

Preprocessing
In this step we proceed to:

Searching in strange words list
In this step our algorithm will check if the word is a part of strange words (it is a word that comes from another language than Arabic, and used in the modern Arabic language especially), those words exists in a list of Strange words constituted manually (Table  4).

‫ألمانيا‬
‫فرنسا‬ ‫ديسمبر‬ ‫أوروبا‬ ‫إفريقيا‬ ‫اكلنيكي‬ ‫ميكانيك‬ ‫ناهوند‬ ‫أوباما‬ ‫بنكيران‬ Table 4 Strange Words If the word exists the algorithm returns the word, otherwise the treatment continues.

Check if the word exists in the list of words that begins by ‫:"و"‬
The stemmer removes letter ‫"و"‬ ("and") from the beginning of the words if the length of the word is more than three characters, and if the word doesn't exist in the list of "Words_begins_by_Waw.txt", because many common Arabic words begin with this character.

Check if the word exists in the list of words that begins by ‫"ال"‬
The stemmer removes letter "‫"ال‬from the beginning of the word if the length of the word is more than three characters, and if the word doesn't exist in the list of "Words_begins_by_AL" (table 6), because many common Arabic words begin with this character.

Normalization
The third step in the stemmer is normalization of the words. Normalization process in the proposed stemmer is the similar to the normalization process in Light10 stemmer which runs as following: • Remove Hamza from letter ‫"أ"‬ (Replace " ‫آ‬ ‫إ‬ ‫"أ‬ with ‫"ا"‬ ) • Replace final letter ‫"ة"‬ with ‫."ه"‬ • Replace final letter ‫"ى"‬ with ‫."ي"‬

Check if the word matches any of the patterns
The last step after deleting the prefixes and suffixes of the words is correcting any word that its meaning changed. In some cases, a letter in the pattern of the word is deleted affecting the process of root extraction, like in the word.
In ‫فعل(‬ ) which is the pattern of ‫رأى(‬ ), the letter ‫"أ"‬ is deleted, and in the present tense of pattern ‫فعل(‬ ) is ‫يفعل(‬ ) which is introduced to ‫)يرأى(‬ and not ‫.)يرى(‬ If we take ‫)يرى(‬ as a present verb, the past will be ‫رى(‬ ) so the letter ‫'ع'‬ is deleted for this reason becoming ‫)يفل(‬ instead of ‫.)يفعل(‬ There are three rules which apply in the stemmer for correcting some words their meaning was affected: 1. Adding ‫"ي"‬ to the end of the word if the suffix ‫"يه"‬ is deleted 2. Adding ‫"ه"‬ to the end of the word if the suffix ‫"ته"‬ is deleted 3. Replacing the letter ‫"ئ"‬ to the end of the word by ‫"ء"‬ if the suffix of the word is deleted.

Detailed Algorithm
To implement our algorithm, we have used Java code program, based on Khoja's one.

Results
The stemming result of the word will be correct, if the output form of the word is the same as the target form of the word. Otherwise, the result of the word will be incorrect. We have used Khoja's java code and we applied our approach on it. This program take in input a text file and returns in output a list of words theirs stems and the type of the each word. We have used as a first test a list of 524 words that begins by "waw" letter, we give below results of our stemmer comparing it to Khoja's one. Khoja   Since the text didn't contain words that begins by "waw", some strange words, and five nouns, the difference between the two results is not large. We have also review and modify stop word list to solve some issues detected is our tests, for example we have added ‫'ففي'‬ and ‫'منهم'‬ in this list.

Conclusion
In this paper we proposed new methods to improve the detection of the stem for Arabic language. Indeed, specific cases related to the five nouns and words starting with a vowels are processed successfully by the algorithm.
In the future work we will test the accuracy of our algorithm and compare it with Light and Heavy stemmers.