Sentence retrieval using Stemming and Lemmatization with Different Length of the Queries

A R T I C L E I N F O A B S T R A C T Article history: Received: 25 March, 2020 Accepted: 13 May, 2020 Online: 11 June, 2020 In this paper we focus on Sentence retrieval which is similar to Document retrieval but with a smaller unit of retrieval. Using data pre-processing in document retrieval is generally considered useful. When it comes to sentence retrieval the situation is not that clear. In this paper we use TF − ISF (term frequency inverse sentence frequency) method for sentence retrieval. As pre-processing steps, we use stop word removal and language modeling techniques: stemming and lemmatization. We also experiment with different query lengths. The results show that data pre-processing with stemming and lemmatization is useful with sentences retrieval as it is with document retrieval. Lemmatization produces better results with longer queries, while stemming shows worse results with longer queries. For the experiment we used data of the Text Retrieval Conference (TREC) novelty tracks.


Introduction
Sentence retrieval consists of retrieving relevant sentences from a document base in response to a query [1]. The main objective of the research is to present the results of sentence retrieval with − (term frequencyinverse sentence frequency) method using data pre-processing consisting of stop word removal and language modeling techniques, stemming and lemmatization. Stemming and lemmatization are data reduction methods [2].
Previous work mentions the usefulness of the pre-processing steps with document retrieval. Contrary to that when it comes to sentence retrieval the usefulness of pre-processing is not clear. Some paper mentions it vaguely without concrete results. Therefore, we will try to clarify the impact of stemming and lemmatization on sentence retrieval and present this through test results. As additional contribution we will test and discuss how pre-processing impacts sentence retrieval with different query lengths. Because sentence retrieval is similar to document retrieval and stemming and lemmatization techniques have shown a positive effect on document retrieval, we expect these procedures to have a beneficial effect on sentence retrieval as well.
In our tests we use the State of The Art − method in combination with stemming and lemmatization. For testing and evaluation, data from the TREC novelty tracks [3 -6], were used. This paper is organised as follows. Previous work is shown in section 2., an overview of methods and techniques is shown in section 3., in section 4. data set and experiment setup are presented, result and discussion are presented in section 5 and 6, and the conclusion is given in section 7.

Sentence retrieval in document retrieval
Sentence retrieval is similar to document retrieval, and document retrieval methods can be adapted for sentence retrieval [7]. When it comes to Document retrieval the State of The Art − (term frequencyinverse document frequency) method is commonly combined with preprocessing steps stemming and stop word removal. However, sentences of a document have an important role in retrieval procedures. In the paper [8], research results have shown that the traditional − algorithm has been improved with sentence-based processing on keywords helping to improve precision and recall.

Document retrieval with stemming and lemmatization
Stemming and lemmatization are language modeling techniques used to improve the document retrieval results [9]. In [10] the authors showed the impact of stemming on document retrieval, using short and long queries. The paper [10] 5, No. 3, 349-354 (2020) www.astesj.com stemming has a positive effect on IR (the ranking of retrieved documents was computed using − ). Paper [11] compares document retrieval precision performances based on language modeling techniques, stemming and lemmatization. In papers [9,11] it is shown that language modeling techniques (stemming and lemmatization) can improve document retrieval.

− sentence retrieval with stemming and lemmatization
When it comes to stemming and lemmatization and their impact on the − method, the results are not clearly presented, unlike the − method, where the impact is clear. In paper [12] stemming is mentioned in context of sentence retrieval. The paper states that stemming can improve recall but can hurt precision because words with distinct meanings may be conflated to the same form (such as "army" and "arm"), and that these mistakes are costly when performing sentence retrieval. Furthermore, paper [12] states that terms from queries that are completely clear and unambiguous, can match with sentences that are not even from the same topic after the stop word removal and stemming process.

Using
− method for sentence retrieval in combination with stemming and lemmatization For sentence retrieval in this paper we use − method based on vector space model of information retrieval.
The ranking function is as follows: ( | ) = ∑ log ( , + 1)log ( , + 1) ∈ log( + 1 0.5 + ) Where: • ,number of appearances of the term in a query, • ,number of appearances of the term in a sentence, • nis the number of sentences in the collection and • is the number of sentences in which the term appears, The search engine for sentence retrieval with ranking function (1) uses data pre-processing consisting of following three steps: stop word removal, stemming and lemmatization. In information retrieval, there are many words that present useless information. Such words are called stop words. Stop words are specific to each language and make the language functional but they do not carry any information (e.g. pronouns, prepositions, links, ...) [15]. For example, there are around 400 to 500 stop words in the English language [15]. Words that often appear at the collection level can be eliminated through some tools like RapidMiner or programmatically, so as not to have an impact on ranking.
There are several different methods for removing stop words presented in [16] like: • Z-Methods; • The Mutual Information Method (MI); • Term Based Random Sampling (TBRS); In this paper we used the classic method of removing stop words based on a previously compiled list of words. Part of the list of words to be removed by pre-processing is shown in Figure 1 which is a snippet from our source code: Stemming refers to the process of removing prefixes and suffixes from words. When it comes to stemming, there are many different algorithms. One of them use the so called "bag of words" that contain words that are semantically identical or similar but are written as different morphological variants. By applying stemming algorithms, words are reduced to their root, allowing documents to be represented by the stems of words instead of original words. In information retrieval, stemming is used to avoid mismatches that may undermine recall. If we have an example in English where a user searches for a document entitled "How to write" over which he raises the query "writing", it will happen that the query will not match the terms in the title. However, after the stemming process, the word "writing" will be reduced to its root (stem) "write", after which the term will match the term in the title. We use the Porter's stemmer, which is one of the most commonly used stemmers, which functions on the principle that it applies a set of rules and eliminates suffixes iteratively. Porter's stemmer has a welldocumented set of constraints, so if we have the words "fisher", "fishing", "fished", etc., they get reduced to the word "fish" [17]. Porter's stemmer algorithm is divided into five steps that are executed linearly until the final word shape is obtained [18]. In paper [19] it was proposed modified version of the Porter stemmer.
Lemmatization is an important pre-processing step for many applications of text mining, and also used in natural language processing [20]. Lemmatization is similar to stemming as both of them reduce a word variant to its "stem" in stemming and to its "lemma" in lemmatizing [21]. It uses vocabulary and morphological analysis for returning words to their dictionary form [11,20]. Lemmatization converts each word to its basic form, the lemma [22]. In the English language lemmatization and stemming often produce same results. Sometimes the normalized/basic form of the word may be different than the stem e.g. "computes", "computing", "computed" is stemmed to "comput", but the lemma of that words is "compute" [20]. Stemming and lemmatization have an important role in order to increase the recall capabilities [23,24].

Data set used and experiment setup
Testing was performed on data from the TREC Novelty tracks [3]- [5]. Three Novelty Tracks were used in the experiment: TREC 2002, TREC 2003 and TREC 2004. Each of the three Novelty Tracks has 50 topics. Each topic consisting of "titles", "descriptions" and "narratives".  Figure 3 shows a snippet from one of the 25 documents assigned to topic N56, which has multiple sentences, which are in the format: <s docid="xxx" num="x">Content of Sentence </s>. In our experiment we extract single sentences from the collection. During the extraction we assign a docid (document identifier) and num (sentence identifier) to each sentence.
Three data collections were used, ( Table 1 and Table 2).  For results evaluation one file is available which contain a list of relevant sentences [25]. Figure 4 shows a snippet from the relevant sentence file. N56 NYT19990409.0104: 16 defines sentence "16" of document "NYT19990409.0104" as relevant to topic "N56".
Using the presented TREC data we test at first the − method without any pre-processing. Then we test the same − method with stemming and with lemmatization. All three tests we do twice: First time with short queries and second time with long queries. In all of our tests we use stop word removal.
We denote the baseline method as − , the method with stemming we denote as − and the method with lemmatization we denote as − .

Result and discussion
As already mentioned, we wanted to test if data preprocessing steps stemming and lemmatization affect the sentence retrieval. Also, we want to analyse if the effect of pre-processing is different when using different query lengths. For test evaluation we used standard measures: P@10, R-precision and Mean Average Precision (MAP) [26,27].
Precision at x or P@x can be defined as: The P@10 values shown in this paper refer to average P@10 for 50 queries.
R-precision can be defined as [26]: Where: • |Rel| is the number of relevant sentences to the query, • r is the number of relevant sentences in top |Rel| sentences of the result.
As with P@10 we also calculate the average R-precision for 50 queries. Another well-known measure is Mean Average Precision which gives similar results to R-precision.
Mean Average Precision and R-precision is used to test high recall. High recall means: It is more important to find all of relevant sentences even if it means searching through many sentences including many non-relevant. In opposite to that P@10 is used for testing precision. Precision in terms of information retrieval means: It is more important to get only relevant sentences than finding all of the relevant sentence.
For result comparison we used two tailed paired t-test with significance level α=0.05. Statistically significant improvements in relation to the base − method (without data preprocessing) are marked with a (*). The results of our tests on different data sets are presented below in tabular form. Table 3 shows the results of our tests on TREC 2002 collection with short queries presented on Figure 2 and labeled with <title>.  Table 3 we see that the method with stemming − and the method with lemmatization − show statistically significant better results in comparison to the baseline method, when it comes to MAP measure. Table 4 shows the results of our tests on TREC 2002 collection with longer queries presented on Figure 2 and labeled with <desc>. Only − provides better results and statistically significant differences in relation to the base − method (without data pre-processing), when the MAP measure is used. We can see that stemming performs a little worse when it comes to longer queries in relation to the base − method. Table  5 and Table 6 show the results of our tests using TREC 2003 collection with short and longer queries respectively. Table 5 show that − and − provide better results and statistically significant differences in relation to the base − method, when the MAP and R-prec. measure are used.   Table 6 is shows that lemmatization keeps showing statistically significant better results even with long queries, unlike the method with that uses stemming. Table 7 and Table 8 show the results of our tests on TREC 2004 collection with short and longer queries.   Table 7 shows that − method with short queries, provides better results in comparison to the baseline, when it comes to MAP measure. Table 8 shows that with longer queries both methods shows statistically significant better results. When taking a look at all the tables above we see that stemming and lemmatization often give statistically significant better results when it comes to MAP and R-Prec. Therefore, we can assume that these pre-processing steps have similar positive effect on sentence retrieval as they have on document retrieval. Let us analyse how query length impacts our two methods ( − and − ). Table 9 shows an overview of the overall number of statistically significant better results for four pairs (stem -short queries, stem -long queries, lem -short queries, lem -long queries).
As we can see − seems to go better with short queries and − seems to go better with long queries. At the moment we do not have enough data to examine this behaviour further. But that will be a topic for further research of us.
More precisely, every match of words between query and sentence is marked bold. Matches that occurred with stemming or lemmatization but not with the baseline are marked as bold and underlined.
In Table 10 and Table 11 we clearly can see some words that could be matched thanks to stemming and lemmatization. For example, if we look at a short query and a sentence through three different methods shown in Table 10, we can see how the word "dies" and "died", in query and sentence is reduced by the stemming and lemmatization to the word form "die", through which it is possible to overlap between the query and the sentence. Also, the tables show a few more examples that show how some words could be matched thanks to stemming and lemmatization, and why a sentence has a better position in the search result.

Conclusion
In this paper we showed through multiple tests that preprocessing steps stemming and lemmatization have clear benefits when it comes to sentence retrieval. In most of our tests we got better results when combining − with stemming or lemmatization. However, the positive effects only appeared with the measures MAP and R-prec. which improve recall. At the same time the pre-processing steps did not show any negative effects on sentence retrieval. Therefore, we think that stemming and lemmatization is generally beneficial to sentence retrieval, we saw that stemming tends to show better result with short queries while lemmatization tends to show better results with longer queries which we will explore in more detail in the future.