Author Identification for Marathi Language

A R T I C L E I N F O A B S T R A C T Article history: Received: 22 July, 2019 Accepted: 09 December, 2019 Online: 04 April, 2020 This is era of new technology; most of information is collected from internet, web sites. Some people uses data from research papers, thesis, and website as it is and publish as their own research without giving proper acknowledgement. This term is known as plagiarism. There are two types of plagiarism detection methods, i) Extrinsic plagiarism detection ii) Intrinsic plagiarism detection. Through extrinsic plagiarism utilizing reference corpus plagiarism is observed, while in intrinsic plagiarism identification, using author's writing style, plagiarism can be identified. If the anonymous text is written by unknown author. By using authorship analysis we can find original author of text. Authorship analysis is having three types i)Author identification ii) Author characterization and iii) Similarity detection. This paper mainly focuses on author identification for Marathi language. To calculate projection in two different files, we used feature vectors of main author file and summary file of other authors. The result of average projection shows, there is similarity in main author file and summary file of different authors, it also shows summary file of each author is having impact of main author file.


Introduction
Plagiarism includes copying material, every word from phrase or as a paraphrase, from any book to websites, course notes, oral or visual displays, lab reports, pc assignments, or artistic works. Plagiarism includes reproducing any individual else's work, whether or not it be posted article, chapter of a book, a paper from a buddy or some file, or whatever. In addition, plagiarism involves the exercise of employing another person to alter or revise the work that a student submits as his or her own, whoever that other man or woman may be. Authorship identification is the ability to identify unidentified authors based on their previous work and statements. The main method in authorship identification is to look at and identify features by an author using stylometric features. We can find the writing style of author by identifying textual features that they used while writing document [1].

Authorship Analysis
Authorship analysis is a method of analyzing the features of the writing part in order to draw conclusions from its authorship [1]. Authorship analysis having three types: i) Authorship Identification, ii) Authorship characterization, iii) Similarity detection.
A. Authorship identification: It defines the likelihood of a part of the writing being produced by a specific author by examining the author's other writings.

B. Authorship characterization:
Authorship characterization reviews the character-istics of an author and produces the author profile based on his or her writing.
C. Similarity detection: Similarity detection examines several pieces of writing and judges whether they have been published by a single author without actually identifying the author [1].

Literature Survey
The PAN workshop brought together experts and researchers around the exciting and future-oriented topics of plagiarism detection, authorship identification, and the detection of social software misuse. It   Author follows the unmasking approach. [11] 1.length of the sentences, 2.variety of vocabulary, 3. Words, n-characters grams, n-4. Words gram, punctuation marks.
Author compares all documents inside a corpus using the cosine similarity, euclidean distance or the correlation coefficient.
For the task of Author Verification, we used the Classification and Regression Trees (CART) algorithm which constructs binary trees using the features and thresholds that yield the largest information gain at each node [12] profiles of character 3-grams for representing information about the Different categories of authors.
Baseline (accuracy) obtained in cross-genre classification by age and gender using Naive Bayes, tf-idf word representation. [13] word bag, stop word bag, punctuation bag, part of speech (POS) bag KNN Algorithm is used [14] 1. counting text elements 2. constructing syntactic n-grams Integrated syntactic graph is used.

Text Corpus
Similar to other language work, work in the Marathi language is also appreciable. But the work is not accessible as an online resource, so far it's offline. Actually, there is no generic Marathi text corpus accessible. For the development of text corpus, we have considered 10 paragraphs for taking summary from 50 users in their own writing. We have used 500 summary files from 50 users as a database for author identification.

Proposed System
We would like to propose a system for Author Identification in Marathi Language. The system workflow is given below:

Feature Extraction
Feature extraction can be defined as the process of extracting a set of new features from the set of features generated in the selection stage feature. Feature extraction is a basic and fundamental step to pattern Recognition and machine learning problem. There is no text corpus available for Marathi language.
We concentrated on two major features: Lexical features and Vocabulary richness features. These include features like Average sentence length by word, Average sentence length by character ,AvgWordFrequencyClass, Avg sentence length, Hapax legomenon, Hapax dislegemena.
We have extracted the following features:

Hapax Legomena and Hapax DisLegemena
Hapax Legomena is a term that appears only once in a sense, either in the written record of the whole language, a single text. Hapax legomenon it is a Greek phrase which is means something that told onetime only.
Similarly, Hapax DisLegemena is the word that is used twice. Following table3 shows that features of original sample files from database.      O10 S10 A2 S10 1242.74 O10 S10 A3 S10 1230.19 O10 S10 A4 S10 1245.55 O10 S10 A5 S10 1240.36  Above figure 4 shows average projection of 10 files. We have calculated feature vector of main author file and feature vector of summary file written by author, we calculated projection these two vectors for 10 different sample summary files of five authors. It shows there is similarity in main author file and summary file of each author. Summary file of author is having impact of main author file. Above graph shows file number 4,5,6,7 are having more projection of main author file.

Conclusion
Authorship identification is the ability to identify unidentified authors based on their previous work and statements. We have created database of 500 summary files from 50 users for author identification. After doing literature survey on features used for author identification, we selected some features like Lexical features and vocabulary richness features. By using feature vector of main author file and summary file of authors, we calculated projection of 10 files. The result of average projection shows, there is similarity in main author file and summary file of different authors. The figure4 shows summary file of each author is having impact of main author file, Summary file number 4,5,6,7 are having more projection of main author file. Currently, most of Marathi native speakers are contributing their research for various topics in Marathi language, but some of researchers are using information from various sources like research papers, books, thesis without giving acknowledgement. There is need to restrict these type of conditions. There is no Author identification tool available for Marathi language. This tool will be helpful to perform quality research in Marathi language. Average projection of each file projection