N-grams
An n-gram is a sequence of symbols extracted from a long string [13]. The symbol can be a byte, character or word [14]. In simplistic terms this means an n-gram is a collection of tokens where « n » equals the number of tokens contained within the collection. A token within this context can basically be any portion of data divided into smaller 10gical pieces by the n-gram creator. One of the simplest examples to consider is a text document. A token within a text document might represent each individual word within the document as delimited by spaces with all punctuation characters removed. However, many alternative tokenization strategies could be devised to pro duce tokens from a given text document. A word n-grams is a sequence of n successive words. Considering the text document example, a 3-gram could represent an n-gram containing 3 word tokens and a 4-gram would contain 4 word tokens. Once the n-gram parameter « n » and a tokenization method has been decided upon, each n-gram can be produced by starting with the fust « n » tokens and creating the flrst n-gram.
From that point on, additional n-grams are produced by removing the fIfSt token contained within the CUITent n-gram and concatenating one additional token to the end of the CUITent n-gram. This process is continued until the last « n » tokens within the provided document are reached, and the last n-gram is then created. Besides, An N-Grams of characters is a sequence of n successive characters [15]. Extracting character n-grams from a document is like moving a n character wide ‘window’ across the document character by character [14]. Each window position covers n characters, deflning a single N-gram. 1t is a bi-grams for n= 2, tri-grams for n=3, a quadric-grams for n=4, etc [16]. For examp1e, cutting into tri-grams of the word ‘computer’ gives ‘corn’, ‘omp’, ‘mpu’, ‘put’, ‘ute’, ‘ter’. Nowadays, n-grams of characters are used for many types of applications including computationa11inguistics, DNA sequencing, protein sequencing, and data compression just to name a few [17] [18] [19] [20]. N-grams of characters have previous1y been used for subjectivity recognition in text and speech [21], classification of images [16], 1anguageindependent categorization of text [22] and so on.
On the other hand, Wilson and R. have shown that there is value in using very shallow 1inguistic representations, such as character ngrams, for recognizing subjective utterances, in particu1ar, gains in the recall of subjective utterances [23]. Besides, in the paper of Kanaris, L, et al., they presented a comparison of words and character n-grams in the framework of content-based anti-spam filtering [24]. The most important property of the character n-gram approach is that it avoids the use of tokenizers, 1emmatizers, and other 1anguage-dependent too1s. In addition to character ngrams of fixed 1ength, or might be variab1e-1ength depends on using which originally used for extracting mu1ti-word terms for information retrieva1 applications [24]. Resu1ts of costsensitive eva1uation indicate that the variab1e-1ength n-gram mode1 is more effective in any of the three examined cost scenarios (i.e. 10w, medium or high cost). A1though the majority of the variab1e-1ength n-grams consists of 3-grams, there are on1y a few common members with the fixed-1ength 3-gram set. Rence, the information included in the variable 1ength n-grams is quite different in comparison to the information represented by case sensitive 3-grams. In this paper, we deve10p a notion of the similarity and distance between n-gram of characters. We bui1d our own methodology for measuring distance between n-grams of characters which is a special case of n-gram distance and similarity respective1y. Besides, for measuring similarity, we use five popular co-occurrence text similarity measure. On the other hand, we a1so have a scale of (0, 1) to understand the measure of simi1arity and dissimi1arity more precisely. Rere, the scale (0, 1) will symbolize the intense similarity or dissimilarity between two files.
Several Examples
For another evaluation, let us consider tri-grams of characters. At fust, we will collect our text from different kind of web-pages, blogs, journals etc. So, we have applied our algorithm on more than 30 sentences which are collected from a webpage like « Wikipedia » and we got a satisfactory result. We took 15 sentences from Wikipedia of both keywords like « Apple Ine. » as company name and « Apple » as fruit name separately. There have many other sources for large text files, because of the versatile collection of sentences, we choose Wikipedia. Besides, our local system has sorne limitations to process large files and that’s why we took only 30 sentences for further procedure. Using the steps and the equations – (i), (ii), (iii), (iv), (v), (vi) and (vii) which are described above, we get following result: Per our opinion, these two keywords are totaIly different, so their texts must be different as weIl. From the above table, the dissimilar coefficient is – 0.99 ::::: 1. That means these two large sentences are quite dissimilar to each other. On the other hand, the similarity values are less than the threshold value (a). So, we can conc1ude that our methodology provides a satisfactory result. Now, we will provide sorne examples to verify that our methodology is free from any grammatical rules ofEnglish and French language respectively. As we know, speakers of any language use many forms of grammatical rules to give c1ear instructions as weIl descriptions in aIl sources of information. Because of complying the instructions of grammar step by step, the written language tends to be more standardized nowadays. So, we can say that every source of information follows the rules of grammar and there have a lot of vocabularies. There have many categorizations of grammatical rules. At frrst, we will demonstrate about the English grammatical rules with examples and afterwards we will discuss about French with examples in a following way:
Noun in a simple sentence / phrase / clause – consists ofname ofperson, place or thing etc. For example: let consider two noun phrase i.e. – textl – ‘Karen rode on a yellow skate board.’ and text2 – ‘Karen lives in the yellow house.’ Result: Using the steps and the equations – (i), (ii), (iii), (iv), (v), (vi) and (vii) which are described above, we get following result: As per our opinion, these two sentences are totally different. Besides, from our ab ove table, our dissimilarity section contains highest values – the dissimilarity coefficient of Jaccard and Dice are 0.95 and 0.9 respectively which is ;::: 1 that means the given sentences are dissimilar. Besides, the similar values like Jaccard, Dice, Overlap, Cosine and Simple matching are less than the threshold value (a 2: 0.3). So, it has been proved that these two sentences are dissimilar.
Declarative Sentence – simply makes a statement or expresses an opinion. In other words, it makes a declaration. This kind of sentence ends with a period. For example -let us consider two text like: textl – ‘1 want to be a good writer.’ and text2 – ‘My friend is a really good writer.’ Now using the steps as weIl as (i), (ii), (iii), (iv) , (v), (vi) and (vii) equations, we get following results: As our opinion, the se two sentences are quite different but the quality i.e. ‘good writer’ of the two-different subject is same. Among these two sentences, the first sentence expresses about the future aim of someone where the second sentence express about the present situation of someone. Simultaneously, our method gives the exact result which is exactly in our opinion. From our table, the dissimilarity values are higher than similarity value. From our table the got the similarity values – 0.17,0.29,0.52,0.32,0.17 given by Jaccard, Dice, Overlap Co sine and Simple Matching respective1y where the Jaccard and Dice give the dissimilar value i.e. 0.83 and 0.71 respective1y. Because of the adjective ‘good writer’ of these two sentences, the Overlap and Cosine provide the similarity value i.e. 0.52 and 0.32 which are greater than our threshold value (a:::: 0.3). But from the table, the similar values of Jaccard, Dice and Simple Matching are very less than our threshold value (a:::: 0.3) which satisfy our expectation.
Imperative Sentences – gives a command or makes a request. It usually ends with a period but can, under certain circumstances, end with an exclamation point. For example – let us consider two sentences: textl – ‘Please sit down.’ and text2 – ‘I need you to sit down now!’ Using the seven noted equations as weIl as the steps, we get following results: According to our results, we get less similar value than dissimilar value except for Overlap coefficient. Here, overlap coefficient gives the similar value – 1, that means this coefficient shows the intense similarity between the given sentences which satisfy our expectation.
EXPERIMENTATION
In this section, we categories the text probabilities getting from different web-pages. Using four various categorized sentences like ‘positive similarities when the similarity of the texts should be positive’, ‘positive similarities when the similarity of texts should be negative’, ‘negative similarities when the texts should be positive’, and ‘negative similarities when the similarity of texts should be negative’, we will illustrate the comparison between our rnethodology and the rnethodology of most recent journal of Akermi and Faiz [2]. The comparison of our method with sorne c1assified examples are given below: We will discuss about each section of the earlier table. For the frrst section entitled ‘Positive similarities when the similarities of the texts should be positive’, we took two similar texts and the meaning of these texts is positive. Here, each text contains two sentences, but in different order and the meaning of these two texts is positive as weIl. Using our method, the Dice, Overlap and Cosine provide the same similar value i.e. 0.3985507 which is greater than the threshold value (a 2: 0.3). So, according to our method we can say that these two sentences are similar. On the other hand, Akermi and Faiz’s algorithm [2] is unable to identify name of person (‘Saima Sultana’) and also place (‘Canada’) by both dictionary and web search engine. So, we can say that their method was unable to detect similarity between these two sentences.
For the second section entitled ‘Positive similarities when the similarities of the texts should be negative’, we took two similar texts but the meaning ofthese texts is negative. Using our method, the similar values of Dice, Overlap and Co sine are – 0.4081633, 0.4761905 and 0.4123931 respectively and these values are greater than the threshold value (a 2: 0.3). Besides, we got same similar value from Jaccard and Simple matching which is – 0.2564103 and this value is lower than the threshold value (a 2: 0.3). So, only because of ‘girl’ we get sorne similar values from Dice, Overlap and Cosine and we get an idea how much these two sentences are similar in the scale of (0, 1) while the meaning of these two sentences is different. On the other hand, the lower similar value from Jaccard and Simple matching shows how much these two sentences have negative meaning. On the other hand, Akermi and Faiz’s algorithm [2] shows similar value is 1 that me ans it shows 100% similarity between these two sentences while the meaning ofthese two sentences is totally different. For the third section entitled ‘Negative similarities when the similarities of the texts should be positive’, we took two different texts but the meaning of these texts is positive. Using our method, we get the similarity value is – 0 that me ans it shows these two sentences are totally dissimilar and we consider it would be a limitation of our methodology. On the other hand, the similar value from Akermi and Faiz’s algorithm [2] is – 0.4173559 of these two sentences.
For the fourth section entitled ‘Negative similarities when the similarities of the texts should be negative’, we took two different texts with the negative meaning. From our method, we found the Jaccard and Simple matching give the same similar value i.e. 0.006479482. Besides, the other similar values are – 0.0129,0.0158 and 0.0131 given by Dice, Overlap and Cosine respectively. These values are lower than the threshold value (a 2: 0.3) which signify the dissimilarity between these two sentences. On the other hand, the similarity score from Akermi and Faiz’s algorithm [2] is – 0.2 which is greater than the similar values of our method as well as their algorithm is unable to detect the university name « UQTR » and also the word « master’s ». We have used around 150 sentences in total for experimenting the methodology and each time selected the n-gram size is 3 as the parameter for nGramSize. For experimenting, we use plain texts as input file but our real application can also compile pdf, doc, docx, ppt, pptx, plain text, web-pages and so on. Though the opinion may vary person to person, but now according to our opinion, we will try to find out the accuracy of our results based on the actual relationship between our experimented texts separately. For this, we have four parameters – True Similar (TS), False Similar (FS), True Dissimilar (TD), False Dissimilar (FD) to find out the actual relationship between two texts. According to our concept, we will fmd out the re1ationship between given texts and then based on the result of our method we will fmd out the relationship between those texts. Here, we will fmd accuracy of the grammatical part of both English and French as well as the multilingualism of English to French respective1y. Here, we will take only the similar scores which satisfy our expectations while comparing those values with the threshold value «(l ~ 0.3) and then we will able to fmd out the relationship between the given texts according to our method. Now we will shortly demonstrate that how we will get TS, TD, FS and FD from the results of our methodology –
ACKN’OWLEDGEMENTS |