Download the database of Farsi Steaming datasets for evaluation
The Persian stemming data set includes a set of Persian words that have been stemmed or reduced using morphological analysis methods. These words are accessible as a text file and are usually used for use in natural language processing or building machine learning models. The Persian stemming data set is known as one of the important data sets in the field of Persian language processing.
Explanations about Steaming
Stemming is one of the natural language processing methods that uses linguistic rules and different algorithms to transform words into their basic forms or roots. This method is usually used in natural language processing and text analysis to convert different words that are actually related to a common root or meaning into a similar form.
For example, using the stemming method, the words "I'm going", "I'm gone" and "we went" are converted into the word "went". This work is very useful for text processing and analysis because by reducing the number of words and converting them to the basic form, language rules and patterns can be easily identified, and with this, text analysis and processing can be done faster and more accurately.
Steaming is usually done using different algorithms. Some of these algorithms are: Porter's algorithm, Lemma's algorithm and the closest path algorithm. These algorithms convert words into their base form or roots according to linguistic rules and lexical patterns.
There is no standard dataset to evaluate the accuracy of Persian root algorithms. In order to create a dataset to evaluate the correctness of bases, we need a set of words along with their stems. These datasets are automatically extracted from two manually rooted datasets. The first dataset consists of a set of words and their roots extracted from the PerTreeBank collection . This collection contains 4689 distinct words. In addition, to perform a better evaluation, we selected a large text set for the second data set. The words and their roots are extracted from this data set from the collection of the Persian Affiliation Tree Bank . It contains 26,913 distinct words. These two data sets are of good quality in terms of the variety of speech part tags.
Each root data set consists of three columns. The first column is the inflectional word, the second is its root, and the third is its part of speech. You must add your roots to the fourth column. Then you can use the following command.
Sample database images
Dear users, it is recommended to download.