لینک کوتاه : https://en.magicfile.ir/?p=2453
دانلود سورس و کد تشخیص زبان یک متن نوشته شده با vb.net
We have prepared the source and language recognition code of a text written with vb.net for you, dear users of the magical file website. The language recognition solution given is based on n-gram and word occurrence comparison. It is suitable for any language that uses words (this is actually not true for all languages). Depending on the model and the length of the input text, the accuracy is between 70% (only short Norwegian, Swedish and Danish classified by the "all" model) and 99.8% using the "default" model.
زمینه
Language recognition of a written text is probably one of the most fundamental tasks in natural language processing (NLP). For any language depending on the processing of an unknown text, the first thing you need to know is what language the text is written in. Fortunately, this is one of the easier NLP challenges. The approach I have chosen to implement is widely known and very simple. The idea is that each language has a unique set of (co)occurrence characters.
نمونه از تصاویر در زمان اجرا
The first step is to collect those statistics for all the languages that should be recognized. This is not as easy as it may seem at first. The problem is collecting a large set of test data (plain text) that includes only one language and is not domain specific. (Only newspaper articles may lack the use of the "I" word and direct speech. Using Shakespeare's plays would not be the best approach to recognize contemporary texts. Medical articles usually contain many domain-specific terms that are not even language-specific (major , minor, arterial, etc...) and if that's not hard enough, the texts should not be copyrighted. copyrighted?) I chose to use Wikipedia as my main source. I had to do some filtering to "Wikipedia contains many proper names (ie group names) that often contain a 'the' or an 'and' are. That is why those words exist in many languages even if they are not part of the language. This should not necessarily be a disadvantage, as Anglicism has spread widely across many languages. I have three for each language. I made a statistic: Wikipedia contains many proper names (i.e. names of groups) that often contain a “the” or “and.” This is why Those words exist in many languages even if they are not part of the language. This should not necessarily be a disadvantage, as Anglicism has spread widely across many languages. I created three statistics for each language:
- مجموعه کاراکتر
- Some languages have a very specific character set (such as Chinese, Japanese, and Russian). For others, some characters give a good hint of the target languages (eg, German Umlauts).
-
N-Grams
- پس از تبدیل متن به کلمات (در صورت لزوم)، تعداد دفعات 1، 2 و 3 گرم شمارش شد. برخی از n-gram ها بسیار خاص زبان هستند (به عنوان مثال، "TH" در انگلیسی).
-
فهرست واژه
- A final source of disambiguation is the words that are actually used. Some languages (such as Portuguese and Spanish) are almost identical in the characters used as well as the occurrence of certain n-grams. However, different words are used at different frequencies.
The statistical set is called a model. I have created subsets of the "all" model that best meet my needs (see table below). The "common" model includes the 10 most spoken languages in the world. "Small" and "Default" are based on my usage scenarios. If you are from another part of the world, your preferences may be different. So please don't take offense at my choice of what languages are in which model.
All statistics are sorted and ranked according to their occurrence. In the demo program, all models can be studied in detail. Classification of an unknown text is simple. The text is marked up and three tables are generated for statistics. The result table is compared with all model tables and the distance is calculated. The comparison table of the model that has the smallest distance with the unknown text is most likely the language of the text.
کد زبان | زبان | کیفیت | پیش فرض | مشترک | بزرگ | کوتاه |
---|---|---|---|---|---|---|
nl | Dutch | 13 | x | x | ||
en | English | 13 | x | x | x | x |
ca | Catalan | 13 | ||||
fr | French | 13 | x | x | x | x |
es | Spanish | 13 | x | x | x | x |
no | Norwegian | 13 | x | x | ||
da | Danish | 13 | x | x | ||
it | Italian | 13 | x | x | ||
sv | Swedish | 13 | x | x | ||
de | German | 13 | x | x | x | x |
pt | Portuguese | 13 | x | x | x | |
ro | Romanian | 13 | ||||
vi | Vietnamese | 13 | ||||
tr | Turkish | 13 | x | |||
fi | Finnish | 12 | x | |||
hu | Hungarian | 12 | x | |||
cs | Czech | 12 | x | |||
pl | Polish | 12 | x | |||
el | Greek | 12 | x | |||
fa | Persian | 12 | ||||
he | Hebrew | 12 | ||||
sr | Serbian | 12 | ||||
sl | Slovenian | 12 | ||||
ar | Arabic | 12 | x | |||
nn | Norwegian, Nynorsk (Norway) | 12 | ||||
ru | Russian | 11 | x | x | ||
et | Estonian | 11 | ||||
ko | Korean | 10 | ||||
hi | Hindi | 10 | x | |||
is | Icelandic | 10 | ||||
th | Thai | 9 | ||||
bn | Bengali (Bangladesh) | 9 | x | |||
ja | Japanese | 9 | x | |||
zh | Chinese (Simplified) | 8 | x | |||
se | Sami (Northern) (Sweden) | 5 |
برای شما کاربر عزیز پیشنهاد دانلود داده می شود