
Short link: https://en.magicfile.ir/?p=2453
دانلود سورس و کدDetecting the language of a text نوشته شده با vb.net
We have prepared the source and language recognition code of a text written with vb.net for you, dear users of the magical file website. The language recognition solution given is based on n-gram and word occurrence comparison. It is suitable for any language that uses words (this is actually not true for all languages). Depending on the model and the length of the input text, the accuracy is between 70% (only short Norwegian, Swedish and Danish classified by the "all" model) and 99.8% using the "default" model.
Background
Language recognition of a written text is probably one of the most fundamental tasks in natural language processing (NLP). For any language depending on the processing of an unknown text, the first thing you need to know is what language the text is written in. Fortunately, this is one of the easier NLP challenges. The approach I have chosen to implement is widely known and very simple. The idea is that each language has a unique set of (co)occurrence characters.
Sample of runtime images
The first step is to collect those statistics for all the languages that should be recognized. This is not as easy as it may seem at first. The problem is collecting a large set of test data (plain text) that includes only one language and is not domain specific. (Only newspaper articles may lack the use of the "I" word and direct speech. Using Shakespeare's plays would not be the best approach to recognize contemporary texts. Medical articles usually contain many domain-specific terms that are not even language-specific (major , minor, arterial, etc...) and if that's not hard enough, the texts should not be copyrighted. copyrighted?) I chose to use Wikipedia as my main source. I had to do some filtering to "Wikipedia contains many proper names (ie group names) that often contain a 'the' or an 'and' are. That is why those words exist in many languages even if they are not part of the language. This should not necessarily be a disadvantage, as Anglicism has spread widely across many languages. I have three for each language. I made a statistic: Wikipedia contains many proper names (i.e. names of groups) that often contain a “the” or “and.” This is why Those words exist in many languages even if they are not part of the language. This should not necessarily be a disadvantage, as Anglicism has spread widely across many languages. I created three statistics for each language:
- Character set
- Some languages have a very specific character set (such as Chinese, Japanese, and Russian). For others, some characters give a good hint of the target languages (eg, German Umlauts).
-
N-Grams
- After converting the text into words (if necessary), the number of times 1, 2, and 3 grams was counted. Some n-grams are very language specific (eg, "TH" in English).
-
word list
- A final source of disambiguation is the words that are actually used. Some languages (such as Portuguese and Spanish) are almost identical in the characters used as well as the occurrence of certain n-grams. However, different words are used at different frequencies.
The statistical set is called a model. I have created subsets of the "all" model that best meet my needs (see table below). The "common" model includes the 10 most spoken languages in the world. "Small" and "Default" are based on my usage scenarios. If you are from another part of the world, your preferences may be different. So please don't take offense at my choice of what languages are in which model.
All statistics are sorted and ranked according to their occurrence. In the demo program, all models can be studied in detail. Classification of an unknown text is simple. The text is marked up and three tables are generated for statistics. The result table is compared with all model tables and the distance is calculated. The comparison table of the model that has the smallest distance with the unknown text is most likely the language of the text.
Language code | Tongue | Quality | Assumption | Common | big | Short |
---|---|---|---|---|---|---|
nl | Dutch | 13 | x | x | ||
en | English | 13 | x | x | x | x |
ca | Catalan | 13 | ||||
fr | French | 13 | x | x | x | x |
es | Spanish | 13 | x | x | x | x |
no | Norwegian | 13 | x | x | ||
da | Danish | 13 | x | x | ||
it | Italian | 13 | x | x | ||
sv | Swedish | 13 | x | x | ||
de | German | 13 | x | x | x | x |
pt | Portuguese | 13 | x | x | x | |
ro | Romanian | 13 | ||||
vi | Vietnamese | 13 | ||||
tr | Turkish | 13 | x | |||
fi | Finnish | 12 | x | |||
hu | Hungarian | 12 | x | |||
cs | Czech | 12 | x | |||
pl | Polish | 12 | x | |||
el | Greek | 12 | x | |||
fa | Persian | 12 | ||||
he | Hebrew | 12 | ||||
sr | Serbian | 12 | ||||
sl | Slovenian | 12 | ||||
ar | Arabic | 12 | x | |||
nn | Norwegian, Nynorsk (Norway) | 12 | ||||
ru | Russian | 11 | x | x | ||
et | Estonian | 11 | ||||
ko | Korean | 10 | ||||
hi | Hindi | 10 | x | |||
is | Icelandic | 10 | ||||
th | Thai | 9 | ||||
bn | Bengali (Bangladesh) | 9 | x | |||
ja | Japanese | 9 | x | |||
zh | Chinese (Simplified) | 8 | x | |||
se | Sami (Northern) (Sweden) | 5 |
Dear user, you are offered a download
Content tags
Text language recognition , Identify the language of the text , Detecting the language of a text , Text language recognition program , Language recognition from text , Text language recognition software , Source and code for text language detection , Text language detection with vb.net ,Files that you may need

Source and tagging code for Instagram with Basic Four Android (b4a)

Download the source and code of English to Persian dictionary and vice versa with C # with sqlite database

Download the source and software code to convert Visual Basic code to C# and vice versa

Download the source of the file converter robot

Senior registration management system using Bunifu framework with full source code vb.net and mysql database
