magicfile icon magicfile

Download the source and language recognition code of a text written with vb.net

دانلود-سورس-و-کد تشخیص-زبان-یک-متن-نوشته-شده-با-vb.net
Short description and download link
We have prepared the source and language recognition code of a text written with vb.net for you, dear users of the magic file website.

Download

List of similar files

Short link: https://en.magicfile.ir/?p=2453

Full description of the file

Download the source and language recognition code of a text written with vb.net

We have prepared the source and language recognition code of a text written with vb.net for you, dear users of the magic file website. The language recognition solution given is based on n-gram and word occurrence comparison. It is suitable for any language that uses words (this is actually not true for all languages). Depending on the model and the length of the input text, the accuracy is between 70% (only short Norwegian, Swedish and Danish classified by the "all" model) and 99.8% using the "default" model.

Background

Language recognition of a written text is probably one of the most fundamental tasks in natural language processing (NLP). For any language depending on the processing of an unknown text, the first thing you need to know is what language the text is written in. Fortunately, this is one of the easier NLP challenges. The approach I have chosen to implement is widely known and very simple. The idea is that each language has a unique set of (co)occurrence characters.

Sample of runtime images

The first step is to collect those statistics for all the languages ​​that should be recognized. This is not as easy as it may seem at first. The problem is collecting a large set of test data (plain text) that includes only one language and is not domain specific. (Only newspaper articles may lack the use of the word "I" and direct speech. Using Shakespeare's plays would not be the best approach to recognize contemporary texts. Medical articles usually contain many domain-specific terms that are not even language-specific (major , minor, arterial, etc...) and if that's not hard enough, the texts shouldn't be copyrighted. copyrighted?) I chose to use Wikipedia as my primary source. I had to do some filtering to "Wikipedia contains many proper names (ie group names) that often contain a 'the' or an 'and.' are. That is why those words exist in many languages ​​even if they are not part of the language. This should not necessarily be a disadvantage, as Anglicism has spread widely across many languages. I have three for each language. I made a statistic: Wikipedia contains a lot of proper names (i.e. group names) that often contain a “the” or “and.” That's why Those words exist in many languages ​​even if they are not part of the language. This should not necessarily be a disadvantage, as Anglicism has spread widely across many languages. I created three statistics for each language:

  • Character set
    • Some languages ​​have a very specific character set (such as Chinese, Japanese, and Russian). For others, some characters give a good hint of the target languages ​​(eg, German Umlauts).
  • N-Grams

    • After converting the text into words (if necessary), the number of times 1, 2, and 3 grams was counted. Some n-grams are very language specific (eg, "TH" in English).
  • فهرست واژه

    • A final source of disambiguation is the words that are actually used. Some languages ​​(such as Portuguese and Spanish) are almost identical in the characters used as well as the occurrence of certain n-grams. However, different words are used at different frequencies.

The statistical set is called a model. I have created subsets of the "all" model that best meet my needs (see table below). The "common" model includes the 10 most spoken languages ​​in the world. "Small" and "Default" are based on my usage scenarios. If you are from another part of the world, your preferences may be different. So please don't take offense at my choice of what languages ​​are in which model.

All statistics are sorted and ranked according to their occurrence. In the demo program, all models can be studied in detail. Classification of an unknown text is simple. The text is marked up and three tables are generated for statistics. The result table is compared with all model tables and the distance is calculated. The comparison table of the model that has the smallest distance with the unknown text is most likely the language of the text.

Language code زبان Quality Assumption Common بزرگ Short
nl Dutch 13 x   x  
en English 13 x x x x
ca Catalan 13        
fr French 13 x x x x
es Spanish 13 x x x x
no Norwegian 13 x   x  
da Danish 13 x   x  
it Italian 13     x x
sv Swedish 13 x   x  
de German 13 x x x x
pt Portuguese 13 x x x  
ro Romanian 13        
vi Vietnamese 13        
tr Turkish 13     x  
fi Finnish 12     x  
hu Hungarian 12     x  
cs Czech 12     x  
pl Polish 12     x  
el Greek 12     x  
fa Persian 12        
he Hebrew 12        
sr Serbian 12        
sl Slovenian 12        
ar Arabic 12   x    
nn Norwegian, Nynorsk (Norway) 12        
ru Russian 11   x x  
et Estonian 11        
ko Korean 10        
hi Hindi 10   x    
is Icelandic 10        
th Thai 9        
bn Bengali (Bangladesh) 9   x    
ja Japanese 9   x    
zh Chinese (Simplified) 8   x    
se Sami (Northern) (Sweden) 5        

Dear user, you are offered a download

 

To download the source and language recognition code of a text written with vb.net, click on the link below

Click here to download

Files that you may need

دانلود-سورس-و-کد-دیکشنری-انگلیسی-به-فارسی-و-برعکس-با-سی-شارپ-همراه-دیتابیس-sqlite

Download the source and code of English to Persian dictionary and vice versa with C # with sqlite database

Download
more details

User comments

کد امنیتی

List of website special files

دانلود-نرم-افزار-تغییر-زبان-سورس-و-کد-ویژوال-استودیو-(عناصر-دیزاین-طراحی-فرم-ها)
Download software to change the source language and code of Visual Studio (design elements of form design)

بهترین-سرویس-پوش-نوتیفیکیشن-اسکريپت-مديريت-اعلان-و-ساخت-پوش-نوتیفیکیشن-سایت
The best notification service push script notification management and build site notification push

دانلود-نرم-افزار-ترجمه-خودکار-فایل-های-po-,-pot-بصورت-کامل-برای-تمامی-زبان-ها-از-جمله-فارسی
Download automatic translation software for po, pot files in full for all languages, including Persian

دانلود-نرم-افزار-تبدیل-فایل-متنی-به-vcf-(مخاطب-موبایل)
Download software to convert text file to vcf (mobile contact)

استخراج-فالوور-های-اینستاگرام-نرم-افزار-ربات-اينستاگرامي-براي-دريافت-ليست-کامل-فالو-شده-ها-و-فالو-کننده-ها
Extracting Instagram followers Instagram robot software to receive a complete list of followers and followers