3-gram | bolbolkod

Tag Archives: 3-gram

For creating n-gram data from xml-formatted Wikipedia abstract files to be used with the language-detection library for Java/Processing:

Download a Wikipedia abstract database file of the language of interest, e.g http://dumps.wikimedia.org/trwiki/latest/trwiki-latest-abstract.xml for Turkish

Run the following line by replacing the last argument with the language code of your downloaded file (here tr is used for Turkish):

java -jar /[PathToLangDetectFile]/langdetect.jar --genprofile -d ./ tr

By aloha | Posted in - | Also tagged bigram, detection, java, language detection, n-gram, processing, statistics, text | Comments (0)

bolbolkod