Is there a description of the mecab (Japanese word parser) algorithm?

Question

Is there a document somewhere that describes the Mecab algorithm?

Or could someone give a simple one-paragraph or one-page description?

I'm finding it too hard to understand the existing code, and what the databases contain.

I need this functionality in my free website and phone apps for teaching languages (www.jtlanguage.com). I also want to generalize it for other languages, and make use of the conjugation detection mechanism I've already implemented, and I also need it without license encumbrance. Therefore I want to create my own implementation (C#).

I already have a dictionary database derived from EDICT. What else is needed? A frequency-of-usage database?

Thank you.

Ahmed Fasih · Accepted Answer

Some thoughts that are too long to fit in a comment.

§ What license encumbrances? MeCab is dual-licensed including BSD, so that's about as unencumbered as you can get.

§ There's also a Java rewrite of Mecab called Kuromoji that's Apache licensed, also very commercial-friendly.

§ MeCab implements a machine learning technique called conditional random fields for morphological parsing (separating free text into morphemes) and part-of-speech tagging (labeling those morphemes) Japanese text. It is able to use various dictionaries as training data, which you've seen—IPADIC, UniDic, etc. Those dictionaries are compilations of morphemes and parts-of-speech, and are the work of many human-years worth of linguistic research. The linked paper is by the authors of MeCab.

§ Others have applied other powerful machine learning algorithms to the problem of Japanese parsing.

Kytea can use both support vector machines and logistic regression to the same problem. C++, Apache licensed, and the papers are there to read.
Rakuten MA is in JavaScript, also liberally licensed (Apache again), and comes with a regular dictionary and a light-weight one for constrained apps—it won't give you readings of kanji though. You can find the academic papers describing the algorithm there.

§ Given the above, I think you can see that simple dictionaries like EDICT and JMDICT are insufficient to do the advanced analysis that these morphological parsers do. (EDIT See Ichiran and https://ichi.moe for a pure dictionary approach to parsing Japanese!) And these algorithms are likely way overkill for other, easier-to-parse languages (i.e., languages with spaces).

If you need the power of these libraries, you're probably better off writing a microservice that runs one of these systems (I wrote a REST frontend to Kuromoji called clj-kuromoji-jmdictfurigana) instead of trying to reimplement them in C#.

Though note that it appears C# bindings to MeCab exist: see this answer.

In several small projects I just shell out to MeCab, then read and parse its output. My TypeScript example using UniDic for Node.js.

§ But maybe you don't need full morphological parsing and part-of-speech tagging? Have you ever used Rikaichamp, the Firefox add-on that uses JMDICT and other low-weight publicly-available resources to put glosses on website text? (A Chrome version also exists.) It uses a much simpler deinflector that quite frankly is awful compared to MeCab et al. but can often get the job done.

§ You had a question about the structure of the dictionaries (you called them "databases"). This note from Kimtaro (the author of Jisho.org) on how to add custom vocabulary to IPADIC may clarify at least how IPADIC works: https://gist.github.com/Kimtaro/ab137870ad4a385b2d79. Other more modern dictionaries (I tend to use UniDic) use different formats, which is why the output of MeCab differs depending on which dictionary you're using.

§ EDIT See Ichiran and https://ichi.moe for a pure dictionary approach to parsing Japanese that is very impressive and might be very useful! It's written in Common Lisp but has a Dockerfile.

Is there a description of the mecab (Japanese word parser) algorithm?

Tags:

mecab

jtsoftware

1 Answers

Ahmed Fasih

Recent Activity

Donate For Us

Is there a description of the mecab (Japanese word parser) algorithm?

Tags:

mecab

jtsoftware

1 Answers

Ahmed Fasih

Related questions

Recent Activity

Donate For Us