Lingua Verbum Adds Japanese and Chinese as Supported Languages

Lingua Verbum now supports Japanese and Chinese, our first Asian languages, with advanced segmentation and learner-controlled overrides.

We’re excited to share that Lingua Verbum now supports both Japanese and Chinese, bringing our total supported language count to 25 languages. As these are the first Asian languages supported by Lingua Verbum, this represents a significant advancement for the platform, and it should set the foundation for a rapid expansion in the platform’s supported languages.

Chinese and Japanese supported languages

Why Japanese and Chinese Are Different… and Challenging

For most European languages, word segmentation is straightforward: words are clearly separated by spaces. This structure makes it easy for our platform to identify individual words, allowing learners to click on any word to track whether they know it or not. It also enables access to our AI assistant, which provides contextual definitions and explanations via clicking on a specific word.

But Japanese and Chinese don’t work that way. In both writing systems, text appears as a continuous stream of characters, with no visual markers to indicate where one word ends and another begins. Since our platform relies on accurate word recognition to help you track vocabulary and use the AI assistant, getting segmentation right was imperative.

Cracking Japanese Segmentation

Japanese word segmentation (i.e., the process of determining where one word ends and another begins) is a notoriously difficult problem, one with no universally “correct” solution. To find the most practical approach for learners, we evaluated several leading segmentation tools, each with different tradeoffs:

MeCab: A well-established, high-performance morphological analyzer known for its speed, stability, and wide adoption in Japanese NLP pipelines.
SudachiPy: A modern tokenizer offering multi-granular segmentation modes (A/B/C) that adapt to different use cases.
Hanabira: A deep learning-based segmenter with strong performance in resolving context-dependent ambiguities.
ChatGPT-based segmentation: An experimental LLM-driven method designed to infer word boundaries through contextual reasoning.

MeCab delivered consistently strong performance and low latency, making it an excellent fit for real-time use within our platform. Its long track record and support made integration relatively seamless. However, like most rule-based tokenizers, it has limitations. MeCab can struggle with informal, newly coined, or domain-specific vocabulary, and it often errs on the side of over-segmentation. For example, compound words like 代表者 (daihyōsha, “representative”) may be broken down into 代表 (daihyō, “representative”) and 者 (mono, “person”), which can hinder a learner’s ability to associate meaning effectively. These issues make it less than ideal in edge cases, especially for beginners who benefit from seeing whole vocabulary items rather than morphemes.

SudachiPy initially looked promising due to its flexible segmentation modes (A (short), B (medium), and C (long)) which allow for different levels of granularity depending on the use case. This could have been valuable for tailoring the experience to different learner levels. However, in practice, our testing found that its segmentation choices were often inconsistent or unintuitive from a learner’s perspective. Ultimately, the variability in segmentation quality made it difficult to trust as a default solution.

Hanabira showed potential in theory for handling ambiguous or context-rich sentences. But in our testing, it frequently made segmentation choices that felt unnatural or overly fragmented.

ChatGPT-based segmentation was our most experimental approach. In theory, a large language model’s contextual understanding could produce highly accurate segmentations, especially for informal text where traditional tokenizers tend to break down. However, in practice, we found the results to be inconsistent. The model would occasionally hallucinate nonexistent words, misinterpret meaning, or apply segmentation in ways that were linguistically incorrect. This unreliability, combined with the inherent unpredictability of LLM outputs, made it unsuitable for production use.

In the end, we selected MeCab as our default tokenizer for Japanese because it struck the best balance between accuracy, speed, and operational simplicity. But we want to be clear: no automated segmentation is perfect, especially in Japanese. That’s why we’ve built a manual override system directly into the platform. When learners spot a segmentation that feels off (like a word split incorrectly or a phrase broken in the wrong place) they can fix it with just a few clicks. This level of control is crucial. No algorithm is perfect, but with the ability to adjust segmentations as needed, learners aren’t stuck with the mistakes.

Segmenting Chinese

Our experience with Japanese segmentation gave us a strong foundation for tackling Chinese, which poses many of the same challenges: no spaces between words, high ambiguity, and a variety of valid segmentation strategies depending on context.

We tested several popular segmentation libraries — including Jieba, pkuseg, and THULAC — along with academic models trained on standardized datasets. After testing on content, we ultimately chose to implement a model based on the PKU ConvSeg architecture, trained on the SIGHAN 2005 PKU corpus.

This model seemed to offer the most reliable balance of accuracy and alignment with how native speakers and learners tend to mentally segment words in real-world usage.As with Japanese, we’ve incorporated manual override functionality into our Chinese pipeline, allowing learners to adjust segmentations whenever the model gets it wrong. This ensures that no matter how good the underlying model is, learners remain in control of their experience.

New Paths Unlocked & Feedback

Building support for Japanese and Chinese has deepened our technical toolkit and laid the groundwork for future languages with complex scripts and no word delimiters, such as Thai, Lao, or Khmer. We plan to expand to dozens of new languages within the next few weeks as a result of this effort.

We’d love for you to dive into our new Japanese and Chinese support and put it to the test. Try reading a manga chapter, watching a C-drama clip, or working through your favorite podcast. See how the segmentation, vocabulary tracking, and AI explanations hold up.

If something feels off (e.g., a word split incorrectly, a translation that doesn’t quite land, or just a feature that could be smoother) let us know. Your feedback is the way we can get better.

Next update soon,
The Lingua Verbum Team