Consider the kanjis in the following sign:
As the English translation in the above sign shows, these eight kanjis represent four words but… what kanjis represent each of the words? Some words might be a single kanji long, like 日 (‘hi’, sun); some others words might be two kanjis long, like 日本 (‘ni-hon’, Japan); while some others might be three kanjis long, like 日本語 (‘ni-hon-go’, the Japanese language). So how do we know which characters are part of a word, and which ones are not? How do Japanese people split a sequence of characters – kanjis or kanas – into separate words?
Parsing using the reader’s knowledge
The process of separating a sequence of characters or sounds into a meaningful sequence of individual units, a.k.a. tokens, is called ‘parsing’, and it’s one of the main difficulties of understanding a new language, specially when listening. The reason we pronounce words slowly and individually when we are trying to be clear about what we are saying is to facilitate our listener the parsing of our words; we are duplicating the role of the spaces in written language with pauses in the spoken language. We start getting really good at understanding a new language when we are able to do this parsing on the fly.
English separates words with spaces, but Japanese doesn’t use any separators:
English
Tokyo Med. Univ. Hospital
thank you very much
vanilla ice cream
romaji
toukyou ika daigaku byouin
doumo arigatou gozaimasu
banira aisu kuriimu
Japanese
東京医科大学病院
どうもありがとうございます
バニラアイスクリーム
Sequences of words written in either kanji or hiragana, like ‘Tokyo Medical University Hospital’ or ‘thank you very much’, are never separated and, instead, the reader must already know how to split the text, i.e., how to parse the sentence into words. It is similar to reading ‘tokyomedicaluniversityhospital’ and ‘thankyouverymuch’; it’s cumbersome but doable, and with practice we would get used to it.
We usually don’t separate words in katakana, either. However, we often separate the first name and last names of people with a dot, ‘・’, the Japanese equivalent of the English dash; we could also add dots if we want to make something extra clear, but in general, there is no separation:
English
John Wayne
Frank Sinatra
Australian dollar (v1, w/ dot)
Australian dollar (v2)
romaji
jon uein
furanku shinatora
oosutoraria doru
oosutorariadoru
buenosuairesu
rosanzerusu
katakana
ジョン・ウェイン
フランク・シナトラ
オーストラリア・ドル
オーストラリアドル
ブエノスアイレス
ロサンゼルス
Writing systems as word separators
We can write any Japanese word using only hiragana or katakana; either of them would suffice. However, normally we write nouns, adjectives, and verb roots in kanji, while we write verb suffixes and particles in hiragana. Thus, given the absence of spaces, a combination of kanjis and kanas helps us parse sentences because we roughly know what kanjis and kanas generally represent. For example:
English
romaji
Japanese
Mr. Tanaka drinks coffee.
tanaka san wa kouhii wo nomi-masu
田中さんはコーヒーを飲みます
In this sentence, the noun ‘Tanaka’ (田中) is written in kanji; the stem of the verb ‘drinks’ (飲み) is written in kanji and hiragana; the honorific ‘Mr.’ (さん), the particles that mark the topic (は) and object (を), and the verb’s suffix of ‘drinks’ (ます) are written in hiragana; and the foreign-origin word ‘coffee’ (コーヒー) is written in katakana. Hence, the writing systems cue us about the nature of the words. To make this clear, let’s see the sentence again, color-coding the kanjis, hiraganas and katakanas:
English
romaji
Japanese
Mr. Tanaka drinks coffee.
tanaka san wa kouhii wo nomi-masu
田中さんはコーヒーを飲みます
Word-particle parsing
We can also parse a sentence knowing that, in Japanese, words are often clustered in word-particle sets. A particle defines the role of the word that precedes it so, in general, they form units, called phrases, that we can move around and still be gramatically correct. For example, は and を indicate the topic and the object, respectively; the verb stands on its own, without a particle, but we can identify it because it starts with a kanji and ends in a syllable that ends in ‘u’. Thus, for topic, object and verb we have the following phrases:
Sentence part
Topic
Object
Verb
romaji
(Tanaka san wa)
(kouhii wo)
(nomi-masu)
Japanese
(田中さんは)
(コーヒーを)
(飲みます)
English
(Mr. Tanaka) (coffee) (drinks)
(coffee) (Mr. Tanaka) (drinks)
(coffee) (drinks) (Mr. Tanaka)
Romaji
(Tanaka san wa) (kouhii wo) (nomi-masu)
(kouhii wo)、(Tanaka san wa) (nomi-masu)
(kouhii wo) (nomi-masu)、(Tanaka san wa)
あににあいににほんににがつにいきます
This is a sentence that even a Japanese person would have to read slowly to make sense of; indeed, it is easier to parse it if we introduce kanjis because the particles stand out:
hiragana
Japanese
romaji
English
あににあいににほんににがつにいきます (Google translate fails)
兄に合いに日本に二月に行きます (Google translate succeeds)
ani ni ai ni nigatsu ni nihon ni ikimasu
In February I am going to Japan to see my big brother
Or good luck parsing this extreme case of a fun tongue-twister:
すもももももももものうち
Let’s see:
hiragana
Japanese
romaji
English
すもももももももものうち
すももも桃も桃のうち
sumomo mo momo mo momo no uchi
Both the plum and the peach are in the peach family
The second issue using particles to parse sentences is that it only works well with formal speech because in casual speech we drop many particles:
formal speech
Tanaka san wa kouhii wo nomimasu
田中さんはコーヒーを飲む
casual speech
Tanaka san kouhii nomu
田中さんコーヒー飲む
Both of these sentences are identical in meaning but the casual one has no particles at all so we cannot find phrases using particles and, instead, we have to parse the sentence relying only on our knowledge of Japanese and its writing systems.
Multiple writing systems in a word
Consider the following newspaper article:
Usually, we write the stem of the verb in kanji (or kanji and hiragana) and the suffix in hiragana, like in 飲みます (nomi-masu – to drink); the same happens with adjectives, i.e., the stem is in kanji (or kanji and hiragana), and the suffix is in hiragana, like in 大きい (ooki-i – big). Nouns are often in kanji, but sometimes a part of a noun is in kanji and the rest in hiragana, e.g., in a red box in the article, we find the word ‘kodomo’ (children); we could write ‘ko-domo’ using two kanjis, 子供, but the author has chosen to write the first character of the word, 子 (ko), in kanji, and the rest of the word in hiragana, ども (domo), so in spite that ‘children’ is a noun, in this case, the use of kanjis and kanas doesn’t help us to identify the end of the word. In all of these cases, the hiragana characters that finish the word started with the kanji are called okurigana, e.g.,
English
to drink
big
children
romaji
nomi-masu
ooki-i
kodomo
kanji + okurigana
飲みます
大きい
子ども
Although not so common, a single word can also be a combination of kanji, hiragana, and katakana. For example, the word ‘keshigomu’ (eraser) combines the Japanese word ‘keshi’ (消し – to erase), in kanji and hiragana, and the foreign-origin word ‘gomu’ (ゴム – gum, or rubber), in katakana; the word ‘denshirenji’ (microwave oven) combines the Japanese word ‘denshi’ (電子 – electronic) in kanji, and the foreign-origin word ‘renji’ (レンジ – range), in katakana. Also, in some cases the word might begin with hiragana and end with a kanji, e.g., ‘tameiki’ (sigh) starts with ‘tame’ (to collect) and ends in ‘iki’ (息 – breath).
English
eraser
microwave oven
sigh
romaji
keshigomu
denshirenji
tameiki
Japanese
消しゴム (kanji + hiragana + katakana)
電子レンジ (kanji + hiragana + katakana)
ため息 (hiragana + kanji)
Roman characters have also found their way as part of a few Japanese words, e.g., we can write T-shirt either in katakana, or using the roman letter:
English
T-shirt
T-shirt
romaji
tiishatsu
tiishatsu
Japanese
ティーシャツ (katakana)
Tシャツ (roman + katakana)
Hence, in general we can use kanjis to parse the beginning of nouns, but there are plenty of exceptions.
Punctuation marks
In addition to the use of kanji vs. kanas for word parsing, Japanese uses punctuation marks to parse sentences and sentence fragments, a.k.a. clauses. We can see in the annotated newspaper article some of the following Japanese equivalents of the roman punctuation marks:
- periods: 。。。(little circles instead of dots)
- commas: 、、、(Japanese commas point forward, roman commas point backwards)
- dash marks: ・・・ (dots play the role of dashes in a telephone number)
- asterisk: ※ (an ‘x’ with four dots around it)
We write the following punctuation marks in the direction of the text, either horizontally or vertically:
- tilde: 〜
- quotation marks: 「 」
- parentheses: ( )
- brackets: [ ]
Japanese doesn’t use hyphens to indicate that a word is split between two consecutive text rows or columns. For example, in the article, the word 大学生 (dai-gaku-sei – ‘college student’) is split between two different columns, but in the first column there is no punctuation mark similar to the hyphen to indicate that the word is split and ends in the second column.