parsing using the reader’s knowledge
Consider the kanjis in the following sign:
As the English translation in the above sign shows, these eight kanjis represent four words but… what kanjis correspond to each word? Some words might be a single kanji long, like 日 (hi – sun); some others words might be two-kanjis long, like 日本 (ni-hon – Japan); while some others might be three kanjis long, like 日本語 (ni-hon-go – the Japanese language). So how do we know which characters are part of a word, and which ones are not? How do Japanese people split a sequence of characters – kanjis or kanas – into separate words?
The process of separating a sequence of characters or sounds into a meaningful sequence of individual units, a.k.a. tokens, is called ‘parsing’, and it’s one of the main difficulties of understanding a new language, specially when listening. The reason we pronounce words separately when we are trying to be very clear about what we are saying is to facilitate our listener the parsing of our words; we are duplicating the role of the spaces in written language with pauses in the spoken language. We start getting really good at understanding a new language when we are able to do this parsing on the fly.
English separates words with spaces, but Japanese doesn’t use any separators:
Tokyo Med. Univ. Hospital
thank you very much
vanilla ice cream
toukyou ika daigaku byouin
doumo arigatou gozai-masu
banira aisu kuriimu
Japanese common form
Sequences of words written in either kanji or hiragana, like ‘Tokyo Medical University Hospital’ or ‘thank you very much’ are never separated and, instead, the reader must already know how to split the text, i.e., how to parse the sentence into words. It is like us reading ‘tokyomedicaluniversityhospital’ and ‘thankyouverymuch’; it’s cumbersome but doable, and with practice we would get used to it.
Words in katakana are usually not separated, either. However, the names of people are always separated with a dot, ‘・’, the Japanese equivalent of the English dash; we could also add dots if we want to make something extra clear, but in general, there is no separation:
Australian dollar (v1)
Australian dollar (v2)
Writing systems as word separators
We can write any Japanese word using only hiragana or katakana; either of them would suffice. However, normally, nouns, adjectives, and root verbs are written in kanji, while tenses of verbs and particles are written in hiragana. Thus, given the absence of spaces, a combination of kanjis and kanas helps us to the parse sentences because we roughly know what kanjis and kanas generally represent. For example:
Mr. Tanaka drinks coffee.
tanaka san wa kouhii wo nomi-masu
In this sentence, the noun ‘Tanaka’ (田中) is written in kanji; the stem of the verb ‘drinks’ (飲み) is written in kanji and hiragana; the honorific ‘Mr.’ (さん), the particles that mark the topic (は) and object (を), and the verb conjugation of ‘drinks’ (ます) are written in hiragana; and the foreign-origin word ‘coffee’ (コーヒー) is written in katakana. Hence, the writing systems cue us about the nature of the words. To make this clear, let’s see the sentence again, color-coding each writing system using kanjis, hiragana and katakana:
Mr. Tanaka drinks coffee.
tanaka san wa kouhii wo nomi-masu
The parsing is not perfect; in our example, using multiple writing systems doesn’t separate ‘san’ and ‘wa’ into different words, and splits ‘nomi’, which is a verb stem, into ‘no’ and ‘mi’. Thus, now is much easier to parse the sentence, but we still need to have some previous knowledge of Japanese.
Multiple writing systems in a word
Consider the following newspaper article:
Usually, we write the root of the verb in kanji (or kanji and hiragana) and the conjugation in hiragana, like in 飲みます (nomi-masu – to drink); the same happens with adjectives, i.e., the root is in kanji, and the conjugation is in hiragana, like in 大きい (ooki-i – big). Nouns are often in kanji, but sometimes a part of a noun is in kanji and the rest in hiragana, e.g., in a red box in the article, we find the word ‘kodomo’ (children); normally we write ‘ko-domo’ using two kanjis – 子供, but the writer has chosen to write the first character of the word, 子, in kanji, and the rest in hiragana ども (domo), so in spite that ‘children’ is a noun, in this case the use of kanjis and kanas doesn’t help us to identify the end of the word. In all of these cases, the hiragana characters that finish the word started with the kanji are called okurigana, e.g.,
kanji + okurigana
Although not so common, a single word can also be a combination of hiragana and katakana, or even kanji, hiragana, and katakana. For example, the word ‘keshigomu’ (eraser) combines the Japanese word ‘keshi’ (けし – to erase), in hiragana, and the foreign-origin word ‘gomu’ (ゴム – gum, or rubber), in katakana. Furthermore, since ‘keshi’ (to erase) is a verb, we can also write it with a kanji and a hiragana: 消し. Thus, we can write ‘eraser’ in two ways. Roman characters have also found their way as part of a few Japanese words, e.g., we can write T-shirt either in katakana, or using the roman letter:
けしゴム (hiragana + katakana)
消しゴム (kanji + hiragana + katakana)
Ｔシャツ (roman + katakana)
In addition to the use of kanji vs. kanas for word parsing, Japanese uses punctuation marks to parse sentences and sentence fragments. As we can see in the annotated newspaper article, the following are the Japanese equivalents of some of the roman punctuation marks:
- periods: 。。。(little circles instead of dots)
- commas: 、、、(Japanese commas point forward, roman commas point backwards)
- single quotes: 「 」(type them with [ and ] when in hiragana mode)
- dash marks: ・・・ (for example, when writing down a telephone number)
- tilde: 〜 (as shown in the schedule-box, similar to the English ~)
- parentheses: （ ） (oriented in the direction of the text)
- brackets: [ ] (oriented in the direction of the text)
Japanese doesn’t use hyphens to indicate that a word is split between two consecutive text lines. For example, in the article, the word 大学生 (dai-gaku-sei – ‘college student’) is split between two different columns, but in the first column there is no punctuation mark similar to the hyphen that indicates that the word is split and ends in the second column.