Download source code - 7.3 KB

Introduction

This is an implementation of natural programming method. The background knowledge in my previous article is required for this article. In the following ordered list, the first one is the previous article. The second one is the article you are now reading and is the sequel to the first one.

Natural Programming Method

A Natural Programming Method - Programming with Natural Language
Implementation of Natural Programming Method

In this article, the method is realized through English, Japanese, Korean, and Taiwanese language. The method can be coded in various computer languages. Target language code is of a variety of choices. All of the romanizations of hanja/hanji/kanji is in uppercase and all of the spellings of English words, Japanese and Korean forms, and Taiwanese tones is either alphabetic or alphanumeric in lowercase.

In the following code snippet:

Parser.new(Matcher.new("writes | eighteen & red").match([_WRITE, _EIGHT, _RED])).\
to_ast.to_javascript(ctx)

Wherein "writes | eighteen & red" is source language. [_WRITE, _EIGHT, _RED] is source text. to_javascript() is target language.

Below is the legend for forms, tones, and words. Each form/tone/word represents an element of a member array of forms/tones/words of a Hanja/Hanji/Ideogram/Kanji object:

Legend form/tone/word
	HANJA	HANJI	IDEOGRAM	KANJI
slot n	form	tone	word	form

Below is a legend for statements. Each statement represents an element of an array of statements of a Hanja/Hanji/Ideogram/Kanji object:

Legend statement
	HANJA/HANJI/IDEOGRAM/KANJI
statement n	statement

Below is a dictionary for all the romanizations of Hanja/Hanji/Kanji:

Dictionary － Theme of Xmas Tree
English	Hanja	Hanji	Kanji
gold		KIM
red	PPALGANG	ANG	AKA
silver		GIN
eight		PEH
write	SSEUDA	SIA	KAKU
green	PUREUN	CINN	AO

Background

In this article, we use the terms lookup or select for the basic operations on arrays or sets.

Hanja, Hanji, and Kanji Are Ideograms

The method is all about how we compose and decompose and group a sequence of ideograms.

Japanese, Korean, and Taiwanese languages all use ideograms. The intrinsic mechanism of Japanese and Korean is able to inflect to compose and decompose a series of ideograms. Taiwanese language use tone sandhi instead of inflection.

The romanization in uppercase is used to denote the ideogram of a hanja or kanji, while the romanization in lowercase is used to denote the inflectional forms of the ideogram of a hanja or kanji.

The romanization in uppercase without tone mark (diacritic) is used to denote the ideogram of a hanji, while the romanization in lowercase with tone mark is used to denote the tones of the ideogram of a hanji.

Since there are no ideograms in English language, we just translate them to corresponding english words instead of romanization.

Series and Sequence

A sequence of hanjas/hanjis/ideograms/kanjis is a series of hanjas/hanjis/ideograms/kanjis without combination operators. A series returns a sum of its members.

This is a sequence of numbers: 1, 2, 3, 4, 5
This is a series of numbers: 1 + 2 + 3 + 4 + 5
This is a sequence of ideograms: WRITE GREEN RED or WRITE-GREEN-RED
This is a series of ideograms: WRITE|GREEN&RED

Expression and Match Data

An expression is used to match a sequence of Hanjas/Hanjis/Ideograms/Kanjis. A matched sequence is a series. A series is an array of expression nodes returned by matcher as match data. A series can then be shunted by parser to build an abstract syntax tree.

This is an expression: writes greener-red
This is an expression: writes|greener&red
This is a matched sequence: WRITE|GREEN&RED
A matched sequence is a series: WRITE|GREEN&RED

Sequence, Group, and Hanja/Hanji/Ideogram/Kanji

Below is the shortest sequence with only one hanji:

The shortest sequence and group.
HANJI
GROUP

The shortest group is equal to the shortest sequence in length. Below is the second shortest sequence with a group of 2 hanjis or just 2 individual hanjis:

The second shortest sequence and group.
HANJI	HANJI
GROUP

Below is a sequence of hanjis wrapped to 4 lines to show how many hanjis a group has. The 1^st, 2^nd, and 3^rd group is 2-hanji in length. The 4^th group is 3-hanji in length. The 5^th group is 4-hanji in length. There is a total of 25 hanjis in the below table:

A folded view of a sequence of hanjis
HANJI	HANJI	HANJI	HANJI	HANJI
HANJI	GROUP		GROUP
GROUP		GROUP
GROUP				HANJI

Below is a linear view of the above table:

A linear view of a sequence of hanjis

HANJI	HANJI	HANJI	HANJI	HANJI	HANJI	GROUP		GROUP		GROUP		GROUP			GROUP				HANJI

The intrinsic mechanism of English, Japanese, Korean, and Taiwanese language are already able to make groups.

A Group is a Lengthened Hanja/Hanji/Ideogram/Kanji

When a hanji is in a group, it is a grouped hanji. The same applies to hanja and kanji. If not, it is an ungrouped hanji. Lookup or selection can be done bidirectionally in a group. It is from a left-hand hanji to a right-hand hanji or vice versa. Lookup or selection across group boundaries is masked.

When we look at a group from a series' perspective, a group is just a lengthened hanji. What a series would do to its member hanjis could also be applied to its member groups.

When evaluating a group, the default statement is returned by each group member and combined accordingly if no lookup or selection rule is assigned for that member. The collective statements are returned by a group to the series.

A group is not a phrase. A group is composed of Hanjas/Hanjis/Ideograms/Kanjis, while a phrase is composed of English/Japanese/Korean/Taiwanese words.

Coherence of a Group

There is no such thing as setting up a combination type for 2 member hanjis across group boundaries. A tone of a hanji is determined for that hanji to become a member of a group. All of the hanjis together in that group thus form a coherent unit. A single hanji is also coherent in itself. Just use a group as a whole for any combinations.

A similar mechanism could be applied to Japanese and Korean. An inflectional form of a hanja or kanji is determined for that hanja or kanji to become a member of a group. All of the hanjas or kanjis together in that group thus form a coherent unit.

There are no ideograms in English. A similar inflectional mechanism can also be used to make groups.

The Purpose of a Series

A member in a series is either a hanja/hanji/ideogram/kanji or a group. The first statement, which is stored at index 0 of an array in a Hanji object, is the default statement for that hanji. The statements of a group are the collective returned values of its members. When evaluating a Series object, the default statement is returned by each series member and combined accordingly if no lookup or selection rule is assigned for that member.

The collective statements returned by the evaluated series is the target language code.

The Mechanism of Grouping: Inflection and Tone Sandhi

Sandhi is the process whereby the form of a word changes as a result of its position in an utterance, according to Oxforddictionaries.

Any given hanji can have more than 1 tone. The original tone of a hanji can be stored at index 0 of an array. The non-original tones can be stored at index 1, 2, and 3 of an array. Original tone is the tone we use to look up a hanji in dictionary. If a hanji is in original tone, it must be either in a series or at the end of a group. A hanji otherwise in non-original tone must be in a group except at the end of it.

Linguistic inflection exhibits the same functionality as tone sandhi. It can hence be used by an expression to match a series of ideograms. English inflection can make use of the change in the stem, prefix, and suffix of a word. Japanese and Korean inflection can make use of the change in the suffix based on a dictionary/plain form or a stem. No matter how the inflectional mechanism may differ, there is always an original form and many non-original forms which we can make use of to make a group. What we can do with tone sandhi can also be done with inflection, since they both come with an original form from which many non-original forms derive.

Grouping and Inflection of English Language

In the following table, we put green, which is a stem, in the first slot of the GREEN table. We can then put greener, which is green suffixed with -er, in the second slot. Finally we can put green suffixed with -est in the third slot. The GREEN's words array is then populated as below:

青
	GREEN
slot 0	green
slot 1	greener
slot 2	greenest

We can use "green", "greener", or "greenest" expression to match GREEN sequence as below:

Matcher.new("green").match([_GREEN])
Matcher.new("greener").match([_GREEN])
Matcher.new("greenest").match([_GREEN])

We can also populate RED's words array with red, redder, and reddest:

紅
	RED
slot 0	red
slot 1	redder
slot 2	reddest

We can use "red", "redder", or "reddest" expression to match RED sequence as below:

Matcher.new("red").match([_RED])
Matcher.new("redder").match([_RED])
Matcher.new("reddest").match([_RED])

We can now use green and red to make groups:

青紅
	GREEN		RED
slot 0	green	&	red
slot 1	greener		redder
slot 2	greenest		reddest

We can use "greener & red" expression to match GREEN-RED sequence as below:

Matcher.new("greener & red").match([_GREEN, _RED])

紅青
	RED		GREEN
slot 0	red	&	green
slot 1	redder		greener
slot 2	reddest		greenest

We can use "redder & green" expression to match RED-GREEN sequence as below:

Matcher.new("redder & green").match([_RED, _GREEN])

寫青紅
	WRITE		GREEN		RED
slot 0	write	\|	green	&	red
slot 1	writes	\|	greener	&	redder

We can use "writes | greener & red" expression to match WRITE-GREEN-RED sequence as below:

Matcher.new("writes | greener & red").match([_WRITE, _GREEN, _RED])

The English language has its own grouping mechanism. A group validator can be incorporated to validate a group.

Grouping and Inflection of Japanese Language

The same can be applied to Japanese as follows:

青赤
	AO		AKA
slot 0	ao	&	aka
slot 1	aoi	&	akai

We can use "aoi & aka" expression to match AO-AKA sequence as below:

Matcher.new("aoi & aka").match([_AO, _AKA])

赤青
	AKA		AO
slot 0	aka	&	ao
slot 1	akai	&	aoi

We can use "akai & ao" expression to match AKA-AO sequence as below:

Matcher.new("akai & ao").match([_AKA, _AO])

書青赤
	KAKU		AO		AKA
slot 0	kaku	\|	ao	&	aka
slot 1	kaki	\|	aoi	&	akai

We can use "kaki | aoi & aka" expression to match KAKI-AO-AKA sequence as below:

Matcher.new("kaki | aoi & aka").match([_KAKU, _AO, _AKA])

The Japanese language has its own grouping mechanism. A group validator can be incorporated to validate a group.

Grouping and Inflection of Korean Language

The same can be applied to Korean as follows:

青赤
	PUREUN		PPALGANG
slot 0	pureun	&	ppalgang
slot 1	pureuda	&	ppalgan

We can use "pureuda & ppalgang" expression to match PUREUN-PPALGANG sequence as below:

Matcher.new("pureuda & ppalgang").match([_PUREUN, _PPALGANG])

赤青
	PPALGANG		PUREUN
slot 0	ppalgang	&	pureun
slot 1	ppalgan	&	pureuda

We can use "ppalgan & pureun" expression to match PPALGANG-PUREUN sequence as below:

Matcher.new("ppalgan & pureun").match([_PPALGANG, _PUREUN])

書青赤
	SSEUDA		PUREUN		PPALGANG
slot 0	sseuda	\|	pureun	&	ppalgang
slot 1	sseugi		pureuda		ppalgan
slot 2	sseugo

We can use "sseugo | pureuda & ppalgang" expression to match SSEUGO-PUREUN-PPALGANG sequence as below:

Matcher.new("sseugo | pureuda & ppalgang").match([_SSEUDA, _PUREUN, _PPALGANG])

The Korean language has its own grouping mechanism. A group validator can be incorporated to validate a group.

Grouping and Tone Sandhi of Taiwanese Language

Below is a table for CINN populated with original tone "cinn" and non-original tone "cinnz".

青
	CINN
slot 0	cinn
slot 1	cinnz

The tones can also be transcribed to alphanumeric, which makes cinn act as a stem and numbers 1 and 7 act as suffixes to the stem:

青
	CINN
slot 0	cinn1
slot 1	cinn7

A matched hanji is either in a series or in a group. When it is in a series, it is a series member. When it is in a group, it is a group member. When a hanji in a series or at the end of a group, it is determined with its original tone which is stored at index 0 of tones array:

青
	CINN
slot 0	cinn
slot 1	cinnz

Any 2 given hanjis can be grouped together. When 2 or more hanjis are grouped together, each group member is determined with one of its tones. A group of hanjis can be viewed as a lengthened hanji. A specific tone of each member hanji is determined. When you group at least 2 hanjis together, you also determine a specific pronunciation for this group. In other words, when you determine a specific tone for at least 2 hanjis, you group them together. The pronunciation of a group is the concatenation of the tone of each of its member hanjis.

The kimz at slot 1 is concatenated with the ginx at slot 0. KIM must be in a group since its non-original tone is determined as below:

金銀
	KIM		GIN
slot 0	kim	&	ginx
slot 1	kimz	&	ginz

We can use "kimz & ginx" expression to match KIM-GIN sequence as below:

Matcher.new("kimz & ginx").match([_KIM, _GIN])

KIM-GIN-CINN sequence is a group of 3 hanjis:

金銀青
	KIM		GIN		CINN
slot 0	kim	&	ginx	&	cinn
slot 1	kimz	&	ginz	&	cinnz

We can use "kimz & ginz & cinn" expression to match KIM-GIN-CINN sequence as below:

Matcher.new("kimz & ginz & cinn").match([_KIM, _GIN, _CINN])

KIM-GIN-CINN-ANG sequence is a group of 4 hanjis:

金銀青紅
	KIM		GIN		CINN		ANG
slot 0	kim	&	ginx	&	cinn	&	angx
slot 1	kimz	&	ginz	&	cinnz	&	angz

We can use "kimz & ginz & cinnz & angx" expression to match KIM-GIN-CINN-ANG sequence as below:

Matcher.new("kimz & ginz & cinnz & angx").match([_KIM, _GIN, _CINN, _ANG])

We can incorporate a Taiwanese group validator to validate the groups.

Grouping and Combination

Combination takes place between any hanji-hanji pair, hanji-group pair, and group-group pair. You can combine any hanjis or groups with combination operators and they can form either a sequential or nesting combination. The same applies to hanja and kanji.

In the below example, SIA, PEH, and CINN are 3 individual Hanji objects and not in a group. PEH is combined with CINN via an Or operator. SIA is combined with PEH-CINN via an And operator. A series SIA|PEH&CINN is thus generated by Matcher. The tones table is as below:

寫[]八[]青
	SIA	[]	PEH	[]	CINN
slot 0	siay	\|	peh	&	cinn
slot 1	sia	\|	pehy	&	cinnz

We can use "siay | peh & cinn" expression to match SIA-PEH-CINN sequence as below:

m = Matcher.new("siay | peh & cinn").match([_SIA, _PEH, _CINN])

In the below example, SIA is a hanji and PEH-CINN is a group of 2 hanji.

寫[]八青
	SIA	[]	PEH		CINN
slot 0	siay	\|	peh	&	cinn
slot 1	sia	\|	pehy	&	cinnz

We can use "siay | pehy & cinn" expression to match SIA-PEH-CINN sequence as below:

m = Matcher.new("siay | pehy & cinn").match([_SIA, _PEH, _CINN])

When SIA is grouped with PEH-CINN sequence:

寫八青
	SIA		PEH		CINN
slot 0	siay	\|	peh	&	cinn
slot 1	sia	\|	pehy	&	cinnz

We can use "sia | pehy & cinn" expression to match SIA-PEH-CINN sequence as below:

m = Matcher.new("sia | pehy & cinn").match([_SIA, _PEH, _CINN])

When you don't group or select, the hanji is just like printed on paper or screen as a symbol or an ideogram.

寫
	SIA
slot 0	siay
slot 1	sia

sia = HanjiExpression.new("寫")
sia.tones = %w[siay sia]

寫
	SIA
statement 0	`function write() {}`

sia.statements = ["function write() {}"]

Associativity

Associtivity of evaluation of a Series object is assumed from right to left, being it in groups or parentheses.

Context

Lookup or selection rules are stored in context. A rule is an array stored as an item in a Set. The convention of a rule is:

rule[_SELF, _OTHER, index]

The above rule means: rule[0] wants statements[index] from rule[1]. That can be then translated to: _SELF wants the statements[index] from _OTHER.

For example, PEH wants the element at statements[2] from ANG:

cxt.add([_PEH, _ANG, 2])

A local copy of context for a group can be extracted from the global one when evaluating a Group object.

Combination Operators

Combinations take place between any two adjacent series members and adjacent group members. The two basic combination types are nesting and sequential. Sequential combination is easier to be implemented than nesting combination in that we just display the statements one by one in sequential order. For nesting combination, we can split up the collective statements from the left-hand node and enclose the collective statements from the right-hand node.

Breakdown of the Source Code

Expressions
- And expression
- Or expression
- Ideogram expression
  - Hanja expression
  - Hanji expression
  - Kanji expression
- AST wrapper
  - Series
  - Group
Context
Matching
- scanning
- lexing
- casting
- creating node objects
- returning expression nodes
Parsing
- building AST
- shunting
- evaluating/compiling/interpreting to target language code

Expressions

Expressions are implemented as Interpreter design pattern. Hanja/Hanji/Ideogram/Kanji class is a terminal expression. And class and Or class are non-terminal expressions. ASTWrapper class is implemented as Wrapper design patter. Series class and Group class are implemented as wrappers of abstract syntax tree.

Series object and Group object both have a member variable representing the abstract syntax tree. A Series object also has Hanja/Hanji/Ideogram/Kanji objects and Group objects from the abstract syntax tree as its series members. A Group object also has Hanji objects from the abstract syntax tree as its group members.

Context

Context is a Ruby Set object.

Matching

The implementation of Matcher class places emphasis on matching. Scanning and Lexing are extremely minimized to only one line of code in this implementation:

@lexer = expression.scan(/[[:alnum:]]+|\&|\|/)

and:

@lexer.shift

We always have to match a form with a hanja or kanji, a tone with a hanji, and a word with an ideogram. The matcher is responsible for matching an expression with a sequence of Hanja/Hanji/Ideogram/Kanji objects. When a match is found, the matcher casts the form to the Hanja or Kanji object, the tone to the Hanji object, and the word to the Ideogram object. A form/tone/word of a Hanja/Hanji/Ideogram/Kanji object is then determined by casting.

Each token will be matched with a Hanja/Hanji/Ideogram/Kanji object or used to create an And or Or expression object. An array of expression nodes is consist of Hanja/Hanji/Ideogram/Kanji objects as operands, And objects as operators, and Or objects as operators. The array of expression nodes is then returned as match data by matcher.

Parsing

A parser constructor takes match data as its argument. Match data have to be shunted by parser to build an abstract syntax tree. If you want to add ability of handling parentheses to the shunting yard algorithm, you can also do it in the shunt method.

Method to_ast calls method shunt to shunt the match data:

def to_ast
  return shunt
end

The shunted match data is actually an abstract syntax tree. An abstract syntax tree is ready to be evaluated/compiled/interpreted to target language code. A visitor pattern can be incorporated to perform various operations on an abstract syntax tree.

Conclusion

This article is a basic realization for the method. There is still more to be done. Every language has its own limitations and abilities and we will try to make the best out of them.

History

Jan 8, 2015: Add a link to my previous article at the beginning of the Introduction section
Jan 9, 2015: Add a tiny list to give readers an overview of the 2 articles in series. Correct the definition of series and sequence. A series should be a matched sequence. Add some examples of sequence and series. Rephrase the list of 2 articles in the Introduction section.