225 lines
13 KiB
Plaintext
225 lines
13 KiB
Plaintext
Morphology Algorithm
|
|
Internal Revision 4.1, 8 June 1992
|
|
|
|
The following will become the official baseline algorithm for resolution of
|
|
Lojban text into individual words from sounds, stress, and pause. As such,
|
|
it is the ultimate standard of Lojban's unambiguous resolvability, which
|
|
may make Lojban speech recognition by computers more possible than for
|
|
other languages. While the algorithm looks very complicated, almost all of
|
|
it is resolving special cases, and performing what error detection and
|
|
correction may be possible.
|
|
|
|
|
|
We have a string representing the speech stream, marked with stress and
|
|
pauses. We want to break it up into words.
|
|
|
|
|
|
1. First, break at all pauses (cannot pause in the middle of a word).
|
|
2. Then, pick the first piece that has not been uniquely resolved.
|
|
A. The first thing is to deal with some constructs which are required to
|
|
end with a pause:
|
|
1) Names:
|
|
a) If the last letter of the piece is a consonant, we have a name. A
|
|
name must have a pause before it UNLESS it is immediately preceded by
|
|
a /la/, /lai/, /la'i/ or /doi/ as a marker, and it cannot contain any
|
|
of these markers unless the marker is immediately preceded by a conso-
|
|
nant. So, look backwards from the end of the piece for any of the
|
|
allowed markers. If we don't find one (e.g. /jonz/), then the whole
|
|
piece has been resolved as a name.
|
|
b) If you do find such a marker, then check what immediately precedes
|
|
it. If there is nothing (e.g. /ladjAn/), or if a vowel precedes (e.g.
|
|
/mivIskaladjAn./, break off the marker as a resolved piece (/la/), and
|
|
what follows it is also a resolved piece, a name (/djAn/), leaving us
|
|
with whatever preceded the marker, if anything, as still unresolved
|
|
(/mivIska/).
|
|
c) If what precedes the marker is a consonant (e.g. /karoslAInas/)
|
|
then ignore the marker and continue looking backwards. This exception
|
|
is allowed because /karos/ with no following pause cannot represent a
|
|
separate word.
|
|
2) ".y.", the hesitation:
|
|
If the piece consists solely of /y/, then it resolves as the
|
|
hesitation word (which is required to be surrounded by pauses).
|
|
|
|
3) If the piece ends in "y", check for some lerfu words: specifically, the last lerfu word of a string, if it ends in a "y" (e.g. /abubycydy/
|
|
or /y'y/), must be followed by a pause:
|
|
a) If the "y" is preceded by a consonant, break off the consonant+"y"
|
|
as a resolved lerfu word (e.g. /abubycydy/ gives /abubycy/ unresolved,
|
|
and /dy/ resolved as a lerfu word). Continue breaking off any Cy
|
|
pieces as lerfu words if they're there (e.g. unresolved /abubycy/
|
|
gives unresolved /abuby/ + resolved /cy/; then /abuby/ gives un-
|
|
resolved /abu/ plus resolved /by/).
|
|
Note that the Cy-type lerfu words will NEVER come before the other
|
|
lerfu word pieces in a breath-group - the "abu" and "y'y" types -
|
|
since they begin with vowels, they MUST be preceded by pauses; and Cy
|
|
followed by anything but another Cy must be followed by a pause
|
|
(because "y" is used as glue in lujvo, it could cause resolvability
|
|
problems if not separate; e.g. /micybusmAbru/ would not uniquely re-
|
|
solve).
|
|
b) If the "y" is preceded by "V'" or "y'" (e.g. /y'y/), break before
|
|
the "V", and the "V'y" is resolved as a lerfu word.
|
|
c) If the "y" is preceded by an "i" or "u" ("iy" and "uy" are
|
|
reserved) the piece cannot be resolved.
|
|
d) If the "y" is preceded by a vowel (V) other than "i" or "u", the
|
|
piece is in error and cannot be further resolved.
|
|
B. Next, see if the piece is composed entirely of cmavo.
|
|
1) Check the piece to see if there are any consonant clusters (a
|
|
consonant cluster is of one of the forms CC or CyC). If there are none,
|
|
break up the piece before each consonant, resolving each piece as a
|
|
cmavo (e.g. /alenumibaca'a/ breaks into the cmavo /a/ + /le/ + /nu/ +
|
|
/mi/ + /ba/ + /ca'a/). If there are no consonants, the piece is a
|
|
single cmavo. In either case, the piece is completely resolved.o
|
|
C. Now we have a piece which we are sure contains a brivla (a gismu, a
|
|
lujvo or a le'avla). We know that a brivla must have a consonant cluster
|
|
(CC or CyC) within the 1st five letters (ignoring apostrophes in the
|
|
count), and must have penultimate stress (ignoring "y" syllables, which
|
|
are not allowed to be stressed).
|
|
1) First, let's check for a potential error (a form which shouldn't
|
|
arise):
|
|
a) If the piece contains no stress, but has a consonant cluster (CC
|
|
or CyC), it is in error. The consonant cluster indicates it contains
|
|
a brivla (gismu, lujvo or le'avla), which requires penultimate stress.
|
|
The only place this MIGHT validly occur is inside a zoi-quote (and
|
|
therefore need not be resolved at all).
|
|
b) However, if stress information is not available, assume the brivla
|
|
ends at the end of the piece. (This rule gives the right behavior
|
|
with canonical written Lojban, where spaces separate all words except
|
|
for some cmavo compounds and stress is normally not marked.)
|
|
2) Next, we need to find THE penultimate stress for the first brivla in
|
|
the piece (the brivla is expected to end after the syllable following
|
|
the stress, ignoring "y" syllables). Starting from the first consonant
|
|
cluster (CC or CyC):
|
|
a) If the previous letter is a stressed vowel, take that as THE
|
|
penultimate stress of the brivla.
|
|
b) If the previous letter is an unstressed vowel, but the letter
|
|
before that is a stressed vowel, then it is a stressed diphthong;
|
|
treat the entire diphthong as stressed (So that "find the next vowel"
|
|
will not get just the second half of the diphthong). Take that as THE
|
|
penultimate stress.
|
|
c) Otherwise, find the first stress after the consonant cluster. If
|
|
the stress is on a diphthong, treat the entire diphthong as stressed
|
|
(So that "find the next vowel" will not get just the second half of
|
|
the diphthong). Take that as THE penultimate stress.
|
|
3) Next, let's find the end of the first brivla in the piece: a) If there is no vowel in the piece after the stress, it can't be a
|
|
penultimate stress, so the piece is in error (unresolvable). This is
|
|
also true if "y" is the only vowel after the stress (e.g. */stAsy/ is
|
|
not a valid breath-group).
|
|
b) If the NEXT vowel following the stress (skipping over "y"'s ) is
|
|
immediately followed by "'V" (as in /mlAtyci'a/), then the syllable
|
|
following the stress cannot be the last syllable of a word (since the
|
|
'V cannot begin the next word). Ordinarily we would count this as an
|
|
error, but let's instead assume that this was a secondary stress and
|
|
ignore the fact that there is some stress on it. Go find the next
|
|
stress to use as THE penultimate stress for this brivla (e.g. in
|
|
/mlAtyci'abrIjuti/, assume the penultimate stress is "I", not "A").
|
|
c) Having eliminated all the potential problems with finding the end,
|
|
let's cut the piece after the end of the brivla:
|
|
Find the first vowel (not counting "y") after the stress. If it is
|
|
part of a diphthong, break after the diphthong; otherwise, break
|
|
after the vowel itself.
|
|
4) Now let's find the beginning of the brivla in the front part of the
|
|
piece we just broke off:
|
|
a) First, break off as many obvious cmavo pieces off the front as we
|
|
can:
|
|
1] If there is no consonant cluster (CC or CyC) in the first 5
|
|
letters (ignoring apostrophes in the count), then, if the piece
|
|
starts with a vowel, break off before the first consonant (e.g.
|
|
/alekArce/ becomes /a/ = cmavo) + /lekArce/ = unresolved), otherwise
|
|
break off before the second consonant (e.g. /vilekArce/ becomes /vi/
|
|
= cmavo + /lekArce/ = unresolved). The front piece is then resolved
|
|
as a cmavo.
|
|
2] Repeat the above as many times as we can (so, /lekArce/ becomes
|
|
/le/ = cmavo + /kArce/ = unresolved. Since /kArce/ has a consonant
|
|
cluster in the first five letters, we can't go any further).
|
|
3] If the piece we have left starts with a vowel, find the first
|
|
consonant. If the first consonant is part of a consonant cluster
|
|
(only CC-form this time), and this consonant cluster is NOT a valid
|
|
initial cluster (with each adjacent pair of consonants is a valid
|
|
initial pair), then we can resolve the entire piece as a le'avla
|
|
(e.g. /antipAsto/); otherwise (if the first consonant is NOT part of
|
|
a consonant cluster, or the consonant cluster IS a valid initial
|
|
cluster), break off before the first consonant as a cmavo (e.g.
|
|
/a'ofArlu/ becomes /a'o/ = cmavo + /fArlu/ = unresolved; or,
|
|
/aismAcu/ becomes /ai/ = cmavo + /smAcu/ = unresolved).
|
|
b) What's left begins with a consonant and has a consonant cluster
|
|
(CC or CyC) in the first 5 letters. The whole thing may be a brivla,
|
|
or there may be (at most) one consonant-initial cmavo in front. Here
|
|
are the possibilities for the start of the piece, and their
|
|
resolutions:
|
|
1] CC... or CVCyC...:
|
|
Resolve whole thing as a brivla (a gismu, lujvo, or le'avla).
|
|
2] CyC... :
|
|
Invalid form. Unresolvable.
|
|
3] CVVCC... : (Note: stressing a cmavo on the final syllable before a brivla is
|
|
not allowed.)
|
|
a] If there is no stress on the VV and the consonant cluster
|
|
beginning with the CC is a valid initial cluster (i.e., each
|
|
adjacent pair of consonants is a valid initial pair), then break
|
|
off the CVV, and resolve it as a cmavo; the remaining piece can
|
|
then be resolved as a brivla (see "CC....", above). For example,
|
|
/leiprEnu/ becomes /lei/ = cmavo + /prEnu/ = brivla.
|
|
b] Otherwise (i.e. there IS a stress on the VV, or the first
|
|
consonant cluster is not a valid initial cluster), resolve the
|
|
whole thing as a brivla (e.g. /cAItro/ = brivla)
|
|
4] CV'VCC... :
|
|
(Note: stressing a cmavo on the final syllable before a brivla is
|
|
not allowed.)
|
|
a] If there is no stress on the final vowel of the V'V) and the
|
|
consonant cluster beginning with the CC is a valid initial cluster
|
|
(i.e., each adjacent pair of consonants is a valid initial pair),
|
|
then break off the CV'V, and resolve it as a cmavo; the remaining
|
|
piece can then be resolved as a brivla (see "CC....", above). For
|
|
example, /so'iprEnu/ becomes /so'i/ = cmavo + /prEnu/ = brivla.
|
|
b] Otherwise (i.e. there is a stress on the final vowel of the
|
|
V'V, or the first consonant cluster is not a valid initial
|
|
cluster), resolve the whole thing as a brivla (e.g. /cA'Itro/ =
|
|
brivla)
|
|
5] CVCC... (This is the hard one. Is the front CV a separate
|
|
word?):
|
|
a] If the whole piece is CVCCV, then the whole thing resolves as a
|
|
gismu.
|
|
b] If the consonant cluster beginning with the CC is not a valid
|
|
initial cluster (with each adjacent pair of consonants is a valid
|
|
initial pair), then the whole piece can be resolved as a brivla
|
|
(gismu, lujvo, or le'avla). For example, /selfArlu/,
|
|
/cidjrspagEti/.
|
|
c] If the penultimate stress is on the 1st vowel of the CVCC (e.g.
|
|
/mAtcti/, then resolve the whole thing as a brivla (a lujvo or
|
|
le'avla).
|
|
d] If there is a "y", we need to look at the sub-piece up to the
|
|
first "y":
|
|
1> If the sub-piece consists entirely of CVC's repeating (at
|
|
least 2 needed: e.g. /cacric/), and all the CC's of the sub-piece
|
|
are valid initial clusters, then resolve the initial CV as a
|
|
cmavo, and the rest of the whole piece is a brivla (a lujvo or
|
|
le'avla).
|
|
2> Otherwise, if the sub-piece can be broken down into any
|
|
number (including 0) of valid lujvo "front-middles" in front and
|
|
exactly one valid lujvo "end" thereafter, resolve the whole piece
|
|
as a brivla.
|
|
a> Valid front-middles (we've eliminated all but those starting
|
|
with CV): CVC CVV CV'V CCV
|
|
b> Valid ends: CVC CCVC CVCC
|
|
3> Otherwise, the front CV should be resolved as a cmavo, and
|
|
the remaining piece is resolved as a brivla (a lujvo or le'avla)
|
|
e] If there is no "y":
|
|
1> If the piece consists of CVC's repeating (at least 2 needed)
|
|
up to a final CV (e.g. /cacricfu/), and all the CC's of the sub-
|
|
piece are valid initial clusters, then resolve the initial CV as
|
|
a cmavo, and the rest of the piece is a brivla (a lujvo).
|
|
2> Otherwise, if the piece can be broken down into any number
|
|
(including 0) of valid lujvo "front-middles" in front and exactly
|
|
one valid lujvo "end", then resolve the whole piece as a brivla (a lujvo).
|
|
a> Valid front-middles (we've eliminated all but those starting
|
|
with CV): CVC CVV CV'V CVC
|
|
d> Valid ends: CVV CV'V CCV CCVCV CVCCV
|
|
|
|
3> Otherwise, the front CV should be resolved as a cmavo, and
|
|
the remaining piece is resolved as a brivla (a le'avla).
|
|
|
|
6] Any other beginning (e.g. CVVCyC):
|
|
Resolve the whole as an error.
|
|
|
|
|
|
|
|
_______________________________________
|