title: otype
¶
node type
Types for text objects. As text objects are represented by nodes in Text-Fabric, we shall use both object and node without much consistency.
type | kind | description |
---|---|---|
word |
slot | single word, fills a slot; sometimes words are not separated by a space |
lex |
-- | lexeme, contains all slots of occupied by its occurrences |
subphrase |
functional | part of a phrase |
phrase |
functional | phrase, maybe with gaps |
phrase_atom |
distributional | maximal consecutive part of a phrase |
clause |
functional | clause, maybe with gaps |
clause_atom |
distributional | maximal consecutive part of a clause |
sentence |
functional | clause, maybe with gaps |
sentence_atom |
distributional | maximal consecutive part of a sentence |
half_verse |
section | main division of the verse, usually into two, sometimes into three parts |
verse |
section | numbered unit of a chapter |
chapter |
section | numbered unit of a book |
book |
section | named part of the Bible |
All objects have a type, which is just a label.
Objects and their slots are represented in Text-Fabric as nodes.
The information which object occupies which slot is stored in the edge feature
oslots
.
type | description |
---|---|
Section types | division in books, chapters, etc |
Word type | all about the individual words |
Linguistic types | phrases, clauses, etc |
Section types¶
The section types correspond to the various divisional units in the Bible.
The Hebrew Bible is divided in books, books are divided in chapters, chapters
are divided in verses, and verses in half-verses.
The sectional types
book
, chapter
, verse
, and half_verse
specify features which indicate which book, chapter, verse, half-verse their
objects refer to.
A book
object carries the book
feature, which contains the name
of the book.
A chapter
object carries the chapter
feature, which contains
the number of the chapter.
It carries also the book
feature to indicate the book of which it
is a chapter.
Analogously, the verse
object carries the verse
feature, which
contains the number of the chapter, and the book
and
chapter
features.
Additionally, the verse
object also carries label
, which
contains a label string indicating the passage.
However, the half_verse
object only carries the half_verse
feature, which contains a key for the half-verse.
Word type¶
There is only one type for words, the word
type.
Word objects correspond to the smallest divisional units in the BHSA dataset.
They are also identified with slots, because each slot is filled by a word
and each word fills a slot.
Words are not identified with strings, because there are various
string representations of the words, none of which is canonical. All word
occurrences are numbered with a slot number.
There are many features that have related forms, e.g. vbe
, g_vbe
and g_vbe_utf8
.
The g_
versions have graphical values, meaning that it contains the pointing,
i.e. all diacritics that occur in the full text.
For the purpose if this documentation, we shall use the contrast consonantal
(without diacritics) and pointed (with diacritics).
The _utf8
versions contain UNICODE representations of the values, using the
Hebrew code block.
The non _utf8
versions contain ASCII representations of the values, according to the
BHSA transliteration.
The text of a word occurrence is in
g_word
(pointed, transliterated) and
g_word_utf8
(pointed, Hebrew),
g_cons
(consonantal, transliterated) and
g_cons_utf8
(consonantal, Hebrew).
None of these features contains material from in between words.
In order to get inter-word material, use
trailer_utf8
.
Word occurrences corresponds to lexemes, i.e. dictionary entries, for which we have a separate object type. For the textual representation of lexemes we have a variety of features, in order to get their consonantal values:
code | description |
---|---|
lex |
transcription |
lex0 |
transcription without disambiguation characters at the end |
lex_utf8 |
Hebrew |
or their vocalized values:
code | description |
---|---|
g_lex |
transcription |
g_lex_utf8 |
Hebrew |
Lexeme type¶
The type lex
corresponds to lexemes. A lexeme object occupies the slots of
all its occurrences.
It does not fit into the hierarchy, because these objects will very rarely lie
embedded in another object. Except if a lexeme is rare.
Hint¶
Have a look at
start
. so see how you could exploit this object type to find lexemes that are unique to books or chapters very easily.
Caution¶
Precisely because of the non-embedding of lexemes in other object types, its use in MQL queries is limited. In Text-Fabric there are no problems. See the note in
gloss
.
Linguistic types¶
Linguistic types correspond to syntactical entities such as sentences, clauses
and phrases.
The BHSA distinguishes between functional and distributional variants of them.
The functional object types are sentence
, clause
, and phrase
.
They correspond to possibly discontinuous stretches of text that function as a unit.
The distributional object types are sentence_atom
, clause_atom
, and phrase_atom
.
They are continuous stretches of text within their functional counterparts.
So the functional objects consist of sequences of the corresponding
distributional objects, and any gaps in the functional object fall neatly
between their distributional atoms.
Note by Cody Kingham (on the ETCBC-VU slack)¶
If you are looking for a sort of neat and tidy definition of what constitutes a “phrase” or “clause” in the ETCBC, you will probably come away disappointed. In its database methodology, the ETCBC purposely avoided strict linguistic definitions and sought to build up phrase and clause boundaries with a bottom-up method. There are a handful of helpful formal rules that were discovered and integrated into the programs. For instance, one rule used by the data creation programs for detecting clause endings is to examine parts of speech on either side of a waw conjunction. If the part of speech to the left of the conjunction was different than the one to the right, it likely indicates a clause boundary. For both clause and phrase segmentation, there is a kind of default list of part of speech patterns called a phrase set. As new patterns are found in the text during an encoding, they were added to the phrase set to be utilized in the next analysis. But with all of that said, here is my best try at summarizing a kind of definition of clauses and phrases for the ETCBC: Clauses and phrases are functional linguistic units made up of their distributional parts, i.e. atoms, which are themselves recognizable through regular patterns in the language that can be detected through computer-assisted cataloguing and analysis. The most comprehensive and informative summary on how clause/phrases are defined and identified in the ETCBC is Eep Talstra 2003 Text segmentation and linguistic levels - Preparing data for SESB. Cody Kingham (Slack message)
Note¶
More explanation needed about the distributional and functional objects hierarchies and how they hang together.
- Is
subphrase
functional or distributional? - Are atoms always maximal continuous stretches, or can you have two adjacent atoms of the same type?
See the AtomsAndMothers
notebook
which makes some basic explorations into these matters.
Note¶
If you are writing an MQL query, there is not a feature as such in which the type is stored. Rather you refer to the type when you write the building blocks such as
[word ...]
or[clause_atom [phrase ]]
.
The otype
feature has the same values as the possible names of the MQL blocks.
Hint¶
In Text-Fabric we have developed a new way of querying. Read more in search.