Types for text objects. As text objects are represented by nodes in Text-Fabric, we shall use both object and node without much consistency.
||slot||single word, fills a slot; sometimes words are not separated by a space|
||--||lexeme, contains all slots of occupied by its occurrences|
||functional||part of a phrase|
||functional||phrase, maybe with gaps|
||distributional||maximal consecutive part of a phrase|
||functional||clause, maybe with gaps|
||distributional||maximal consecutive part of a clause|
||functional||clause, maybe with gaps|
||distributional||maximal consecutive part of a sentence|
||section||main division of the verse, usually into two, somtimes into three parts|
||section||numbered unit of a chapter|
||section||numbered unit of a book|
||section||named part of the Bible|
All objects have a type, which is just a label. Objects and their slots are represented in Text-Fabric as nodes. The information which object occupies which slot is stored in the edge feature oslots.
|Section types||division in books, chapters, etc|
|Word type||all about the individual words|
|Linguistic types||phrases, clauses, etc|
The section types correspond to the various divisional units in the Bible.
The Hebrew Bible is divided in books, books are divided in chapters, chapters are divided in verses, and verses in half-verses.
The sectional types
specify features which indicate which book, chapter, verse, half-verse their objects refer to.
book object carries the book feature, which contains the name of the book.
chapter object carries the chapter feature, which contains the number of the chapter.
It carries also the book feature to indicate the book of which it is a chapter.
verse object carries the verse feature, which contains the number of the chapter,
and the book and chapter features.
verse object also carries label, which contains a label string indicating the passage.
half_verse object only carries the half_verse feature, which contains a key for the half-verse.
There is only one type for words, the
Word objects correspond to the smallest divisional units in the BHSA dataset.
They are also identified with slots, because each slot is filled by a word and each word fills a slot.
Words are not identified with strings, because there are various
string representations of the words, none of which is canonical. All word occurrences are numbered
with a slot number.
There are many features that have related forms, e.g.
g_ versions have graphical values, meaning that it contains the pointing,
i.e. all diacritics that occur in the full text.
For the purpose if this documentation, we shall use the contrast consonantal (without diacritics)
and pointed (with diacritics).
_utf8 versions contain UNICODE representations of the values, using the Hebrew code block.
_utf8 versions contain ASCII representations of the values, according to the
The text of a word occurrence is in g_word (pointed, transliterated) and g_word_utf8 (pointed, Hebrew), g_cons (consonantal, transliterated) and g_cons_utf8 (consonantal, Hebrew). None of these features contains material from in between words. In order to get inter-word material, use trailer_utf8.
Word occurrences corresponds to lexemes, i.e. dictionary entries, for which we have a separate object type. For the textual representation of lexemes we have a variety of features, in order to get their consonantal values:
|lex0||transcription without disambiguation characters at the end|
or their vocalized values:
lex corresponds to lexemes. A lexeme object occupies the slots of all its occurrences.
It does not fit into the hierarchy, because these objects will very rarely lie embedded in another object.
Except if a lexeme is rare.
Have a look at start. so see how you could exploit this object type to find lexemes that are unique to books or chapters very easily.
Precisely because of the non-embedding of lexemes in other object types, its use in MQL queries is limited. In Text-Fabric there are no problems. See the note in gloss.
Linguistic types correspond to syntactical entities such as sentences, clauses and phrases.
The BHSA distinguishes between functional and distributional variants of them.
The functional object types are
They correspond to possibly discontinuous stretches of text that function as a unit.
The distributional object types are
They are continuous stretches of text within their functional counterparts.
So the functional objects consist of sequences of the corresponding distributional objects, and any gaps in
the functional object fall neatly between their distributional atoms.
Note by Cody Kingham (on the etcbc-vu slack)¶
If you are looking for a sort of neat and tidy definition of what constitutes a “phrase” or “clause” in the ETCBC, you will probably come away disappointed. In its database methodology, the ETCBC purposely avoided strict linguistic definitions and sought to build up phrase and clause boundaries with a bottom-up method. There are a handful of helpful formal rules that were discovered and integrated into the programs. For instance, one rule used by the data creation programs for detecting clause endings is to examine parts of speech on either side of a waw conjunction. If the part of speech to the left of the conjunction was different than the one to the right, it likely indicates a clause boundary. For both clause and phrase segmentation, there is a kind of default list of part of speech patterns called a phrase set. As new patterns are found in the text during an encoding, they were added to the phrase set to be utilized in the next analysis. But with all of that said, here is my best try at summarizing a kind of definition of clauses and phrases for the etcbc: Clauses and phrases are functional linguistic units made up of their distributional parts, i.e. atoms, which are themselves recognizable through regular patterns in the language that can be detected through computer-assisted cataloguing and analysis. The most comprehensive and informative summary on how clause/phrases are defined and identified in the ETCBC is Eep Talstra 2003 Text segmentation and linguistic levels - Preparing data for SESB. Cody Kingham (Slack message)
More explanation needed about the distributional and functional objects hierarchies and how they hang together. * Is
subphrasefunctional or distributional? * Are atoms always maximal continous stretches, or can you have two adjacent atoms of the same type?
See the AtomsAndMothers notebook which makes some basic explorations into these matters.
If you are writing an MQL query, there is not a feature as such in which the type is stored. Rather you refer to the type when you write the building blocks such as
[clause_atom [phrase ]].
The otype feature has the same values as the possible names of the MQL blocks.
In Text-Fabric we have developed a new way of querying. Read more in search.