title: `otype`¶

node type

Types for text objects. As text objects are represented by nodes in Text-Fabric, we shall use both object and node without much consistency.

type	kind	description
`word`	slot	single word, fills a slot; sometimes words are not separated by a space
`lex`	--	lexeme, contains all slots of occupied by its occurrences
`subphrase`	functional	part of a phrase
`phrase`	functional	phrase, maybe with gaps
`phrase_atom`	distributional	maximal consecutive part of a phrase
`clause`	functional	clause, maybe with gaps
`clause_atom`	distributional	maximal consecutive part of a clause
`sentence`	functional	clause, maybe with gaps
`sentence_atom`	distributional	maximal consecutive part of a sentence
`half_verse`	section	main division of the verse, usually into two, sometimes into three parts
`verse`	section	numbered unit of a chapter
`chapter`	section	numbered unit of a book
`book`	section	named part of the Bible

All objects have a type, which is just a label. Objects and their slots are represented in Text-Fabric as nodes. The information which object occupies which slot is stored in the edge feature oslots.

type	description
Section types	division in books, chapters, etc
Word type	all about the individual words
Linguistic types	phrases, clauses, etc

Section types¶

The section types correspond to the various divisional units in the Bible. The Hebrew Bible is divided in books, books are divided in chapters, chapters are divided in verses, and verses in half-verses. The sectional types book, chapter, verse, and half_verse specify features which indicate which book, chapter, verse, half-verse their objects refer to.

A book object carries the book feature, which contains the name of the book. A chapter object carries the chapter feature, which contains the number of the chapter. It carries also the book feature to indicate the book of which it is a chapter. Analogously, the verse object carries the verse feature, which contains the number of the chapter, and the book and chapter features. Additionally, the verse object also carries label, which contains a label string indicating the passage. However, the half_verse object only carries the half_verse feature, which contains a key for the half-verse.

Word type¶

There is only one type for words, the word type. Word objects correspond to the smallest divisional units in the BHSA dataset. They are also identified with slots, because each slot is filled by a word and each word fills a slot. Words are not identified with strings, because there are various string representations of the words, none of which is canonical. All word occurrences are numbered with a slot number.

There are many features that have related forms, e.g. vbe, g_vbe and g_vbe_utf8. The g_ versions have graphical values, meaning that it contains the pointing, i.e. all diacritics that occur in the full text. For the purpose if this documentation, we shall use the contrast consonantal (without diacritics) and pointed (with diacritics). The _utf8 versions contain UNICODE representations of the values, using the Hebrew code block. The non _utf8 versions contain ASCII representations of the values, according to the BHSA transliteration.

The text of a word occurrence is in g_word (pointed, transliterated) and g_word_utf8 (pointed, Hebrew), g_cons (consonantal, transliterated) and g_cons_utf8 (consonantal, Hebrew). None of these features contains material from in between words. In order to get inter-word material, use trailer_utf8.

Word occurrences corresponds to lexemes, i.e. dictionary entries, for which we have a separate object type. For the textual representation of lexemes we have a variety of features, in order to get their consonantal values:

code	description
`lex`	transcription
`lex0`	transcription without disambiguation characters at the end
`lex_utf8`	Hebrew

or their vocalized values:

code	description
`g_lex`	transcription
`g_lex_utf8`	Hebrew

Lexeme type¶

The type lex corresponds to lexemes. A lexeme object occupies the slots of all its occurrences. It does not fit into the hierarchy, because these objects will very rarely lie embedded in another object. Except if a lexeme is rare.

Hint¶

Have a look at start. so see how you could exploit this object type to find lexemes that are unique to books or chapters very easily.

Caution¶

Precisely because of the non-embedding of lexemes in other object types, its use in MQL queries is limited. In Text-Fabric there are no problems. See the note in gloss.

Linguistic types¶

Linguistic types correspond to syntactical entities such as sentences, clauses and phrases. The BHSA distinguishes between functional and distributional variants of them. The functional object types are sentence, clause, and phrase. They correspond to possibly discontinuous stretches of text that function as a unit. The distributional object types are sentence_atom, clause_atom, and phrase_atom. They are continuous stretches of text within their functional counterparts. So the functional objects consist of sequences of the corresponding distributional objects, and any gaps in the functional object fall neatly between their distributional atoms.

Note by Cody Kingham (on the ETCBC-VU slack)¶

If you are looking for a sort of neat and tidy definition of what constitutes a “phrase” or “clause” in the ETCBC, you will probably come away disappointed. In its database methodology, the ETCBC purposely avoided strict linguistic definitions and sought to build up phrase and clause boundaries with a bottom-up method. There are a handful of helpful formal rules that were discovered and integrated into the programs. For instance, one rule used by the data creation programs for detecting clause endings is to examine parts of speech on either side of a waw conjunction. If the part of speech to the left of the conjunction was different than the one to the right, it likely indicates a clause boundary. For both clause and phrase segmentation, there is a kind of default list of part of speech patterns called a phrase set. As new patterns are found in the text during an encoding, they were added to the phrase set to be utilized in the next analysis. But with all of that said, here is my best try at summarizing a kind of definition of clauses and phrases for the ETCBC: Clauses and phrases are functional linguistic units made up of their distributional parts, i.e. atoms, which are themselves recognizable through regular patterns in the language that can be detected through computer-assisted cataloguing and analysis. The most comprehensive and informative summary on how clause/phrases are defined and identified in the ETCBC is Eep Talstra 2003 Text segmentation and linguistic levels - Preparing data for SESB. Cody Kingham (Slack message)

Note¶

More explanation needed about the distributional and functional objects hierarchies and how they hang together.

Is subphrase functional or distributional?
Are atoms always maximal continuous stretches, or can you have two adjacent atoms of the same type?

See the AtomsAndMothers notebook which makes some basic explorations into these matters.

Note¶

If you are writing an MQL query, there is not a feature as such in which the type is stored. Rather you refer to the type when you write the building blocks such as [word ...] or [clause_atom [phrase ]].

The otype feature has the same values as the possible names of the MQL blocks.

Hint¶

In Text-Fabric we have developed a new way of querying. Read more in search.

title: otype¶

Section types¶

Word type¶

Lexeme type¶

Hint¶

Caution¶

Linguistic types¶

Note by Cody Kingham (on the ETCBC-VU slack)¶

Note¶

Note¶

Hint¶

title: `otype`¶