Media Resources

Even before I started this project, when I was first learning Hebrew, I was able to get audio recordings and text for the Tanakh on the internet.

Text

While there are a number of sources online where Hebrew text may be found, with Sepharia Library being one of the best, the text for this project was obtained from Biblehub since the "Translation Table" style is convenient and fits the read-along process well.

Audio

I acquired the audio from Talking Bibles International and had to cut out certain parts that weren't in the text, such as the book and chapter introductions.

Lexicon & Dictionary

Biblehub

Biblehub makes its translation tables accessible for download, which contain all of the information needed for the read-along word-by-word matching process, including phonetic symbols for each word's pronunciation, as shown in the sample below:

wə·ḏe·reḵ

which has to be mapped to the Montreal Forced Aligner's phonetic system:

W AE1 D EH0 R EH0 JH

For this, I created a python script that does the "translation" using the following equivalences discovered through manual testing:

PHONETIC_EQUIVALENCES = {
   "a": "AH0",
   chr(8216) + "a": "AH1",
   chr(8217) + "a": "AH2",
   "ā": "AA0",
   chr(8216) + "ā": "AA1",
   chr(8217) + "ā": "AA2",
   "å": "AO0",

   chr(703) + "a": "AY0",
   chr(703) + "ă": "AY1",
   chr(703) + "å": "AY2",

   "b": "B",
   "ḇ": "V",

   "d": "D",
   "ḏ": "D",
   "ṣ": "DH",

   "ə": "AE0",
   chr(8216) + "ə": "AE1",
   "ɛ": "AE2",

   "e": "EH0",
   chr(8216) + "e": "EH1",
   chr(8217) + "e": "EH2",
   "ê": "EY0",
   chr(8216) + "ê": "EY1",
   chr(8217) + "ê": "EY2",
   "ĕ": "ER0",
   chr(8216) + "ĕ": "ER1",
   chr(8217) + "ĕ": "ER2",
   "ē": "EH0",

   "g": "G",
   chr(7713): "G",

   "h": "HH",
   "ḥ": "JH",

   "i": "IH0",
   "’i": "IH1",
   "‘i": "IH2",
   "î": "IY0",
   "’î": "IY1",
   "‘î": "IY2",
   "ī": "IH0",

   "k": "K",
   "ḵ": "JH",

   "l": "L",
   "m": "M",
   "n": "N",

   "o": "OY0",
   chr(8216) + "o": "OY1",
   chr(8217) + "o": "OY2",
   "ō": "OW0",
   chr(8216) + "ō": "OW1",
   chr(8217) + "ō": "OW2",

   "p": "P",
   "q": "K",
   "r": "R",

   "s": "S",
   "ś": "S",
   "š": "SH",

   "t": "T",
   "ṯ": "TH",
   "ṭ": "T",

   "u": "UH0",
   chr(8216) + "u": "UH1",
   chr(8217) + "u": "UH2",
   "ū": "UW0",
   chr(8216) + "ū": "UW1",
   chr(8217) + "ū": "UW2",

   "w": "W",
   "y": "IY0",
   "z": "Z",
}

The Translation Tables were available as a 40+ MB excel file that Google Sheets and OpenOffice refused to access, necessitating the usage of a Python script to divide it into Json files.

The OSHB Hebrew Lexicon

I contemplated using data from The OSHB Hebrew Lexicon, which is licensed under the Creative Commons Attribution 4.0 International license, however it only included a pronunciation lexicon for entries in its dictionary, leaving out a lot of Tanakh conjugated terms. The whole information can be found on Biblehub.

Forced Alignment