One result ...several questions
At this point we have tokenized the whole text of the “World Messianic Bible” and we have the first report of the most used words that are also semantically valuable:” LORD”, “shall” and “said” are the first of 2362 unique token-words found.
The API service that provided us with the text has listed 23 English available versions of the bible, and it claims to provide 2500 versions in over 1600 languages.
I wonder how the token list will vary depending on the version we use and even other languages, I’m particularly interested in checking whether NLTK provides support for Hebrew or not.
Another thing we can get is the most used words by each book, which could give us some insight into the main topic of each one.
For the word-token frequency by book, we can run:
SELECT book,token, count(*) as 'count' FROM wordtokens GROUP BY book,token ORDER by 1,3 desc;
Which give us the following results (f_e_print_word_frequency_by_book.py):
1CH sons(303) David(181) LORD(172)
1CO things(76) Messiah(66) Lord(66)
1JN know(36) us(35) love(35)
1KI LORD(252) said(207) Israel(194)
1PE Messiah(19) may(14) also(13)
1SA LORD(309) Saul(288) David(281)
1TH Lord(25) Yeshua(17) us(16)
1TI good(22) faith(17) Messiah(16)
2CH king(248) house(194) God(181)
2CO Messiah(47) also(45) us(43)
2JN Father(4) us(3) teaching(3)
2KI LORD(276) said(221) Israel(149)
2PE things(12) day(10) Yeshua(9)
2SA David(274) said(230) LORD(142)
2TH God(19) Yeshua(12) us(9)
2TI Yeshua(14) Messiah(14) God(13)
3JN brothers(3) God(3) write(2)
ACT said(149) Paul(126) Lord(98)
AMO says(46) Israel(26) house(25)
COL God(22) things(20) also(15)
DAN Daniel(72) God(48) kingdom(47)
DEU LORD(545) God(341) land(179)
ECC God(39) time(38) heart(36)
EPH God(31) one(24) Lord(24)
EST Mordecai(56) Esther(53) Haman(51)
EXO LORD(385) Moses(283) said(199)
EZK Lord(221) says(209) LORD(191)
EZR God(95) king(63) hundred(60)
GAL law(30) God(30) faith(19)
GEN father(260) God(214) land(190)
HAB nations(7) like(7) violence(5)
HAG says(20) Hosts(13) son(10)
HEB things(38) made(30) faith(30)
HOS LORD(37) Israel(37) Ephraim(24)
ISA shall(192) like(162) says(109)
JAS God(17) faith(16) brothers(14)
JDG Israel(172) LORD(169) children(128)
JER says(337) king(226) land(190)
JHN Yeshua(238) therefore(100) answered(77)
JOB man(77) Job(54) know(53)
JOL children(11) great(10) God(10)
JON said(18) Jonah(18) God(14)
JOS children(190) Joshua(155) Israel(153)
JUD ungodly(5) Yeshua(5) God(5)
LAM daughter(20) like(14) day(14)
LEV LORD(284) offering(268) priest(183)
LUK Yeshua(96) son(89) saying(72)
MAL says(27) Hosts(18) say(13)
MAT Yeshua(158) saying(94) came(93)
MIC like(19) come(15) people(12)
MRK Yeshua(84) came(75) went(47)
NAM like(12) away(7) make(5)
NEH son(112) God(71) hundred(49)
NUM LORD(385) children(258) offering(242)
OBA possess(6) LORD(6) Esau(6)
PHM Messiah(6) fellow(4) Paul(4)
PHP things(25) God(22) Yeshua(19)
PRO wicked(75) LORD(74) shall(61)
PSA God(321) shall(183) like(146)
REV earth(76) great(61) like(60)
ROM law(73) Messiah(63) also(60)
RUT Naomi(22) Boaz(20) LORD(18)
SNG beloved(32) love(20) beautiful(13)
TIT things(10) good(10) may(9)
ZEC Hosts(51) says(42) Jerusalem(35)
Next, lets try with other English translation, for that we just need to change the BIBLE_ID parameter and run all our scripts again:
In c_a_parameters_bible.py
we put the id for : bf8f1c7f3f9045a5-01'JPS TaNaKH 1917
And we run the next scripts:
python c_b_retrieve_bible_books.py
python d_b_retrieve_books_chapters.py
python e_b_retrieve_chapters_text.py
python f_b_tokenize_documents.py
Which result in the following list of tokens by frequency:
('shall', 7404)
('unto', 6820)
('LORD', 6535)
('thou', 3356)
('thy', 3317)
('said', 2899)
('thee', 2879)
('God', 2654)
('Israel', 2505)
('upon', 2451)
('king', 2410)
('ye', 2314)
('son', 1897)
('land', 1838)
Which shows that NLTK does not have words like “thou” or “thy” in its list of stop-words. It does allow for editing that list though, so is a solvable issue.
Next, a tougher test: Hebrew
Previous | Home | Next |
---|---|---|
Tokenizing all Documents | θεόφιλος Journey | Hebrew |