Developing an automatic tag-based computational model for text corpus processing

Summary: This PhD thesis provides a detailed analysis of the automatic tagging of text corpora, and various ways to disambiguate words that, according to the context, may have multiple grammatical categories, primarily based on the probability theory of hidden Markov models.This thesis includes the structure, operation and description of the proposed computational model, called ETIPROCT (Spanish acronym) (tagger and text corpus processor), and describes the model’s two sections: automatic text tagging and linguistic information processing. The system’s effectiveness is evidenced by the use of ETIPROCT for two lexically different text corpora. The system processed 52,051 words from 358 texts written by junior high students from eight Cuban provinces, and showed 98.15% efficacy. On the other hand, ETIPROCT analyzed 51,252 words in 131 texts from the Cuban press (Granma, Trabajadores, and Juventud Rebelde), with an efficacy of 97.16%. Another new feature in this system is the automatic coding of spelling errors in students’ written compositions, which used to be done manually. The recognition of compound words, continued vocabulary enrichment, introduction of the semantic element, and numerous linguostatistical results, are some of the most significant contributions of the first automatic grammatical tagger of text corpora developed in Cuba, which is the fundamental aim of this thesis.

Author:

Dr. Leonel Ruiz Miyares

Leave a Reply Cancel reply