Semi-automated construction of Neglish-Malay machine readable dictionary for technical terms
Abstract
This project presents a method for semi – automated construction of English – Malay
machine readable dictionary for technical terms. We proposed to use Keyword Density
in order to classify the category for each term by measuring the weight of the term with
Visual Studio using visual basic language. In the meantime, Cosine Similarity
algorithm is used to measure the similarity between two sentence which are definition
and sentence from the journal using C language. In order to calculate the category, 523 trainings data which is a set of journal for each term was collected. Then, we preprocessed the journal by using Brill’s Tagger with Penn-Tree Bank Tagger. We
assigned 50 terms to test the algorithm. By using word extraction method the terms
occurrence was counted. The total of the word in the category journal are also
calculated. To categorize the term, we calculated the keyword density. For example
sentence extraction, the data is used from the highest cosine similarity measurement
between definition and sentence from journal. The sentence with the highest value was
extracted as example sentence by the system. By using this algorithm, the Precision for
the example sentence is 79%, Recall 90% and the F-Measure is 84%. It can be
considered as a successful since the result is high. As a conclusion, based on the result,
the proposed method shows a great potential with further improvement.