Dr Mustafa Jarrar, Computer Science Department, Birzeit University, Palestine | +972 599 662258

About A vast amount of text was collected from Facebook, twitter, "Watan Aa Water" scripts, and others. The corpus is about 56K words/tokens. Every word in the corpus was then manually annotated with a set of metadata attributes to describe the orthographical, morphological, and semantic features of the word such as part of speech, prefixes, stem, suffixes, dialect lemma, MSA lemma, CODA surface, gender, number, mode, and a gloss in English. Every word was annotated in context, see our article . As when writing in dialect, people write same word in different forms (e.g., بيكتب، بكتب), as there are no spelling rules, we had to develop a set of spelling rules for Palestinian dialect (called PAL-CODA). When annotating a word, we also specified its “CODA surface”, which is the “correct” / “standard” spelling of the word according to PAL-CODA guidelines.

Why this corpus (i) Language learners can use it as a trilingual Palestinian-Standard Arabic-English lexicon (ii) Linguists can use it to for research purposes (iii) To develop IT applications. The dialectal content is rapidly increasing on the web, especially in the social media, and there are no computer applications currently available to process and understand this content, e.g., automatic translate, effective searching and retrieval, spell checking, speech recognition, and many others.

Funding: This project was funded by the Palestinian Ministry of Higher Education, Scientific Research Council.

Researchers and Collaborators: Mustafa Jarrar (Main researcher and contact person), Faeq Alrimawi, Diyam Akra (Birzeit University), and Nizar Habash (New York University of Abu Dhabi).

Acknowledgement: We wish to thank the "Watan Aa Watar" team from sharing the scripts of their TV show. Special thanks to Rami Asia for developing the Curras portal, and to Mohammad Dwaikat, Bahya Mustafa, Nasser Zalmout, Mahdi Arar and several other colleagues and students for their help and support. We wish to also thank Owen Rambow, Ramy Eskandar and Faisal Al-Shargi for their support with DIWAN and MADAMIRA.

جامعة بيرزيت تطلق مدونة اللهجة العامية الفلسطينية المحوسبة كراس Online Corpus for Palestinian Dialect Launched by...

Posted by Mustafa Jarrar on Saturday, January 16, 2016