This page presents some ideas for students and researchers, who wish to extend our work or develop applications based on our corpus. Please cite this article if you decide to implement any of these suggestions:
1. Annotating Phrases
One may extend our corpus by re-annotating phrases rather than individual words. That is, our corpus annotations are based on unigram (each word is a token), but this does not cover the accurate meaning of phrases. For example, currently we annotated “ان شاء الله” as three separate words: ان/if, شاء/wills, الله/Allah. However, the intended meaning of this phrase in a certain sentence, as a whole, might be “ok”, rather than “if Allah wills ”.
2. Sentiment Analysis
To automatically judge whether a text (e.g., a hotel review) is positive, negative, or neutral, is called sentiment analysis. Such applications use internal knowledge (e.g., list of words annotated as positive, negative, and neutral) to be able to judge a given text. Annotating each Palestinian word in our corpus whether (positive/negative/neutral) enables people to develop sentiment analysis applications that are able to process text (hotel reviews, political opinions… ) that are written in Palestinian dialect.
3. Speech Recognition
Applications that convert speech to text typically use annotated audio corpuses - for training their applications. Developing an annotated audio corpus for Palestinian dialect is only one step, based on our corpus. Because most of our corpus is script of TV shows (Watan_Aa_Water), and the audio of these scripts is available already, one may link each annotation in our corpus with the corresponding audio segments. For example, and for simplicity, one may extend the annotation of a word in our corpus by providing the time intervals of pronouncing that word. Such an audio corpus can be then used for training speech-to-text synthesizers.
4. Probabilistic Language Model
Prediction-based types of applications (e.g., spelling and grammar checkers, autocomplete, and others) are typically based on the so-called n-gram probabilistic language models. To build, e.g. 5th-gram model for Palestinian dialect, one may tokenize our corpus into tokens, that consist of (1, 2, 3, 4, and 5) words, and store these tokens and their frequencies, so to estimate predications. See this tutorial .
5. Dialect Morphological Analyser and Part-of-Speech Tagging
To analyze a given dialectal sentence, and understand its morphology and structure (which is important in e.g., machine translation, spell/grammar checking), one need to build a treebank (i.e., patterns of sentence structures). Building such a treebank to analyze Palestinian dialect is made easily possible based on our annotated corpus.
6. Build Jordanian, Lebanese, Syrian Corpora
Our corpus can be used to build other Levantine corpora easily. Because of the very large overlap between Palestinian and other Levantine dialects, one needs to annotate only the typical words in that dialect, and use the rest of annotations from our corpus. Please remark that some people might think that there is a large difference between Levantine dialects but this is not true. For example, our experiment in [ 2 ] illustrates that there is 75% overlap between Palestinian and Egyptian dialects; hence, the overlap with other Levantine dialects is assumed to be much more.