© Copyright and user license Note

  • Data downloadable in this page is copyrighted to Birzeit University.
  • People are free to use our data only for research and academic purposes, otherwise, a special license can be granted (please contact Mustafa Jarrar at Birzeit University).
  • No one is allowed to share or re-publish any part of the data, for whatever reason and in whatever means..
  • Users of the data must acknowledge and cite this article [1] in a proper and clear way.
  • When downloading any file in this page you will be prompted to specify your (full name, organization, professional email, and Purpose of Use). This info must be specified accurately.
  • Not agreeing on the above means that your use of the data is illegal.


Whole Corpus:

The whole Curras corpus consists of 46 raw documents that are well annotated:

  • Download the   CurrasRawCorpus.rar , which contains the 46 raw documents, in text format.
  • Download the CurrasAnnotations_Full.csv ,   which contains the annotations of the 46 documents. To understand each attribute in the annotation, please see this Readme.pdf .
  • Download this DIWAN_db_Curras.rar , which is a zipped directory, containing the output files of the DIWAN tool. Please remark that this version (which might be useful for people familiar with DIWAN) but it is not recommended, and it is not latest version of the annotations.

Parts of the Corpus:

From the whole corpus above we isolated some parts in a way that might be useful for some researchers:

  • Download only the Curras_wordsFrequencies.csv  which is a file containing every word (i.e. token) and its frequency in the whole corpus. Please remark that some words are MSA, and not necessarily a dialect word. The list is called Palestinian since it appears in a palestinian corpus.
  • Download only the PalestinianWords_MSA_Lemma.csv  which is a file containing every word (i.e. token) and its corresponding MSA lemma word.
  • Download only the PalestinianWords_MSAlemma_Gloss.csv, which is a file containing every word (i.e. token), its corresponding MSA lemma word, and a gloss to describe their meaning in English.
  • Download each document with its full annotation:

1.

Doc1.Facebook Collected Posts.txt

فيسبوك منشورات مختارة

Doc1Annotations.csv

2.

Doc2.Twitter Collected Tweets.txt

تويتر تغريدات مختارة

Doc2Annotations.csv

3.

Doc3.Abd Al-Hamid Collected Blogs.txt

مختارات من مدونة عبدالحميد عبدالعاطي

Doc3Annotations.csv

4.

Doc4.Palestinian Collected Terms.txt

مصطلحات فلسطينية مجمعة

Doc4Annotations.csv

5.

Doc5.Palestinian Collected Stories.txt

قصص فلسطينية من منتديات حلم فلسطين

Doc5Annotations.csv

6.

Doc6.Palestinian Networks Collected comments.txt

تعليقات من منتديات شبكة فلسطين للحوار

Doc6Annotations.csv

7.

Doc7.Watan3watar Saddest Arabi.txt

وطن على وتر أتعس عربي

Doc7Annotations.csv

Video

8.

Doc8.Watan3watar Unemplyed Song.txt

وطن على وتر - اغنية احنا ما عنا شباب تتوظف بشهادتها

Doc8Annotations.csv

Video

9.

Doc9.Watan3watar The Program P3.txt

وطن على وتر - البرنامج جزء 3

Doc9Annotations.csv

10.

Doc10.Watan3watar The Program P4.txt

وطن على وتر - البرنامج جزء

Doc10Annotations.csv

Video

11.

Doc11.Watan3watar The Program Final.txt

وطن على وتر - البرنامج النهائي

Doc11Annotations.csv

Video

12.

Doc12.Watan3watar The Program P5.txt

وطن على وتر - البرنامج جزء 5

Doc12Annotations.csv

13.

Doc13.Watan3watar Palestinian Dream.txt

وطن على وتر-   الحلم الفلسطيني

Doc13Annotations.csv

14.

Doc14.Watan3watar Taxes.txt

وطن على وتر - الضرائب

Doc14Annotations.csv

Video

15.

Doc15.Watan3watar The Family.txt

وطن على وتر - العائلة

Doc15Annotations.csv

Video 1 (3:27 - 4:20)
Video 2

16.

Doc16.Watan3watar The Program Movies.txt

وطن على وتر - برنامج البرنامج / افلام

Doc16Annotations.csv

17.

Doc17.Watan3watar TheProgram Palestine.txt

وطن على وتر - برنامج البرنامج من فلسطين

Doc17Annotations.csv

18.

Doc18.Watan3watar Banzeen.txt

وطن على وتر - البنزين

Doc18Annotations.csv

19.

Doc19.Watan3watar Takhareef.txt

وطن على وتر - تخاريف

Doc19Annotations.csv

20.

Doc20.Watan3watar Turkish.txt

وطن على وتر -   تركي

Doc20Annotations.csv

Video

21.

Doc21.Watan3watar Standup Comedy.txt

وطن على وتر - ستاند اب كوميدي

Doc21Annotations.csv

22.

Doc22.Watan3watar Obama Car.txt

وطن على وتر - سيارة اوباما

Doc22Annotations.csv

Video

23.

Doc23.Watan3watar Fakher and the Family.txt

وطن على وتر -  فاخر والعائلة

Doc23Annotations.csv

Video

24.

Doc24.Watan3watar Fafies.txt

وطن على وتر - فافيز  و طنطات

Doc24Annotations.csv

Video

25.

Doc25.Watan3watar American Movie.txt

وطن على وتر - فيلم امريكي

Doc25Annotations.csv

Video

26.

Doc26.Watan3watar Coffee Shop.txt

وطن على وتر - كوفي شوب لاجئ

Doc26Annotations.csv

27.

Doc27.Watan3watar Correcting the Path.txt

وطن على وتر - تصحيح المسار

Doc27Annotations.csv

Video

28.

Doc28.Watan3watar Alia School.txt

وطن على وتر - مدرسة علياء المهدي

Doc28Annotations.csv

29.

Doc29.Watan3watar Friends.txt

Doc29Annotations.csv

30.

Doc30.Watan3watar Mawaqef.txt

وطن على وتر - مواقف محرجه

Doc30Annotations.csv

Video

31.

Doc31.Watan3watar Nakshat.txt

وطن على وتر - نكشات

Doc31Annotations.csv

32.

Doc32.Watan3watar Calls StandUp.txt

وطن على وتر - الاتصالات

Doc32Annotations.csv

33.

Doc33.Watan3watar A3ras StandUp.txt

وطن على وتر - الأعراس

Doc33Annotations.csv

Video

34.

Doc34.Watan3watar Calandia StandUp.txt

وطن على وتر - قلنديا

Doc34Annotations.csv

35.

Doc35.Watan3watar Media FM.txt

وطن على وتر - إعلام إف إم

Doc35Annotations.csv

Video

36.

Doc36.Watan3watar Travel1980.txt

وطن على وتر -  سفر1980

Doc36Annotations.csv

37.

Doc37.Watan3watar Jack and Fakher Family3.txt

وطن على وتر - جاك و فاخر فاميلي3

Doc37Annotations.csv

Video

38.

Doc38.Watan3watar Jack and Fakher Family2.txt

وطن على وتر - جاك و فاخر فاميلي2

Doc38Annotations.csv

39.

Doc39.Watan3watar Jack and Fakher Family.txt

وطن على وتر - جاك و فاخر فاميلي

Doc39Annotations.csv

40.

Doc40.Watan3watar Ramadan.txt

وطن على وتر - رمضان

Doc40Annotations.csv

Video

41.

Doc41.Watan3watar Gaza Electricity.txt

وطن على وتر - غزة والكهرباء

Doc41Annotations.csv

42.

Doc42.Watan3watar High Prices.txt

وطن على وتر - غلاء الأسعار

Doc42Annotations.csv

43.

Doc43.Watan3watar Mwaten Interview.txt

وطن على وتر - لقاء مع مواطن

Doc43Annotations.csv

44.

Doc44.Watan3watar Collections.txt

وطن على وتر - المواصلات في البلد

Doc44Annotations.csv

45.

Doc45.Watan3watar_Watar.txt

وطن على وتر - المياه

Doc45Annotations.csv

Video

46.

Doc46.Watan3watar_Tashri3iMember.txt

وطن على وتر -   يوميات عضو مجلس تشريعي

Doc46Annotations.csv

Video


The Gold Standard

The gold standard is a small corpus consisting of 3 documents only, which were annotated and verified carefully by two experts together (Please read section 8 in [1] for more info) about the gold standard was developed. These 3 documents are about 1529 words:

Experiment Files:

The full description of the experiment that was conducted to evaluate both the accuracy  and the inter-annotator agreement  can be found in section 8 in this article [1]. Two annotators were asked to annotate 3 documents (that we used to build the gold standard), but each annotator annotated the documents independently:

© Copyright and user licence Note

  • Data downloadable in this page is copyrighted to Birzeit University.
  • People are free to use our data only for research and academic purposes, otherwise, a special licence can be granted (please contact Mustafa Jarrar at Birzeit University).
  • No one is allowed to share or re-publish any part of the data, for whatever reason and in whatever means..
  • Users of the data must acknowledge and cite this article [1] in a proper and clear way.
  • Not agreeing on the above means that your use of the data is illegal.


    





Agree
Cancel