Korpus Dyskursu Parlamentarnego

Korpus Dyskursu Parlamentarnego jest zbiorem anotowanych lingwistycznie tekstów z posiedzeń plenarnych Sejmu i Senatu RP, interpelacji i zapytań poselskich oraz posiedzeń komisji od roku 1919 do chwili obecnej (są stale uzupełniane materiałami z kolejnych posiedzeń). Teksty opisane metadanymi oraz przetworzone automatycznie narzędziami lingwistycznymi (do segmentacji, analizy morfoskładniowej, rozpoznawania grup składniowych i nazw własnych) są dostępne do przeszukiwania oraz pobrania.

Strona: www.kdp.nlp.ipipan.waw.pl

Wyszukiwarka i znakowanie:

MTAS — znakowanie: lematyzacja, znaczniki morfosyntaktyczne, rozbiory składniowe (zależnościowe), jednostki nazewnicze.

Jednostka odpowiedzialna za korpus:

Instytut Podstaw Informatyki PAN,

Wielkość korpusu: 781 milionów segmentów (czerwiec 2023)

Czas powstania: 2011-

Publikacje:

Maciej Ogrodniczuk. Polish Parliamentary Corpus [w:] Darja Fišer, Maria Eskevich, Franciska de Jong (red.) Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, s. 15–19, Paryż, European Language Resources Association (ELRA).
Maciej Ogrodniczuk. The Polish Sejm Corpus [w:] Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), s. 2219–2223, Stambuł, ELRA.

@inproceedings{ogrodniczuk-2012-polish,
    title = "The {P}olish Sejm Corpus",
    author = "Ogrodniczuk, Maciej",
    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
    month = may,
    year = "2012",
    address = "Istanbul, Turkey",
    publisher = "European Language Resources Association (ELRA)",
    url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/653_Paper.pdf",
    pages = "2219--2223",
    abstract = "This document presents the first edition of the Polish Sejm Corpus -- a new specialized resource containing transcribed, automatically annotated utterances of the Members of Polish Sejm (lower chamber of the Polish Parliament). The corpus data encoding is inherited from the National Corpus of Polish and enhanced with session metadata and structure. The multi-layered stand-off annotation contains sentence- and token-level segmentation, disambiguated morphosyntactic information, syntactic words and groups resulting from shallow parsing and named entities. The paper also outlines several novel ideas for corpus preparation, e.g. the notion of a live corpus, constantly populated with new data or the concept of linking corpus data with external databases to enrich content. Although initial statistical comparison of the resource with the balanced corpus of general Polish reveals substantial differences in language richness, the resource makes a valuable source of linguistic information as a large (300 M segments) collection of quasi-spoken data ready to be aligned with the audio/video recording of sessions, currently being made publicly available by Sejm.",
}

@inproceedings{ogr:2018:parlaclarin,
    editor = "Fišer, Darja and Eskevich, Maria and de Jong, Franciska",
    author = "Ogrodniczuk, Maciej",
    publisher = "European Language Resources Association (ELRA)",
    isbn = "979-10-95546-02-3",
    title = "Polish {P}arliamentary {C}orpus",
    url = "http://lrec-conf.org/workshops/lrec2018/W2/summaries/11_W2.html",
    booktitle = "Proceedings of the {LREC} 2018 Workshop \emph{{P}arla{CLARIN}: {C}reating and {U}sing {P}arliamentary {C}orpora}",
    year = "2018",
    location = "Miyazaki, Japan",
    address = "Paris, France",
    pdf = "http://lrec-conf.org/workshops/lrec2018/W2/pdf/11_W2.pdf",
    pages = "15--19"
}