Identification of social scientifically relevant topics in an interview repository. A natural language processing experiment

Gárdos, Judit and Egyed-Gergely, Júlia and Horváth, Anna and Pataki, Balázs and Vajda, Róza and Micsik, András (2023) Identification of social scientifically relevant topics in an interview repository. A natural language processing experiment. Journal of Documentation. ISSN 0022-0418

[img]
Preview
Text
Social topics in interview repository 2023.pdf - Accepted Version
Available under license Creative Commons Attribution Non-commercial.

Download (1MB) | Preview
Original publication URL: https://doi.org/10.1108/JD-12-2022-0269

Abstract

Purpose: The present study is about generating metadata to enhance thematic transparency and to facilitate research of interview collections at the Research Documentation Centre, Centre for Social Sciences (TK KDK) in Budapest. It explores the use of artificial intelligence in producing, managing and processing social science data and its potential to generate useful metadata to describe the contents of such archives on a large scale. Study design/methodology/approach: We combined manual and automated/semi-automated methods of metadata development and curation. We developed a suitable domain-oriented taxonomy to classify a large text corpus of semi-structured interviews. To this end, we adapted the European Language Social Science Thesaurus (ELSST) to produce a concise, hierarchical structure of topics relevant in social sciences. We identified and tested the most promising NLP tools supporting the Hungarian language. The results of manual and machine coding will be presented in a user interface. Findings: The study describes how an international social scientific taxonomy can be adapted to a specific local setting and tailored to be used by automated NLP tools. We show the potentials and limitations of existing and new NLP methods for thematic assignment. Current possibilities of multi-label classification in social scientific metadata assignment are discussed, i.e. the problem of automated selection of relevant labels from a large pool. Originality/value: Interview materials have not yet been used for building manually annotated training datasets for automated indexing of scientifically relevant topics in a data repository. Comparing various automated indexing methods this study shows a possible implementation of a researcher tool supporting custom visualizations and the faceted search of interview collections.

Purpose: The present study is about generating metadata to enhance thematic transparency and to facilitate research of interview collections at the Research Documentation Centre, Centre for Social Sciences (TK KDK) in Budapest. It explores the use of artificial intelligence in producing, managing and processing social science data and its potential to generate useful metadata to describe the contents of such archives on a large scale. Study design/methodology/approach: We combined manual and automated/semi-automated methods of metadata development and curation. We developed a suitable domain-oriented taxonomy to classify a large text corpus of semi-structured interviews. To this end, we adapted the European Language Social Science Thesaurus (ELSST) to produce a concise, hierarchical structure of topics relevant in social sciences. We identified and tested the most promising NLP tools supporting the Hungarian language. The results of manual and machine coding will be presented in a user interface. Findings: The study describes how an international social scientific taxonomy can be adapted to a specific local setting and tailored to be used by automated NLP tools. We show the potentials and limitations of existing and new NLP methods for thematic assignment. Current possibilities of multi-label classification in social scientific metadata assignment are discussed, i.e. the problem of automated selection of relevant labels from a large pool. Originality/value: Interview materials have not yet been used for building manually annotated training datasets for automated indexing of scientifically relevant topics in a data repository. Comparing various automated indexing methods this studPurpose: The present study is about generating metadata to enhance thematic transparency and to facilitate research of interview collections at the Research Documentation Centre, Centre for Social Sciences (TK KDK) in Budapest. It explores the use of artificial intelligence in producing, managing and processing social science data and its potential to generate useful metadata to describe the contents of such archives on a large scale.

Item Type: Article
Title in English: Identification of social scientifically relevant topics in an interview repository. A natural language processing experiment
Keywords in English: sociology, research data repository, natural language processing (NLP), thesaurus, multi-label classification, exploratory UI, text visualization
Subjects: H Social Sciences > HM Sociology
Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
Divisions: Research Documentation Centre (KDK)
Research funder: European Union project RRF-2.3.1-21-2022-00004 within the framework of the Artificial Intelligence National Laboratory
Depositing User: Judit Gárdos
Date Deposited: 12 Oct 2023 06:26
Last Modified: 12 Oct 2023 06:56
URI: https://openarchive.tk.mta.hu/id/eprint/601

Actions (login required)

View Item View Item