Skip to main navigation Skip to search Skip to main content

Semantic interpretation of dataless tables: A metadata-driven approach for Findable, Accessible, Interoperable and Reusable restricted access data

  • Margherita Martorana

Research output: PhD ThesisPhD-Thesis - Research and graduation internal

77 Downloads (Pure)

Abstract

The rapid expansion of digital information has transformed how data is shared and reused, driving innovation across science, policy, and industry. Yet many datasets contain confidential information that prevents them from being openly shared, making it essential to balance the protection of sensitive data with the need for scientific progress. This thesis investigates how restricted access tabular data - i.e. datasets that cannot be openly shared due to privacy, licensing, or ethical concerns - can be made more reusable and discoverable while remaining compliant with legal and ethical frameworks such as the General Data Protection Regulation (GDPR). Guided by the FAIR Principles (Findable, Accessible, Interoperable, Reusable), the work advances metadata-driven approaches that treat metadata not as a secondary by-product but as a central component of research data management. Unlike raw data, metadata provides structured descriptions that enable discovery and reuse while safeguarding confidentiality, offering a pathway for both technical innovation and regulatory compliance. The work begins with a systematic review of existing methods for handling restricted access data within the FAIR framework. This leads to the creation of a Data Methods framework and ontology, which captures practices for secure storage, sharing, and reuse in formats that are both human- and machine-readable. The review highlights the centrality of metadata representation in ensuring compliance, interoperability, and future reuse. Building on this, the thesis introduces the DataSet-Variable Ontology (DSV), a model designed to represent restricted tabular metadata across three interconnected layers: structural, which describes the organization and format of tables; statistical, which captures properties such as completeness and data types; and semantic, which expresses meaning by linking variables to external vocabularies and codebooks. Case studies show that DSV improves dataset discoverability and interpretability without exposing raw data. Following, we investigate how metadata can be enriched automatically through Large Language Models (LLMs). By mapping column headers to controlled vocabularies, LLMs such as GPT, Gemini,and Llama demonstrate the capacity to enhance metadata even when direct access to raw data is not possible. To evaluate performance in this context, the thesis introduces three new metrics: internal consistency, measuring whether a single model gives stable outputs; inter-model alignment, comparing results across different LLMs; and human-computer agreement, assessing how closely automated mappings align with human annotations. The findings suggest that LLMs can effectively leverage their background knowledge to classify and enrich metadata, with results influenced by factors such as configuration settings and integration with retrieval-augmented generation systems. This work also introduces two key concepts: Column Vocabulary Association (CVA), which links column headers to external knowledge without exposing sensitive values, and Dataless Tables, which show that even metadata alone can support structured interpretation of restricted datasets. The final part of the thesis addresses metadata-driven data integration. A new method, Metadata Union Search (MUS), adapts the Table Union Search (TUS) problem to situations where raw data cannot be accessed. By comparing structured metadata through embeddings and similarity measures, MUS achieves high precision and recall when tested against TUS benchmarks, outperforming many state-of-the-art algorithms and demonstrating that metadata alone can support effective integration across datasets. Taken together, these contributions show that metadata-driven strategies can significantly enhance the FAIRness of restricted access data. While motivated by the challenges of sensitive datasets, the approaches developed here are equally relevant to open data, where metadata is often underrepresented. By positioning metadata as a central asset, this research contributes conceptual frameworks, technical tools, and evaluation strategies that strengthen data interoperability and enable secure data reuse. Beyond the technical domain, these advances also promote collaboration across disciplines and foster innovation in both academic and commercial contexts, ultimately contributing to a more informed and connected society.
Original languageEnglish
QualificationPhD
Awarding Institution
  • Vrije Universiteit Amsterdam
Supervisors/Advisors
  • van Ossenbruggen, Jacco, Supervisor
  • Kuhn, Tobias, Co-supervisor
Award date31 Oct 2025
DOIs
Publication statusPublished - 31 Oct 2025

Fingerprint

Dive into the research topics of 'Semantic interpretation of dataless tables: A metadata-driven approach for Findable, Accessible, Interoperable and Reusable restricted access data'. Together they form a unique fingerprint.

Cite this