Journal of Data Science logo


Login Register

  1. Home
  2. To appear
  3. Leveraging Survey Metadata for LLM Reaso ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Leveraging Survey Metadata for LLM Reasoning via Knowledge Graphs
Irina Belyaeva ORCID icon link to view author Irina Belyaeva details   Christopher Carino   Liang-Chi Wang  

Authors

 
Placeholder
https://doi.org/10.6339/26-JDS1230
Pub. online: 21 May 2026      Type: Statistical Data Science      Open accessOpen Access

Received
9 September 2025
Accepted
9 April 2026
Published
21 May 2026

Abstract

Statistical survey metadata contains essential contextual information that underpins the accurate interpretation, discovery, and reuse of statistical data. However, traditional metadata formats are not optimized for consumption by large language models (LLMs), which increasingly function as interfaces for data exploration, question-answering, and decision support. This work introduces a knowledge graph-based approach to modeling survey metadata using semantic web standards and linked data principles, specifically designed to make metadata machine-understandable and LLM-compatible. The core metadata entities, including surveys, datasets, variables, concepts, populations, and provenance, are modeled as rich interlinked nodes that allow reasoning, contextual enrichment, and structured prompting. The graph integrates established ontologies such as the Resource Description Framework (RDF) to promote interoperability and alignment with global standards. We demonstrate how this structure allows LLMs to surface relevant metadata, ground their outputs in authoritative sources, and generate semantically precise responses. This approach enhances transparency, facilitates metadata reuse, and supports the development of artificial intelligence (AI) applications powered by statistical products.

Supplementary material

 Supplementary Material
Appendices A-C.

References

 
Abu-Salih B (2021). Domain-specific knowledge graphs: A survey. Journal of Network and Computer Applications, 185: 103076. https://doi.org/10.1016/j.jnca.2021.103076
 
Bang Y, Cahyawijaya S, Lee N, Dai W, Su D, ..., Fung P (2023). A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In: Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) (JC Park, Y Arase, B Hu, W Lu, D Wijaya, A Purwarianti, AA Krisnadhi, eds.), 675–718. Association for Computational Linguistics, Nusa Dua, Bali.
 
Bennett M (2013). The financial industry business ontology: Best practice for big data. Journal of Banking Regulation, 14(3): 255–268. https://doi.org/10.1057/jbr.2013.13
 
Bodenreider O (2004). The unified medical language system (umls): Integrating biomedical terminology. Nucleic acids research. 32(suppl_1): D267–D270.
 
Bouma G (2009). Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL Conference: From Form to Meaning—Processing Texts Automatically (C Chiarcos, RE de Castilho, M Stede, eds.), 31–40.
 
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, ..., Amodei D (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33: 1877–1901.
 
Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka E, Mitchell T (2010). Toward an architecture for never-ending language learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (M Fox, D Poole, eds.), volume 24, 1306–1313.
 
Christiano PF, Leike J, Brown T, Martic M, Legg S, Amodei D (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
 
Cyganiak R, Wood D, Lanthaler M (2014). RDF 1.1 concepts and abstract syntax. https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/. W3C Recommendation. 25 February 2014.
 
Dai D, Dong L, Hao Y, Sui Z, Chang B, Wei F (2021). Knowledge neurons in pretrained transformers. arXiv preprint.
 
Devlin J (2018). Bert: Pre-training of deep bidirectional transformers for language understanding/arxiv preprint. arXiv preprint: arXiv:1810.04805
 
Golovneva O, Chen M, Poff S, Corredor M, Zettlemoyer L, ..., Celikyilmaz A (2023). ROSCOE: A suite of metrics for scoring step-by-step reasoning. In: Proceedings of the Eleventh International Conference on Learning Representations (ICLR).
 
Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, ..., Ma Z (2024). The llama 3 herd of models. arXiv preprint: arXiv:2407.21783
 
Grootendorst M (2022). Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint: arXiv:2203.05794
 
Hastings J, Chepelev L, Willighagen E, Adams N, Steinbeck C, Dumontier M (2011). The chemical information ontology: Provenance and disambiguation for chemical data on the biological semantic web. PLoS ONE, 6(10): e25513. https://doi.org/10.1371/journal.pone.0025513
 
Hu N, Wu Y, Qi G, Min D, Chen J, ..., Ali Z (2023). An empirical study of pre-trained language models in simple knowledge graph question answering. World Wide Web, 26(5): 2855–2886. https://doi.org/10.1007/s11280-023-01166-y
 
Hu Z, Xu Y, Yu W, Wang S, Yang Z, ..., Sun Y (2022). Empowering language models with knowledge graph reasoning for question answering. arXiv preprint: arXiv:2211.08380
 
International Organization for Standardization (2013). Statistical data and metadata exchange (SDMX).
 
Järvelin K, Kekäläinen J (2002). Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems, 20(4): 422–446. https://doi.org/10.1145/582415.582418
 
Ji S, Pan S, Cambria E, Marttinen P, Yu PS (2021). A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 33(2): 494–514. https://doi.org/10.1109/TNNLS.2021.3070843
 
Ji Z, Lee N, Frieske R, Yu T, Su D, ..., Fung P (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12): 1–38. https://doi.org/10.1145/3571730
 
Kevian D, Syed U, Guo X, Havens A, Dullerud G, ..., Hu B (2024). Capabilities of large language models in control engineering: A benchmark study on gpt-4, claude 3 opus, and gemini 1.0 ultra. arXiv preprint: arXiv:2404.03647
 
Lau JH, Newman D, Baldwin T (2014). Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 530–539. Association for Computational Linguistics, Gothenburg, Sweden.
 
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, ..., Kiela D (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33: 9459–9474.
 
Li Z, Wang C, Liu Z, Wang H, Wang S, Gao C (2022). Cctest: Testing and repairing code completion systems. 2023 ieee/acm 45th international conference on software engineering (icse) (2022), 1238–1250.
 
Lin BY, Chen X, Chen J, Ren X (2019). KagNet: Knowledge-aware graph networks for commonsense reasoning. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (K Inui, J Jiang, V Ng, X Wan, eds.), 2829–2839. Association for Computational Linguistics, Hong Kong, China.
 
Liu J, Liu C, Zhou P, Lv R, Zhou K, Zhang Y (2023). Is chatgpt a good recommender? a preliminary study. arXiv preprint: arXiv:2304.10149
 
Liu NF, Gardner M, Belinkov Y, Peters ME, Smith NA (2019). Linguistic knowledge and transferability of contextual representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (J Burstein, C Doran, T Solorio, eds.), volume 1 of Long and Short Papers, 1073–1094. Association for Computational Linguistics, Minneapolis, Minnesota.
 
Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, ..., Wang P (2020). K-bert: Enabling language representation with knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2901–2908.
 
Liu Y, Ott M, Goyal N, Du J, Joshi M, ..., Stoyanov V (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint: arXiv:1907.11692
 
Liu Y, Wan Y, He L, Peng H, Yu PS (2021). KG-bart: Knowledge graph-augmented bart for generative commonsense reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 6418–6425.
 
Logan R, Liu NF, Peters ME, Gardner M, Singh S (2019). Barack’s wife hillary: Using knowledge graphs for fact-aware language modeling. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (A Korhonen, D Traum, L Màrquez, eds.), 5962–5971. Association for Computational Linguistics, Florence, Italy.
 
Luo D, Su J, Yu S (2020). A bert-based approach with relation-aware attention for knowledge base question answering. In: 2020 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE.
 
Malinka K, Peresíni M, Firc A, Hujnák O, Janus F (2023). On the educational impact of chatgpt: Is artificial intelligence ready to obtain a university degree? In: Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education v. 1, 47–53.
 
Manning CD, Raghavan P, Schütze H (2008). Introduction to Information Retrieval. Cambridge University Press, Cambridge.
 
Mitchell T, Cohen W, Hruschka E, Talukdar P, Yang B, ..., Welling J (2018). Never-ending learning. Communications of the ACM, 61(5): 103–115. https://doi.org/10.1145/3191513
 
Newman D, Lau JH, Grieser K, Baldwin T (2010). Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 100–108. Association for Computational Linguistics, Los Angeles, California.
 
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, ..., Lowe R (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744.
 
Petroni F, Rocktäschel T, Lewis P, Bakhtin A, Wu Y, ..., Riedel S (2019). Language models as knowledge bases? arXiv preprint: arXiv:1909.01066
 
Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 53728–53741. https://doi.org/10.52202/075280-2338
 
Reimers N, Gurevych I (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint: arXiv:1908.10084
 
Ristoski P, Rosati J, Di Noia T, De Leone R, Paulheim H (2019). Rdf2vec: RDF graph embeddings and their applications. Semantic Web, 10(4): 721–752.
 
Robertson S, Zaragoza H (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4): 333–389.
 
Röder M, Both A, Hinneburg A (2015). Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM), 399–408. ACM.
 
Sanh V, Webson A, Raffel C, Bach SH, Sutawika L, ..., Rush AM (2021). Multitask prompted training enables zero-shot task generalization. arXiv preprint: arXiv:2110.08207
 
Suchanek FM, Kasneci G, Weikum G (2007). Yago: A core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, 697–706.
 
Team G, Mesnard T, Hardin C, Dadashi R, Bhupatiraju S, ..., Kenealy K (2024). Gemma: Open models based on Gemini Research and technology. arXiv preprint: arXiv:2403.08295
 
United Nations Economic Commission for Europe (UNECE) (2025). Generic statistical information model (GSIM) version 2.0: User guide. https://unece.org/. User Guide PDF. GSIM v2.0.
 
US Census Bureau (2025a). Census API user guide. https://www.census.gov/data/developers/guidance/api-user-guide.html. Published January 16, 2025. Accessed September 1, 2025.
 
US Census Bureau, American Community Survey (2025b). American community survey (ACS). https://www.census.gov/programs-surveys/acs.html. Accessed September 1, 2025.
 
US Census Bureau, American Community Survey 1-Year Estimates (2023). American community survey 1-year estimates. https://api.census.gov/data/2023/acs/acs1. Accessed September 1. 2025.
 
US Census Bureau, American Community Survey 5-Year Estimates (2020). American community survey 5-year estimates. https://api.census.gov/data/2020/acs/acs5. Accessed September 1. 2025.
 
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, ..., Polosukhin I (2017). Attention is all you need. Advances in neural information processing systems, 30.
 
Vrandečić D, Krötzsch M (2014). Wikidata: A free collaborative knowledgebase. Communications of the ACM, 57(10): 78–85. https://doi.org/10.1145/2629489
 
Wang J, Hu X, Hou W, Chen H, Zheng R, ..., Xie X (2023a). On the robustness of chatgpt: An adversarial and out-of-distribution perspective. arXiv preprint: arXiv:2302.12095
 
Wang X, Wei J, Schuurmans D, Le QV, Chi EH, ..., Zhou D (2023b). Self-consistency improves chain of thought reasoning in language models. In: Proceedings of the Eleventh International Conference on Learning Representations (ICLR). ICLR. 2023.
 
Wei J, Bosma M, Zhao VY, Guu K, Yu AW, ..., Le QV (2021). Finetuned language models are zero-shot learners. arXiv preprint: arXiv:2109.01652
 
Yang J, Jin H, Tang R, Han X, Feng Q, ..., Hu X (2024). Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data, 18(6): 1–32. https://doi.org/10.1145/3649506
 
Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q (2019). ERNIE: Enhanced language representation with informative entities. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (A Korhonen, D Traum, L Màrquez, eds.), 1441–1451. Association for Computational Linguistics, Florence, Italy.

Related articles PDF XML
Related articles PDF XML

Copyright
2026 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
large language models linked data link prediction metadata interoperability retrieval-augmented generation semantic search statistical knowledge graphs

Metrics
since February 2021
71

Article info
views

35

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal
  • Renmin University of China homepage
  • Academic Journal Management
    and Development Center homepage

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • Contact person: Jing Zhou
  • Phone: +86-10-62511318
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy