Evaluating Data Suitability for RAG Systems in Manufacturing: A Comparison Between Human and LLM Judgments

Skip to content

SEARCH

Current issue

Online first

Archive

Ethics standards

Current issue

Online first

Archive

About Information on the JME funding Editorial Office Publisher About the journal Publishing procedure Peer review procedure Karpacz conference

Ethics standards

Guidelines for authors for authors for reviewers

Contact

Evaluating Data Suitability for RAG Systems in Manufacturing: A Comparison Between Human and LLM Judgments

Tom Keller ¹

,

Carsten Wohlgemuth ¹

,

Eike Permin ¹

1

Institute of General Mechanical Engineering, Faculty of Computer Science and Engineering Science, TH Köln University of Applied Science, Germany

Submission date: 2026-02-05

Final revision date: 2026-02-27

Acceptance date: 2026-03-02

Online publication date: 2026-04-16

Publication date: 2026-06-22

Corresponding author

Tom Keller

Institute of General Mechanical Engineering, Faculty of Computer Science and Engineering Science, TH Köln University of Applied Science, Steinmüllerallee 1, 51643, Gummersbach, Germany

Journal of Machine Engineering 2026;26(2):163-174

DOI: https://doi.org/10.36897/jme/218709

References (18)

KEYWORDS

artificial intelligence

information retrieval

large language models

TOPICS

Mechanical Engineering

ABSTRACT

Retrieval-Augmented Generation (RAG) is widely used for manufacturing assistance, but its effectiveness depends on selecting retrievable text units. We test whether humans or Large Language Models (LLMs) can judge which case descriptions are better suited as RAG inputs. We constructed 100 synthetic manufacturing service cases, each paired with a realistic query and two comparable problem–solution variants differing in contextual completeness, granularity, and quality. Five engineers and five LLMs chose the variant expected to be more retrievable and useful. As a reference, both variants were indexed in a minimal retrieval setup with one chunk per case and evaluated with MRR@3, treating the case-matching chunk as the only relevant item among distractors. LLMs showed much higher within-group agreement than humans, yet neither cohort consistently matched retrieval-derived winners. Ties were frequent; on non-tied cases, majority decisions fell below chance and were significantly worse than random guessing in one embedding setting, while no individual rater achieved above-chance performance. Overall, the findings indicate that perceived RAG-fitness is not a reliable proxy for retrieval performance and should be grounded in retrieval-based evaluation under the target deployment setup.

REFERENCES (18)

1.

GAO Y., XIONG Y., GAO X., JIA K., PAN J., BI Y., DAI Y., SUN J., WANG M., WANG H., 2024, Retrieval-Augmented Generation for Large Language Models: a Survey, arXiv.

2.

SHAN R., SHAN T., 2025, Retrieval-Augmented Generation Architecture Framework: Harnessing the Power of RAG, Cognitive Computing - ICCC 2024, 88–104, https://doi.org/10.1007/978-3-....

3.

CHENG M., LUO Y., JIE O., LIU Q., LIU H., LI L., YU S., ZHANG B., CAO J., MA J., WANG D., CHEN E., 2025, A Survey on Knowledge-Oriented Retrieval-Augmented Generation, ArXiv, abs/2503.10677, https://doi.org/10.48550/arxiv....

4.

FLEISCHER J., PUCHTA A., GÖNNHEIMER P., 2021, Seamless and Modular Architecture for Autonomous Machine Tools, Journal of Machine Engineering, https://doi.org/10.36897/jme/1....

5.

OCHE A.J., FOLASHADE A.G., GHOSAL T., BISWAS A., 2025, A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions, arXiv.

6.

FRIEDRICH C., VOGT S., RUDOLPH F., PATOLLA P., GRÜTZMANN J.M., HOHMEIER O., RICHTER M., WENZEL K., REICHELT D., IHLENFELDT S., 2024, Enabling Federated Learning Services Using OPC UA, Linked Data and GAIA-X in Cognitive Production, Journal of Machine Engineering, 24/2, 18–33, https://doi.org/10.36897/jme/1....

7.

MAYAT N., WACHTER C., SPATZENEGGER S., HINRICHS M.P., WEISSER T., SCHMITT R.H., 2025, Performance of Rag-Based Systems in Industrial Organizations: A Case Study in the Automotive Industry, IEEE 8th International Conference on Industrial Cyber-Physical Systems (ICPS), 1–6, https://doi.org/10.1109/icps65....

8.

SHUMAILOV I., SHUMAYLOV Z., ZHAO Y., PAPERNOT N., ANDERSON R., GAL Y., 2024, AI Models Collapse when Trained on Recursively Generated Data, Nature, 631/8022, 755–759, https://doi.org/10.1038/s41586....

9.

ZHAO P., ZHANG H., YU Q., WANG Z., GENG Y., FU F., YANG L., ZHANG W., JIANG J., CUI B., 2024, Retrieval-Augmented Generation for AI-Generated Content: A Survey, arXiv.

10.

BLEICHER F., RAMSAUER C., LEONHARTSBERGER M., LAMPRECHT M., STADLER P., STRASSER D., WIEDERMANN C., 2021, Tooling Systems with Integrated Sensors Enabling Data Based Process Optimization, Journal of Machine Engineering, 5–21, https://doi.org/10.36897/jme/1....

11.

MÜLLER J., HOLSTEIN, 2025, Data Quality Challenges in Retrieval- Augmented Generation, https://doi.org/10.48550/arXiv....

12.

BREHME L., DORNAUER B., STRÖHLE T., EHRHART M., BREU R., 2025, Retrieval-Augmented Generation in Industry: an Interview Study on Use Cases, Requirements, Challenges, and Evaluation, Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, 110–122, https://doi.org/10.5220/001373....

13.

ZHOU Y., LIU Y., LI X., JIN J., QIAN H., LIU Z., LI C., DOU Z., HO T.-Y., YU P.S., 2024, Trustworthiness in Retrieval-Augmented Generation Systems: A Survey, arXiv.

14.

CAVALCANTI Y.C., DA MOTA SILVEIRA NETO P.A., LUCRÉDIO D., VALE T., DE ALMEIDA E.S., DE LEMOS MEIRA S.R., 2013, The Bug Report Duplication Problem: an Exploratory Study, Software Qual J, 21/1, 39–66, https://doi.org/10.1007/s11219....

15.

EBRAHIMI N., TRABELSI A., ISLAM MD.S., HAMOU-LHADJ A., KHANMOHAMMADI K., 2019, An HMM-Based Approach for Automatic Detection and Classification of Duplicate Bug Reports, Information and Software Technology, 113, 98–109, https://doi.org/10.1016/j.infs....

16.

JAHAN S., RAHMAN M.M., 2022, Towards Understanding the Impacts of Textual Dissimilarity on Duplicate Bug Report Detection, arXiv.

17.

DIMIDOV V., HAWLADER F., JAFARNEJAD S., FRANK R., 2025, Cleaning Maintenance Logs with LLM Agents for Improved Predictive Maintenance, arXiv.

18.

XU Z., CRUZ M.J., GUEVARA M., WANG T., DESHPANDE M., WANG X., LI Z., 2024, Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering, Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2905–2909, https://doi.org/10.1145/362677....

Submit your paper

Share

RELATED ARTICLE

Selected Issues Regarding AI-Assisted Design of Superhard Grinding Pins and Grinding Wheels CAD Models

Towards Metrology 4.0 in Dimensional Measurements

Machine learning in Cyber-Physical Systems and manufacturing singularity – It does not mean total automation, human is still in the centre: Part II – In-CPS and a view from community on Industry 4.0 impact on society

Offline-Online pattern recognition for enabling time series anomaly detection on older NC machine tools

Machine Learning in Cyber-Physical Systems and Manufacturing Singularity – it Does Not Mean Total Automation, Human Is Still in the Centre: Part I – Manufacturing Singularity and an Intelligent Machine Architecture

Indexes

eISSN:	2391-8071
ISSN:	1895-7595

© 2006-2026 Journal hosting platform by Bentus

Scroll to top