Unsupervised Anomaly Detection in OpenStack Logs via Fine-Tuned RoBERTa Embeddings

Authors

  • Janit Rajkarnikar School of Computer Sciences, University of Southern Mississippi, Hattiesburg, MS, USA Author
  • Nishan Poudel School of Computer Sciences, University of Southern Mississippi, Hattiesburg, MS, USA Author
  • Nick Rahimi School of Computer Sciences, University of Southern Mississippi, Hattiesburg, MS, USA Author

DOI:

https://doi.org/10.65879/3070-5789.2025.01.02

Keywords:

Anomaly Detection, Transformer Embeddings, RoBERTa, Isolation Forest, One-Class SVM, OpenStack Logs, Unsupervised Learning, PEFT, LoRA

Abstract

As cloud computing infrastructures increasingly depend on reliable operation, proactively detecting anomalies in system logs becomes indispensable. Conventional log analysis techniques often produce high false-alarm rates and exhibit limited semantic understanding. To address these limitations, we developed an unsupervised anomaly detection framework that leverages fine-tuned RoBERTa-base model embeddings to capture contextual patterns within OpenStack log event sequences. We apply a crucial filtering step to remove high-frequency, non-discriminatory events, ensuring our models learn from nuanced contextual signals rather than simple indicators. From these refined sequences, we construct a custom vocabulary and fine-tune RoBERTa with Parameter-Efficient Fine-Tuning (PEFT) using LoRA. These contextualized embeddings inform unsupervised classifiers, including Isolation Forest and One-Class SVM, trained solely on normal data. Our approach demonstrates excellent and robust performance on a holdout test set (Anomaly F1-Score up to 0.97), significantly outperforming traditional LSTM-based baselines on the same task. These results demonstrate that contextualized transformer embeddings provide a powerful and resilient foundation for log-based anomaly detection, reducing false alarms and improving detection accuracy in complex cloud environments.

References

[1] Zhu J, He S, Liu J, He P, Xie Q, Zheng Z, et al. Tools and benchmarks for automated log parsing. In: Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice. Montreal, QC, Canada. IEEE 2019; pp. 121-30.

https://doi.org/10.1109/ICSE-SEIP.2019.00021

[2] He P, Zhu J, He S, Li J, Lyu MR. An evaluation study on log parsing and its use in log mining. In: Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE 2016; pp. 654-61.

https://doi.org/10.1109/DSN.2016.66

[3] Landauer M, Skopik F, Wurzenberger M. A critical review of common log data sets used for evaluation of sequence-based anomaly detection techniques. Proceedings of the ACM on Software Engineering 2024; 1(FSE): 1354-75.

https://doi.org/10.1145/3660768

[4] Kalaki PS, Shameli-Sendi A, Abbasi B. Anomaly detection on OpenStack logs based on an improved robust principal component analysis model and its projection onto column space. Software: Practice and Experience 2022; 53: 665-81.

https://doi.org/10.1002/spe.3164

[5] Guo H, Yuan S, Wu X. LogBERT: log anomaly detection via BERT. arXiv. 2021 [cited 2025 Nov 2]. Available from: https://arxiv.org/abs/2103.04475

[6] Lim YF, Zhu J, Pang G. Adapting large language models for parameter-efficient log anomaly detection. arXiv. 2025 [cited 2025 Nov 2]. Available from: https://arxiv.org/abs/2503.08045

[7] Alahmadi BA, Axon L, Martinovic I. 99% false positives: a qualitative study of SOC analysts’ perspectives on security alarms. In: Proceedings of the 31st USENIX Security Symposium; 2022 Aug 10-12; Boston, MA. Berkeley (CA): USENIX Association; 2022. p. 2783-800. Available from: https://www.usenix.org/conference/usenixsecurity22/presentation/alahmadi

[8] Nepal S, Hernandez J, Lewis R, Chaudhry A, Houck B, Knudsen E, et al. Burnout in cybersecurity incident responders: exploring the factors that light the fire. Proceedings of the ACM on Human-Computer Interaction. 2024; 8(CSCW1): 27: 1-35.

https://doi.org/10.1145/3637304

[9] Chen W, Zhang J. Elevating security operations: the role of AI-driven automation in enhancing SOC efficiency and efficacy. Journal of Artificial Intelligence and Machine Learning in Management 2024; 8(2): 1-13. Available from: https://journals.sagescience.org/index.php/jamm/article/view/128

[10] Oliner AJ, Ganapathi A, Xu W. Advances and challenges in log analysis. Communications of the ACM 2012; 55(2): 55-61.

https://doi.org/10.1145/2076450.2076466

[11] Buchta R, Gkoktsis G, Heine F, Kleiner C. Advanced persistent threat attack detection systems: a review of approaches, challenges, and trends. Digital Threats: Research and Practice 2024; 5(4): Art. 39:1-37.

https://doi.org/10.1145/3696014

[12] Li R, Li Q, Zhang Y, Zhao D, Jiang Y, Yang Y. Interpreting unsupervised anomaly detection in security via rule extraction. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, editors. Advances in Neural Information Processing Systems 36. Curran Associates 2023; pp. 62224-43. Available from: https://proceedings.neurips.cc/ paper_files/paper/2023/file/c43b987f23fd5ea840df2b2be426315c-Paper-Conference.pdf

[13] Dahal A, Bajgai P, Rahimi N. Analysis of zero day attack detection using MLP and XAI. In: Daimi K, Arabnia HR, Deligiannidis L, editors. Security and Management and Wireless Networks. CSCE 2024. Communications in Computer and Information Science, vol. 2254. Cham: Springer 2025; pp. 1-10.

https://doi.org/10.1007/978-3-031-86637-1_5

[14] Rahimi N, Maynor J, Gupta B. Adversarial machine learning: difficulties in applying machine learning in existing cybersecurity systems. In: Lee G, Jin Y, editors. Proceedings of the 35th International Conference on Computers and Their Applications, vol. 69. EasyChair 2020; pp. 40-7. Available from: https://easychair.org/publications/paper/XwRv

Downloads

Published

2025-11-04

Issue

Section

Articles