PypiGuard: A novel meta-learning approach for enhanced malicious package detection in PyPI through static-dynamic feature fusion

CCFC

Key Points

静态分析和动态分析结合

Problems

作者提出的问题：

RQ1：如何将静态元数据与动态 API 调用行为相结合，以提高开源存储库中恶意软件包检测的准确性和可靠性？

RQ2：混合集成元学习框架与传统机器学习和深度学习方法在检测恶意软件包方面相比如何？

Notes

References

通过识别与流行包非常相似的名字来识别恶意包

Neupane S, Holmes G, Wyss E, Davidson D, De Carli L. Beyond typosquatting:an in-depth look at package confusion. In: Proceedings of the 32nd USENIX conference on security symposium. USA: USENIX Association; 2023.

一石二鸟，静态分析：研究了软件包名称、作者详细信息以及依赖结构等元数据信息

Zhang J, Huang K, Huang Y, Chen B, Wang R, Wang C, et al. Killing two birds with one stone: Malicious package detection in NPM and PyPI using a single model of malicious behavior sequence. ACM Trans Softw Eng Methodol 2024.http://dx.doi.org/10.1145/3705304.

一些研究探究了基于元数据的检测方式

使用机器学习与静态元数据结合

Halder S, Bewong M, Mahboubi A, Jiang Y, Islam MR, Islam MZ, et al.Malicious package detection using metadata information. New York, NY, USA:Association for Computing Machinery; 2024, p. 1779–89. http://dx.doi.org/10.1145/3589334.3645543.

深度学习在识别Android恶意软件

Manzil R, Haidros H, Naik S M. DeepMetaDroid: Real-time android malware detection using deep learning and metadata features. Cloud Comput Data Sci 2024;203–25. http://dx.doi.org/10.37256/ccds.5220244503.

整合机器学习进一步改进了静态分析

Charoenwet W, Thongtanunam P, Pham V-T, Treude C. An empirical study of static analysis tools for secure code review. In: Proceedings of the 33rd ACM SIGSOFT international symposium on software testing and analysis. ISSTA 2024, New York, NY, USA: Association for Computing Machinery; 2024, p. 691–703.http://dx.doi.org/10.1145/3650212.3680313.

动态分析

OSCAR

Zheng X, Wei C, Wang S, Zhao Y, Gao P, Zhang Y, et al. Towards robust detection of open source software supply chain poisoning attacks in industry environments. In: Proceedings of the 39th IEEE/ACM international conference on automated software engineering. New York, NY, USA: Association for Computing Machinery;2024, p. 1990–2001. http://dx.doi.org/10.1145/3691620.3695262.

DONAPI

Huang C, Wang N, Wang Z, Siqi, Li L, Chen J, et al. DONAPI: Malicious NPM packages detector using behavior sequence knowledge mapping. 2024, arXiv:2403.08334.

使用TF-IDF和滑动窗口的高级预处理技术在为深度学习模型准备API调用序列方面显示出有效性

Kim M, Kim H. A dynamic analysis data preprocessing technique for malicious code detection with TF-IDF and sliding windows. Electronics 2024;13(5). http://dx.doi.org/10.3390/electronics13050963, [Online].Available: https://www.mdpi.com/2079-9292/13/5/963.

将特征集扩展到API调用之外

Ilić S, Gnjatović M, Tot I, Jovanović B, Maček N, Gavrilović Božović M.Going beyond API calls in dynamic malware analysis: A novel dataset. Elec-tronics 2024;13(17). http://dx.doi.org/10.3390/electronics13173553, [Online].Available: https://www.mdpi.com/2079-9292/13/17/3553.

CTIMD，展示了监控带参数的API调用序列的有效性

Chen T, Zeng H, Lv M, Zhu T. CTIMD: Cyber threat intelligence enhanced malware detection using API call sequences with parameters. ComputSecur 2024;136:103518. http://dx.doi.org/10.1016/j.cose.2023.103518,[Online].Available:https://www.sciencedirect.com/science/article/pii/S0167404823004285.

混合元学习框架在检测有限训练数据下的恶意软件方面已被证明是有效的，因此非常适合集成静态和动态检测技术

Tapu SU, Shopnil SAA, Tamanna RB, Dewan MAA, Alam MGR. Malicious data classification in packet data network through hybrid meta deep learning. IEEE Access 2023;11:140609–25. http://dx.doi.org/10.1109/ACCESS.2023.3341911.

跨语言

跨语言检测，js 和 Python

Ladisa P, Ponta SE, Ronzoni N, Martinez M, Barais O. On the feasibility of cross-language detection of malicious packages in npm and PyPI. In: Proceedings of the 39th annual computer security applications conference. New York, NY, USA: Association for Computing Machinery; 2023, p. 71–82. http://dx.doi.org/ 10.1145/3627106.3627138.

跨语言检测，Cerebro 模型” 应用微调的 BERT 模型，使用统一的行為序列来检测恶意软件包

Zhang J, Huang K, Huang Y, Chen B, Wang R, Wang C, et al. Killing two birds with one stone: Malicious package detection in NPM and PyPI using a single model of malicious behavior sequence. ACM Trans Softw Eng Methodol 2024. http://dx.doi.org/10.1145/3705304.

跨语言的可行性

Ohm M, Boes F, Bungartz C, Meier M. On the feasibility of supervised machine learning for the detection of malicious software packages. In: Proceedings of the 17th international conference on availability, reliability and security. New York, NY, USA: Association for Computing Machinery; 2022, http://dx.doi.org/ 10.1145/3538969.3544415.

结合元数据和代码分析的混合框架

Amalfi，机器学习模型，将分类器和元数据验证相结合

Sejfia A, Schäfer M. Practical automated detection of malicious npm packages. New York, NY, USA: Association for Computing Machinery; 2022, p. 1681–92. http://dx.doi.org/10.1145/3510003.3510104.

Ea4mp，1+1>2, 融合深度代码行为分析与调用图

Sun X, Gao X, Cao S, Bo L, Wu X, Huang K. 1+1>2: Integrating deep code behaviors with metadata features for malicious PyPI package detection. In: Proceedings of the 39th IEEE/ACM international conference on automated software engineering. New York, NY, USA: Association for Computing Machinery; 2024, p. 1159–70. http://dx.doi.org/10.1145/3691620.3695493.

MalHyStack，采用堆叠集成方法，将决策树和深度学习模型结合

Roy KS, Ahmed T, Udas PB, Karim ME, Majumdar S. MalHyStack: A hybrid stacked ensemble learning framework with feature engineering schemes for obfuscated malware analysis. Intell Syst Appl 2023;20:200283. http://dx.doi.org/ 10.1016/j.iswa.2023.200283, [Online]. Available: https://www.sciencedirect. com/science/article/pii/S2667305323001084.

基于 BERT 的框架也已成为跨生态系统恶意软件检测的强大工具

准确率很高，但面临计算开销的挑战

Ladisa P, Sahin M, Ponta SE, Rosa M, Martinez M, Barais O. The hitchhiker’s guide to malicious third-party dependencies. In: Proceedings of the 2023 workshop on software supply chain offensive research and ecosystem defenses. New York, NY, USA: Association for Computing Machinery; 2023, p. 65–74. http://dx.doi.org/10.1145/3605770.3625212.

使用语义代码分析的基于 LLM 的检测框架在识别复杂攻击模式方面表现出高精度

Zahan N, Burckhardt P, Lysenko M, Aboukhadijeh F, Williams L. Shifting the lens: Detecting malicious npm packages using large language models. 2024, arXiv:2403.12196.

基于深度学习的跨平台方法在检测混淆恶意软件方面取得了更高的准确率

Bhavya RA, Bindhu Shree GV, Chandan Gowda N, Sanjana S, ShwethaShree KV. ML-based cross-platform malware detection. In: 2024 international conference on knowledge engineering and communication systems, vol. 1. 2024, p. 1–6. http://dx.doi.org/10.1109/ICKECS61492.2024.10616557.

数据集

MalwareBench，将静态元数据特征与动态 API 行为结合

Zahan N, Burckhardt P, Lysenko M, Aboukhadijeh F, Williams L. MalwareBench: Malware samples are not enough. In: Proceedings of the 21st international conference on mining software repositories. New York, NY, USA: Association for Computing Machinery; 2024, p. 728–32. http://dx.doi.org/10.1145/3643991. 3644883.

PyRadar 引入了一个数据集，解决了 PyPi 中普遍存在的元数据不准确的问题

Gao K, Xu W, Yang W, Zhou M. Pyradar: Towards automatically retrieving and validating source code repository information for PyPI packages. Proc ACM Softw Eng 2024. http://dx.doi.org/10.1145/3660822.

BadSnakes

Vu D-L, Newman Z, Meyers JS. Bad snakes: Understanding and improving python package index malware scanning. In: 2023 IEEE/ACM 45th international conference on software engineering. 2023, p. 499–511. http://dx.doi.org/10. 1109/ICSE48619.2023.00052.

OSS 恶意软件包野外分析

Zhou X, Zhang Y, Niu W, Liu J, Wang H, Li Q. OSS malicious package analysis in the wild. 2024, arXiv:2404.04991.

基于LLM的恶意软件包检测

Zahan N, Burckhardt P, Lysenko M, Aboukhadijeh F, Williams L. Shifting the lens: Detecting malicious npm packages using large language models. 2024, arXiv:2403.12196.