Modélisation, classification et détection de vulnérabilités et leurs variants dans les bases de codes logiciels à l'aide de l'IA

Submitted by Paul TEMPLE on lun 27/01/2025 - 09:31

Team

Place

Rennes

Laboratory

IRISA - UMR 6074

Description of the subject

Plusieurs bases de données de vulnérabilités existent. L’objectif de cette thèse est, à partir de catalogues de
vulnérabilités existants, de modéliser, classifier et généraliser les vulnérabilités des logiciels, afin de
retrouver ces vulnérabilités ou leurs variants, ou de trouver de nouvelles vulnérabilités apparentées, dans
des bases de code (dépôts de code logiciel, public ou privés).

Cette thèse consiste à mettre en place la modélisation appropriée à partir d’une base de données
(catalogue) classant les vulnérabilités et d’en faire un apprentissage exploitable. Les vulnérabilités
pourraient prendre diverses formes (anti-patterns, AST incorrects, utilisation d’API dépréciées,
etc.).

Le but de l’apprentissage sera tout particulièrement d’être capable d’abstraire et généraliser les
vulnérabilités en fratries (une vulnérabilité et ses variantes). Ceci permettra une robustesse vis-à-
vis des contextes de code dans lesquels se trouveront insérées les vulnérabilités à détecter.

Ce dernier point est tout particulièrement crucial car la conception de systèmes informatiques
modernes se base de plus en plus sur leur modularisation en une multitude de micro-services, qui
proposent des fonctionnalités plus petites mais qui interagissent intensivement. Cette structure
est particulièrement favorable à de nouvelles attaques [13] en plusieurs étapes, correspondant à
des vulnérabilités qui peuvent être dispersées à divers endroits dans le système. Des exemples
peuvent être des anti-patterns ou bien l’utilisation d’arbres de syntaxe abstraite incorrects.

Ainsi, leur détection est également plus difficile puisque noyée et dispersée dans le reste du code.
Les bases de données (i.e., catalogues) référençant les vulnérabilités ne prennent pas
nécessairement en compte cet aspect et il devient donc difficile de les détecter efficacement. De
plus, du fait de l’hétérogénéité des systèmes et des micro-services, il est important de pouvoir
gagner en généralisation afin de prendre en compte d’éventuels variants ou de prévoir l’apparition
de futurs variants.
Afin de pouvoir analyser et contrer un maximum de ces vulnérabilités (et variants associés),
l’utilisation de techniques d’intelligence artificielle (surtout basées sur de l’apprentissage machine)
est une direction prometteuse.
Cependant pour que ces techniques soient effectives, il faut d’une part pouvoir abstraire la façon
dont se matérialisent ces vulnérabilités, et d’autre part que l’abstraction choisie soit capable d’en
capturer les variantes. De plus, elles devront être efficaces et capable de monter en charge de
façon à être utilisables en pratique.

Bibliography

[1] Piergiorgio Ladisa, Henrik Plate, Matias Martinez, and Olivier Barais. “Taxonomy of Attacks on
Open-Source Software Supply Chains”. In: Proceedings of the 44th IEEE Symposium on Security and Privacy,
SP 2023, May 22-26, 2023, SAN FRANCISCO, CA. Ed. by IEEE Computer Society Technical Committee on
Security and Privacy. IEEE, 2023, To appear. URl: https://doi.org/10.48550/arXiv.2204.04008
[2] Djamel Eddine Khelladi, Benoît Combemale, Mathieu Acher, Olivier Barais, and Jean-Marc Jézéquel.
“Co-evolving code with evolving metamodels”. In: ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020. Ed. by Gregg Rothermel and Doo-Hwan Bae. ACM,
2020, pp. 1496–1508. doI: 10.1145/ 3377811.3380324. URl: https://doi.org/10.1145/3377811.3380324.
[3] Elliot Chikofsky and James Cross II. “Reverse Engineering and Design Recovery: A Taxonomy”. In:
IEEE Software 7.1 (Jan. 1990), pp. 13–17. doI: 10.1109/52.43044. URl: http://dx.doi.org/10.1109/52.43044.
[4] European Commission. EU-FOSSA 2 - Free and Open Source Software Auditing. June 2020. URl:
https://joinup.ec.europa.eu/collection/eu- fossa- 2/news/eu- fossa- 2- project-close (visited on
05/04/2022).
[5] Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. “Fine-
grained and accurate source code differencing”. In: ACM/IEEE Interna- tional Conference on Automated
Software Engineering, ASE ’14, Vasteras, Sweden - September 15 - 19, 2014. Ed. by Ivica Crnkovic, Marsha
Chechik, and Paul Grünbacher. ACM, 2014, pp. 313–324. doI: 10.1145/2642937.2642982. URl:
https://doi.org/10.1145/2642937.
White House. Readout of White House Meeting on Software Security. Jan. 2022. URl:
https://www.whitehouse.gov/briefing-room/statements-releases/2022/01/13…-
meeting-on-software-security/ (visited on 05/04/2022).
[6] Hugo Martin, Mathieu Acher, Juliana Alves Pereira, et al. “Transfer Learning Across Variants and
Versions: The Case of Linux Kernel Size”. In: IEEE Transactions on Software Engineering (2021), pp. 1–17.
URl: https://hal.inria.fr/hal-03358817.
[7] Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. “Backstabber’s Knife Collection: A
Review of Open Source Software Supply Chain Attacks”. In: Detection of Intrusions and Malware, and
Vulnerability Assessment. Ed. by Clémentine Maurice, Leyla Bilge, Gianluca Stringhini, and Nuno Neves.
Cham: Springer International Publishing, 2020, pp. 23–43. IsBn: 978-3-030-52683-2.
[8] Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. “Backstabber’s Knife Collection: A
Review of Open Source Software Supply Chain Attacks”. In: Detection of Intrusions and Malware, and
Vulnerability Assessment. Ed. by Clémentine Maurice, Leyla Bilge, Gianluca Stringhini, and Nuno Neves.
Cham: Springer International Publishing, 2020, pp. 23–43. IsBn: 978-3-030-52683-2.
[9] OpenSSF. ossf/scorecard: Security Scorecards - Security health metrics for Open Source. Dec. 2020.
URl: https://github.com/ossf/scorecard (visited on 05/04/2022).
[10] Henning Perl, Sergej Dechand, Matthew Smith, et al. “VCCFinder: Finding Potential Vulnerabilities in
Open-Source Projects to Assist Code Audits”. In: Proceedings of the 22nd ACM SIGSAC Conference on
Computer and Communications Security, Denver, CO, USA, October 12-16, 2015. Ed. by Indrajit Ray,
Ninghui Li, and Christopher Kruegel. ACM, 2015, pp. 426–437. doI: 10.1145/2810103.2813604. URl:
https://doi.org/10.1145/2810103.2813604.
[11] Serena Elisa Ponta, Henrik Plate, and Antonino Sabetta. “Detection, assessment and mitigation of
vulnerabilities in open source dependencies”. In: Empir. Softw. Eng. 25.5 (2020), pp. 3175–3215. doI:
10.1007/s10664-020-09830-x. URl: https://doi.org/10. 1007/s10664-020-09830-x.
[12] Marc Schönefeld. “Anti-patterns in JDK security and refactorings”. In: Detection of intrusions and
malware & vulnerability assessment, GI SIG SIDAR workshop, DIMVA 2004. Gesellschaft für Informatik eV.
2004.
[13] Fabio Pierrazzi, Feargus Pendlebury, Jacopo Cortellazzi, and Lorenzo Cavallaro. “Intriguing properties of
adversarial ml attacks in the problem space”. In : 2020 IEEE symposium on security and privacy (SP). IEEE,[14] MITRE CVE: https://cve.mitre.org/index.html
[15] Software Heritage: https://www.softwareheritage.org/
[16] R. Russell et al., "Automated Vulnerability Detection in Source Code Using Deep Representation
Learning," 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA),
Orlando, FL, USA, 2018, pp. 757-762, doi: 10.1109/ICMLA.2018.00120.
[17] Asaf Shabtai, Robert Moskovitch, Yuval Elovici, Chanan Glezer, Detection of malicious code by applying
machine learning classifiers on static features: A state-of-the-art survey, Information Security Technical
Report, Volume 14, Issue 1, 2009, Pages 16-29, ISSN 1363-4127, https://doi.org/10.1016/j.istr.2009.03.003.
[18] J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang and X. Liu, "A Novel Neural Source Code Representation
Based on Abstract Syntax Tree," 2019 IEEE/ACM 41st International Conference on Software Engineering
(ICSE), Montreal, QC, Canada, 2019, pp. 783-794, doi: 10.1109/ICSE.2019.00086.
[19] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2008). The graph neural network
model. IEEE transactions on neural networks, 20(1), 61-80.
[20] Steenhoek, B., Rahman, M. M., Jiles, R., & Le, W. (2023, May). An empirical study of deep learning
models for vulnerability detection. In 2023 IEEE/ACM 45th International Conference on Software
Engineering (ICSE) (pp. 2237-2248). IEEE.
[21] P. Zeng, G. Lin, L. Pan, Y. Tai, and J. Zhang. Software vulnerability analysis and discovery using deep
learning techniques: A survey. IEEE Access, 8:197158–197172, 2020.
[22] G. Lu, X. Ju, X. Chen, W. Pei, and Z. Cai. GRACE: Empowering LLM-based software vulnerability
detection with graph structure and in-context learning. J. Syst. Softw., 212:112031, 2024.