Context
It is widely recognized that some of the data collected, processed and stored is of a personal
nature, making it sensitive and requiring appropriate protec9on. Organiza9ons, whether
public or private, e.g., in domains like healthcare, finance, logis9cs, that handle such data are
subject to increasingly stringent regulations. It is then crucial to equip information systems
with mechanisms that control and restrict access to sensitive data to authorized users only.
While experts in cybersecurity are involved in the design of advanced mechanisms to ensure
data protec9on, an increasing number of individuals and organizations with limited technical
exper9se are tasked with managing large volumes of sensitive data. The challenge then lies in
bridging the gap between sophis9cated data security practices and the practical, user-friendly
tools that non-specialists can implement and understand.
The strategy followed in this project is to enable non-experts to express their security needs
in natural language. By developing systems that can automatically translate natural language
security requirements into formal, enforceable policies, we could make data security
(particularly specification of security requirements) accessible to a wider range of users. Such
an approach will also provide users with a high degree of transparency and confidence in the
protection of their data, making complex security concepts more intuitive. Furthermore, users
should be able to audit and understand how their policies are being enforced, increasing trust
in data integration and automated data management systems.
With this thesis, we seek to democratize data security by designing an approach and
developing tools that allow users to specify data usage and access control policies in a way
that is understandable and manageable without the need for deep technical expertise.
Objective
When security policies are defined within an information system, ensuring their effective
enforcement becomes paramount. In this context, our objective is to enhance data protec9on
by detecting and blocking suspicious queries in real time. This will be based on user behavior
paKerns such as unusual query sequences, high-volume access within a short 9meframe, or
repeated aKempts to access sensitive data from a single user. These paKerns will be codified
into templates of suspicious behavior, allowing the system to quickly check and intercept
queries that match.
A policy cache will store these templates for efficient runtime evaluation. Additionally, for each
user, past queries will be logged and used to assess new queries. If a combination of past and
new queries forms a potentially dangerous transaction, the system will block the new query.
We will explore two approaches for this combination: one attribute-based, which links queries
via shared aKributes, and another based on auditing of query history.
Positioning
Access control (AC) is a vital aspect of safeguarding information systems. Various types of AC
have been proposed in the literature, such as Role-Based (RBAC), Organization-Based (OrBAC),
AKribute-Based (ABAC), etc. These access control systems provide a formal means of
specifying a security policy, usually using logical or constraint languages. Recently, several
research works have addressed the extraction of (ABAC) access control policies from natural
language texts using machine learning and natural language processing (NLP) techniques
[1,2,3]. In a recent work [4], authors recognize that human involvement is essen9al to validate
access control policy predic9on. An interactive (log-based) approach for adap9ve policies has
been proposed in [5], but it does not consider natural language inputs. Furthermore, while AC
blocks direct access, it does not prevent inference attacks. Approaches like [8] handle statistical
inference by limiting access to aggregated data, and [9] addresses semantic inference risks.
The authors of [14] propose an auditing module that focuses on historical query logs in
conjunction with the current query, aiming to identify the inference channel which potentially
leads to violations of access policies.
Caching is a common acceleration strategy in computer science. Researchers in the Data Base
community have been interested in data caching for decades. So_ware acceleration in this
field can be found in data warehouses, distributed databases, web search engines for example
[10]. Semantic caching [11] is a caching technique that uses semantic information to improve
the efficiency of the cache. In semantic caching, the cache stores not only the data but also
the semantics or meaning of the data, which can help to reduce cache misses and improve
cache hit rates. By understanding the seman9cs of the data, the cache can be more intelligent
in predicting and pre-fetching the data that is likely to be needed in the future. Semantic
caching has been seen as a solution to consider response time and energy consumption in
mobile cloud computing [12] or to efficiently rely on FPGA [13]. Exploring semantic caching,
which stores data along with its associated meaning, could serve a dual purpose: on the one
hand, it may empower the detection of inference channels, and on the other hand, its
integration could optimize query processing while enhancing security. This makes it necessary
to rethink the different strategies that caches may rely on. To the best of our knowledge,
security and optimization has never been addressed in a such manner before. Finally, despite
recent progress in AI's explainability [10], especially in machine learning, text-based
generation of explana9ons for access control models handling both permissions and
prohibitions is almost non-existent.
Organization
The PhD thesis will be organized as follows to consider run-time policy enforcement:
• Design an approach that exploits the interaction between security rules and data
dependencies to generate addi9onal rules designed to avoid the problem of inferring
sensitive data.
• Develop a module for policy enforcement and run-time monitoring of queries using
semantic cache techniques.
• Perform dynamic policy adjustment by refining templates so to improve the system’s
ability to block suspicious queries.
• Use available synthetic corpus of security policy requirements such as IBM-CM, iTrust,
• CyberChair to test our approach.
[1] Nobi, M. & Gupta, M. & Praharaj, L. & Abdelsalam, M.& Krishnan, R. & Sandhu, R. (2022).
Machine Learning in Access Control: A Taxonomy and Survey. 10.48550/arXiv.2207.01739.
[2] Xiao, X, Paradkar A, Thummalapenta S, Xie T (2012) Automated extraction of security
policies from natural-language so_ware documents In: Proc. of the ACM SIGSOFT, NY, USA, FSE
’12, 12:1–12:11.
[3] Narouei, M, Khanpour H, Takabi H, Parde N, Nielsen R (2017) Towards a top-down policy
engineering framework for attribute-based access control In: Proc. of SACMAT ’17, ACM, 103–
114.
[4] John Heaps, Ram Krishnan, Yufei Huang, Jianwei Niu, and Ravi Sandhu (2021) Access
Control Policy Generation from User Stories Using Machine Learning. In Proc. of DBSec 2021,
LNCS, Springer, 19–20.
[5] Karimi, L., Abdelhakim, M., & Joshi, J.B. (2021). Adaptive ABAC Policy Learning: A
Reinforcement Learning Approach. ArXiv, abs/2105.08587.
[6] Farkas, C., & Jajodia, S. (2002). The inference problem: a survey. ACM SIGKDD Explora9ons
NewsleKer, 4(2), 6-11.
[7] Bindschaedler, V., Grubbs, P., Cash, D., Ristenpart, T., & Shma9kov, V. (2017). The tap of
inference in privacy-protected databases. Cryptology ePrint Archive.
[8] Stanley RM Oliveira and Osmar R Zaiane. “Privacy preserving clustering by data
transformation”. In: Journal of Informa9on and Data Management 1.1 (2010), pp. 37–37.
[9] Sabrina De Capitani di Vimerca9 et al. “Confidentiality protection in large databases”. In: A
Comprehensive Guide Through the Italian Database Research Over the Last 25 Years. Springer,
2018, pp. 457– 472.
[10] Carlos Barrios, Mohan Kumar: Service Caching and Computation Reuse Strategies at the
Edge: A Survey. ACM Comput. Surv. 56(2): 43:1-43:38 (2024)
[11] Shaul Dar, Michael J. Franklin, Bj.rn ..r J.nsson, Divesh Srivastava, Michael Tan: Seman9c
Data Caching and Replacement. VLDB 1996: 330-341
[12] Mikael Perrin, Jonathan Mullen, Florian Helff, Le Gruenwald, Laurent d'Orazio: Time-,
Energy-, and Monetary Cost-Aware Cache Design for a Mobile-Cloud Database System.
Big-O(Q)/DMAH@VLDB 2015: 71-85
[13] Van Long Nguyen Huu, Laurent d'Orazio, Emmanuel Casseau, Julien Lallet: MASCARAFPGA
coopera9on model: Query Trimming through accelerators. SSDBM 2021: 203-208
[14] Agoun, J., Terras, J., Hacid, M. S., & Hariri, S. (2023, December). Empowering Data
Federation Security in Polystore Systems. In 2023 20th ACS/IEEE Internti9onal Conference on
Computer Systems and Applica9ons (AICCSA) (pp. 1-8). IEEE.