Sorry, you need to enable JavaScript to visit this website.
Share

Publications

2023

  • Exploring the Perception of Pain in Virtual Reality through Perceptual Manipulations
    • Clavelin Gaëlle
    • Bouhier Mickael
    • Tseng Wen-Jie
    • Gugenheimer Jan
    , 2023. Perceptual manipulations (PMs) in Virtual Reality (VR) can steer users' actions (e.g., redirection techniques) and amplify haptic perceptions (e.g., weight). However, their ability to amplify or induce negative perceptions such as physical pain is not well understood. In this work, we explore if PMs can be leveraged to induce the perception of pain, without modifying the physical stimulus. We implemented a VR experience combined with a haptic prototype, simulating the dislocation of a finger. A user study (n=18) compared three conditions (visual-only, haptic-only and combined) on the perception of physical pain and physical discomfort. We observed that using PMs with a haptic device resulted in a significantly higher perception of physical discomfort and an increase in the perception of pain compared to the unmodified sensation (haptic-only). Finally, we discuss how perception of pain can be leveraged in future VR applications and reflect on ethical concerns. (10.1145/3544549.3585674)
    DOI : 10.1145/3544549.3585674
  • On Selective, Mutable and Dialogic XAI: a Review of What Users Say about Different Types of Interactive Explanations
    • Bertrand Astrid
    • Viard Tiphaine
    • Belloum Rafik
    • Eagan James R
    • Maxwell Winston
    , 2023, pp.1-21. Explainability (XAI) has matured in recent years to provide more human-centered explanations of AI-based decision systems. While static explanations remain predominant, interactive XAI has gathered momentum to support the human cognitive process of explaining. However, the evidence regarding the benefits of interactive explanations is unclear. In this paper, we map existing findings by conducting a detailed scoping review of 48 empirical studies in which interactive explanations are evaluated with human users. We also create a classification of interactive techniques specific to XAI and group the resulting categories according to their role in the cognitive process of explanation: "selective", "mutable" or "dialogic". We identify the effects of interactivity on several user-based metrics. We find that interactive explanations improve perceived usefulness and performance of the human+AI team but take longer. We highlight conflicting results regarding cognitive load and overconfidence. Lastly, we describe underexplored areas including measuring curiosity or learning or perturbing outcomes. CCS CONCEPTS • Human-centered computing → Interaction design theory, concepts and paradigms; • Computing methodologies → Artificial intelligence. (10.1145/3544548.3581314)
    DOI : 10.1145/3544548.3581314
  • Memory Manipulations in Extended Reality
    • Bonnail Elise
    • Lecolinet Eric
    • Tseng Wen-Jie
    • Mcgill Mark
    • Huron Samuel
    • Gugenheimer Jan
    , 2023. Human memory has notable limitations (e.g., forgetting) which have necessitated a variety of memory aids (e.g., calendars). As we grow closer to mass adoption of everyday Extended Reality (XR), which is frequently leveraging perceptual limitations (e.g., redirected walking), it becomes pertinent to consider how XR could leverage memory limitations (forgetting, distorting, persistence) to induce memory manipulations. As memories highly impact our self-perception, social interactions, and behaviors, there is a pressing need to understand XR Memory Manipulations (XRMMs). We ran three speculative design workshops (n=12), with XR and memory researchers creating 48 XRMM scenarios. Through thematic analysis, we define XRMMs, present a framework of their core components and reveal three classes (at encoding, pre-retrieval, at retrieval). Each class differs in terms of technology (AR, VR) and impact on memory (influencing quality of memories, inducing forgetting, distorting memories). We raise ethical concerns and discuss opportunities of perceptual and memory manipulations in XR. (10.1145/3544548.3580988)
    DOI : 10.1145/3544548.3580988
  • Understanding Physical Breakdowns in Virtual Reality
    • Tseng Wen-Jie
    , 2023 (506), pp.1-5. Virtual Reality (VR) moves away from well-controlled laboratory environments into public and personal spaces. As users are visually disconnected from the physical environment, interacting in an uncontrolled space frequently leads to collisions and raises safety concerns. In my thesis, I investigate this phenomenon which I defne as the physical breakdown in VR. The goal is to understand the reasons for physical breakdowns, provide solutions, and explore future mechanisms that could perpetuate safety risks. First, I explored the reasons for physical breakdowns by investigating how people interact with the current VR safety mechanism (e.g., Oculus Guardian). Results show one reason for breaking out of the safety boundary is when interacting with large motions (e.g., swinging arms), the user does not have enough time to react although they see the safety boundary. I proposed a solution, FingerMapper, that maps small-scale finger motions onto virtual arms and hands to enable whole-body virtual arm motions in VR to avoid physical breakdowns. To demonstrate future safety risks, I explored the malicious use of perceptual manipulations (e.g., redirection techniques) in VR, which could deliberately create physical breakdowns without users noticing. Results indicate further open challenges about the cognitive process of how users comprehend their physical environment when they are blindfolded in VR. (10.1145/3544549.3577064)
    DOI : 10.1145/3544549.3577064
  • Improved alpha-information bounds for higher-order masked cryptographic implementations
    • Liu Yi
    • Béguinot Julien
    • Cheng Wei
    • Guilley Sylvain
    • Masure Loïc
    • Rioul Olivier
    • Standaert François-Xavier
    , 2023. Embedded cryptographic devices are usually pro- tected against side-channel attacks by masking strategies. In this paper, the security of protected cryptographic implementations is evaluated for any masking order, using alpha-information measures. Universal upper bounds on the probability of success of any type of side-channel attack are derived. These also provide lower bounds on the minimum number of queries required to achieve a given success rate. An important issue, solved in this paper, is to remove the loss factor due to the masking field size.
  • Making with Data (and Beyond)
    • Oehlberg Lora
    • Huron Samuel
    • Willett Wesley
    • Nagel Till
    • Thudt Alice
    • Ijeoma Ekene
    • Offenhuber Dietmar
    • Hornecker Eva
    , 2023, pp.1-5. In this proposed panel, we will discuss the practice of Making with Data—the practice of creating physical artifacts that represent a dataset. This topic lies at the intersection of several CHI communities: data visualization, fabrication, and tangible interaction. Our goal is to discuss contemporary practices, but also to envision future ways that we might continue to make physical representations of data in the future, given emerging fabrication techniques, data representation practices, and desired interactions and experiences with data. (10.1145/3544549.3583748)
    DOI : 10.1145/3544549.3583748
  • Machine learning techniques for automatic knowledge graph completion
    • Boschin Armand
    , 2023. A knowledge graph is a directed graph in which nodes are entities and edges, typed by a relation, represent known facts linking two entities. These graphs can encode a wide variety of information, but their construction and exploitation can be complex. Historically, symbolic methods have been used to extract rules about entities and relations, to correct anomalies or to predict missing facts. More recently, techniques of representation learning, or embeddings, have attempted to solve these same tasks. Initially purely algebraic or geometric, these methods have become more complex with deep neural networks and have sometimes been combined with pre-existing symbolic techniques.In this thesis, we first focus on the problem of implementation. Indeed, the diversity of libraries used makes the comparison of results obtained by different models a complex task. In this context, the Python library TorchKGE was developed to provide a unique setup for the implementation of embedding models and a highly efficient inference evaluation module. This library relies on graphic acceleration of tensor computation provided by PyTorch, is compatible with widespread optimization libraries and is available as open source.We then consider the automatic enrichment of Wikidata by typing the hyperlinks linking Wikipedia pages. A preliminary study showed that the graph of Wikipedia articles is much denser than the corresponding knowledge graph in Wikidata. A new training method involving relations and an inference method using entity types were proposed and experiments showed the relevance of the combined approach, including on a new dataset.Finally, we explore automatic entity typing as a hierarchical classification task. That led to the design of a new hierarchical loss used to train tensor-based models along with a new type of encoder. Experiments on two datasets have allowed a good understanding of the impact a prior knowledge of class taxonomy can have on a classifier but also reinforced the intuition that the hierarchy can be learned from the features if the dataset is large enough.
  • Structured Prediction with Output Regularization : Improving Statistical and Computational Efficiency
    • Motte Luc
    , 2023. Supervised learning algorithms aims at identifying relationship between inputs and outputs thanks to training sets of couples (input, output). The most studied setting of supervised learning deals with high-dimensional inputs but low-dimensional outputs, as, for example, real numbers in the case of regression, and the values zero or one in the case of binary classification. Nevertheless, being able to predict complex outputs, as graphs, sequences, or images, allows for addressing much more practical tasks. This is the so-called structured output prediction setting. The question that has motivated this thesis is the following: How to take advantage of the structure of the output space in order to obtain statistically and computationally efficient structured prediction methods? We try to answer this question through the lens of the structured prediction framework of surrogate methods. More precisely, this manuscript starts by considering the problem of graph prediction. We propose to leverage the Gromov-Wasserstein (GW) distance, carrying a natural geometry for graph spaces, as a loss function. From this idea, we derive a new family of models for graph prediction: GW barycentric models. In a second contribution, we propose a generalization of reduced-rank regression which allows handling non-linear output spaces. It consists in solving the surrogate regression problems appearing in surrogate methods thanks to a reduced-rank regression estimator. We carry out a theoretical study of the reduced-rank estimator, taking values in a Hilbert space of possibly infinite dimension, and prove under output regularity assumptions that the rank regularization is statistically and computationally beneficial. Our results extend the interest of reduced-rank regression beyond the standard setting where the optimum is assumed to be low-rank. In a third contribution, we propose the principle of loss regularization. The method aims at obtaining a statistical and computational gain in structured prediction, by exploiting additional output data, and regularity information on the loss function. We study theoretically under which setting the method is beneficial. Our results show, intuitively, that one had better adapt the level of detail of the structured outputs predicted with respect to the quantity of training data, to reduce the effects of the output variance (or labeling noise), and also to alleviate the computational complexity of the pre-image in surrogate methods.
  • Impact of the saturable absorber on the linewidth enhancement factor of hybrid silicon quantum dot comb lasers (Student paper)
    • Renaud Thibaut
    • Huang Heming
    • Kurczveil Geza
    • Beausoleil R G
    • Grillot Frédéric
    , 2023. This work investigates the effects of the saturable absorber on the linewidth enhancement factor of hybrid silicon quantum dot comb lasers, which is a key parameter involved in frequency comb generation. Experiments have been performed on two carefully chosen laser devices sharing the same gain material and cavity design, with and without saturable absorber. The results unveil that the increase of the reversed bias on the absorber drives up the linewidth enhancement factor, which gives birth to the comb spectrum. This paper brings insights on the fundamental aspects of comb lasers and provides concepting guidelines of future on-chip light sources for integrated wavelength-division multiplexing applications. Keywords: Quantum dots, frequency combs, linewidth enhancement factor, silicon photonics.
  • Méthodologie de développement des véhicules autonomes sûrs à partir d’exigences fonctionnelles et non fonctionnelles
    • Assioua Yasmine
    , 2023. L’industrie automobile est en pleine évolution, le numérique remplace peu à peu les systèmes mécaniques. L’avènement des voitures autonomes et connectées augmente le nombre et la complexité des systèmes électroniques et informatiques qui y sont embarqués, ce qui pose de nouveaux défis et nécessite de nouveaux processus pour les développer. En effet, par rapport aux véhicules classiques, ces objets hautement technologiques ont un rôle accru dans la sécurité de leurs passagers et de leur environnement. Les exigences en termes de fiabilité et de sécurité s’en trouvent accrues. Pour aborder cette nouvelle ère, les industriels doivent améliorer et trouver de meilleures méthodes de production. La thèse propose une méthode pour répondre à certains défis liés à l’impératif de fiabilité et de sécurité, que les limitations de l’approche de développement classique ne résolvent pas de façon satisfaisante. Elle consiste à introduire de la validation au plus tôt dans le cycle de vie du développement logiciel. La méthode pose les bases d’une démarche itérative pour la validation et la vérification des exigences et énoncés textuels afin de détecter des erreurs, des oublis ou des incohérences éventuels avant la réalisation. Cette démarche de qualification des exigences repose sur des techniques de modélisation et de vérification formelle. Elle utilise aussi des simulations pour l’analyse de traces et de scénarios. Elle est largement automatisée.
  • CONTRASTIVE LEARNING FOR REGRESSION IN MULTI-SITE BRAIN AGE PREDICTION
    • Barbano Carlo Alberto
    • Dufumier Benoit
    • Duchesnay Edouard
    • Grangetto Marco
    • Gori Pietro
    , 2023. Building accurate Deep Learning (DL) models for brain age prediction is a very relevant topic in neuroimaging, as it could help better understand neurodegenerative disorders and find new biomarkers. To estimate accurate and generalizable models, large datasets have been collected, which are often multi-site and multi-scanner. This large heterogeneity negatively affects the generalization performance of DL models since they are prone to overfit site-related noise. Recently, contrastive learning approaches have been shown to be more robust against noise in data or labels. For this reason, we propose a novel contrastive learning regression loss for robust brain age prediction using MRI scans. Our method achieves state-of-the-art performance on the OpenBHB challenge, yielding the best generalization capability and robustness to site-related noise.
  • Learning to diagnose cirrhosis from radiological and histological labels with joint self and weakly-supervised pretraining strategies
    • Sarfati Emma
    • Bone Alexandre
    • Rohe Marc-Michel
    • Gori Pietro
    • Bloch Isabelle
    , 2023. Identifying cirrhosis is key to correctly assess the health of the liver. However, the gold standard diagnosis of the cirrhosis needs a medical intervention to obtain the histological confirmation, e.g. the METAVIR score, as the radiological presentation can be equivocal. In this work, we propose to leverage transfer learning from large datasets annotated by radiologists, which we consider as a weak annotation, to predict the histological score available on a small annex dataset. To this end, we propose to compare different pretraining methods, namely weakly-supervised and self-supervised ones, to improve the prediction of the cirrhosis. Finally, we introduce a loss function combining both supervised and self-supervised frameworks for pretraining. This method outperforms the baseline classification of the METAVIR score, reaching an AUC of 0.84 and a balanced accuracy of 0.75, compared to 0.77 and 0.72 for a baseline classifier.
  • Sparse non-negative matrix factorization for preclinical bioluminescent imaging
    • Dereure Erwan
    • Kervazo Christophe
    • Johanne Seguin
    • Garofalakis Anikitos
    • Angelini Elsa
    • Mignet Nathalie
    • Olivo-Marin Jean-Christophe
    , 2023.
  • Multi-View CNN for total lung volume inference on cardiac computed tomography
    • Wysoczanski Artur
    • Angelini Elsa
    • Sun Yifei
    • Smith Benjamin
    • Hoffman Eric
    • Stukovsky Karen
    • Budoff Matthew
    • Watson Karol
    • Carr Jeffrey
    • Oelsner Elizabeth
    • Barr R Graham
    • Laine Andrew
    , 2023.
  • Highlighting Two EM Fault Models While Analyzing a Digital Sensor Limitations
    • Nabhan Roukoz
    • Dutertre Jean-Max
    • Rigaud Jean-Baptiste
    • Danger Jean-Luc
    • Sauvage Laurent
    , 2023, pp.1-2. Fault injection attacks can be carried out against an operating circuit by exposing it to EM perturbations. These attacks can be detected using embedded digital sensors based on the EM fault injection mechanism, as the one introduced by El Baze et al. [1] which uses the sampling fault model [2], [3]. We tested on an experimental basis the efficiency of this sensor embedded in the AES accelerator of an FPGA. It proved effective when the target was clocked at moderate frequency (the injected faults were consistent with the sampling fault model). As the clock frequency was progressively increased, faults started to escape detection, which raises warnings about possible limitations of the sampling model. Further tests at frequencies close to the target maximal frequency revealed faults injected according to a timing fault model. Both series of experimental results ascertain that EM injection can follow at least two different fault models. Undetected faults and the existence of different fault injection mechanisms cast doubt upon the use of sensors based on a single model. (10.23919/DATE56975.2023.10137124)
    DOI : 10.23919/DATE56975.2023.10137124
  • Microstrip Antenna Array Design for Unmanned Aerial Vehicles Detection Radar
    • Mendes Ruiz Pedro
    • Begaud Xavier
    • Magne François
    • Leder Etienne
    • Khy Antoine
    Advanced Electromagnetics, Advanced Electromagnetics, 2023, 12 (3), pp.1-9. This work presents the design and realization of four linear arrays of microstrip rectangular patch antennas. This linear array is one of the elements of a passive radar using signals from 4G base stations for UAV detection. The arrays have been validated and operate from 2.62 GHz to 2.69 GHz, with a HPBW of 82° in H-plane and a maximal gain going from 11.1 dB to 12.2 dB in the required bandwidth, with a cosecant squared pattern in the E-plane. (10.7716/aem.v12i3.2066)
    DOI : 10.7716/aem.v12i3.2066
  • Wangiri Fraud: Pattern Analysis and Machine-Learning-Based Detection
    • Ravi Akshaya
    • Msahli Mounira
    • Qiu Han
    • Memmi Gérard
    • Bifet Albert
    • Qiu Meikang
    IEEE Internet of Things Journal, IEEE, 2023, 10 (8), pp.6794-6802. The rapid growth of the telecommunication landscape leads to a rapid rise of frauds in such networks. In this article, Wangiri fraud in which users are deceived by being charged for services without their knowledge during a call is tackled. In fact, Wangiri fraud has significant negative financial and reputation consequences for the mobile service providers and also has a bad psychological impact on the victims. In order to identify this fraudulent behavior, three Wangiri fraud patterns are defined by analyzing call records of over a year. Then, the security and performance of unsupervised and supervised machine learning (ML) methods in detecting one Wangiri pattern are evaluated using a large real-world Call Detail Records (CDRs) data set. In the context of Wangiri fraud detection, classification algorithms outperformed the others based on the chosen security and performance metrics. Finally, the performance evaluation of these algorithms is extended in detecting the other two real-world Wangiri fraud patterns. This article provides a detailed definition of the Wangiri fraud patterns and outlines the implementation and evaluation of ML algorithms in the context of detecting Wangiri fraud. The security analysis and experimental results demonstrate that depending on fraud patterns the best ML algorithm to detect Wangiri fraud may also vary. (10.1109/JIOT.2022.3174143)
    DOI : 10.1109/JIOT.2022.3174143
  • Side-Channel Security. How Much Are You Secure? Mrs. Gerber’s Lemma and Majorization
    • Béguinot Julien
    • Rioul Olivier
    • Guilley Sylvain
    • Cheng Wei
    • Liu Yi
    , 2023.
  • Diverse Paraphrasing with Insertion Models for Few-Shot Intent Detection
    • Chevasson Raphaël
    • Laclau Charlotte
    • Gravier Christophe
    , 2023, 13876, pp.65-76. In contrast to classic autoregressive generation, insertion-based models can predict in a order-free way multiple tokens at a time, which make their generation uniquely controllable: it can be constrained to strictly include an ordered list of tokens. We propose to exploit this feature in a new diverse paraphrasing framework: first, we extract important tokens or keywords in the source sentence; second, we augment them; third, we generate new samples around them by using insertion models. We show that the generated paraphrases are competitive with state of the art autoregressive paraphrasers, not only in diversity but also in quality. We further investigate their potential to create new pseudo-labelled samples for data augmentation, using a meta-learning classification framework, and find equally competitive result. In addition to proving non-autoregressive (NAR) viability for paraphrasing, we contribute our open-source framework as a starting point for further research into controllable NAR generation. (10.1007/978-3-031-30047-9_6)
    DOI : 10.1007/978-3-031-30047-9_6
  • An Investigation of Structures Responsible for Gender Bias in BERT and DistilBERT
    • Leteno Thibaud
    • Gourru Antoine
    • Laclau Charlotte
    • Gravier Christophe
    , 2023, 13876, pp.249-261. In recent years, large Transformer-based Pre-trained Language Models (PLM) have changed the Natural Language Processing (NLP) landscape, by pushing the performance boundaries of the state-of-the-art on a wide variety of tasks. However, this performance gain goes along with an increase in complexity, and as a result, the size of such models (up to billions of parameters) represents a constraint for their deployment on embedded devices or short-inference time tasks. To cope with this situation, compressed models emerged (e.g. DistilBERT), democratizing their usage in a growing number of applications that impact our daily lives. A crucial issue is the fairness of the predictions made by both PLMs and their distilled counterparts. In this paper, we propose an empirical exploration of this problem by formalizing two questions: (1) Can we identify the neural mechanism(s) responsible for gender bias in BERT (and by extension DistilBERT)? (2) Does distillation tend to accentuate or mitigate gender bias (e.g. is DistilBERT more prone to gender bias than its uncompressed version, BERT)? Our findings are the following: (I) one cannot identify a specific layer that produces bias; (II) every attention head uniformly encodes bias; except in the context of underrepresented classes with a high imbalance of the sensitive attribute; (III) this subset of heads is different as we re-fine tune the network; (IV) bias is more homogeneously produced by the heads in the distilled model. (10.1007/978-3-031-30047-9_20)
    DOI : 10.1007/978-3-031-30047-9_20
  • Nouvelles technologies de réception cohérente pour la mesure et le contrôle des paramètres physiques de transmission sur fibres optiques
    • May Alix
    , 2023. Dans les réseaux à fibre optique, le monitoring massif a suscité un intérêt important pour leur permettre d'être plus autonomes et élastiques. Au fil des années, diverses techniques de monitoring basées sur le traitement numérique du signal côté récepteur ont été proposées. Ces techniques sont particulièrement intéressantes car elles ne nécessitent pas de matériel supplémentaire et sont moins coûteuses. Dans ma thèse, je me suis concentrée sur les techniques de monitoring de la puissance longitudinale d'un lien optique, basées sur l'analyse des effets de propagation non linéaires. Dans un premier temps, j'ai proposé d'utiliser une technique existante pour estimer la valeur d'une perte de puissance dans une liaison optique point à point et je l'ai validée expérimentalement. Ensuite, j'ai généralisé la méthode d'estimation des pertes de puissance et l'ai appliquée à un réseau maillé. L'utilisation de différents trajets lumineux sur ce réseau m'a permis de montrer une augmentation de la précision de l'estimation des valeurs de perte. Afin d'élargir les possibilités de mise en œuvre de cette méthode, j'ai étudié ensuite expérimentalement l'application de la technique d'estimation du profil de puissance sur une longue liaison optique pour valider son utilisation pour les systèmes sous-marins. Enfin, je propose d'utiliser cette technique pour surveiller un autre type de pertes de puissance, la perte dépendante de la polarisation (PDL). La PDL est présente dans les composants optiques tels que les commutateurs et les amplificateurs. Habituellement, seul le montant cumulé est surveillé. J'ai proposé d'utiliser une méthode similaire à celle des pertes de puissance, permettant de localiser un élément PDL variable et d'estimer sa variation. Ce dernier travail nous permet de nous rapprocher du type d'événement, ce qui est important pour prendre des solutions intelligentes et efficaces.
  • All you ever wanted to know about side-channel attacks and protections
    • Guilley Sylvain
    • Rioul Olivier
    , 2023.
  • Query Evaluation: Enumeration, Maintenance, Reliability
    • Amarilli Antoine
    , 2023.
  • Removing the Field Size Loss from Duc et al.’s Conjectured Bound for Masked Encodings
    • Béguinot Julien
    • Cheng Wei
    • Guilley Sylvain
    • Liu Yi
    • Masure Loïc
    • Rioul Olivier
    • Standaert François-Xavier
    , 2023, 13979, pp.86-104. At Eurocrypt 2015, Duc et al. conjectured that the success rate of a side-channel attack targeting an intermediate computation en- coded in a linear secret-sharing, a.k.a. masking with d+1 shares, could be inferred by measuring the mutual information between the leakage and each share separately. This way, security bounds can be derived without having to mount the complete attack. So far, the best proven bounds for masked encodings were nearly tight with the conjecture, up to a constant factor overhead equal to the field size, which may still give loose security guarantees compared to actual attacks. In this paper, we improve upon the state-of-the-art bounds by removing the field size loss, in the cases of Boolean masking and arithmetic masking modulo a power of two. As an example, when masking in the AES field, our new bound outperforms the former ones by a factor 256. Moreover, we provide theoretical hints that similar results could hold for masking in other fields as well. (10.1007/978-3-031-29497-6_5)
    DOI : 10.1007/978-3-031-29497-6_5
  • StreamMLOps: Operationalizing Online Learning for Big Data Streaming & Real-Time Applications
    • Barry Mariam
    • Montiel Jacob
    • Bifet Albert
    • Wadkar Sameer
    • Manchev Nikolay
    • Halford Max
    • Chiky Raja
    • Jaouhari Saad El
    • Shakman Katherine B.
    • Fehaily Joudi Al
    • Deit Fabrice Le
    • Tran Vinh-Thuy
    • Guerizec Eric
    , 2023, pp.3508--3521. Continuously learning and serving from evolving streaming data and serving in real-time is a challenging problem. Traditionally, data is partitioned and processed in batches to train machine learning (ML) models. In industrial applications, static models’ performance drops over time (model degradation, concept drift), requiring new models to be trained with recent data and redeployed in production. The scientific community has been studying online and adaptive methods to address batch-learning limitations and continuously train AI tasks for industrial applications such as cyber-security, AIOps, anomaly scoring, and drift detection in stock markets. This paper deals with the MLOps aspects of deploying such online and dynamic models to address the requirements in the production systems for real-time applications. Our architectures - based on open-source tools such as Kafka and River - demonstrated how online learning methods could be scaled horizontally in production to meet the demands of a high-velocity streaming pipeline. We demonstrate an MLOps strategy to perform incremental learning from streaming data and continuously deploy the online learning model without pausing the inference pipeline. Indeed, the design satisfies requirements such as model versioning, monitoring, audibility and reproducibility of prediction in both a supervised and semi-supervised setting. Our experiments - for malicious URLs detection task - performed on high-dimensional and feature-evolving streaming data (more than 3 million features) establish the effectiveness and efficiency of online learning models compared to batch (static) machine learning regarding both time and space complexity. Finally, we provide some best practices on data engineering for deploying online models to process a real-time feature stream in production environments. Code is publicly available for reproducibility. (10.1109/ICDE55515.2023.00272)
    DOI : 10.1109/ICDE55515.2023.00272