Léopold Maytié

Ph.D Student
CerCo
Université de Toulouse
Toulouse (France)

Profile picture

About

I am a third-year Ph.D. student at the université de Toulouse and ANITI, under the supervision of Rufin VanRullen in the NeuroAI Team at CerCo. My research is centered on adapting a cognitive science theory known as Global Workspace Theory to the domain of artificial intelligence. The objective of my work is to develop and enhance multimodal models, integrating various modalities, particularly within the context of robotics. Specifically, I aim to create models that facilitate learning in a manner more aligned with human cognition, requiring less supervision and fewer data. I am particularly focused on demonstrating the advantages of such models for reinforcement learning and robotics applications.

As part of my thesis, I maintain a strong interest in multimodal AI models, perception and control learning in robotics, reinforcement learning, and world models. Additionally, I developed an interest about embodiment theory, which I would like to explore to bridge cognitive science and robotics, two important topics in my thesis.

News

Research

Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace
Léopold Maytié, Benjamin Devillers, Alexandre Arnold, Rufin VanRullen
RLC 2024.
@article{maytie2024zero,
    title={Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace},
    author={Mayti{\'{e}}, L{\'{e}}opold and Devillers, Benjamin and Arnold, Alexandre and VanRullen, Rufin},
    journal={Reinforcement Learning Journal},
    volume={1},
    issue={1},
    year={2024}
}
                        

Humans perceive the world through multiple senses, enabling them to create a comprehensive representation of their surroundings and to generalize information across domains. For instance, when a textual description of a scene is given, humans can mentally visualize it. In fields like robotics and Reinforcement Learning (RL), agents can also access information about the environment through multiple sensors; yet redundancy and complementarity between sensors is difficult to exploit as a source of robustness (e.g. against sensor failure) or generalization (e.g. transfer across domains). Prior research demonstrated that a robust and flexible multimodal representation can be efficiently constructed based on the cognitive science notion of a 'Global Workspace': a unique representation trained to combine information across modalities, and to broadcast its signal back to each modality. Here, we explore whether such a brain-inspired multimodal representation could be advantageous for RL agents. First, we train a 'Global Workspace' to exploit information collected about the environment via two input modalities (a visual input, or an attribute vector representing the state of the agent and/or its environment). Then, we train a RL agent policy using this frozen Global Workspace. In two distinct environments and tasks, our results reveal the model's ability to perform zero-shot cross-modal transfer between input modalities, i.e. to apply to image inputs a policy previously trained on attribute vectors (and vice-versa), without additional training or fine-tuning. Variants and ablations of the full Global Workspace (including a CLIP-like multimodal representation trained via contrastive learning) did not display the same generalization abilities.

Semi-supervised Multimodal Representation Learning through a Global Workspace
Benjamin Devillers, Léopold Maytié, Rufin VanRullen
IEEE TNNLS 2024.
@ARTICLE{10580966,
    author={Devillers, Benjamin and Maytié, Léopold and VanRullen, Rufin},
    journal={IEEE Transactions on Neural Networks and Learning Systems}, 
    title={Semi-Supervised Multimodal Representation Learning Through a Global Workspace}, 
    year={2024},
    volume={},
    number={},
    pages={1-15},
    keywords={Task analysis;Training;Visualization;Trajectory;Synchronization;Representation learning;Annotations;Cycle-consistency;global workspace (GW) theory;multimodal learning;semi-supervised learning},
    doi={10.1109/TNNLS.2024.3416701}
}         
                        

Recent deep learning models can efficiently combine inputs from different modalities (e.g., images and text) and learn to align their latent representations, or to translate signals from one domain to another (as in image captioning, or text-to-image generation). However, current approaches mainly rely on brute-force supervised training over large multimodal datasets. In contrast, humans (and other animals) can learn useful multimodal representations from only sparse experience with matched cross-modal data. Here we evaluate the capabilities of a neural network architecture inspired by the cognitive notion of a "Global Workspace": a shared representation for two (or more) input modalities. Each modality is processed by a specialized system (pretrained on unimodal data, and subsequently frozen). The corresponding latent representations are then encoded to and decoded from a single shared workspace. Importantly, this architecture is amenable to self-supervised training via cycle-consistency: encoding-decoding sequences should approximate the identity function. For various pairings of vision-language modalities and across two datasets of varying complexity, we show that such an architecture can be trained to align and translate between two modalities with very little need for matched data (from 4 to 7 times less than a fully supervised approach). The global workspace representation can be used advantageously for downstream classification tasks and for robust transfer learning. Ablation studies reveal that both the shared workspace and the self-supervised cycle-consistency training are critical to the system's performance.

Teaching

2023-2025

Talks

2024
2023