Multimodal ToM Framework for Advanced Social Deduction Game AI

by pixel_artist · 1 12 月, 2025

Social deduction games (SDGs) like One Night Ultimate Werewolfrepresent a unique challenge for AI systems, requiring not only strategic reasoning but also the ability to interpret social cues and model opponents’ mental states. SocialMind activategames introduces a novel multimodal framework that integrates visual, vocal, and textual cues with advanced Theory of Mind (ToM) modeling to create AI agents capable of human-level social reasoning in gaming environments .

Multimodal Perception Architecture

Traditional game AI systems rely solely on textual information, ignoring critical communication channels like facial expressions and speech tone. SocialMind’s perception engine processes multiple data streams simultaneously, converting facial expressions through computer vision analysis and speech patterns through audio processing into structured emotional descriptors . This multimodal approach allows AI agents to detect subtle deception cues and emotional states that text-based systems miss entirely. The framework achieves this through a fusion network that aligns visual, vocal, and textual features into a unified representation space, enabling comprehensive social signal processing .

Hierarchical Theory of Mind Modeling

SocialMind implements a sophisticated ToM system that operates at multiple reasoning levels. First-order ToM involves inferring other players’ roles and basic intentions, while second-order ToM enables reasoning about how players perceive each other’s beliefs . This hierarchical structure allows AI agents to track evolving belief distributions across all players and anticipate how their communications will influence others’ suspicions. The system represents these mental state distributions as dynamic belief matrices that update in real-time based on game events and social interactions .

Proactive Strategy Optimization

Unlike activategames reactive systems that respond to immediate stimuli, SocialMind employs Monte Carlo Tree Search (MCTS) to plan multi-step communication strategies that minimize suspicion toward the AI agent . The system simulates potential conversation paths and their impact on other players’ belief states, selecting utterances that advance the agent’s objectives while maintaining credible social positioning. This strategic planning capability enables sophisticated behaviors like deliberate misdirection and calculated trust-building, essential for success in social deduction games .

Cross-Modal Alignment and Fusion

A key innovation in SocialMind is its ability to identify correlations between different communication modalities. The system detects when verbal claims contradict non-verbal cues (such as a player claiming confidence while displaying nervous facial expressions) and uses these discrepancies to assess credibility . This cross-modal alignment is achieved through attention mechanisms that weight the reliability of different signal types based on context and source reliability. The framework’s fusion network learns optimal weighting schemes through reinforcement learning, continuously improving its deception detection capabilities .

Implementation and Performance

SocialMind activategames demonstrates significant advancements over previous approaches, achieving 94% accuracy in predicting player roles and 89% precision in identifying deceptive communications . In agent-versus-agent simulations, SocialMind-powered players achieved win rates 35% higher than text-only systems and performed comparably to human experts in complex bluffing scenarios. The framework’s modular architecture allows for deployment across various social deduction games with minimal adaptation, making it a versatile solution for advancing AI social reasoning capabilities .

The integration of multimodal perception with sophisticated ToM modeling represents a significant step toward AI systems that can navigate the complex social dynamics inherent in human interactions. SocialMind provides a foundation for developing more sophisticated game AI, virtual characters, and social computing applications that require deep understanding of human communication patterns .

You may also like