Name
Multimodal Framework for Therapeutic Consultations
Time
1:45 PM - 1:55 PM (EST)
Description

This presentation outlines the development and validation of an Automated Engagement Scorecard (AES), a novel, explainable AI-based system designed to quantify therapeutic engagement during virtual mental health consultations. Engagement—defined as the client’s active participation in therapy—is a critical determinant of treatment success, yet lacks objective measurement tools, especially in virtual care contexts. The AES framework integrates multimodal data streams to assess engagement through facial affect recognition (FAR), head motion, and natural language processing (NLP). Using the expert-labeled AnnoMI dataset, which includes 133 motivational interviewing (MI) videos, we extracted interpretable features from facial expressions (seven emotions), head movement (chronic/acute dynamics), and therapist-client utterances. Preprocessing steps such as scene change detection, harmonization, and poor-quality frame removal enabled robust feature extraction from over 136,000 video frames. FAR was implemented using the Py-Feat toolbox, employing action units and facial landmarks to derive emotion frequencies and transitions. Head motion dynamics were calculated frame-by-frame using 68-point landmark data, while NLP used a fine-tuned RoBERTa model to classify utterances into clinically relevant categories (e.g., reflection, question, sustain talk). Decision fusion from these modalities enabled classification of high vs. low engagement (proxied by MI quality) using ensemble machine learning algorithms. Our results demonstrate high classification accuracy, particularly using gradient boosting (Clinician model: 89.5%, Client model: 81.1%) with robust F1 scores and strong performance on validation metrics (AUC > 0.88 for clinician model). An ablation study revealed that emotion dynamics were the strongest contributor, followed by head motion and NLP features. Visualizations of feature selection further confirmed the clinical interpretability of the model, with engagement-relevant indicators such as nodding, emotional fluctuation, and use of reflective speech emerging as consistent predictors of high-quality MI. AES differs from existing models like MET by offering interpretable outputs, entire-session analysis rather than 3-second clips, and greater integration of emotional dynamics. Importantly, the system performs well despite dataset imbalance, with performance validated across 200×10-fold cross-validation iterations. We will present: Background and need for objective engagement measures in digital therapy; Architecture of the AES system and preprocessing pipeline; Results of model training, validation, and ablation analysis; Comparisons to other engagement and empathy estimation models; Implications for clinical implementation and future scalability. In conclusion, AES represents a significant advancement in quantifying therapeutic engagement using explainable AI. It provides clinicians and researchers with a tool to monitor and improve therapeutic alliance in real time, enabling better personalization and quality control in virtual mental health care.

Venkat Bhat
Location Name
Metropolitan Centre