SensorLM: Learning the language of wearable sensors

Wearable devices, from smartwatches to fitness trackers, have become ubiquitous, continuously capturing a rich stream of data about our lives. They record our heart rate, count our steps, track our fitness and sleep, and much more. This deluge of information holds immense potential for personalized health and wellness. However, while we can easily see what our body is doing (e.g., a heart rate of 150 bpm), the crucial context of why (say, "a brisk uphill run" vs. "a stressful public speaking event") is often missing. This gap between raw sensor data and its real-world meaning has been a major barrier to unlocking the full potential of these devices.The primary challenge lies in the scarcity of large-scale datasets that pair sensor recordings with rich, descriptive text. Manually annotating millions of hours of data is prohibitively expensive and time-consuming. To solve this, and to truly let wearable data "speak for itself", we need models that can learn the intricate connections between sensor signals and human language directly from the data.In “SensorLM: Learning the Language of Wearable Sensors”, we introduce SensorLM, a family of sensor–language foundation models that bridges this gap. Pre-trained on an unprecedented 59.7 million hours of multimodal sensor data from over 103,000 individuals, SensorLM learns to interpret and generate nuanced, human-readable descriptions from high-dimensional wearable data, setting a new state of the art in sensor data understanding.
Training the SensorLM models
To create the sensor dataset needed for SensorLM, we sampled nearly 2.5M person-days of de-identified data from 103,643 people across 127 countries. This data was collected between March 1st and May 1st, 2024, from Fitbit or Pixel Watch devices, with participants consenting to the use of their de-identified data for research to contribute to general knowledge about health and science.To overcome the annotation bottleneck, we developed a novel hierarchical pipeline that automatically generates descriptive text captions by calculating statistics, identifying trends, and describing events from the sensor data itself. This process allowed us to curate the largest-known sensor-language dataset to date, orders of magnitude larger than those used in previous studies.
The SensorLM architecture builds on and unifies prominent multimodal pre-training strategies, such as contrastive learning and generative pre-training.Contrastive Learning: The model learns to match a segment of sensor data with its corresponding text description from a set of options. This teaches it to discriminate between different activities and states (e.g., distinguishing a "light swim" from a "strength workout").Generative Pre-training: The model learns to generate text captions directly from the sensor data. This equips it with the ability to produce rich, context-aware descriptions from understanding the high-dimensional sensor signals.By integrating these approaches into a single, cohesive framework, SensorLM develops a deep, multimodal understanding of the relationship between sensor signals and language.
Key capabilities and scaling behaviors
We evaluated SensorLM on a wide range of real-world tasks in human activity recognition and healthcare. The results demonstrate significant advances over previous state-of-the-art models.
Activity recognition and retrieval
SensorLM shines in tasks with limited labeled data. It achieves remarkable zero-shot classification for activity, accurately classifying from 20 activities without any fine-tuning, and excels in few-shot learning, quickly learning from just a handful of examples. This makes the model highly adaptable to new tasks and users with minimal data. Furthermore, SensorLM enables powerful cross-modal retrieval, allowing cross-modal understanding between sensor data and language descriptions. This allows us to query descriptions using sensor input, or find specific sensor patterns using natural language, facilitating expert-driven analysis (see further results can be found in the paper).
Generative capabilities
Beyond its classification power, SensorLM demonstrates impressive caption generation capabilities. Given only the high-dimensional sensor signals from a wearable device, SensorLM can produce hierarchical and contextually relevant captions. Experimental results indicate that these generated captions were more coherent and factually correct than those produced by powerful non-specialist LLMs.
Scaling behavior
Our experiments also revealed that SensorLM's performance consistently improves with more data, larger model sizes, and increased computation, aligning with established scaling laws. This sustained growth suggests we have only scratched the surface of what is possible with large-scale sensor-language pre-training, indicating that further investigation into this paradigm is highly valuable.
Conclusion
Our research establishes a foundation for unlocking the understanding of wearable sensor data through natural language, enabled by a novel hierarchical captioning pipeline and the largest sensor-language dataset to date. The SensorLM family of models represents a major advance in making personal health data understandable and actionable. By teaching AI to comprehend the language of our bodies, we can move beyond simple metrics and toward truly personalized insights.Looking forward, we plan to scale pre-training data into new domains, including metabolic health and detailed sleep analysis, to address the messy reality of consumer health devices. We envision SensorLM leading to a future generation of digital health coaches, clinical monitoring tools, and personal wellness applications that can offer advice through natural language query, interaction, and generation. Any future products or applications inspired by this foundational research may require further assessment of any clinical and regulatory considerations that may be applicable.
Acknowledgements
The research described here is joint work across Google Research, Google Health, Google DeepMind, and partnering teams. The following researchers contributed to this work: Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A. Ali Heydari, Girish Narayanswamy, Maxwell A. Xu, Ahmed Metwally, Shawn Xu, Jake Garrison, Xuhai Xu, Tim Althoff, Yun Liu, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Cecilia Mascolo, Xin Liu, Daniel McDuff, and Yuzhe Yang. We would also like to thank participants who contributed their data for this study.