Diving Into the Data
We continue our quest to demystify how the Apple Watch estimates VO2 Max. Let’s take the plunge into the data and prepare it for analysis. If you’re tuning in for the first time, I’d recommend checking out the previous post to get up to speed. It’s worth the detour.
Apple Health Export Data
Thanks to the script we discussed last time, we converted the daunting
export.xml file from HealthKit into a much friendlier
Here’s a link to the python script: Apple Health
Note, if you’ve been playing along at home, your CSV may have a date suffix.
Now, let’s talk about the CSV itself. It’s fairly large, my CSV was about 1.3GB (which isn’t crazy for nearly a decade of data). Within this file, you’ll find rows and rows of HealthKit entries. There are a bunch of columns, ranging from the type of data to the source, value, unit, and timestamps of creation, start, and end. (There are many other columns, but we will ignore these because they are more sparsely populated metadata.)
Only some of that data pertains to VO2 Max. Stupid ChatGPT joke:
Much of it is like that gym equipment you buy with great intentions – it’s there, but you’re not going to use it.
Here’s a sneak peek at what we’re dealing with:
|VO2Max||Erkin’s Apple Watch||45.0789||mL/min·kg||2020-01-08 19:59:01-04:00||2020-01-08 19:59:01-04:00||2020-01-08 19:59:02 -0400|
|DistanceWalkingRunning||Erkin’s Apple Watch||0.289404||mi||2020-01-08 19:42:40-04:00||2020-01-08 19:47:45-04:00||2020-04-09 07:19:11 -0400|
|DistanceWalkingRunning||Erkin’s iPhone 6s||0.616122||mi||2020-01-08 19:46:19-04:00||2020-01-08 19:56:19-04:00||2020-01-08 19:57:22 -0400|
|DistanceWalkingRunning||Erkin’s Apple Watch||0.306078||mi||2020-01-08 19:47:45-04:00||2020-01-08 19:52:49-04:00||2020-04-09 07:19:11 -0400|
|DistanceWalkingRunning||Erkin’s Apple Watch||0.319039||mi||2020-01-08 19:52:49-04:00||2020-01-08 19:57:53-04:00||2020-04-09 07:19:12 -0400|
|DistanceWalkingRunning||Erkin’s Apple Watch||0.0363016||mi||2020-01-08 19:57:53-04:00||2020-01-08 19:58:55-04:00||2020-04-09 07:19:12 -0400|
|ActiveEnergyBurned||Erkin’s Apple Watch||39.915||Cal||2020-01-08 19:42:33-04:00||2020-01-08 19:47:37-04:00||2020-04-09 07:19:13 -0400|
So, we need a way to extract only the data related to workouts.
HealthKit is robust, and I’m sure that if I were doing this directly as part of an iOS application, I could use some of Apple’s APIs (like this).
However, we’re not in Apple’s beautiful walled garden anymore - so we need a different way to extract the workout-related data.
I was stymied at first because the extracted healthKit data don’t have any flag or metadata that indicate workout status.
I know that specific sensors (like the heart rate monitor) sample at an increased frequency when a workout is started; however, I didn’t feel confident with an approach that tried to determine workout status implicitly.
Then, I realized that the healthKit zip contains a directory called
workout-routes directory contains a bunch of
.gpx files. I’ve never seen this type of file before.
They’re also known as GPS Exchange Format files and store geographic information such as waypoints, tracks, and routes.
So, they’re an ideal file format to store recordings of your position throughout a walk or run.
If you’re curious about these files, take a gander at these links:
In short, this directory contains a record of every run and walk that I’ve been on! And in addition to exercises having GPS coordinates, they have timestamps!
These files are a flavor of
XML and contain a ton of trackpoints with timestamps.
I asked chatGPT to whip up some code for extracting the first and last timestamps from the files
(Prompt: “could you help me parse a gpx file? I would like to get the first and last time stamp from all the trkpts in trkseg”).
With that little script, we can filter out the extraneous data.
Workout Health Data
I wrote a simple script to use the
workout-routes to filter down the
By matching the start and end timestamps of the
GPX files with HealthKit data streams, I could isolate just the sensor measurements associated with each workout.
To do this, I read through all the
GPX files in the
workout-routes directory and got the workout timestamps.
Then, I opened the
apple_health_export.csv and filtered out all rows that did not occur between the start or end timestamps of a workout.
You can find the workout health data extraction script here.
The python script takes in the directory for
workout-routes and the
apple_health_export.csv file and returns
workout_health_export.csv. Optionally, it takes in a parameter for the file path for this new CSV.
With this code, we now have a dataset of all the HealthKit samples that directly pertain to a running or walking workout (the workout types for which Apple calculates VO2 Max).
Jumping the (Data Analysis) Gun
At this point, I got excited because I had data! So, I jumped directly to machine learning; I did some more initial workout data preprocessing and called SkLearn to make some models. The results were… OK (MAE of ~1 for a value usually in the 30s).
Several hours into model selection, I realized I had jumped the gun. I decided to call back the cavalry and do a thorough job of data exploration before training models. This data exploration process is what we will focus on in the next post.
Go ÖN Home
I want to thank Emily A. Balczewski for reviewing this post and providing feedback on it and the project!