#SuperAnimal pretrained models for analyzing behavior through pose estimation. #WildlifeBehavior

Summarise this content to 300 words

Datasets

We collected publicly available datasets from the community, as well as provided two new datasets for showing how to build models with the SuperAnimal method, iRodent, and MausHaus, as described below. Thereby, we sought to cover diverse individuals, backgrounds, scenarios, and postures. We did not modify the source data otherwise. In the following, we detail the references for those datasets.

TopViewMouse-5k

3CSI, BM, EPM, LDB, OFT See full details at ref. ¹⁵ and in ref. ⁵⁷. BlackMice See full details at ref. ²⁴. WhiteMice See details in SIMBA ref. ²⁵. Courtesy of Prof. Sam Golden and Nastacia Goodwin. TriMouse See full details at ref. ¹⁰. DLC-Openfield See full details at ref. ⁹. Kiehn-Lab-Openfield, Swimming, and treadmill See details at ref. ⁵⁸. Courtesy of Prof. Ole Kiehn, Dr. Jared Cregg, and Prof. Carmelo Bellardita. MausHaus We collected video data from five single-housed C57BL/6J male and female mice in an extended home cage, carried out in the laboratory of Mackenzie Mathis at Harvard University and also EPFL (temperature of housing was 20–25 °C, humidity 20-50%). Data were recorded at 30Hz with 640 × 480 pixels resolution acquired with White Matter, LLC eV cameras. Annotators localized 26 keypoints across 322 frames sampled from within DeepLabCut using the k-means clustering approach⁵⁹. All experimental procedures for mice were in accordance with the National Institutes of Health Guide for the Care and Use of Laboratory Animals and approved by the Harvard Institutional Animal Care and Use Committee (IACUC) (n = 1 mouse) and by the Veterinary Office of the Canton of Geneva (Switzerland; license GE01) (n = 4 mice). MausHaus data is banked on zenodo⁶⁰.

For ease of use, we packaged these datasets into one directory that can be accessed at https://zenodo.org/records/10618947⁶¹.

Quadruped-80K

AwA-Pose Quadruped dataset, see full details at ref. ⁶². AnimalPose see full details at ref. ²⁸. AcinoSet see full details at ref. ²⁶. Horse-30 Horse-30 dataset, benchmark task is called Horse-10; see full details at ref. ¹⁶. StanfordDogs see full details at refs. ^63,64. AP-10K see full details at ref. ³¹. APT-36K see full details at ref. ³² iRodent We utilized the iNaturalist API functions for scraping observations with the taxon ID of Suborder Myomorpha⁶⁵. The functions allowed us to filter the large amount of observations down to the ones with photos under the CC BY-NC creative license. The most common types of rodents from the collected observations are Muskrat (Ondatra zibethicus), Brown Rat (Rattus norvegicus), House Mouse (Mus musculus), Black Rat (Rattus rattus), Hispid Cotton Rat (Sigmodon hispidus), Meadow Vole (Microtus pennsylvanicus), Bank Vole (Clethrionomys glareolus), Deer Mouse (Peromyscus maniculatus), White-footed Mouse (Peromyscus leucopus), Striped Field Mouse (Apodemus agrarius). We then generated segmentation masks over target animals in the data by processing the media through an algorithm we designed that uses a Mask Region Based Convolutional Neural Networks(Mask R-CNN)⁶⁶ model with a ResNet-50-FPN backbone⁴⁵, pretrained on the COCO datasets⁴⁰. The processed 443 images were then manually labeled with pose annotations, and bounding boxes were generated by running Mega Detector⁶⁷ on the images, which were then manually verified. iRodent data is banked at https://zenodo.org/record/8250392.

For ease of use, we packaged these datasets into one directory, which is banked at: https://zenodo.org/records/10619173⁶⁸.

Additional OOD Videos

In Fig. 3, for video testing we additionally used the following data: Golden Lab mouse: see details at ref. ⁶⁹. Smear Lab Mouse: see details at ref. ⁷⁰. Mathis Lab MausHaus: New video conditions, but the same as MausHaus ethics approval as above. BlackDog: video from https://www.pexels.com/video/unleashing-the-pet-dog-outdoors-4763071/, Elk video from https://www.pexels.com/video/a-deer-looking-for-food-in-the-ground-covered-with-snow-3195531/. Horse-30 videos: we used the ground truth annotations for 30 horse videos as described in ref. ¹⁶.

Benchmarking: data splits and training ratios

Pre-training datasets: For every test of an OOD dataset we create a dataset that has all datasets that exclude the OOD dataset. Within the pretraining datasets, we used 100% of the images and annotations, and we used the OOD datasets for performance evaluation.

OOD datasets: For AP-10K, we used the official training and validation set. For AnimalPose, iRodent, and DLC-Openfield, we create our own splits and shuffles. We use the 80:20 train test ratio for AnimalPose and iRodent and we use the 95:5 train test ratio for DLC-Openfield.

Note that in our data release, each leave-one-out dataset is noted in the metadata such that others can easily benchmark their models in the future.

Panoptic pose estimation

We cast animal pose estimation as panoptic segmentation⁷¹ on the animal body; i.e., every pixel on the body is potentially a semantically meaningful keypoint that has an individual identity. Ideally, an infinite collection of diverse pose datasets covers this and the union of keypoints that are defined across datasets makes the label space of panoptic pose estimation.

Data conversion and panoptic vocabulary mapping (generalized data converter)

Data came from multiple sources and in multiple formats. To homogenize different annotation formats (COCO-style, DeepLabCut format, etc.), we implemented a generalized data converter. We parsed more than 20 public datasets and re-formatted them into DeepLabCut projects. Besides data conversion, the generalized data converter also implements key steps for the panoptic animal pose estimation task formulation. These steps include:

1.

Hand-crafted conversion mapping. The same anatomical keypoint might be named differently in different datasets, or different anatomical locations might correspond to different labels in different datasets. Thus, the generalized data converter used a hand-crafted conversion mapping (see Supplementary Figs. S1a and S5) to enforce a shared vocabulary among datasets. We checked the visual appearance of keypoints to determine whether two keypoints (in different datasets) should be regarded as identical. In such cases, the model had to learn (possible) dataset-bias in a data-driven way. We can also think of it as a form of data augmentation that randomly shifts the coordinate of keypoints by a small magnitude, which is the case for keypoints which most dataset creators agree on (e.g., keypoints on the face). For keypoints on the body, the quality of the conversion table can be critical for the model to learn a stable representation of animal bodyparts.
2.

Vocabulary projection. After the conversion mapping was made, keypoints from various datasets were projected to a super-set keypoint space. Every keypoint became a one-hot vector in the union of keypoint spaces of all datasets. Thereby the animal pose vocabularies were unified.
3.

Dataset merging. After annotations were unified into the super-set annotation space, we merged annotations from datasets by concatenating them into a collection of annotation vectors. Note that if the images only displayed a single species, we essentially built a specialized dataset for that species in different cage and camera settings. If there were multiple species present, we essentially grouped them in a species-invariant way to encourage the model to learn species-agnostic keypoint representations, as is the case for our SuperAnimal-Quadruped model.

The SuperAnimal algorithmic enhancements for training and inference

Keypoint gradient masking

First, we manually verified a semantic mapping of the datasets with diverse naming (i.e., nose in dataset 1 and snout in dataset 2). Then, we defined a master keypoint space naming, where no one dataset needed to have all the names identified. This yielded sparse keypoint annotations into the super-set keypoint space (Supplementary Fig. S1b, c). Training naively on these projected annotations would harm the training stability, as the loss function penalizes undefined keypoints, as if they were not visible (i.e., occluded).

For stable training of our panoptic pose estimation model, we mask components of the loss function across keypoints. The keypoint mask n_k is set to 1 if the keypoint k is present in the annotation of the image and set to 0 if the keypoint is absent. We denote the predicted probability for keypoint k at pixel (i, j) as p_k(i, j) ∈ [0, 1) and the respective label as t_k(i, j) ∈ {0, 1}, and formulate the masked L_k error loss function as

$${{{{{{{{\mathcal{L}}}}}}}}}_{{{{{{{{{\rm{L}}}}}}}}}_{{{{{{{{\rm{k}}}}}}}}}}={\sum }_{k=1}^{m}{\sum}_{i,j}{n}_{k}\cdot \parallel \!{p}_{k}(i,\;j)-{t}_{k}(i,\;j){\parallel }_{z},$$

(1)

with z = 2 for mean square error and z = 1 for L1 loss (e.g., used for locref maps in DLCRNet¹⁰) and the masked cross-entropy loss function as

$${{{{{{{{\mathcal{L}}}}}}}}}_{{{{{{{{\rm{CE}}}}}}}}}=-{\sum }_{k=1}^{m}{\sum}_{i,j}{n}_{k}{t}_{k}(i,j)\log {p}_{k}(i,j).$$

(2)

Note that we make distinct the difference between not annotated and not defined in the original dataset and we only mask undefined keypoints. This is important as, in the case of sideview animals, “not annotated” could also mean occluded/invisible. Adding masking to not annotated keypoints will encourage the model to assign high likelihood to occluded keypoints.

Also note that the network predictions p_k(i, j) are generated by applying a softmax to the logits l_k(i, j) across all possible keypoints, including masked ones:

$${p}_{k}(i,j)=\frac{\exp {l}_{k}(i,\; j)}{{\sum }_{k^{\prime}=1}^{M}\exp {l}_{k^{\prime}}(i,\; j)}.$$

(3)

M is the total number of keypoints. The masking in the loss function then ensures that probability assigned to non-defined keypoints is neither penalized nor encouraged during training.

Automatic keypoint matching

In cases where users want to apply our models to an existing, annotated pose dataset, we recommend to use our keypoint matching algorithm. This step is important because our models define their own vocabulary of keypoints that might differ from the novel pose dataset. To minimize the gap between the model and the dataset, we propose a matching algorithm to minimize the gap between the models’ vocabulary and the dataset vocabulary. Thus, we use our model to perform zero-shot inference on the whole dataset. This gives pairs of predictions and ground truth for every image. Then, we cast the matching between models’ predictions (2D coordinates) and ground truth as bipartite matching using the Euclidean distance as the cost between pairs of keypoints. We then solve the matching using the Hungarian algorithm. Thus for every image, we end up getting a matching matrix where 1 counts for match and 0 counts for non-matching. Because the models’ predictions can be noisy from image to image, we average the aforementioned matching matrix across all the images and perform another bipartite matching, resulting in the final keypoint conversion table between the model and the dataset (example affinity matrices are shown in Supplementary Fig. S2a, b).

Note that the quality of the matching will impact the performance of the model, especially for zero-shot. In the case where, e.g., the annotation nose is mistakenly converted to keypoint tail and vice versa, the model will have to unlearn the channel that corresponds to nose and tail (see also case study in Mathis et al.⁷). For evaluation metrics such as mAP where a per keypoint sigma is used, we sample the sigmas from the SuperAnimal sigmas (See Supplementary Table S1).

Memory replay fine tuning

Catastrophic forgetting⁷² describes a classic problem in continual learning³⁸. Indeed, a model gradually loses its ability to solve previous tasks after it learns to solve new ones.

Fine-tuning a SuperAnimal models falls into the category of continual learning: the downstream dataset defines potentially different keypoints than those learned by the models. Thus, the models might forget the keypoints they learned and only pick up those defined in the target dataset. Here, retraining with the original dataset and the new one, is not a feasible option as datasets cannot be easily shared and more computational resources would be required.

To counter that, we treat zero-shot inference of the model as a memory buffer that stores knowledge from the original model. When we fine-tune a SuperAnimal model, we replace the model predicted keypoints with the ground-truth annotations, resulting in hybrid learning of old and new knowledge. The quality of the zero-shot predictions can vary and we use the confidence of prediction (0.7) as a threshold to filter out low-confidence predictions. With the threshold set to 1, memory replay fine-tuning becomes naive-fine-tuning.

Memory replay pseudo-code:

def is_defined(keypoints):

# Check whether the original dataset defines each keypoint. We use a flag ‘-1’ to denote that a given keypoint is not defined in the original dataset. Note this is different from not annotated in the COCO convention, which use flag ‘0’

return True if keypoints[2] >= 0 else False

def load_pseudo_keypoints(image_ids):

# get the pseudo keypoints by image IDs.
# note, pseudo keypoints are loaded from disk and fixed throughout the process, so no label drifting is expected as in typical online pseudo labeling

return pseudo_keypoints

def get_confidence(keypoints):

# get the model confidence of the predicted keypoints. Unlike ground truth data that have 3 discrete flags, predicted keypoints have confidence that can be used as likelihood readout for post-inference analysis

return keypoints[2]

def memory_replay(model, superset_gt_data_loader, optimizer, threshold):

# gt data is preprocessed such that annotations are now in superset keypoint space.
# we extended the visibility flag of COCO annotation to following (-1: not defined in the target dataset, 0: not labeled, 1: labeled but not visible, 2: labeled and visible)

for batch_data in superset_gt_data_loader:

gt_keypoints = batch_data[‘keypoints’]
image_ids = batch_data[‘image_ids’]
images = batch_data[‘images’]
# model() is a pytorch style forward function preds = model(images)
pseudo_keypoints = load_pseudo_keypoints(image_ids)
# 3 here is (x, y, flag)
batch_size, num_kpts, _ = gt_keypoints.shape
# iterate through batch

for b_id in batch_size:

# iterate through keypoints

for kpt_id in range (num_kpts):

# since this specific bodypart is not defined in the new dataset, we use saved pseudo labels (zero-shot prediction) as gt. This prevents catastrophic forgetting and drifting. We can also use confidence to filter the pseudo keypoints]

if not is_defined(gt_keypoints[b_id][kpt_id]) and get_confidence(pseudo_keypoints[b_id][kpt_id]) > threshold:

# we assume a single animal scenario for simplicity. For multiple animals, matching between gt and pseudo keypoints need to be done.
gt_keypoints[b_id][kpt_id] = pseudo_keypoints[b_id][kpt_id]
loss = criterion(preds, gt_keypoints)
optimizer.zero_grad()
loss.backward()
optimizer.step()

Model architectures

For SuperAnimal-TopViewMouse we used both a bottom-up model (DLCRNet) and top-down model (HRNet-w32), or transformer (AnimalTokenPose) (see below). Whereas for SuperAnimal-Quadruped we only use top-down based HRNet-w32. Please refer to the Supplementary Fig. S6 and Supplementary discussion for why we use only top-down models for quadruped.

Bottom-Up model

DLCRNet

The SuperAnimal-TopViewMouse used the bottom-up approach as described in DeepLabCut^9,10. We use DLCRNet_ms5¹⁰ as the baseline network architecture for its excellent performance on animal pose estimation. A batch size of 8 was used and the SuperAnimal-TopViewMouse was trained for a total of 750k iterations. In the fine-tuning stage, a batch size of 8 was used for 70k iterations. The Adam optimizer⁷³ was used for all training instances, and we otherwise used default parameters. We follow DeepLabCut’s multi-step learning rate scheduler to drop learning rates three times from 1e—4 to 1e—5. Cross-entropy is used for learning heatmaps. For fine-tuning experiments, we keep the same optimizer, batch size, and learning rate scheduler. The total number of training steps is adjusted to 70k iterations. During video adaptation, we keep the same optimizer and learning rate scheduler, but with batch size 1 and total training steps as 1000. We observe that the low computational budget as described is sufficient for the model to adapt.

Top-Down models

Object detectors

For the object detectors, we trained Faster R-CNN using ResNet-50 as the backbone⁷⁴ and incorporated Feature Pyramid Networks⁴⁵ for enhanced feature extraction. The training was conducted over 100 epochs using the AdamW optimizer and LRListScheduler. We initiated the training with a learning rate of 0.0001, which was decreased to 1e—05 at the 90th epoch. The batch size was set to 4 for both the SuperAnimal-TopviewMice and SuperAnimal-Quadruped.

We processed the TopViewMouse-5K and Quadruped-80K datasets to ensure that there is only one animal category, namely top-view mice or quadrupeds, in each dataset. This approach was adopted to train the model to detect generic animal types effectively. During training, image resizing to 1333 × 800 pixels, random flipping, normalization, and padding were applied as part of the data augmentation process.

HRNet-w32

HRNet-w32²⁰ is used for the top-down based SuperAnimal-Quadruped models. The training protocol follows that described in the AP-10K paper³¹. Specifically, we employed the Adam optimizer⁷³ with an initial learning rate of 5e—4. The total training duration was set to 210 epochs, with a step decay applied to the learning rate at epochs 170 and 200. A batch size of 64 was used. Consistent with the AP-10K protocol, random flip, half-body transformation, and random scale rotation were applied during training, along with flip testing during evaluation.

For fine-tuning models with a very small number of unique images (e.g., fewer than 64 images in the training set), we fixed the running stats of batch normalization layers and used a smaller initial learning rate of 5e—5. This setting improves training stability.

HRNet-w32 was also employed for the top-down based SuperAnimal-TopviewMouse models, adhering to the exact same training protocol as the SuperAnimal-Quadruped.

AnimalTokenPose

Inspired by recent results of Vision Transformers²¹ on human pose estimation tasks²³ we assessed ViT’s zero-shot performance. We conducted experiments with the original ViT architecture in three setups: with masked auto-encoder (MAE)⁷⁵ initialization, DeiT⁷⁶ initialization, and truncated normal initialization with standard deviation 0.02 and 0 mean. Following the original setup²¹, we did not use a convolutional backbone. The input image of size 224 × 224 was split into patches of 16 × 16 pixels, the depth of the transformer encoder was equal to 12 and each attention layer had 12 heads with a feature dimension of 768. It was crucial to use a pre-trained vision transformer; without pre-training, the model did not converge for either dataset (data not shown).

We also adapted the TokenPose model by Yang et al.²², which adds information about each keypoint in learnable queries called keypoint embeddings. The model was originally used for human pose estimation with a fixed number of keypoints. Combining TokenPose and panoptic animal pose estimation, we obtain AnimalTokenPose models that are able to achieve high zero-shot performance in OOD datasets we prepared (Figs. 1 and 2).

For keypoint estimation, 12 transformer encoder blocks with feature vector of size 192 were stacked. While the ViT encoder received raw pixels as an input, in TokenPose²² the images of size 256 × 256 are first processed by a convolutional backbone, and captured abstract features are then split into patches of size 4 × 4. As in TokenPose²², we used the first three stages of HRNet⁷⁷ and 2 stacked residual blocks from a ResNet⁷⁸.

The training procedure for AnimalTokenPose is identical to HRNet-w32 detailed above.

Video inference methods and considerations

Domain shifts and unsupervised adaptation

These domain shifts⁷⁹ describe a classical vulnerability of neural networks, where a model takes inputs from a data domain that is dissimilar from the training data domain, which usually leads to large performance degradation. We empirically observe three types of domain shifts when applying our models in a zero-shot manner. These domain shifts range from pixel statistics shift⁸⁰, to spatial shift⁸¹, to semantic shift^79,80. To mitigate those, we applied two methods, test time spatial-pyramid search and video adaptation.

Handling the train and test time resolution discrepancy for bottom-up models

One notable challenge for our bottom-up models face at inference time is the discrepancy in the animal appearance sizes and image resolutions between train and test stages. Even though scale jitter augmentation is part of most pose estimation frameworks’ data augmentation pipeline, including DeepLabCut’s^10,59,82, the model can still have trouble handling dramatic changes in the image resolution or the animal appearance sizes. To further deal with scale changes, we employ spatial-pyramid search at test-time (see below). The same challenge happens in fine-tuning stage. The downstream dataset (and the animals present in it) could have very different animal sizes from the pre-training datasets, causing a distribution shift to the pre-trained models. We thus apply resizing (height 400 pixels and same aspect ratio) to downstream datasets if their sizes are drastically different from our training images.

Test time spatial-pyramid search for bottom-up models

As bottom-up models do not enforce the standardization of the animal size seen by the pose estimator, the relative animal size (ratio between the animal’s bounding box area and the area of the image) seen in the pre-training stage and testing stage can be very different. In other words, the bottom-up model performs best with the animal sizes seen in the training stage. The relative animal size in the test time is unknown and as a result, it can cause performance degradation due to spatial distribution shift. We propose to apply multiple rescaling factors to the test image and aggregate the models’ predictions.

Therefore, during the inference, we build a spatial-pyramid composed of model’s predictions for multiple copies of the original image at different resolutions. We use model’s confidence as the criterion to filter out the resolutions that give sub-optimal performance and aggregate (taking median) predictions from resolutions that have above-threshold confidence as our final prediction.

The train-test resolution discrepancy⁸³ has been studied actively and most approach it through multi-resolution fusion^10,45,77. Previous work mostly focuses on IID settings where the resolution of testing images did not vary considerably from the training images. Moreover, prior work approaches multi-resolution fusion via deep features, requiring modifications of the architecture and adding more parameters. In contrast, the proposed spatial-pyramid search is designed to aid SuperAnimal models in zero-shot scenarios where the resolutions of testing images are most likely out of distribution to our training images. We did not apply multi-resolution fusion via deep features for that requires fixing choice of architectures. On the other hand, commonly used multi-scale testing in IID setting does not need to carefully filter out very noisy predictions. This method can also be used for calibration to find the optimal scale.

Spatial-pyramid pseudo-code:

def spatial_pyramid_search(images, model, scale_list, confidence_threshold, cosine_threshold):

# generate rescaled version of original images with multiplescaling factor
rescaled_images = rescale_images(images, scale_list)
preds_per_scale = []
# gather predictions of the model, assuming the final pred_keypoints are projected to the original image space by the forward function

for rescaled_image in rescaled_images:

pred_keypoints = model(rescaled_image)
# using median to get a good estimate of expected keypoint positions
median_keypoint = get_median_keypoint(pred_keypoints)
# If the rescaled image is not suitable for the model, we expect the model have a confidence less than a given threshold
pred_keypoints = filter_by_confidence(pred_keypoints, confidence_threshold)
# A median filter alone does not remove outliers. After confidence filtering, we compare the remained predictions to the median keypoint and drop the low quality predictions
pred_keypoints = filter_by_cosine_similarity(pred_keypoints, median_keypoint, cosine_threshold)
preds_per_scale.append(pred_keypoints)
return get_median_keypoints(preds_per_scale)

Video adaptation

To aid SuperAnimal models to adapt to novel videos, we inference the model on the videos, and treat these predictions as the pseudo ground-truth⁸⁴ labels to train on. We remove the predictions that have low confidence to filter out unreliable predictions. We found that it is critical to fix the running stats of batch normalization layers during the adaptation training (See supp for more details). Empirically, 1000 iterations with batch size 1 is sufficient to greatly reduce the jitter. The optimal number of iterations and the confidence threshold are hyperparameters for different videos.

Video adaptation pseudo-code:

def get_pseudo_predictions(frame_id):

# return pseudo prediction by frame id

def video_adaptation(model, video_data_loader, optimizer, threshold):

for data in video_data_loader:

# fix the running stats of BN layers

model.eval()

frame_id = data[‘frame_id’]
Image = data[‘image’]
pseudo_keypoints = get_pseudo_predictions(frame_id)
preds = model(image)
# predictions that have low confidence are masked from loss calculation.
loss = criterion(preds, pseudo_keypoints, mask_by_threshold = threshold)
optimizer.zero_grad()
loss.backward()
optimizer.step()

Evaluation metrics

Supervised metrics for pose estimation

RMSE

Root Mean Squared Error is a metric to measure the distance between prediction and ground truth annotations in pixel space^7,9. However, for pose estimation, does not take the scale of the image and individuals into consideration and the distance is thus non-normalized. As our data is highly variable, we also sometimes use normalized errors. We use RMSE for the DLC-Openfield benchmarking, as this was the original authors’ main reported metric. Note that during evaluating RMSE, we do not remove predictions that have low confidence due to occlusion. Therefore, all predictions including outliers are penalized by RMSE.

Normalized error

For Horse-10 experiments, we compute RMSE between ground-truth annotations and predictions with confidence cutoff 0 (to include all predictions to ensure low confidence predictions are also penalized). The resulting RMSE is then normalized by the eye-to-nose GT distance provided by ref. ¹⁶.

mAP

Mean average precision (mAP) is the averaged precision of object keypoint similarity (OKS)⁸⁵:

$$OKS=\frac{{\sum }_{i=1}^{n}\left[\exp \left(-{d}_{i}^{2}/2{s}^{2}{{k}_{i}}^{2}\right)\delta \left({v}_{i} \, > \, 0\right)\right]}{{\sum }_{i=1}^{n}\left[\delta \left({v}_{i} \, > \, 0\right)\right]}$$

(4)

where d_i is the Euclidean distance between each corresponding ground truth and detected keypoint and v_i is the visibility flags of the ground truth, s is the object scale and k_i is a per keypoint constant that controls falloff (see full implementation details at ref. ⁴⁰). For lab mice, we used 0.1 for all keypoints following¹⁰. For quadruped, we used the sigmas (per keypoint constant) of the 17 keypoints shared with AP-10K³¹ and used 0.067 for the rest of animal keypoints (see below). s is the square root of the bounding box area (product of width X height of the bounding box).

The body parts along with their corresponding k in pixels are: nose (0.026), upper_jaw (0.067), lower_jaw (0.067), mouth_end_right (0.067), mouth_end_left (0.067), right_eye (0.025), right_earbase (0.067), right_earend (0.067), right_antler_base (0.067), right_antler_end (0.067), left_eye (0.025), left_earbase (0.067), left_earend (0.067), left_antler_base (0.067), left_antler_end (0.067), neck_base (0.035), neck_end (0.067), throat_base (0.067), throat_end (0.067), back_base (0.067), back_end (0.067), back_middle (0.035), tail_base (0.067), tail_end (0.079), front_left_thai (0.072), front_left_knee (0.062), front_left_paw (0.079), front_right_thigh (0.072), front_right_knee (0.062), front_right_paw (0.089), back_left_paw (0.107), back_left_thigh (0.107), back_right_thai (0.087), back_left_knee (0.087), back_right_knee (0.089), back_right_paw (0.067), belly_bottom (0.067), body_middle_right (0.067), body_middle_left (0.067).

Unsupervised metrics for video prediction smoothness

Convex hull body area measurement

To evaluate the smoothness of SuperAnimal model predictions in video, we utilize a simple unsupervised heuristic. It computes the area of a polygon encompassing all keypoints, the idea being that the smoother the detections, the lower the variance of this polygon’s area. This is formally noted by A_body, to estimate the animal body area. A_body is calculated using the convex hull containing all keypoints over time. Let ${{{{{{{\mathcal{K}}}}}}}}$ represent the set of all keypoints for the animal at each time step, and ${{{{{{{\rm{H}}}}}}}}({{{{{{{\mathcal{K}}}}}}}})$ denote the convex hull containing all keypoints. The animal body area, A_body, is then given by the area of the convex hull:

$${A}_{{{{{{{{\rm{body}}}}}}}}}={{{{{{{\rm{Area}}}}}}}}({{{{{{{\rm{H}}}}}}}}({{{{{{{\mathcal{K}}}}}}}}))$$

(5)

where ${{{{{{{\rm{Area}}}}}}}}({{{{{{{\rm{H}}}}}}}}({{{{{{{\mathcal{K}}}}}}}}))$ is the function that calculates the area of the convex hull ${{{{{{{\rm{H}}}}}}}}({{{{{{{\mathcal{K}}}}}}}})$ containing all keypoints over time.

Jittering metric

We define jittering, denoted by J, as the average of the absolute values of centered, non-signed speeds across all examples and all keypoints. For a given keypoint k and example e, the jittering value is computed as follows:

$${{{{{{{{\rm{J}}}}}}}}}_{k,e}=\frac{1}{{N}_{k,e}}{\sum }_{i=1}^{{N}_{k,e}}\left\vert {v}_{k,e,i}\right\vert$$

(6)

where: J_k,e is the jittering value for keypoint k in example e; N_k,e is the total number of centered, non-signed speed measurements for keypoint k in example e; v_{k,e, i} is the i-th centered, non-signed speed measurement for keypoint k in example e.

Keypoint dropping metric

Keypoint drop is a count of the number of keypoints with predicted likelihood below a set threshold for every predicted frame (the cutoff was set to 0.1 for bottom-up models, and 0.05 for top-down models). In practice, low-confidence predictions are dropped to remove predictions that are not reliable or occluded.

In this work, keypoint dropping is used to complement metrics such as RMSE to evaluate the jittery of predictions or catastrophic forgetting. Note keypoint dropping is only valid for top-view, openfield (almost no occlusion) videos where every keypoint is supposed to be predicted with relatively high confidence. For side-view poses, keypoint dropping is not suitable as many bodyparts are occluded.

Let K_total be the total number of keypoints in the video sequence, and K_dropped be the count of keypoints that are below a defined threshold T_threshold and considered for dropping in environments with little occlusion and a top view.

$${K}_{{{{{{{{\rm{dropped}}}}}}}}}(t)={\sum }_{i=1}^{{K}_{{{{{{{{\rm{total}}}}}}}}}}{\delta }_{i}(t)$$

(7)

where K_dropped(t) is the count of keypoints dropped at time t, and δ_i(t) is an indicator function that returns 1 if the i-th keypoint is below the threshold at time t, and 0 otherwise:

$${\delta }_{i}(t)=\left\{\begin{array}{ll}1,\quad &{{{{{{{\rm{if}}}}}}}} \,{{{{{{{{\rm{score}}}}}}}}}_{i}(t) \, < \, {T}_{{{{{{{{\rm{threshold}}}}}}}}}\\ 0,\quad &{{{{{{{\rm{otherwise}}}}}}}}\hfill\end{array}\right.$$

(8)

where ${{{{{{{{\rm{score}}}}}}}}}_{i}(t)$ is the confidence score or measurement of the i-th keypoint at time t.

Adaptation gain (or loss) in mAP

Denotes the adapted model’s change in mAP on the adapted video. A negative number means a performance degradation after adaptation.

Every video in Horse-30 dataset is densely annotated. Thus we can calculate the mAP gain on the video after the model is adapted to it. We use the pre-adapted zero-shot mAP as the reference and calculate the difference between the post-adaptation mAP and pre-adaptation mAP.

Robustness gain (or loss) in mAP

Calculates mAP gain on all videos from the same dataset. This helps to identify whether the model overfits one single video it trains on or it performs successful domain adaptation with respect to the whole video dataset. We use this robustness gain to complement adaptation gain. We calculate the mAP for the adapted models on all 30 videos of Horse-30 dataset¹⁶. A positive gain in robustness also suggests that the method can be used on one video and benefit all other videos in the same dataset.

Video adaptation compared to baselines using supervised metrics (mAP)

We use HRNet-w32 with the detector we trained to perform video adaptation to inference the videos to obtain pseudo-labels.

For video adaptation algorithm, the prediction confidence threshold is set to 0.5 and we perform video adaptation for 4 epochs for each video it adapts to. The learning rate scheduler and augmentations are identical to HRNet-w32’s.

PPLO

Progressive Pseudo-label-based Optimization³³ implements iterative pseudo-labeling that follows a curriculum, namely, the pseudo-labeling starts with high confidence prediction, and then trains with small confidence predictions, following an easy-to-hard curriculum. We initialize three confidence intervals as [0.9, 0.7, 0.5] and sequentially apply pseudo-labeling to the model for four epochs with each confidence level, making a total of 12 epochs training with PPLO.

The full algorithm of PPLO also requires training on both labeled source data and labeled target data, which the video adaptation does not do. For fairness reasons, we only performed the iterative pseudo-labeling step.

Kalman filtering

We apply a constant-velocity Kalman filter (implemented in filterpy v1.4.5) as post-processing to our pre-adaptation zero-shot pose predictions. As Kalman filtering does not modify the model weights, we do not report the general robustness gain on it.

Statistical analysis

Linear mixed-effects models were fitted in R⁸⁶ using the lme4 package (v1.1.31;)⁸⁷. Training data ratio (or, equivalently, the number of images) and fine-tuning methods were defined as fixed effects, whereas the various datasets and shuffles were treated as random effects; random intercepts and slopes were also added at the dataset level. The best models were selected based on the Akaike Information Criterion (AIC); adding complexity did not result in lower AIC, and even led to singular fits, indicative of overfitting. The weight of evidence for an effect was computed using likelihood ratio tests, as well as with p values provided by lmerTest (v.3.1.3). Two-sided pairwise contrasts and Cohen’s d standardized effect sizes were computed with the emmeans package (v.1.8.9), and degrees of freedom estimated with the Kenward–Roger method. Distributions of prediction errors with and without spatial-pyramid search were compared with the two-sample, one-sided (alternative hypothesis: “less”) Kolmogorov–Smirnov test. The significance threshold was set at 0.05.

Behavioral action segmentation, OFT

As our benchmark dataset, we used the openfield test (OFT) task from Sturman et al.¹⁵. We calculated the same skeleton-based features by concatenating 10 distances between keypoints, six angles, four body areas, and two additional boolean variables coding whether the nose and head center was inside the arena, resulting in a 22D vector at each time step. For the action classifier, we used an MLP neural network as the action decoder that acted as a sliding window across 31 time steps to perform action segmentation and used F1 score on supported and unsupported rears as evaluation metrics. As in the original paper, we performed leave-one-out cross-validation on 20 videos and across three annotators.

Note that the original model for OFT task from Sturman et al. includes the center and four corners of the mouse cage, which is critical for their handcrafted features to determine the relative distance between the mouse and the walls. As our SuperAnimal models focus on animal bodyparts only, we take the corner coordinates from their released data for the sake of comparison. In practice, those static environmental keypoints can be provided by taking users’ inputs via interactive GUI for videos.

For CEBRA⁴⁸, we used the model architecture “offset10-model”. The output dimension was set to 32, as found via a simple grid search over the following values: [4, 8, 16, 32]. We trained it for 5000 iterations with batch size 4096, the Adam optimizer, and learning rate 1e—4.

Behavioral action segmentation, MABe

MABe has two rounds and since only round 2 released videos, we use videos from round 2 as the inputs for our pretrained SA-TVM model. Since our paper focuses on pretrained pose model, we use recommended baselines^49,50 from round 1 that build representation based on pose trajectories instead of RGB-based representation learning baselines (as RGB-based representation learning is known to be better than pose trajectory-based representation⁸⁸. Videos from MABe round 2 have three mice in videos, therefore we used our top-down version SA-TVM. The procedure is as follows: we inferenced our pretrained top-down SA-TVM on all 1830 videos from round 2, converted the pose results into MABe keypoint file format, and ran PointNet code to obtain embeddings. Finally, we use the official evaluation code to compare the performance between using the official MABe poses obtained from fully supervised learning and poses that are obtained via our models’ zero-shot predictions.

Behavioral action segmentation, Horse Gait Analysis

Our SA-Q model was run on the videos from Horse-30¹⁶. The start (2 s) and end (2 s) of each of the 30 videos were removed from the analysis, to ignore instants when the horse is only partially seen. Front and back hoof contacts and lifts were identified using respectively peak and valley detection from the 2D kinematic traces of the front and back hooves. Beforehand, these trajectories were smoothed using a 2nd-order, low-pass, zero-lag Butterworth filter (cutoff=3 Hz) and centered on a keypoint located on the animal’s back; this effectively expresses keypoint coordinates in a reference frame stationary relative to the moving horse, facilitating event detection. We extracted fore and hind limb strides between consecutive ground contacts, and stance phases between a contact of one hoof until it is lifted off the ground. Stride lengths (in pixels), stances, and the number of identified hoof contacts were then computed, and qualitatively compared to those obtained using the densely annotated (ground truth) keypoints (Fig. 4g, h, i).

Code API

High-level inference API (with optional spatial-pyramid search) for using SuperAnimal models in DeepLabCut:

video_path = ‘demo-video.mp4’
superanimal_name = ‘superanimal_topviewmouse’

scale_list = range(200, 600, 50) # image height pixel size range and increment

deeplabcut.video_inference_superanimal(
[video_path],
superanimal_name,
scale_list=scale_list,
video_adapt=True)

Web App

Many labs use DeepLabCut to define, annotate, and refine animal bodyparts, resulting in high quality, diverse keypoint annotations for animals in different contexts^10,59. In order to enable a positive feedback loop to turn the collection of animal pose data and models into a community effort we developed a Web App.

The app is available at https://contrib.deeplabcut.org/. This app allows anyone, within their browser, to a) upload their own image and label, b) annotate community images, c) run inference of available community models on their own data, and d) share models to be hosted. The website is written using JavaScript with the Svelte framework, and the models are run on cloud servers.

Data collection

The website has an upload portal for groups to upload their models and labeled data in DeepLabCut format to help grow the pre-training datasets and allow researchers to build on top of varied models and data.

Annotation

Additionally, the website hosts a labeling web app that allows users to annotate curated images. The datasets currently available for annotation are from iNaturalist⁸⁹ and the OpenImage Datase⁹⁰. After selecting which dataset to label, images are displayed successively with the target animal prominently shown in front of an opaque masked background (which can be toggled off). The keypoint set is selected taking into account the species morphology and keypoint value in subsequent analysis. Once the annotation is complete, the data is saved to the database and made available for use in further research.

Online inference

To allow testing DeepLabCut models in the browser, the user selects a few images, which model to run, and receives predictions along with confidence scores for each keypoint. Users are then able to adjust or delete keypoints, as well as download the model weights from HuggingFace. This allows for a quick and hassle-free evaluation of DeepLabCut’s capabilities and suitability for specific tasks, making it available to a wider range of users.