Introduction
============

This data repository contains the data underpinning the following paper:

Title: Group-personalized regression models for prediction of mental health scores from objective mobile phone data streams
Authors: N. Palmius; K. E. A. Saunders; O. Carr; J. R. Geddes; G. M. Goodwin; and M. De Vos
Journal: Journal of Medical Internet Research

Full citation details are available in the metadata for the repository.

Data Overview
=============

The following data are available in this repository:

  * Demographic data from Table 2.
  * QIDS scores and calculated feature for participants included in the analysis.
  * Mean regression results underpinning the summary statistics in Table 3.

Data format and notation
------------------------

All data are provided in standard comma-delimited csv files in the compressed data.zip file.

Each participant is identified by a unique random ID field, different from the one used to identify participants in the associated AMoSS study.

Full Data Description
=====================

Demographic data
----------------

Demographic data from Table 2 are available in the demographics.csv file. This includes the following data for each participant:

  * ParticipantID: Unique identifier for the participant.
  * Cohort: The diagnosis of the participant - healthy control (HC); bipolar disorder (BD) or borderline personality disorder (BPD).
  * Gender: Gender of the participant - m or f.
  * Age: The age of the participant in whole years on entry to the study.
  * BMI: The BMI of the participant on entry to the study.
  * WeeksOfData: The number of valid labelled weeks of data available for processing from the participant.
  * QIDSMean: The mean value of the QIDS scores in the valid labelled weeks of data available for processing from the participant.
  * QIDSRange: The range of the QIDS scores in the valid labelled weeks of data available for processing from the participant (maximum - minimum QIDS scores).

QIDS scores and feature values
------------------------------

Calculated feature values for the valid labelled weeks of data from the participants are available in the data.csv file. This contains the following details for each week of valid data:

  * ParticipantID: Unique identifier for the participant.
  * WeekBeginning: The first day of the week (Monday) for the data from which the features were calculated.
  * QIDS: The QIDS score label for the week.
  * Features: The next 10 columns contain the raw calculated features. Features are notated using the feature abbreviation given in Table 1 of the paper, which also gives details of how each feature is calculated.

Regression model results
------------------------

Individual regression model results summarised in Table 3 are available in the remaining .csv files, named results_<model>_predictions.csv. There are six files as follows:

  * results_population_level_model_predictions.csv: Results from the population-level model.
  * results_fully_personalized_cv_model_predictions.csv: Results from the fully personalized model using cross validation over all data points.
  * results_fully_personalized_model_predictions.csv: Results from the fully personalized model trained on calibration data.
  * results_group_personalized_model_predictions.csv: Results from the group-personalized model with optimized clusters.
  * results_group_personalized_model_calibration_predictions.csv: Results from the group-personalized model with clusters allocated using calibration data.
  * results_community_similarity_network_predictions.csv: Results from the Community Similarity Network clustering model based on Lane et al. / Abdullah et al.

In all six files, the following fields identify the data:

  * ParticipantID: Unique identifier for the participant.
  * WeekBeginning: The first day of the week (Monday) for the data for which the results were calculated.

The following additional fields are used in specific data files:

  * results_population_level_model_predictions.csv:
    - TrainTest: Identifies the calibration data (0) used in the training of the model; and the test data (1) used to present the results in Table 3.
    - PopulationLevelModelPrediction: The mean predicted QIDS scores over all 1000 iterations of Gibbs sampling of the Bayesian Lasso model.

  * results_fully_personalized_cv_model_predictions.csv:
    - TrainTest: Identifies the calibration data (0) used in the training of the model; and the test data (1) used to present the results in Table 3.
    - FullyPersonalizedModelPrediction: The mean predicted QIDS scores over all 1000 iterations of Gibbs sampling of the Bayesian Lasso model.

  * results_fully_personalized_cv_model_predictions.csv:
    - TrainTest<n>: Identifies the calibration data (0) used in the training of the model in iteration n; and the test data (1) used to evaluate the model.
    - FullyPersonalizedModelPrediction<n>: The mean predicted QIDS scores in iteration n over all 1000 iterations of Gibbs sampling of the Bayesian Lasso model. The results in Table 3 show the mean MAE value of the predictions on the test participants in each iteration.

  * results_group_personalized_model_predictions.csv / results_group_personalized_model_calibration_predictions.csv:
    - TrainTest: Identifies the calibration data (0) used in the training of the model; and the test data (1) used to present the results in Table 3.
    - GroupAllocated: The group ID to which the participant was allocated.
    GroupPersonalizedModelPrediction: The mean predicted QIDS scores over all 1000 iterations of Gibbs sampling of the Bayesian Lasso model.

  * results_community_similarity_network_predictions.csv:
    - CommunitySimilarityNetworkPrediction: The mean predicted QIDS scores using the Community Similarity Network clustering model.