The Oxford digital multiple errands test (OxMET): Validation of a simplified computer tablet based multiple errands test

ABSTRACT Impairments in executive functioning are common following Acquired Brain Injury, though there are few screening tools which present a time efficient and ecologically valid approach to assessing the consequences of executive impairments. We present the Oxford Digital Multiple Errands Test (OxMET), a novel and simplified computer-tablet version of a Multiple Errands Test. We recruited 124 neurologically healthy controls and 105 stroke survivors to complete the OxMET task. Normative data and internal consistency were established from the healthy control data. Convergent and divergent validation was assessed in a mixed subset of 158 participants who completed the OxMET and OCS-Plus. Test–retest reliability was examined across a mixed subset of 39 participants. Finally, we investigated the known-group discriminability of the OxMET. The OxMET demonstrated very high internal consistency, and stable group level test–retest performance as well as good convergent and divergent validity. The OxMET demonstrated high sensitivity and good specificity in overall differentiation of stroke survivors from controls. The Oxford Digital Multiple Errands Test is a brief, easy to administer tool, designed to quickly screen for potential consequences of executive impairments in a virtual environment shopping task on a computer tablet. Initial normative data and validation within a chronic stroke cohort is presented.


Introduction
Cognitive impairment in executive function is common after acquired brain injury including stroke (Jokinen et al., 2015;Merriman et al., 2019;Millis et al., 2001). Executive function refers to higher-order cognitive abilities such as planning, shifting tasks, and inhibiting behaviour in order to adapt to novel situations in everyday life (Gilbert & Burgess, 2008). Impairments in executive functioning have been shown to lead to worse functional outcomes including impairments in instrumental activities of daily living (e.g., Connor & Maeir, 2011;Goverover & Josman, 2004;Mole & Demeyere, 2020;Pohjasvaara et al., 2002). Josman and colleagues suggested that accurate examination of executive functioning following brain injury can reduce the burden on costly and hard to access social and health services, through early signposting and support (Josman et al., 2014).
Executive functioning is a notoriously hard to define cognitive phenomenon (Goldstein et al., 2014), with many theories and models attempting to define what constitutes executive function and how this is linked to the frontal lobes (Gilbert & Burgess, 2008;Luria et al., 1966). One established model is the Supervisory Attentional System (SAS) developed by Norman and Shallice (Norman & Shallice, 1980). In brief, the theory posits that everyday human behaviour is automatic and efficient, except where novelty and difficulty are encountered and behavioural schema are to be updated (Norman & Shallice, 1980). In this case the proposed contention scheduling mechanism chooses a new course of action, and an overarching supervisory control system biases the choice where planning is required (Norman & Shallice, 1980;Van der Linden & Andres, 2001).
Many well used tests have been developed on the basis of the SAS theory, including the Tower of London (Shallice, 1982), and The Hayling and Brixton tests (Bielak et al., 2006;Burgess & Shallice, 1997). The Tower of London assesses planning and problem-solving ability, the Hayling test assesses proponent response inhibition, response initiation, and strategy use (Robinson et al., 2015), and the Brixton task assesses updating responses and abstraction of rules (Van Den Berg et al., 2009;Van der Linden & Andres, 2001). Meta-analysis of neuroimaging tasks have found frontal lobe involvement in performance on tasks associated with the SAS (Cieslik et al., 2015), although frontal lobe damage does not always lead to failure on SAS tasks (see Vordenberg et al., 2014). Though executive functions, defined from any theory, are now most often framed to be supported by a diffuse network of white and grey matter (e.g., Sasson et al., 2013), this network is thought to be mediated by the frontal lobes (Antoniak et al., 2019;Bettcher et al., 2016).
One important issue with executive function, and the neuropsychological tests designed to assess this, is the frontal lobe paradox (George & Gilbert, 2018). This paradox arises where an individual with frontal brain damage performs well on tests of executive function yet has profound executive impairments in everyday life (Shallice & Burgess, 1991). Shallice and Burgess reported the case of three patients with neurological damage specific to the frontal cortex, who each sustained frontal brain injuries yet had high intelligence quotients and performed well on cognitive tests including tests of executive function and language (Shallice & Burgess, 1991). Shallice and Burgess developed an ecological task to bridge the gap between neuropsychological testing and activities of everyday life (Steverson et al., 2017), where three patients were taken to a shopping centre and given multiple tasks to complete. There were rules to follow, as well as a set limit on money to spend and time to take to complete the task. The goal was for the participants to complete the errands efficiently before reporting back to the researcher. This assessed the patient's problem-solving, planning, and monitoring abilities (Antoniak et al., 2019;Shallice & Burgess, 1991). The patients presented different types of errors, in terms of rule breaks, inefficiencies, interpretation failures, and task failures. Further, when compared to nine IQ and age matched controls on the task, each patient performed at less than the typical 5% control performance rate, which suggested the test was effective at detecting impairment (Shallice & Burgess, 1991). This test was subsequently known as the Multiple Errands Test (MET), characterized as a naturalistic and ecologically valid assessment of executive function. It has been suggested since that the MET may not directly measure executive function, and instead measures the effects of executive dysfunction (Antoniak et al., 2019). Alternative views however suggest the MET in fact assesses executive function to a greater extent than traditional abstract executive assessments, on which dysexecutive patients perform well in presence of daily dysfunction (Shallice & Burgess, 1991). On balance, we suggest the MET can be thought of as a test of executive functioning related to functional outcomes and activities of daily life. As an overall broad-spectrum assessment, it is indeed not a highly controlled domain-specific test but has huge potential as a screening test. The purpose of a screening test is to allow us to determine with high probability that a problem is present, and further assessment is required to understand the nature of the problem and the constituting impairments (Roebuck-Spencer et al., 2017).
In a clinical context however, a MET is often not feasible. Most prominently, the need to take patients out into the real world raises practical concerns regarding transport, staff time, and patient safety. Several versions of the MET have since been created (see review Rotenberg et al., 2020) to work around some of these issues. Real-world versions have been adapted for patients in hospitals (Dawson et al., 2009;Knight et al., 2002), shopping centres (Alderman et al., 2003) and large stores (Antoniak et al., 2019), as well as a home-based version (Burns et al., 2018), addressing some of the barriers. However, patients with neurological conditions often have co-occurring motor impairments restricting their ability to complete the task. Virtual reality/ computerized versions may be able to provide a solution here, and several have been developed (e.g., Cipresso et al., 2014;Jovanovski et al., 2012;Rand et al., 2005;Raspelli et al., 2012), though these often bring high costs and needs for technical equipment and expertise which are often not readily available. In a recent systematic review, 33 articles reporting a version of the MET were found and their psychometric properties were assessed. The MET was commonly scored by accuracy of task completion, task omissions and partial omissions, as well as scores regarding rule breaks (Rotenberg et al., 2020), with partial omissions being most sensitive to impairment (Dawson et al., 2009). Furthermore, this review found that for many versions of the MET, there was generally good internal consistency, good inter-rater reliability, sufficient test-retest reliability, good-adequate convergent validity, and good ability to differentiate clinical groups.
With regards to convergent validity, the MET has been found to converge with a variety of standardized neuropsychological measures of executive functioning. For example, the performance on the MET was found to correlate with the Trail Making Test Trail A time and Trail B, and Trail B/A performance (Alderman et al., 2003;Jovanovski et al., 2012;La Paglia et al., 2014), and with inhibition impairment (Burgess et al., 1998). A 2014 systematic review summarized nine papers that included the MET, validated measures of executive function, and participants with acquired brain injury (Quinn, 2014). The review showed that each paper used a diverse range of neuropsychological tests, commonly including subsets of the Behavioural Assessment of Dysexecutive Syndrome (BADS; Wilson et al., 1996), digit span tests, fluency tests, figure drawing tests, story recall tasks, attention tests, and activities of daily living assessments. Convergent associations between the performance on the MET and the following were found: the Modified Six Elements Test (Jovanovski et al., 2012) and Zoo Map test (Rand et al., 2009) of the BADS (although note no convergence found in Erez et al., 2013or Okahashi et al., 2013, Rivermead Behavioural Memory Test (Wilson et al., 1999), Comprehensive Assessment of Prospective Memory (Waugh, 1999), and Instrumental Activities of Daily Living Scale (Lawton & Brody, 1969).
Evidence for divergent validity would result from comparing the MET to nonexecutive measures (Rotenberg et al., 2020), such as memory and intelligence (Hanberg et al., 2018). Explicitly testing for divergent validity has however not been common practice in studies on the Multiple Errands Task. The systematic review by Rotenberg et al. (2020) highlighted that no examinations of divergent validity were conducted in the studies included.
Though broadly speaking convergence with executive tasks and divergence from non-executive tasks seems to classify validation in most cases, the heterogeneity of versions of multiple errands tests and the wide variation in the scoring methods makes a direct comparison of psychometric properties difficult (Rotenberg et al., 2020).
So far, previous versions of the MET have failed to consider it within the framework of a feasible short, ecologically valid, screening tool for the executive aspects of activities of daily life inclusive for the clinical reality after acquired brain injury, which will include individuals with pre-existing dementia, as well as with mobility and upper limb impairments. Specifically, in acute brain injury and in-patient neurorehabilitation settings, cognitive screening must be time efficient and easy to administer to prevent failure to assess cognition appropriately (Demeyere et al., 2015). Up to now, virtual versions of the MET have used joysticks (see, for example, Titov & Knight, 2005), virtual reality wands (Kourtesis et al., 2020a), and desktop keyboard set ups (see, for example, Law et al., 2006), but are often too long to complete and require complex set ups. These can be expensive, even if the cost of virtual reality technology has decreased in recent years (Kourtesis et al., 2020b).
Outside of the Multiple Errands Tests, other short versions of ecologically valid tests of executive function have been developed, including for instance the 15 min Hotel task (Manly et al., 2002), which is similar to a 6 elements task (Shallice & Burgess, 1991). This example is a table top task, which requires a set-up of complex materials and a skilled examiner for administration and scoring. The computer-tablet is an alternative format not yet tried with the MET that is able to address these issues. A computer tablet app version of the MET could provide guided administration, remove the complexity of any material set up, automatic scoring and shorten the time necessary to test, fitting with the increasing tendency to improve cost-effectiveness with computer-tablet technology in our healthcare settings (Bauer et al., 2012;Koski et al., 2011;Pew Research Centre, 2019).
Older adults have become comfortable performing tasks on computer tablets, due to wider adoption of the computer-tablet format (Anderson & Perrin, 2017). A computer-tablet based version of the MET would shorten testing time, making it more appropriate on time pressed clinical settings, and allow assessment of otherwise difficult to assess patients (e.g., those with mobility and upper limb weakness).
Here, we present a new version of the Multiple Errands Test called the Oxford Digital Multiple Errands Test (OxMET) which is performed on a simple computer tablet app interface with stylus pen and takes less than 10 min to conduct, with the majority of controls completing the task within 3 min. This test hopes to improve usability and feasibility of examining impairments in healthcare settings due to the easy administration of the test. Ultimately, we developed the OxMET to serve as a brief screening tool for impairments of executive dysfunction that may impact activities of everyday life. We established the normative performance scores and clinical cut offs, and examined internal reliability, test-retest reliability, convergent and divergent validity, known-group discriminability, and the sensitivity to impairment of the OxMET in a mixed healthy control and unselected stroke cohort.

Methods
We established the normative data for our test using an English speaking neurologically healthy cohort and assessed the validity of the OxMET outcome measures for an unselected stroke survivor cohort between 2014 and 2019.
We examined the measurement properties of the Oxford Digital Multiple Errands Test (OxMET) and established internal consistency, test-retest reliability, and initial convergent and divergent validity with executive function measures in the OCS-Plus . Approval for the study was gained from the Medical Sciences InterDivisional Research Ethics Committee (R51993/RE001) and the National Research Ethics Committee South Central -Oxford C Research Ethics Committee (REC reference: 18/SC/0044, IRAS project ID: 241571). Data are stored on the Open Science Framework (doi 10.17605/OSF.IO/8SUT), due to copyright, the tests used are not open access.

Participants
A convenience sample of 124 healthy controls with no self-reported neurological history and 105 chronic stroke survivors were recruited from established participant databases from the Translational Neuropsychology Research group at the University of Oxford. All 124 controls completed the OxMET to establish normative data. All stroke survivors had a confirmed diagnosis of stroke or TIA and completed the OxMET. Lesion information was taken from clinical notes and confirmed by visual inspection of clinical brain scans. No selection criteria regarding behaviour or lesion location or size were used. The only two exclusion criteria were an inability to stay alert for the duration of testing and incapacity to provide informed consent.
A mixed subset of 158 participants, consisting of 78 controls and 80 stroke survivors completed the validation tests alongside OxMET. Finally, 39 participants (11 controls and 28 stroke survivors) were retested on OxMET to provide test-retest reliability. The initial start of the project only gathered OxMET data for norming and feasibility, with convergent and divergent validation starting in a later phase. Two measures were added later into data collection as part of a further validation project, and as such only 76-79 participants took part in the Zoo Map test from the Behavioural Assessment of Dysexecutive Syndrome (BADS; Wilson et al., 1996), and the Pill Box (Zartman et al., 2013) tasks. The participants who were retested were selected by opportunity sampling when these participants took part in other studies for the lab where there was additional time availability (Table 1).

Materials
The Oxford digital multiple errands test (OxMET) The OxMET computer tablet shopping task requires participants to buy six items and to answer two questions. Participants are allowed to complete the tasks in any order. Following an explanation of the tablet use and practice using the pen, the participants are provided with standardized instructions: On the following screen you will see a street with shops on it. Your task is to buy six items and to answer two questions. You can enter any shop by tapping on its picture. Once inside a shop you can tap on any item to buy it, the price tag will turn green once selected. Once you know the answer to a question you can tap on the question to answer it. There are some rules to follow. You must take as little time as possible and spend as little money as possible. You can only enter a shop in order to buy an item or to answer a question. You must avoid entering a shop more than once. The errands can be done in any order.
Next, the screen in Figure 1 appears. In order to reduce memory demands for this task, the items on the shopping list can be struck through to keep track and  the instructions remain on screen at all times. No further elaboration on the instructions is to be given during the task completion in order to standardize administration; Where participants asked questions regarding the task, the researcher would simply point them to the instructions which stayed on the screen at all times. Note, that where there are technical questions from the participants such as how to exit a shop, the examiner can provide help. The design and "watercolour on paper" look and feel of the task was developed following feedback from stroke survivors where they expressed preference for an adult drawing style over more digitally created elements which were felt to come across as more suitable to children. Participants are able to tap on the image of a shop front with a stylus pen to enter a shop. When inside a shop a static image is presented with six options of items to buy presented on the right of the screen with a price tag (see Figure 2 for an example).
In every shop there are two options of the same type of item at different prices. For instance, where the participant is required to buy apples, there are red and green apples, among the distractor items, and one type is more expensive than the other. Participants are able to tap on the items they desire, the price tag of which turns green when selected, and then pay to leave the store. Alternatively, the participant can buy more than one item or deselect their desired item before paying and leaving. There is no requirement to buy an item in order to leave the store. There is no indication of money spent in total at any time; the participants are implicitly expected to only buy the cheaper option of the item on their shopping list. Ten shops are presented, the "fish and chips" shop and the "books" shop are distractors and not to be entered during the task. A full run-through video of the task is available on the Open Science Framework (doi 10.17605/OSF.IO/8SUT).
The OxMET was run through an application created in MATLAB 2014b on a Microsoft Surface Pro computer tablet (Windows 10 Pro, version 1511) in landscape orientation. The application can run on any windows computer-tablet with touch screen, specific tablet requirements can be found in the supplementary materials (doi 10.17605/OSF.IO/8SUT).

Task scoring
The scoring of the task is completed automatically: no assessor input is required to either save or score the main outcome data. The main outcome measure of the task is accuracy which ranges from minus 10-10 based on a score obtained in each shop. For each of the target buying shops: if the correct shop was entered only once, and when inside only the correct item was bought, then this scored an accuracy point (+1). If the correct shop was not entered, or entered more than once, or entered and a distractor item was bought or no item bought at all, then the participant scores a minus one. For the two taskrelated shops, an accuracy point was scored if it was entered only once and was left without buying an item and the task question was correctly answered. For the distractor shops, a point was scored if it was not entered.
In sum, if a single error is made in any shop, or a distractor shop is entered, or a question not answered, this deducts a point from the shop related to the task. For instance, the first question asks what the florist's name is, if the florist's shop is not entered the question cannot be scored correctly even if attempted and so the participant is deducted a point for not entering that shop.
In addition to this overall score, the application also stores the breakdown on each type of error and correct move per shop as well as time stamps for full duration of the task and time stamps for time spent in each shop. We calculated additional error scores for comparison to other work using the MET (see Table  2), as well as the total error, which was the sum of frequency of rule breaks, omissions, and commissions. It is possible, however, to generate further scores if this were desired by individual assessors. No requirements were made regarding the order in which the shops needed to be visited.

Validation tests
Participants in the validation group completed brief measures from the Oxford Cognitive Screen -Plus (OCS-plus; see Demeyere et al., 2020), which examined both executive and non-executive abilities. In addition, a more complex measure of planning and executing a complex set of tasks was taken from the Zoo Map test from the Behavioural Assessment of Dysexecutive Syndrome (BADS; Wilson et al., 1996), and an ecological measure of executive functioning in the Pill Box task (Zartman et al., 2013). To establish convergent validity, we compared the OxMET with the executive tasks from the OCS-Plus and the Zoo Map and Pill Box tasks. Brief measures of language and memory domains from the OCS-Plus were used to establish divergent validity. Note that we considered association between partial omissions and perseveration and the word memory task of the OCS-Plus to be convergent. This is due to how memory deficits would play a part in both forgetting to fully complete tasks and potentially explain why a participant may repeatedly complete an action. An overview of the sub-tasks included in the validation is given in Table 3. The validation tasks were completed in a session lasting maximum one-hour.

Procedure
Participants completed the OxMET on a Microsoft Surface Pro computer tablet (Windows 10 Pro, version 1511) in landscape orientation. The OCS-plus was run in portrait on a separate application on the same device. All tests, including the subtest from the BADS and the Pill Box Test, were conducted by a trained research assistant. Both the OxMET and OCS-Plus were completed within a one-hour session at the Department of Experimental Psychology, Oxford, or at the participants home if they were a stroke survivor. Participants who completed the validation tests completed an additional one-hour session. The researcher sat next to or opposite to the participant when verbally explaining the instructions and during task completion, similar to the video demonstration.

Analysis
Normative data and impairment thresholds for each OxMET measure were calculated in terms of 5th and 95th centiles based upon the performance of the  healthy control sample in cohort one. We assessed age and education effects on OxMET measures in the control group to determine whether the normative data should be stratified. Next, we assessed the psychometric properties of the OxMET. Using the normative and stroke survivor data combined for greater power, we established the internal reliability of the OxMET using a split half method and Cronbach's alpha. Further, we assessed test-retest reliability at the group and individual level using the subset of mixed stroke and control sample. Finally, we examined the associations of sub-measures from the OxMET with validation tasks (see Table 3) to establish convergent and divergent validity.
Finally, we assessed the preliminary ability of performance on the app to differentiate healthy controls from stroke survivors, through group comparisons and ROC analysis.
A priori power analysis was not conducted for inferential tests, we instead established a smallest effect size of interest for our correlations (r = .31, alpha adjustment described later, 80% power) and calculated power for the onesided Wilcoxon signed rank tests following Shieh et al. (2006), revealing a power of one with our Bonferroni corrected alpha levels per analyses. All analyses were computed in R (version 3.5.1; 2018-07-02; R Core Team, 2018), the data and analyses scripts used to generate this manuscript are openly available (doi 10.17605/OSF.IO/8SUT). We used the following packages for analysis and visualizing data: readxl (Wickham & Bryan, 2019), pROC (Robin et al., 2011), rcompanion (Mangiafico, 2019), sjstats (Lüdecke, 2018)

Normative data
Both age and education (with no imputation for missing data) significantly correlated with most OxMET measures (Bonferroni alpha corrected level for 16 comparisons, p = .003), except for omissions, frequency of rule breaks, partial omissions, and perseveration. Correlations can be found in Supplemental  Table 1. Through the unbiased code function "split" in base R (R Core Team, 2018) we found three age groups of approximately equal sizes which were 21-63.49 (n = 45), 63.50-71.70 (n = 38), and 71.70-91 (n = 41). This method is unbiased in so far that age group divides are not based on any performance data. The age groups, when compared on OxMET measures, performed statistically different to each other, which therefore justified separate impairment cut offs. The OxMET measures where age groups behaved differently were Time (H (2) = 14, p = .001), Accuracy (H(2) = 13.03, p = .002), and Commission errors (H(2) = 12.59, p = .002). Due to the differences in performance on the OxMET measures between age groups we split the normative data into the age categories. Despite an overall correlation, we found no significant differences when splitting the groups based on a variety of different levels of education and therefore do not present differential education level cut-offs.
Normative data for the OxMET is presented in Table 4 using data taken from the 124 neurologically healthy participants. On average, controls were at ceiling in accuracy and made no errors, except in the over 71 group who on average made two non-specific errors. The main differences in age were apparent in the time to complete the task where the older age groups took longer.

Reliability
Internal consistency A split half reliability estimate with 5,000 bootstrapped random samples was conducted on OxMET accuracy. Cronbach's alpha revealed an average internal consistency statistic of a = .79 (SD = .13). We further performed internal consistency analyses on raw scores for all OxMET measures (with reverse coding of accuracy to be consistent with error scoring) and found a standardized Cronbach's alpha of a = .87 (average item correlation r = .45, Gutman's Lambda 6 of 1). This demonstrates a very high internal consistency of the OxMET.

Test-retest reliability
Test-retest reliability was assessed on the individual and group level. See Table  5 for Wilcoxon signed rank tests comparing test and retest for differences at the group level, interpretations of p-values were Bonferroni corrected (significant if below p = .00625). Performance across time for each of the OxMET measures is graphically presented in Supplemental Figure 1 and all intraclass correlation coefficients are presented in Supplemental Table 2.

Validity
The results of the convergent and divergent validity analyses are found in Table  6 and interpretations were corrected for multiple comparisons using the meff function in R (Derringer, 2018). This alpha-correlation method is specifically Note: Oxford Digital Multiple Errands Test (OxMET). All values except centiles are medians. Centiles are 5th for accuracy, and 95th centiles for errors and time. Partial omissions centile was rounded to 2, total error was rounded to 6. Med refers to median. designed for correcting for non-independent test statistics as in correlations (Derringer, 2018), and thus we use this correction for our large correlational analyses where Bonferroni corrections may be inappropriate. This alpha correction a M eff takes into account the effective number of outcomes from both OxMET (M eff , p. 11 = 6.05519) and validation measures (M eff 2 = 22.9746729). The alpha corrected level for interpretation is a M eff = a M eff 1 × M eff 2 = .00036. Following the definition of convergent validity used by Rotenberg et al. (2020), we interpret convergence if correlations are significant above .30. With regard to the main outcome measure of the OxMET, accuracy on the OxMET correlated convergently with Trail B accuracy and executive score from the Oxford Cognitive Screen -Plus (OCS-Plus; Demeyere et al., 2020) battery, as well as encoding of words, both attentional tests from the OCS-Plus, and the Zoo Map raw score from the Behavioural Assessment of Dysexecutive Syndrome (BADS; Wilson et al., 1996) but not the Rule Finding test from the OCS-Plus. Accuracy was not associated with comprehension, orientation, delayed memory and non-executive tasks, demonstrating divergent validity.
With regards to the different measures from the OxMET, time taken to complete the OxMET was indiscriminate with its relations, relating to many of the OCS-Plus tasks accuracy measures, but crucially the Trail B time is the one of two time measures that the OxMET time related to, as well as Rule Finding time, which discriminates it from the Trail A baseline and other non-executive tasks. Partial omissions were related to measures of working memory in the immediate word encoding and invisible cancellation accuracy. Commissions related to encoding and delayed recall tasks, as well as Trail B accuracy and executive sore, and Cancellation tasks, suggesting a similar relationship as accuracy. The total error score behaved similar to accuracy and commissions as these are interdependent measures. Full task omissions, frequency of rule breaks, and perseveration scores, did not correlate with any OCS-Plus measure, possibly due to lack of variance and small numbers who made errors.

Group comparisons
Wilcoxon rank sum tests with Bonferroni corrections for multiple comparisons (p = .00625) and continuity correction were carried out between controls and stroke survivors on each of the OxMET outcome measures (see analysis for variance checks in code justifying this choice of statistic). The groups differed on all OxMET measures except for omissions (see Figure 3 and Table 7). Note we also ran ANCOVA with age and education as covariates to examine the influence of age and education on significance of group comparisons, and this revealed group comparisons were still statistically significant in the same direction even when controlling for the covariates.

Sensitivity and specificity
We computed a sensitivity analysis of the main OxMET metric, accuracy to differentiate healthy controls from stroke survivors. We found good sensitivity of the OxMET control 5th centile cut off for accuracy at 74.29% and a specificity of 64.52%. We computed a ROC curve analysis and report an area under the curve of 71.94% (see Figure 4). Dawson et al. suggested that partial omissions may be the best measure to differentiate participant groups (Dawson et al., 2009), and therefore we also computed a ROC analysis on partial omissions and found a sensitivity of 52.38%, a specificity of 77.87%, and an area under the curve of 66.54%. We compared the two ROC results using the roc.test function (bootstrap test for two correlated ROC curves) in the pROC package and found that the measures were not statistically different at differentiating individuals in groups (D = −2.79, p = 1). ROC curves for all other OxMET metrics can be found in Supplementary Figure 2. We further computed ROC analyses for the validation measures regarding the ability to distinguish between stroke survivors and controls and found that in terms of area under the curve (AUC), there were two measures with a greater AUC: the OCS-Plus Encoding 1 and Rule Finding tasks. For sensitivity, the only validation measure with a greater sensitivity was the OCS-Plus delayed recall. In comparison to the Zoo map (AUC = 66.19%) and Pill Box Test (AUC = 70.41%), the OxMET seemed to strike the better balance between high sensitivity and specificity. The pillbox test for example demonstrated very high specificity (97.4%) but a low sensitivity (29.74%).

Discussion
We present a standardized new test in the form of a computer tablet app version of the Multiple Errands Test called the Oxford Digital Multiple Errands Test (OxMET). Following the description of the test, psychometric data was  provided. We established the association between age and education on the main outcome metric of overall accuracy on the task as well as a range of specific error and time scores. Age based normative cut offs for performance were derived based on a neurologically healthy control cohort of 124 older adults.
The neurologically healthy older adult cohort were found to perform at ceiling on the OxMET metrics, with the exception of older age groups (greater than 71), but no significant differences were found when comparing different education groups on performance. The main difference between groups was the increase in total error score for the older age group, meaning two non-specific errors were made on average in that group, compared to zero in the lower (<63) and middle age groups (63-71). These relative ceiling effects make the OxMET a potentially strong screening measure, with an expected near errorless performance, and any difficulties with this straightforward shopping task likely to demonstrate an impairment in executive functions tapped by the test. Further ecological validity data is required to demonstrate whether this is also likely to flag a significant impact on activities of daily life.
Next, we assessed the reliability of each outcome metric, both internal and across time, and found high internal consistency (trial level a = .79; metric level a = .87). Furthermore, performance on the OxMET across all metrics demonstrated good test-retest stability on the group level, even with a wide retest interval (average 21.31 months), in a combined test-retest cohort of neurologically healthy adults and stroke survivors.
In addition, we compared performance on the OxMET to a neuropsychological screening tool and two tests of executive function to establish convergent and divergent validity. We found Time for completion to be indiscriminately related to many OCS-Plus Measures and the Zoo Map test, revealing an oftenfound non-specific performance difference in processing speed and motor slowness in stroke survivors affecting all tasks (e.g., Su et al., 2015). In contrast, the main outcome metric of overall accuracy was more selective in correlations and revealed good convergent and divergent associations above the coefficient of .30 suggested by Rotenberg et al. (2020). This suggests the accuracy measure is a more specific executive and attentional measure, discriminant from lower level basic comprehension and understanding tasks.
The error metrics from the OxMET, which are interdependent on accuracy but differently scored, revealed similar relationships to other variables. The perseveration metric distinctly convergently related to delayed memory recall and the commissions metric related to false positives on an attentional task, suggesting convergence in our measure for items related by inhibition and longer-term memory issues. The frequency with which people made errors did not relate to any comparator measure, suggesting this is not a well understood metric or whether this adds anything to the interpretation of the OxMET over and above specific error types. Our results fit with the mixed results found by Rotenberg et al. (2020) systematic review, which found inconsistent validity of different outcome measures from multiple errands tasks across 33 studies.
Finally, we compared the healthy control and stroke cohorts on all performance metrics and revealed statistically significant differences (after correction for multiple comparisons) between groups on all OxMET outcome measures bar the omissions measure which did not survive the correction. When further exploring the sensitivity of the OxMETs scores, through data visualization, we found most of the stroke survivor cohort was clustered near the bottom of most error metrics, with a critical number of patients making more errors. With regards to ROC analyses, we found moderate to good sensitivity of the task to differentiating this heterogeneous stroke survivor sample from healthy controls. Though the aim of our test is to screen for executive impairment and not to screen for the presence of a stroke, such pathological group differentiation has been shown in other versions of the MET (Rotenberg et al., 2020, with 12 of the 14 studies examining discriminability showing significant differences). The present results align this digital OxMET version with the known-group literature on multiple errands tasks.

Study limitations and future research
This paper has presented only a subset of the theoretically motivated possible performance metrics that can be derived from the app. For instance, the app stores time stamped information as well as audio-recordings, and other strategies and metrics for completion of the task could be determined and evaluated. We provide all data from this project openly on the Open Science Framework (doi 10.17605/OSF.IO/8SUT5), and would be happy to see further exploration by other researchers.
We did not establish the ecological validity of the OxMET in this investigation, and this is a key next step. In order to firmly establish the link between these virtual and real-world activities of daily life, further validation of OxMET is required. In addition, validating the task as well as linking it to wider functional outcomes is required to establish the informational and clinical value of the OxMET.
Finally, an important issue that affects most Multiple Errands Tests is that they are not easily translatable to different cultures or countries, with obvious examples being hospital specific adaptions (Knight et al., 2002), and ethnocentric requirements in tests such as in Alderman et al. (2003) where they require participants to answer "What is the headline from either today's 'Daily Mail', 'Daily Mirror' or 'The Sun' newspaper?" (p. 44). The OxMET is still biased towards a Western culture, and the design with input from UK stroke survivors intentionally fit a familiar shopping scene. The app has inbuilt technology to be easily translatable to other countries, with flexible pulling in of text files for instructions. With regards to the shop and shopkeeper images, these could similarly be replaced within the code and cultural adaptations and translations would be encouraged, with new and appropriate norm and acceptability testing required for these versions. There has been a push towards a standardized and adaptable scoring system that can be used across many settings and cultures (see Antoniak et al., 2019 andBurns et al., 2018 for examples), and we feel that although the initial and normed design is UK centric, there is definite scope for adaptation. This approach fits strongly with similar approaches taken by our research group on translations for the Oxford Cognitive Screen (e.g., Robotham et al., 2020;Shendyapina et al., 2019) and OCS-Plus .

Next steps and clinical use
With normative data and initial psychometric investigation of the OxMET now completed, the next steps are to examine the clinical applicability of the app and further validation in specific and selective clinical groups and subgroups. (e.g., in groups with damage to frontal-executive networks such as Shallice & Burgess, 1991), as well as further ecological validation into predictive validity regarding instrumental activities of daily life. The present investigation sets the foundation for future clinical studies to provide the necessary evidence base for clinical adoption.
Once established, the OxMET could fill a clinical need as a brief screen for impairments in executive function that may impact everyday life, which should be further assessed in functional observational assessments of everyday tasks. The brief and inclusive nature of the digital format screen means that all patients can complete this screen to detect potentially hidden impacts of executive impairments on everyday tasks (e.g., see the frontal paradox George & Gilbert, 2018). We believe this screen, if shown to be ecologically valid for real life activities, has the potential to provide key input to decisions around rehabilitation pathways as well as discharge and care package decisions. This may have an especially important contribution when considering a home discharge with independent living. This will however require further research on predictive validity and clinical utility.

Conclusions
The current study presented a novel, tablet based Multiple Errands Task with normative data from a large healthy aging cohort and initial reliability and validation in chronic stroke survivors. We aim for this assessment to provide a quick screening for daily life consequences of executive impairment. Future research should establish further clinical sub-group validation, links to broad functional outcomes, and feasibility of the OxMET assessment in clinical settings.