Navigation of a virtual exercise environment with Microsoft Kinect by people post-stroke or with cerebral palsy

ABSTRACT One approach to encourage and facilitate exercise is through interaction with virtual environments. The present study assessed the utility of Microsoft Kinect as an interface for choosing between multiple routes within a virtual environment through body gestures and voice commands. The approach was successfully tested on 12 individuals post-stroke and 15 individuals with cerebral palsy (CP). Participants rated their perception of difficulty in completing each gesture using a 5-point Likert scale questionnaire. The “most viable” gestures were defined as those with average success rates of 90% or higher and perception of difficulty ranging between easy and very easy. For those with CP, hand raises, hand extensions, and head nod gestures were found most viable. For those post-stroke, the most viable gestures were torso twists, head nods, as well as hand raises and hand extensions using the less impaired hand. Voice commands containing two syllables were viable (>85% successful) for those post-stroke; however, participants with CP were unable to complete any voice commands with a high success rate. This study demonstrated that Kinect may be useful for persons with mobility impairments to interface with virtual exercise environments, but the effectiveness of the various gestures depends upon the disability of the user.


Introduction
The World Health Organization (WHO) estimates that 15% of the world's population have a moderate or severe disability (WHO, 2004). These populations include people with cerebral palsy (CP) and those post-stroke, many of whom require use of a wheelchair or other mobility device. Coincident with such mobility impairments is a tendency toward decreased physical activity and the onset of secondary conditions including pain, fatigue, depression, and anxiety, as well as obesity, diabetes, and cardiovascular disease (Armour, Courtney-Long, Campbell, & Wethington, 2013;Centers for Disease Control andPrevention, 2014a, 2014b;Finch, Owen, & Price, 2001;Walsh, 2011). Exercise has been shown, however, to decrease the prevalence and magnitude of secondary conditions in people with disabilities.
Despite the benefits of exercise, only 25% of people with disabilities attain even a moderate amount of physical activity each week, compared to 43% of adults without disabilities (Boslaugh & Andresen, 2006). Known barriers to exercise faced by people with disabilities include physical, economic, psychological, and knowledge-related barriers (Malone, Barfield, & Brasher, 2012;Rimmer, Riley, Wang, Rauworth, & Jurkowski, 2004;Rimmer, Wang, & Smith, 2008). These barriers include difficulties with physical access; high costs associated with membership to exercise facilities or with the purchase of exercise equipment; and lack of knowledge of individuals with disabilities regarding how to exercise or of staff at exercise facilities regarding how to help individuals with disabilities. Interventions that address these barriers are more likely to increase physical activity in individuals with mobility impairments.
Our research team is currently working toward development of active video game devices and technologies for people with disabilities. Other studies have shown that exergaming can produce beneficial exercise intensities and higher program adherence (Warburton et al., 2007;Widman, McDonald, & Abresch, 2006). Building on these encouraging results, the present efforts seek to overcome barriers that inhibit exercise among people with disabilities and to capitalize on the attentional capture of virtual reality to distract from the tedium and fatigue of exercising. Specifically, the present study investigated methods to allow participants to interface with, and navigate through, a virtual environment using marker-less tracking and gesture recognition with Microsoft Kinect.
Since its release in 2010, the Kinect has opened the door for many new and interesting applications in rehabilitation and exercise research. It has been compared to other standard sensors for gait estimation (Gabel, Gilad-Bachrach, Renshaw, & Schuster, 2012) and also to provide continuous gait analysis and safety information for the elderly (Stone & Skubic, 2013). Additionally, it has been used to analyze position information in specific body parts (Mentiplay et al., 2013), as well as whole body analysis for older adults (Obdrzalek et al., 2012). Other applications include rehabilitation programs (Pastor, Hayes, & Bamberg, 2012;Roy, Member, Soni, & Dubey, 2013) and use in functional brain mapping (Scherer, Wagner, Moitzi, & Muller-Putz, 2012). Some studies have piloted the use of the Kinect for people with motor disabilities like CP (Y. J. Chang, Chen, & Huang, 2011;Y. J. Chang, Han, & Tsai, 2013). Of particular relevance to the present study, the Kinect has previously been used to navigate Google Earth and Google Street View by healthy individuals, demonstrating seamless navigation of a virtual environment (Boulos et al., 2011). Other benefits of the Kinect include low price point, comparable accuracy to high cost technologies, and marker-less tracking (Obdrzalek et al., 2012).
The purpose of this study was to develop and test directional navigation options (i.e., body movements, voice commands) within a virtual environment using Microsoft Kinect for people with mobility impairments, specifically post-stroke or CP. Since exercise may use arm ergometers, voice commands were included as a means of navigation that may allow users to make path choices without interfering with the exercise motion. The body gestures and voice commands were evaluated according to degree of success and time to complete. These data, along with results from a questionnaire rating perceived difficulty, provide initial data as to which interactions were most viable for individuals post-stroke and for those with CP.

Microsoft Kinect
The Microsoft Kinect is composed of a red-green-blue (RGB) camera, an infrared emitter and receiver, and a multi-array microphone. The two camera systems, RGB and infrared, capture color and depth image data, respectively. Utilizing both images, the Kinect determines human skeletons within a scene according to its proprietary algorithm. Kinect for Windows includes a "Near Range" tracking mode suited for ranges between 0.4 and 0.8 meters and adds a "Seated" tracking mode, which was used in this study, that tracks only the upper body, including head, shoulders, arms, wrists, and hands ( Figure 1). A comparison of Kinect to a commonly used laser scanner established its accuracy as less than 0.4 in.

Virtual environment
In order to test the various interactions with Kinect, a simple virtual environment was developed, featuring a video from a human walking perspective that moved along the path until it arrived at a single four-way intersection. At that point the system required input from the user to proceed (Figure 2, left), and upon completion of a successful gesture or voice command, played the corresponding continuation video (Figure 2, right). If the user did not perform the appropriate movement after 10 seconds, the program restarted. The prechoice video would then play again and the program would continue to loop through the directional choices until all interactions had been exhausted. From beginning to end, the entire simulation lasted 30 seconds which allowed for multiple Kinect interactions to be quickly tested. The virtual environment utilized real-world video captured in 720p resolution using a Canon T1i camera. These videos were motion corrected using the Deshaker (http://www.guthspot.se/video/ deshaker.htm) plugin for VirtualDub (http://www.virtualdub. org) to eliminate camera shake caused by walking while recording.
The program that connected the videos with the Kinect and user input was developed in Microsoft Visual C# 2010 Express. Upon start-up, the Kinect and real-world environment videos were loaded and the Kinect RGB and Skeleton streams were enabled. RGB video was captured by the Kinect in 1,280 × 960 resolution at 12 frames per second (FPS). Display of the Kinect-captured RGB video allowed for proper positioning of the user in front of the Kinect before testing with the virtual environment.
Fourteen interactions were recognized: a left or right vertical hand raise over the head; left or right hands extended horizontally to the side; a torso twist to the left or the right; a vertical shrug of the left or right shoulder; a head nod to the left or the right ( Figure 3); and four voice commands ("left," "right," "forward," and "reset"). The nonverbal choices were implemented using hard coded joint comparisons and angle calculations using the Kinect's skeleton array data. For example, the hand extend gesture was recognized by comparing the distance between the hand and the corresponding shoulder. If this distance was greater than the distance between the shoulders then the hand was considered to be extended.

Data collection
Thirty participants (14 male), including 15 individuals with CP (ages 24-59 years) and 15 individuals post-stroke (ages 43-86 years), volunteered to participate in the study (Table 1). All participants were recruited at Lakeshore Foundation (Birmingham, AL). Inclusion criteria were male or female gender, age 19 years or older, and living with CP or poststroke. Exclusion criteria included any cognitive disability that would prevent the participant from communicating over the phone, or the inability to speak English. Approval was obtained from the Institutional Review Board at UAB (IRB#00000726, protocol X130717007). Following informed consent, each participant completed a Box and Block Test as a measure of unilateral gross manual dexterity (Mathiowetz, Volland, Kashman, & Weber, 1985). Participants were instructed to use one hand to move as many small cubes, size 1 in 3 (16.4 cm 3 ), from one side of a divided container to the other within 1 minute. Afterward, they repeated the procedure with their other hands. The number of blocks moved was recorded as the score for each hand with the higher score indicating higher functional ability.
Upon completion of the Box and Block Test, each participant completed the Kinect computer interface test. At each intersection, the screen would display text informing the participant which gesture to perform. Participants performed the 14 unique interactions (10 gestures and four voice commands), three times each and in a randomized order. Joint   position data in three dimensions were collected at 12 FPS across the 10 seated-mode data points (head, shoulders, elbows, and hands). Coordinate data were output to a text file along with time in milliseconds, which interaction the user was instructed to perform, and an identifier for when the Kinect registered that the movement was completed. Upon completion of the Kinect test, each participant completed a 5point Likert questionnaire for each gesture in which they were asked to rate their perception of difficulty as very easy (1), easy (2), neutral (3), hard (4), or very hard (5) to perform.

Data processing
A successful interaction was recorded when the Kinect recognized a participant's movement or voice command as matching the required interaction. The success score for each interaction was defined as the number of movements successfully performed divided by the number of movements attempted and multiplied by 100%. Recognition time for movement completion was output automatically as a time stamp by the Kinect. Logistic regression with generalized estimating equations was used to compare the odds of success within the CP and post-stroke groups. In addition, repeated measures analysis of variance was used to compare performance time scores within the groups. p-Values ≤ 0.05 (twotailed) were considered statistically significant. All statistical analyses were conducted with SAS v. 9.3 (SAS Institute, Cary, NC).

Box and Block Test
Participants in the post-stroke group scored higher on average in the Box and Block Test than those with CP for both hands (Table 2). Large standard deviations reflected the range of ability levels among participants. For the CP participants, the upper extremities with the higher and lower scores were called "dominant" and "non-dominant," respectively, in subsequent analyses. For those post-stroke, "less impaired" and "more impaired" described the upper extremities with the higher and lower scores, respectively.

Movement analysis
Three post-stroke participants were excluded from the analyses due to incomplete data, as each moved out of view of the Kinect camera during data collection. Thus, the present analyses included 12 individuals post-stroke and 15 individuals with CP, with resulting sample sizes for each gesture of n = 36 (post-stroke) and n = 45 (CP). The statistical analyses accounted for the dependence/correlation within measurements from the same participant. The resulting p-values for each paired comparison are listed in Tables 3 and 4.
Individuals with CP showed a significant effect of movement type on success (p = 0.0174, Figure 4A). The average success scores were 90% or higher for the hand extend, hand raise, and head nod gestures, for both dominant and nondominant sides. These results were significantly higher (Table 3) as compared to the shoulder shrug on the dominant side (62% success) for the hand extend (p = 0.0052), the hand raise (p = 0.001), and the head nod (p = 0.0053). In addition, the success scores for the head nod (98%) and hand raise (98%) were significantly higher (p = 0.0032, p = 0.0087, respectively) than the torso twist toward the dominant side (73% success). For the non-dominant side, significant differences were observed comparing success scores for the shoulder shrug (71%) against the hand extend (91%, p = 0.0159), hand raise (91%, p = 0.0372), and head nod (93%, p = 0.0397).
The post-stroke group also showed a significant effect of movement type on success (p = 0.0003, Figure 5A). Mean success scores exceeding 90% were observed for the hand extend, hand raise, and head nod on the less impaired side, as well as the torso twist (both sides). These results were significantly higher (Table 4) as compared to the shoulder shrug gesture on the less impaired side (55% success) for the hand raise (97%, p = 0.0010), hand extend (94%, p = 0.0011), head nod (91%, p = 0.0027), and torso twist (94%, p = 0.0011). For the more impaired side, the success score for the torso twist gesture (97%) was significantly higher than the hand extend (72%, p = 0.0320), hand raise (58%, p = 0.0111), and shoulder shrug gesture (63%,  p = 0.0152). Head nod on the more impaired side was associated with an 86% success rate, which was significantly higher than the hand raise (p = 0.0068) and the shoulder shrug (p = 0.0313). Comparing success scores across sides, the hand extend and hand raise were the only gestures for which the less impaired side scored significantly higher than the more impaired side (p = 0.0114 and 0.0001, respectively).
Both groups were able to complete all gestures within 5 seconds. For the CP participants, there were no significant differences in performance time comparing dominant and non-dominant sides ( Figure 4B, Table 3). The hand extend gesture was significantly faster on average (2.4 sec dominant; 2.5 sec non-dominant) than hand raise (3.1 sec dominant, p = 0.0256; 3.1 sec non-dominant, p = 0.0367), torso twist (3.9 sec dominant, p = 0.0005; 3.7 sec non-dominant,  p = 0.0001), and shoulder shrug (3.1 sec dominant, p = 0.0158; 3.0 sec non-dominant, p = 0.296). The nod gesture was significantly faster on average (2.6 and 2.7 sec, respectively) than the torso twist for both the dominant and non-dominant sides (p = 0.0024 and p = 0.0012, respectively). On the non-dominant side, the hand raise was faster than the torso twist gesture (p = 0.0452).
For the post-stroke group, the average performance time for the hand extend on the more impaired side was substantially longer (3.8 sec) than all other gestures, though not statistically significant due to the large variance ( Figure 5B, Table 4). For the less impaired side, the hand extend gesture was significantly faster than the torso twist (3.3 sec, p = 0.0207), shoulder shrug (3.6 sec, p = 0.0014) and head nod (3.3 sec, p = 0.0087). The raise gesture was also significantly faster (2.8 sec) than the shoulder shrug (3.6 sec, p = 0.0068) and head nod gestures (3.3 sec, p = 0.0452) on the less impaired side. The torso twist gesture was significantly faster (3.1 sec) than the head nod gesture (3.7 sec, p = 0.0450) on the more impaired side.

Voice command analysis
The speech recognition software from Microsoft was found to be more accurate in distinguishing two syllable words ("forward" and "reset") than one syllable words ("left" and "right"). For example, participants post-stroke were highly successful with two syllable words (86-97% success) as compared to one syllable words (47-72%, Figure 5C). Participants with CP were not successful with voice commands in general, with success rates averaging 43-70% ( Figure 4C).

Questionnaire results
Questionnaire responses were averaged by group and by Kinect interaction type ( Figures 4D and 5D). Participants with CP found dominant hand extend (mean score = 1.07), hand raise (1.13), head nod (1.20), non-dominant hand raise (1.27), and head nod (1.13) to be relatively easy. In this group, the non-dominant torso twist and shoulder shrug were considered more difficult (mean scores, 2.40 and 2.73, respectively). For the post-stroke participants, gestures which were more difficult included the less impaired hand raise (2.00), more impaired hand extend (2.58), and torso twist toward the more impaired side (2.17). The less impaired hand extend and shoulder shrug were both rated as very easy (1.00).

Discussion
The accuracy of the Kinect system has been tested and found to provide adequate accuracy for the depth range utilized in this study (Dutta, 2012;Khoshelham & Elberink, 2012); therefore, the present efforts focused on defining the gestures in ways that allowed the Kinect to correctly judge performance of these interactions. The difficulty of defining gestures has been investigated and multiple strategies have been proposed (Ibañez, Soria, Teyseyre, & Campo, 2014;Suma et al., 2013). The present study took the approach of defining gestures based on benchmarks between joints rather than implementing a machine learning algorithm. Although the benchmarking approach was perhaps coarser, it was sufficient for the scope of this study, where, for all of the gestures, there was at least one group and hand combination that was able to successfully perform the required motion at a rate greater than 90%. The high success rates indicate that the Kinect and benchmarking were in fact successful at being able to identify proper gesture performance, and that unsuccessful attempts reflected the inability to complete an action.
The practical application of the present results is to suggest gestures that could be used by people with mobility impairments navigating virtual environments. The viability of each gesture may be established based upon success rate, completion time, and the perceived ease of execution. Based upon the present data, the "most viable" gestures were selected as those with mean success rates approximately 90% or better and questionnaire scores averaging easy to very easy. According to these criteria, the gestures that were most viable for those with CP were the hand extend, hand raise, and head nod gestures, with hand extensions and head nods being associated with the fastest performance times. Similarly, the most viable gestures for those poststroke were the head nod, torso twist, as well as hand extend and hand raise of the less impaired extremity. Hand extensions and raises of the less impaired side were associated with the fastest performance times for this group. In general, the questionnaire results revealed that those gestures that were less successful or took longer time to complete were perceived as more difficult. One may envision interaction with a virtual environment during an exercise activity that requires rapid reaction times which would limit the time to complete each gesture; however, no such limits were explored in the present study.
For participants with CP, the shoulder shrug gesture had significantly lower success scores as compared to the other gestures. Although not significant in every comparison, the shoulder shrug gesture was less successful for the post-stroke group as well. This gesture was reliant on there being a height difference between the two shoulder joints. In order for this gesture to be well recognized, it required the participant to raise one shoulder while simultaneously lowering the opposite shoulder, which may not have been apparent to participants based on the instruction provided. Regardless, the results of this gesture did not necessarily represent the performance abilities for these groups.
The torso twist gesture was associated with low success scores for the CP group. Posture control deficit is one of the defining characteristics of CP (Bax et al., 2005); therefore, trying to maintain posture while initiating a twist may have led to the lower success rates and longer performance times observed. The torso twist gesture was also ranked higher in difficulty on the questionnaire. Additionally, this movement was reliant on the depth measurement of the shoulder joints, which is the least accurate measurement of the sensor (Dutta, 2012;Khoshelham & Elberink, 2012). Together these factors may explain the lower success rates for this gesture in the group with CP.
For the post-stroke group, the shoulder shrug gesture was less successful as compared to the other gestures. Of the remaining gestures, the hand raise motion using the more impaired side had the lowest success rate. This result is consistent with the nature of post-stroke conditions and suggests that, for participants with severe hemiplegia, raising a hand above the head was a more challenging gesture to accomplish than extending the hand outward.
Voice commands were not generally viable for individuals with CP. Individuals post-stroke were highly successful, however, when the command had more than one syllable. Including a keyword identifier before all voice commands, e.g., "turn" or "proceed," may improve the success rate in future studies. This strategy would also reduce false positives when keywords like "left" or "right" are spoken conversationally and not intended for interpretation by the virtual exercise environment. Next, success was also observationally linked with lack of accent or speech impediment. Microsoft speech recognition supports only one expected pronunciation of "United States English," and adding more syllables may aid in proper recognition. Finally, the use of a headset in place of the Kinect microphone array could possibly reduce the impact of background noise by bringing the receiver closer to the sender.
A potential limitation of this study was the small sample size, especially with missing data for three individuals poststroke. Certain statistical comparisons showing strong trends did not meet the threshold for significance. Expanding the sample size may provide a stronger base for the statistical tests performed. In addition, the present study did not consider division between impairment levels in either group when analyzing hand (raise, extend) versus body movements (nod, twist, shrug). Including these distinctions in future analyses may allow for a better understanding of movement usability.
The Box and Block Test, while a good indicator of gross manual dexterity (Mathiowetz et al., 1985), was not a good indicator of trunk ability. Including a functional test for body movements like the Trunk Impairment Scale (Verheyden et al., 2004) or the expanded Trunk Control Measurement Scale (Heyrman et al., 2011) would allow for quantification and comparison of trunk ability with Kinect movement performance. Due to the fact that this study used adult participants and most measures for balance in people with CP have only been validated for children (Saether, Helbostad, Riphagen, & Vik, 2013), a test validated for adults like the Posture and Postural Ability Scale may be useful (Rodby-Bousquet et al., 2014); however, this scale requires trained reviewers. Regardless, such a test may help to understand more concretely the relationship between trunk stability and difficulty for the CP group to accomplish certain gestures. A more comprehensive functional ability test would also allow for gesture libraries to be designated according to level of impairment.
It is worth noting that the Kinect used in the present study is now 5 years old. Recently, Apple bought PrimeSense, the company that developed the Kinect for Microsoft, and was awarded a patent for gesture control devices (A. Chang, 2015;U.S. Patent No. 8,933,876, 2015). Emerging technologies could improve the precision of skeleton tracking and interfaces for virtual exercise environments.

Conclusion
The purpose of the present study was to investigate the use of Microsoft Kinect to navigate a simple virtual environment by individuals with mobility impairments, specifically CP and post-stroke. Fourteen unique gestures were studied for interfacing with the environment. As the user approached an intersection, the user was instructed how to move past it, whether turning or continuing straight. The study demonstrated that the Kinect was useful for navigating this virtual environment, and identified the most viable gestures for future implementation with the two groups studied. Additionally, this study provides a first step in developing movement-based user interfaces for virtual exercise environments for people with mobility impairments, and lays the groundwork for creating larger libraries of gestures for people with disabilities that can generate interfaces for these groups to use the Kinect in new ways.