Accuracy of Laryngoscopy for Quantitative Vocal Fold
Analysis in Combination with AI, A Cohort Study of Manual
Artefacts

Mette Pedersen; Christian F Larsen

Email Us: info@lupinepublishers.com phone

Call Us: +1 (914) 407-6109 57 West 57th Street, 3rd floor, New York - NY 10019, USA

Submit Manuscript

ISSN: 2641-1709

Scholarly Journal of Otolaryngology

Research Article(ISSN: 2641-1709)

Accuracy of Laryngoscopy for Quantitative Vocal Fold Analysis in Combination with AI, A Cohort Study of Manual Artefacts Volume 6 - Issue 3

Mette Pedersen¹*, Christian F Larsen²

¹Medical Centre, Østergade, Copenhagen, Denmark
²Copenhagen Business School, Solbjerg Plads, Denmark

Received: April 01, 2021; Published: April 15, 2021

Corresponding author: Mette Pedersen, Medical Centre, Østergade 18, Copenhagen, Denmark

DOI: 10.32474/SJO.2021.06.000237

Abstract PDF

Abstract

Introduction: A cohort of high-speed videoendoscopies was evaluated for usability for deep learning. The aim of our study was to find the percentage of our high-speed videos (15.732) that could be used for deep learning (AI). A screening of the material showed that some videos had artefacts, making them non usable for deep learning.

Material: A randomization was made with Wolfram Alpha random number generator selecting between 15.732 videos from 7.909 patients. The various non usable videos are described including the rear parts of the vocal folds not seen, the epiglottis or uvula blocking vision, parts of the vocal folds not seen, no vibration of the vocal folds, persistent constricted larynx, picture taken from an oblique angle, the front part of the vocal folds not seen, and parts of the arytenoid region not seen.

Method: Assuming the assessments are independent with regards to whether there is a finding, the total number of assessments with a given finding is binomial distributed. With 100 assessments, an observed incidence of 1, 10 and 25 findings will result in estimated 95% confidence intervals of [0%-3%], [4%-16%] and [17%-33%], respectively. 95% confidence intervals are calculated as Wald test using the asymptotic Normal distribution assumption of the estimated proportion in the binomial distribution. Assuming the incidence of findings for each of the different findings was below 25%, the expected length of the 95% confidence interval is 16%-point (33-17), with 200 and 500 assessments, the corresponding length is 14%-point and 8%-point, respectively. Based on these calculations 100 randomised films were sufficient to be used for calculations.

Results and Conclusion: The prospective cohort study of high-speed videos covered 12 years from the February 2007 to January 2019 in an otorhinolaryngology medical centre. 7.909 patients with a total of 15.732 high-speed video films of the larynx including the vocal folds had been consecutively sampled (4.000 frames per second, Richard Wolf Ltd. endocam 5562). Observations on high-speed video for the usable versus non usable videos with 95% confidence intervals, showed that only 51% were usable. The interesting result is that oblique angle pictures (10%) as well as insufficient pictures of the front of the vocal folds and arytenoids (14%) were the largest groups of the non-usable. They can be augmented by the examiner in the future. Various video and deep learning programs are discussed.

Keywords: Manual artifacts; deep learning; vocal fold analysis; quantitative measures

Abbreviations: AI: Artificial Intelligence; OCT: Optical Coherence Tomography

Introduction

Deep learning, a branch of Artificial Intelligence (AI), is a future possibility for quantitative measurements of the vocal folds, also for documenting treatment effects. We were interested in a co-operation for deep learning since we wanted our prospective cohort of high-speed video endoscopy results to be analysed [1]. The required videos had artefacts in many cases making them non usable for deep learning. Analysing our prospective cohort material of 15.732 videos seemed to be of interest, because several categories of non-usable videos were made based on inspection of the videos. Vocal folds and the arytenoids must be fully visible for optimal evaluation. We have made high-speed videos with stiff scopes for indirect laryngoscopy. But the results are usable for nasal endoscopies of the larynx as well as stroboscopy in combination with AI. A discussion of statistics ended up with a plan for randomisation for calculations. High-speed video is usable to quantify vocal fold measurements [2-9]. Difficulties with light and spatial cameras have been discussed [10-13]. Videos can be affected by patient movement, as well as with nonlinear distortions and phase asymmetry [14,15]. Artifacts, such as parts of the glottis concealed, and parts of the arytenoid cartilage can cover other parts. In the future deep learning might be trained to some extend to take abnormal films into account and correct them, but that will take some time, and there is a risk for bias [16,17]. The aim of our study was to find the percentage of our clinical material of 15.732 highspeed videos that could be used for deep learning (AI) and later Optical Coherence Tomography (OCT). Clinicians should be advised to take the necessary precautions to ensure that laryngoscopies are usable with AI.

Material

A cohort of high-speed videos was evaluated for usability for deep learning. The regularity of the high-speed videos was necessary for the vocal folds. The ideal video included vibrating vocal folds from front to rear without any hidden parts, including well defined arytenoids, relevant for pathology. The distortion could be the following: The vocal folds could be hidden in front or in the rear part. The film could be recorded from an oblique angle. Epiglottis or uvula could block the vision. There could be a persistent constricted larynx. The arytenoid region could be insufficiently presented. Vibration of the vocal folds could be lacking. Recordings of high-speed videos (Figures 1 & 2) were made with HRES Endocam 5562 from Richard Wolf Ltd., 4.000 frames per second and 256 x 256 pixels. 90-degree angle, with stiff scopes used. For randomisation Wolfram Alphas number generator was used. Patients´ high-speed videos were stored in 4 folders and the number generator was used to choose between the folders. For each folder the random number generator was used to generate a random between the total amount of patients in that folder; First folder 1.470 patients, 2.638 videos. Second folder 2.350 patients, 4.790 videos. Third folder 2.232 patients, 4.605 videos. 4th folder 1.857 patients, 3.699 videos. For some patients there were multiple recordings, and the random number generator was used to generate a random number between the multiple recordings. To correlate our study with the literature, we made a search to find comparable studies. The Search strategy in databases for papers on inspection results of larynx videos was made with the library of the Royal Society of Medicine, UK. The planned search words were -- Vocal folds – Larynx – Stroboscopy -- High-speed digital imaging -- Voice assessment – Recording – Examiner—Error-- Video endoscopy -- Observational error -- Measurement error-- Distortion. The final 9th (in italics) search included 13 publications with three from the larynx [14,17,18].

Method

Figure 1: Presentation of distorted videos. A: Rear part of the vocal folds not seen. B: Epiglottis or uvula block the vision. C: Parts of the vocal folds are not seen. D: No vibration of vocal folds. E: Persistent constricted larynx. F: Picture taken from an oblique angle. G: Front part of the vocal folds not seen. H: Parts of the arytenoids are not seen.

Lupinepublishers-openaccess-otolaryngology-journal

Assuming the assessments are independent with regards to whether there is a finding, the total number of assessments with a given finding is binomial distributed. With 100 assessments, an observed incidence of 1, 10 and 25 findings will result in estimated 95% confidence intervals of [0%-3%], [4%-16%] and [17%-33%], respectively. 95% confidence intervals are calculated as Wald test using the asymptotic Normal distribution assumption of the estimated proportion in the binomial distribution. Assuming the incidence of findings for each of the different findings was below 25%, the expected length of the 95% confidence interval is 16%-point (33-17), with 200 and 500 assessments, the corresponding length is 14%-point and 8%-point, respectively. Based on these calculations 100 randomised films were sufficient to be used for calculations. The randomisation was made as mentioned, with Wolfram Alpha random number generator selecting between 15.732 videos from 7.909 patients. The total size of all recordings was 515 GB. To evaluate whether a video was usable, two experienced observers went through the randomised recordings and carefully categorized each finding. The normal findings included clear presentations of the different parts of the vocal folds and arytenoid region, including well defined vibrations. To be used in deep learning it is not possible to extrapolate from the findings what the full picture would have shown. To understand the various findings, we have drawn examples of the groups of nonusable videos as presented in Figure 1. Figure 2 shows an example of a usable video, with and without drawing of the relevant areas. Two experienced examiners evaluated each film, in some cases, more than one non usable finding was seen. A comparison of the different non usable groups for age and gender was made, and the number of frames for influence on the results.

Results

The prospective cohort study of high-speed videos covered 12 years from the February 2007 to January 2019 in an otorhinolaryngology medical centre. 7.909 patients with a total of 15.732 high-speed videos of the larynx including the vocal folds had been consecutively sampled (4.000 frames per second, Richard Wolf Ltd. endocam 5562). The statistical method of Wald test 95% confidence intervals allowed using 100 randomised videos to extrapolate from the total material. Using Wolfram Alphas random number generator for randomisation and JMP 16, 2021 (SAS institute) for statistical calculations, the following results from age, gender, number of frames were found. The mean age was 44 years for groups in total, range 9 - 82y, (CI 95% 40,6 – 47,5, Std 17,2) Figure 3. The mean age for men was 46 years, range 9 - 74y, (CI 95% 40,5 – 51,6, Std 15,5), the mean age for women was 43 years, range 13 - 82y, (CI 95% 38,6 – 47,5, Std 18,1) Figure 4. With ANOVA not statistically difference for age groups and number of frames was found, Figure 5. One way analysis of age groups by usable versus non usable videos are presented and shows no statistical difference between groups, Figure 6. Age distribution for usable versus non usable videos is shown in Figure 7. In an analysis of usability versus gender, for the total amount of female videos and the total amount of male videos. 56,72% of female and 39,39% of male were usable. Fisher’s exact 2-sided test (0,1369) showed no significant difference between genders (Table 1). To check if videos improved with more than one examination, a hypothesis was that patients became more familiar with the equipment and relaxed more, allowing for a better recording to be made. Another hypothesis was that as patients’ symptoms improved, it became easier to get a good recording. An analysis of usable by number of videos in one patient (1st, 2nd, 3rd etc examination), Pearsons test (0,8185) showed no statistical difference between groups (Table 2). The usable videos include regular movements of the vocal folds, clear presence of the arytenoids and the false vocal folds (Figure 2).

Table 1: Contingency analysis of usability versus gender. For the total amount of female videos (F) and the total amount of male videos (M) is shown how many that were usable (Y) in percentage and how many that were non usable (N). Fisher’s exact 2-sided test (0,1369) showed no statistical difference between groups (JMP 16, 2021 SAS institute).

Table 2: Contingency analysis of usable (Y) and non-usable (N) by video examination number of one patient (1st, 2nd, 3rd etc examination), Pearsons test (0,8185) showed no statistical difference between groups (JMP 16, 2021 SAS institute).

Figure 2: Clinical high-speed video with visible vocal folds and arytenoid regions (usable).

Figure 3: The mean age was 44 years for groups in total, range 9 - 82y, (CI 95% 40,6 – 47,5, Std 17,2) (JMP 16, 2021 SAS institute).

Figure 4: The mean age for men (M) was 46 years, range 9 - 74y, (CI 95% 40,5 – 51,6, Std 15,5) (JMP 16, 2021 SAS institute). The mean age for females (F) was 43 years, range 13 - 82y, (CI 95% 38,6 – 47,5, Std 18,1) (JMP 16, 2021 SAS institute).

Figure 5: Bivariate fit for number of frames by age is presented, ANOVA showed no statistically significant difference for age groups and number of frames (Prob > F = 0,2735) (JMP 16, 2021 SAS institute).

Observations on high-speed video for the usable versus non usable videos with 95% confidence intervals, showed that only 51% were usable. The interesting result is that oblique angle pictures (10%) and insufficient pictures of the front of the vocal folds (14%) and arytenoids (14%) were the largest groups of the non-usable. They can be augmented by the examiner in the future. Another interesting result was that so many had persistent constricted larynx (9%). It should be noted that vibration of the vocal folds (7%) should be secured by the examiner if possible. In some cases, parts of the vocal folds were not visible (5%) the laryngoscope not being centered due to among others anatomical variance. It was also noted that in some cases the epiglottis or uvula blocked the vision (4%). In our study we included an overview of the cases where arytenoids were insufficiently visible (14%). This is of special interest in mucosal disorders of the larynx (e.g., reflux, allergy, infection, etc.) Figure 8. Rear part of the vocal folds not seen 3 % (Wald 95% ci: 0% – 6,3%), Epiglottis or uvula block the vision 4% (Wald 95% ci: 0,1% - 7,8%), Parts of the vocal folds are not seen 5% (Wald 95% ci: 0,7% - 9,3%), No vibration of vocal folds 7% (Wald 95% ci: 2% - 12%), Persistent constricted larynx 9% (Wald 95% ci: 3,4% - 14,6%), Picture taken from an oblique angle 10% (Wald 95% ci: 4,1% - 15,9%), Front part of the vocal folds not seen 14% (Wald 95% ci: 7,2% - 20,8%), Parts of the arytenoids are not seen 14% (Wald 95% ci: 7,2% - 20,8%), Indirect video endoscopy with visible vocal folds and arytenoid regions (usable) 51% (Wald 95% ci: 41,2% – 60,8%).

Figure 6: One way analysis of age groups at recording by usable versus non usable video are presented and shows no statistical difference between groups (Prob > F 0,17) (JMP 16, 2021 SAS institute).

Figure 7: Age distribution for usable (Y) versus non usable (N) videos is shown (JMP 16, 2021 SAS institute).

The statistical advice to randomise 100 films illustrates our point sufficiently. The results show that it is necessary for the examiner to focus on optimising the recordings so that they can be used for quantification and AI. In the literature search we found 3 larynx related papers in S9 out of the 13 papers in total. Ghasemzadeh and Deliyski discuss fiberoptic flexible endoscope distortions on the calibration of images acquired by the laser production system. The first one being from the wide-angle lens with higher spatial resolution in the center of the field of view, the second one being from the variation in the imaging angle. [14]. Adamain N, Naunheim M and Jowett N discus an automatic quantitative tracking of vocal fold motion from video laryngoscopy focusing on the glottal opening angles [17]. A thesis by Deng J. concludes that use of the camera framerate, spatial resolution and angle of view can all modify the resulting video of the vocal folds, and various algorithms are discussed [18]. Based on the results it must be underlined that the importance of good quality videos is prerequisite for any statistical quantitative evaluation of the vocal folds. The results show that this is probably not the case in daily clinical work.

Discussion

The background for this study was a potential collaboration on deep learning that initially required 20 normal videos, which proved to be challenging [1]. Statistically the randomisation of 100 videos of 15.732 was sufficient to find the percentage of usable videos for deep learning. 51% of the videos were usable for reproduction with deep learning. Some of the non-usable videos could be augmented by the examiner. This is the case in Figure 8 with adjustment of the laryngoscope for the pictures taken from an angle, the front and the rear of the vocal folds not seen, and the part of the arytenoids not seen. In a few cases the larynx will remain constricted, or epiglottis or uvula block the vision. It is noted statistically that the age and gender distribution for males and females are not significantly different in the study. In Figure 8 the indirect videoendoscopy with visible vocal folds and arytenoid regions (usable) were 51% (Wald 95% ci: 41,2% – 60,8%). It is noted that the examination number does not influence the probability of it being usable. A higher percentage of women were usable than men, but not enough for a statistical significance difference in this study (Tables 1 & 2). There is an ongoing discussion of video stroboscopy and high-speed videoendoscopy as for evaluation of amplitude and edge of mucosal wave and left-right phase asymmetry [19]. A possible solution has been suggested for consideration of artefacts [20]. Various deep learning software are discussed [21-23]. Since stroboscopy is a major diagnostic clinical tool for functional larynx evaluation, its use in deep learning and optical coherence tomography probably must be elucidated more in the future [24,25].

Figure 8: Bar chart of observations on high-speed videos for the usable versus non usable videos with 95% confidence intervals. What is notable is that only 51 are usable (Excel version).

Conclusion

Only 51% of high-speed videos in a clinical setting were sufficient with full pictures of the vocal folds and arytenoids. There is a need for clinicians to focus on optimization of videos while recording for use with quantitative measurements and deep learning, this is also the case for optical coherence tomography. The discussion of videostroboscopy versus high-speed video is in the favor of high-speed, since stroboscopy pictures does not include all consecutive vocal fold movements.

References

Editorial Manager:

Williams Lily

Email:

oto@lupinepublishers.com

oto@lupinepublisher.co

Track Your Article

Top Editors

Mark E Smith

Bio chemistry
University of Texas Medical Branch, USA
Lawrence A Presley

Department of Criminal Justice
Liberty University, USA
Thomas W Miller

Department of Psychiatry
University of Kentucky, USA
Gjumrakch Aliev

Department of Medicine
Gally International Biomedical Research & Consulting LLC, USA
Christopher Bryant

Department of Urbanisation and Agricultural
Montreal university, USA
Robert William Frare

Oral & Maxillofacial Pathology
New York University, USA
Rudolph Modesto Navari

Gastroenterology and Hepatology
University of Alabama, UK
Andrew Hague

Department of Medicine
Universities of Bradford, UK
George Gregory Buttigieg

Maltese College of Obstetrics and Gynaecology, Europe
Chen-Hsiung Yeh

Oncology
Circulogene Theranostics, England
Emilio Bucio-Carrillo

Radiation Chemistry
National University of Mexico, USA
Casey J Grenier

Analytical Chemistry
Wentworth Institute of Technology, USA
Hany Atalah

Minimally Invasive Surgery
Mercer University school of Medicine, USA
Abu-Hussein Muhamad

Pediatric Dentistry
University of Athens , Greece