Early efforts at having computers understand and produce language were inspired by the Turing test, which is whether a computer, in its communication with a person, could plausibly be mistaken as human.1 These efforts were informed by Chomsky's hierarchy of classes of formal grammars (Figure 1), among which the most basic grammar (syntax) is strictly rule-based, relying on fixed relationships with limited dictionaries (semantics).2 This basic grammar was used by SHRDLU, an early computer “language parser” (Figure 2) that enabled users to manipulate objects in a “block world” through written command.3 SHRDLU could make inferences about which object “it” might refer to and learn the definition of new words for objects. It had some understanding of physics (knowing balls could not be stacked), could ask for clarification of color or size, and answer questions about its history of actions and the current configuration of objects. It could even say “okay” and “thank you.” However, with limited vocabulary and context, and inability to deal with ambiguity, SHRDLU was rudimentary compared with today's language algorithms.
Chomsky's schema of hierarchy of classes of formal grammars. Reprinted (permission is not required) from Wikimedia Commons (https://commons.wikimedia.org/wiki/File:Chomsky-hierarchy.svg) under the Creative Commons Attribution–ShareAlike 3.0 unported license.
SHRDLU example. Reprinted (permission is not required) from Wikimedia Commons (https://commons.wikimedia.org/wiki/File:Ailect15shrdlu1.gif) under the Creative Commons Attribution–ShareAlike 4.0 International license
Another early computer program for the study of natural language communication between people and machine was ELIZA.4 Its name was inspired by Eliza Doolittle, a fictional character whose language ability improved with training (Figure 3). ELIZA scanned for keywords in text entered by users, and then reorganized that text according to grammatical rules, using minimal context. If no keyword was identified, ELIZA would make a content-free remark or retrieve an earlier transformation. The resulting dialogue between human and machine resembled Rogerian client-centered therapy,5 with mirroring of emotional content and default open-ended questions. Although ELIZA would not pass the Turing test, its context-sensitive grammar and seeming empathy led many users to confide in it. ELIZA was the forerunner of modern-day “chatbots,” which use similar scanning of keywords, with preprogrammed responses to simulate conversations.
Natural language processing timeline.
Natural Language Processing and Machine Learning
In Chomsky's hierarchy,2 a Turing machine would have to master the multidimensional unrestricted grammar characteristic of human language to be mistaken as human. The strictly rule-based approaches of SHRDLU and ELIZA, with hard “if-then” rules, could not work given the irregularity of human grammar. A more promising alternative is to have computers do what humans do—learn about language through exposure and then make “soft” probabilistic decisions about what is said or heard, with weights given to different possible meanings, and prioritization of what is common and therefore likely. This approach constitutes natural language processing (NLP), which has increased the illusion of actual understanding by modern-day chatbots as compared with ELIZA. Computers have become increasingly adept at NLP by using probabilistic approaches based on acquisition of vocabulary (semantics) and learning of grammar (syntax) through machine learning (ML) algorithms trained on large bodies of text, all of which have been enabled by exponential increases in computing power and the flood of text that came with the arrival of the Internet. NLP includes probabilistic parsing of grammar, with part of speech tagging used for sentence boundary detection and estimates of usage,6 and latent semantic analysis (LSA), in which a word meaning is learned based on co-occurrence with other words, similar to how children acquire vocabulary. These and other NLP analyses can be performed using the open source “Natural Language Toolkit” (NLTK) (available at www.nltk.org).
ML algorithms are probabilistic in that they learn from experience and transfer learning to new and previously unseen inputs to respond accordingly.7 ML can be supervised, such that patterns for classification of predetermined groups are learned by dividing datasets into training and test groups, in an iterative fashion, known as N-fold cross-validation, which requires prior annotation as to group membership for each person, who is represented as a feature vector. Support vector machines learn by maximizing the margin or distance between predefined classes; other supervised approaches include logistic regression, decision trees, and “random forest.” ML can also be unsupervised, which is useful for discovering latent (or hidden) subgroups, features, and patterns in data. ML algorithms also typically learn from their own mistakes; this is the case for artificial neural networks, models of biological networks composed of a mesh of “perceptrons,” analogous to circuits of neurons, with an input layer that receives data, inner layers that process them, and an output layer that produces a result (eg, “patient” or “control”). Learning is implemented by computing the error between the inferred value in the output layer, and the actual value for a training example is “back-propagated” to the inner layers to correct future outputs. The extreme version of unsupervised ML with back propagation is “deep learning,” in which artificial neural networks with multiple hidden layers can be trained to solve problems based on feedback alone over large numbers of iterations.
Tagging words as parts of speech (POS) and creating a syntax tree is familiar to those who learned grammar at the blackboard in elementary school. For corpus-based linguistics, large bodies of text can be tagged and annotated. For instance, a given clause (eg, The/cat/is/under/the/table) has a POS tag series (ie, “determiner/noun/verb/preposition/determiner/noun”). A potential challenge for automated POS tagging is that words can have different grammatical functions: eg, “fair” (noun or adjective), “dog” (noun or verb), or “out” (preposition or noun). This has been addressed by probabilistic models (eg, “hidden Markov” models) that can estimate probabilities of different sequences of POS tags given the grammar model learned.8 The open access software of the NLTK is commonly used to parse text and identify grammatical functions of words using the University of Pennsylvania's “treebank tag-set,” which has 36 POS tags, encompassing nouns, verbs, prepositions, and other parts of speech.6
POS tagging is fundamental to annotation of texts for NLP analyses. It has been used for author verification and attribution in literature, as different authors have unique patterns or “fingerprints” of grammatical use.9,10 Its use in clinical research has been limited, although it holds promise for psychosis prediction and characterization of language in Parkinson's dementia.11 In an initial proof-of-principle study, Bedi et al.12 found that among teens and young adults at clinical risk for psychosis, later onset of schizophrenia was predicted by baseline reduction in speech complexity, indexed by shorter sentences and reduced use of determiner pronouns (eg, “that,” “which”) that introduce dependent clauses. Reduction in syntactic complexity was associated with severity of negative symptoms, such as low motivation and anhedonia.
“You shall know a word by the company it keeps.”12 Thus said Firth and Sila,13 who introduced the idea that the meaning or semantic value of a word could be learned by understanding the contexts in which it occurs and its rate of co-occurrence with other words. This makes inherent sense. We think of “cat” and “dog” as similar, as these words frequently co-occur, in contrast to pairings of “cat” with “toothpaste,” “pencil,” or “factory.” If computers are to effectively communicate with us, they must be able to ascribe meaning or semantic value to our words. Furnas et al.14 laid out the problem in 1987, and a decade later Landauer and Dumais15 published its proposed solution, operationalizing “distributional semantics” in the article “A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge.” “Plato's problem” is that people seem to have more knowledge than would be expected from information they are given. Landauer and Dumais15 argued that humans, including children, exploit large numbers of weak interrelations to learn language through inference, and they believed computers could do the same. They described LSA as a high-dimensional associative model that has as input only a large corpus of natural text, without any prior knowledge of grammatical rules or vocabulary, and that has as output degrees of semantic similarity among words in a lexicon. In LSA, a matrix is created in which rows are words in a lexicon, columns are instances of text, and matrix cells are frequencies of words in instances of text, then weighted. Just as factor analyses can reduce a number of variables to a few descriptive factors, singular value decomposition (SVD) can reduce the data in a matrix to a few descriptive multidimensional vectors.
Overall, LSA provides a construction of meaning in language that resembles what the human mind does; ie, learn the meaning of words in terms of prior experience with those words in different contexts. Words that co-occur more frequently have greater semantic similarity and their semantic vector directions will be more aligned. Similar algorithms have been developed to create multidimensional semantic vectors of words from large text corpora. One example is “word2vec,” which uses neural networks to examine windows of surrounding context words, either without paying attention to word order (“continuous bag of words”), or with keeping the order but skipping over words (“skip grams”).
Clinical Examples of Natural Language Processing Analyses
Among different NLP tools, LSA is commonly used in research to understand language disturbances in neuropsychiatric disorders, primarily the disruption in the flow of meaning in language in schizophrenia and its risk states, operationalized as semantic coherence, which is the semantic similarity of subsequent words (and their vectors) aggregated at any level—single word, sentence, paragraph, full narrative, skip grams. The semantic vector for an aggregate is the sum of its word semantic vectors. Using LSA, transcripts are converted from sequences of words to sequences of semantic vectors, maintaining the original order of the text, similar to POS tagging. Semantic coherence is indexed as the cosine between successive semantic vectors for words or aggregates, also known as the dot product, which ranges from −1 for incoherence to 1 for coherence. Elvevag et al.16,17 pioneered studies in schizophrenia, finding that LSA semantic coherence differentiates speech in schizophrenia from that of healthy people with 82% accuracy, and from that of unaffected siblings with 86% accuracy. Decreased semantic coherence in schizophrenia is associated with clinical ratings,16 functional impairment,18 and decreased activation in the language network.19
We extended these LSA analyses to transcribed open-ended interviews with young people at risk for schizophrenia (of whom about 15% to 30% develop psychosis within 2 years),20,21 adding POS tagging and using a three-dimensional “convex hull” ML classifier to estimate onset of later psychosis. Of note, a convex hull is the minimal convex polyhedron that contains all data in a set, essentially a “shrink-wrapping” of data. The convex hull was defined by all persons who remained free of psychosis, whereas the patients who later developed psychosis were outside this convex hull.12 The three-dimensional convex hull classifier included reduced semantic coherence and two parameters identified by POS tagging—shortened sentence length and reduced determiner use. Automated linguistic variables were correlated with symptom severity but outperformed them in prediction of psychosis. In a Brazilian Portuguese-speaking cohort, the classifier discriminated speech in patients with schizophrenia from that of healthy people, suggesting the classifier was robust across language and illness stages.22 The findings were replicated in a second psychosis risk cohort, generating an ML classifier with intra-protocol accuracy of 83% and cross-protocol accuracy of 79% in predicting psychosis, and 72% accuracy in discriminating speech in recent-onset psychosis from the norm.22 Convex hull classification consistently showed that psychosis led to deviation from the norm in independent recent-onset psychosis and schizophrenia cohorts.23
A recent systematic review and meta-analysis shows that “semantic space” NLP models, primarily LSA, have medium-to-large effect sizes for discriminating neuropsychiatric diagnosis from the norm, including mood disorders, but especially for the autism and psychosis spectrum.24 Accuracy is highest when LSA is applied to transcripts of natural speech as opposed to word lists in verbal fluency tasks.24 Of note, LSA has been used in an innovative way to distinguish states of intoxication by different substances within the same people in a laboratory setting. Specifically, the drug MDMA (3,4-methylenedioxy-methamphetamine, also known as “ecstasy”) led to language with semantic similarity to concepts such as “friend,” “support,” “intimacy,” and “rapport” as compared to placebo.25 Language analysis, therefore, could plausibly provide an alternative to a test of toxicology to index intoxication.
One other approach to language analysis that merits attention is the use of graph theory and construction of speech graphs to analyze language structure, illustrated by “word trajectory graphs” with vertices or nodes (words) and edges (arrows) that lead from one word to the next in succession, with self-loops formed when a person returns to a word in speaking. These graphs yield measures of recurrence and connectivity within speech. Mota et al.26 have used speech graphs to distinguish the verbosity and flight of ideas of manic speech from the relatively sparse, disjointed speech of schizophrenia. In a prospective study, they found that decreased connectedness, defined as fewer nodes (words) in the “largest connected components,” was associated with anhedonia and decreased motivation, and predicted schizophrenia diagnosis 6 months later.26
The Use of Natural Language Processing Analysis in Electronic Health Records
Electronic health records (EHRs) provide a wealth of information over time about large numbers of people, including structured data such as laboratory tests, symptom checklists, lists of medications, and diagnoses. But EHRs also contain unstructured narrative that conveys important information about clinicians' observations and insights. NLP methods have provided a means to tap into this unstructured data to improve “computational phenotyping” of cases. As described in a recent review,27 NLP can assist with diagnosis categorization, clinical trial screening, detection of drug-drug interactions and adverse events, and associations of genes with medication response (pharmacogenomics) and phenotype (EHR-based phenome-wide association studies, which are important for genetic-wide association studies). Other uses include staging and prediction of recurrence in cancer, prediction of violence, and tracking of nosocomial infections.
Most NLP analyses with ML done with EHR are supervised and largely rule-based with scanning for keywords, similar to ELIZA and modern-day chatbots, but with a medical focus with keywords related to defined medical concepts in databases or “clinical health terminology products” such as SNOMED-CT (Systematized Nomenclature of Medicine–Clinical Terms) or UMLD (Unified Medical Language System) (Figure 3). An example is the use of keywords of “tobacco,” “pack-year,” and “cigarettes,” which all refer to the concept of “smoking.” Although basic (essentially a word count) and requiring expert opinion and extensive data annotation, these rule-based keyword/concept searches of clinical narratives can greatly improve accuracy of classification of disease and outcome compared to the use of structured data alone. For example, in psychiatry, the count of expert terms related to dimensional psychopathology and their synonyms in discharge summaries has shown that terms related to arousal and cognition are associated with specific genetic variants,28 as well as length of hospital stay and neurocognitive performance.29 Remarkably, the use of nonexpert words with positive valence (eg, “glad,” “lovely”) by doctors in discharge summaries is associated with a 30% reduction in suicide risk.30
There have been recent efforts to use unsupervised and deep learning in NLP approaches to EHRs,27 using time series data primarily to find latent phenotypes or clusters, with an example being the identification of clusters of predominant symptoms in autism spectrum disorder (seizures, psychiatric, gastrointestinal). Unsupervised and deep learning have been used to analyze protein-protein interaction and to identify youth depression from unstructured clinical notes. Other advantages of unsupervised learning with NLP include greater scalability due to less need for annotation, especially if an auto-coder can be used. However, NLP approaches to EHRs are not without challenges. One is the interpretability of revealed latent phenotypes or clusters, which still require expert review. Others include heterogeneity in narratives and the use of rare terms and medical acronyms (“out of vocabulary” words that are assigned default random vectors), “rule-outs” (false-positive keyword identification), and ungrammatical constructions.
NLP approaches to computational phenotyping thus far have largely been applied to the analysis of EHRs, with new efforts to enhance performance of ML models through the inclusion of multiple data modalities beyond EHRs, including public databases, social media, the scientific literature, and also biological data such as protein and gene sequences, which have been amenable to semantic analysis using word2vec (Figure 3). Yet, computational phenotyping can be expanded to include not just how patients can be described, but also to directly operationalize and analyze their behavior. Herein, we have reviewed approaches to the computational phenotyping of people through analysis of the semantics and syntax of their language.
Communication and behavior, however, extend beyond lexical content of speech and include audiological features of language such as prosody, rate, volume, and pauses, as well as gesture and facial expression, which have a semantics and syntax of their own. These features can be measured from audio and video and supplemented by data obtained from wearable sensors and mobile phones (eg, heart rate, blood pressure, sleep/wake patterns, geolocation, social interaction). These data collectively, as a “bag of features,” can be aggregated to obtain synergies that can improve inference and learning, especially if aligned in time series.31
Further, in respect to language and communication, higher-level features can be constructed from lower-level ones in an informed manner, exemplified by the way complex messages such as sarcasm can be conveyed through aligning of emotional prosody with literal semantic content of words, and specific facial expressions and gestures. This has real-world applications for illnesses such as schizophrenia that are characterized by deficits in social communication (including sarcasm), which may be manifest as deviation from the norm in the semantics, syntax, and alignment of language and behavioral features.
We also hope to use NLP to understand the form of discourse that is one of psychiatry's main therapeutic tools, specifically psychotherapy. Thus far, we are limited to modeling dialogue by transforming speakers' entries into abstract representations or “topics” used to learn “best practices” from successful therapy sessions or interviews, which requires expert definitions of success. Although the concepts of “working alliance” and “turn-taking” can be formalized using NLP techniques, computational modeling of dialogue thus far is mostly generative, in that it aims to create fully formed sentences, not unlike ELIZA and chatbots.
Therefore, interactive systems can be implemented now that only have a limited set of options for triage and recommendation, and for today or even the near future, it is unlikely we will have an automated unconstrained conversational system. We do not yet have systems that can clearly pass the Turing test when there are real conversational demands, as in psychotherapy. However, we have seen that computational development, like biological evolution, can proceed with a mixture of steady progression and quantum leaps, such that it may not be too soon to think, and to be prepared, for the day in which “Turing will pass the Freud test.”
- Turing AM. Computing machinery and intelligence. Mind. 1950;59:433–460. doi:10.1093/mind/LIX.236.433 [CrossRef]
- Chomsky N. Three models for the description of language. IRE Trans Inf Theor. 1956;2:113–124. doi:. doi:10.1109/TIT.1956.1056813 [CrossRef]
- Winograd T. Understanding natural language. Cogn Psychol. 1972;3(1):191. doi:. doi:10.1016/0010-0285(72)90002-3 [CrossRef]
- Weizenbaum JJ, Cot A. ELIZA—a computer program for the study of natural language communication between man and machine. Commun Assoc Comput Machine. 1966;9:36–45.
- Rogers CR, Carmichael L. Counseling and Psychotherapy: Newer Concepts in Practice. Boston, MA: Houghton Mifflin; 1942.
- Santorini BJ. Part-of-speech tagging guidelines for the Penn Treebank Project (3rd revision). https://repository.upenn.edu/cis_reports/570/. Accessed April 11, 2019.
- Johnson M. How the statistical revolution changes (computational) linguistics. In: Balwin T, Kordoni V, eds. Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous?. Athens, Greece. : Association for Computational Linguistics; 2009:3–11.
- Kupiec JJ. Language. Robust part-of-speech tagging using a hidden Markov model. 1992;6:225–242. Comput Speech Lang. doi:10.1016/0885-2308(92)90019-Z [CrossRef].
- Zhao Y, Zobel J. Searching with style: authorship attribution in classic literature. In: Proceedings of the Thirtieth Australasian Conference on Computer Science. Vol. 62. Ballarat, Victoria, Australia: Australian Computer Society; 2007:59–68.
- Pokou YJM, Fournier-Viger P, Moghrabi C. Authorship attribution using variable length part-of-speech patterns. Int Conf Agent Artific Intell. 2016;2:354–361. doi:. doi:10.5220/0005710103540361 [CrossRef]
- García AM, Carrillo F, Orozco-Arroyave JR, et al. How language flows when movements don't: an automated analysis of spontaneous discourse in Parkinson's disease. Brain Lang. 2016;162:19–28. doi:. doi:10.1016/j.bandl.2016.07.008 [CrossRef]
- Bedi G, Carrillo F, Cecchi GA, et al. Automated analysis of free speech predicts psychosis onset in high-risk youths. NPJ Schizophren. 2015;1:15030. doi:. doi:10.1038/npjschz.2015.30 [CrossRef]
- Firth JR, Sila J. A synopsis of linguistic theory, 1930–1955. In: Firth JR, ed. Studies in Linguistic Analysis. Special volume of the Philological Society. Oxford, UK: Blackwell; 1957:1–32.
- Furnas GW, Landauer TK, Gomez LM, Dumais ST. The vocabulary problem in human-system communication. Commun ACM. 1987;30:964–971. doi:. doi:10.1145/32206.32212 [CrossRef]
- Landauer TK, Dumais ST. A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev. 1997;104:211–240. doi:10.1037/0033-295X.104.2.211 [CrossRef]
- Elvevag B, Foltz PW, Weinberger DR, Goldberg TE. Quantifying incoherence in speech: an automated methodology and novel application to schizophrenia. Schizophr Res. 2007;93:304–316. doi:. doi:10.1016/j.schres.2007.03.001 [CrossRef]
- Elvevag B, Foltz PW, Rosenstein M, DeLisi LE. An automated method to analyze language use in patients with schizophrenia and their first-degree relatives. J Neurolinguistics. 2010;23:270–284. doi:. doi:10.1016/j.jneuroling.2009.05.002 [CrossRef]
- Holshausen K, Harvey PD, Elvevåg B, Foltz PW, Bowie CR. Latent semantic variables are associated with formal thought disorder and adaptive behavior in older inpatients with schizophrenia. Cortex. 2014;55:88–96. doi:. doi:10.1016/j.cortex.2013.02.006 [CrossRef]
- Tagamets MA, Cortes CR, Griego JA, Elvevåg BJC. Neural correlates of the relationship between discourse coherence and sensory monitoring in schizophrenia. Cortex. 2014;55:77–87. doi:. doi:10.1016/j.cortex.2013.06.011 [CrossRef]
- Fusar-Poli P, Bonoldi I, Yung AR, et al. Predicting psychosis: meta-analysis of transition outcomes in individuals at high clinical risk. Arch Gen Psychiatry. 2012;69(3):220–229. doi:. doi:10.1001/archgenpsychiatry.2011.1472 [CrossRef]
- DeVylder JE, Muchomba FM, Gill KE, et al. Symptom trajectories and psychosis onset in a clinical high-risk cohort: the relevance of subthreshold thought disorder. Schizophr Res. 2014;159(2–3):278–283. doi:. doi:10.1016/j.schres.2014.08.008 [CrossRef]
- Corcoran CM, Carrillo F, Fernández-Slezak D, et al. Prediction of psychosis across protocols and risk cohorts using automated language analysis. World Psychiatry. 2018;17:67–75. doi:. doi:10.1002/wps.20491 [CrossRef]
- Cecchi G, Corcoran C. O2.3. Automated analysis of recent-onset and prodromal schizophrenia. Schizophr Bull. 2018;44(suppl 1):S76. doi:. doi:10.1093/schbul/sby015.193 [CrossRef]
- de Boer J, Voppel A, Begemann M, Schnack H, Wijnen F, Sommers IEC. Clinical use of semantic space models in psychiatry and neurology: a systematic review and meta-analysis. Neurosci Biobehav Rev. 2018;93:85–92. doi:. doi:10.1016/j.neubiorev.2018.06.008 [CrossRef]
- Bedi G, Cecchi GA, Slezak DF, Carrillo F, Sigman M, de Wit H. A window into the intoxicated mind? Speech as an index of psychoactive drug effects. Neuropsychopharmacology. 2014;39:2340–2348. doi:. doi:10.1038/npp.2014.80 [CrossRef]
- Mota NB, Copelli M, Ribeiro SJ. Thought disorder measured as random speech structure classifies negative symptoms and schizophrenia diagnosis 6 months in advance. NPJ Schizophr. 2017;3:18. doi:. doi:10.1038/s41537-017-0019-3 [CrossRef]
- Zeng Z, Deng Y, Li X, Naumann T, Luo Y. Natural language processing for EHR-based computational phenotyping. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(1):139–153. doi:. doi:10.1109/TCBB.2018.2849968 [CrossRef]
- McCoy TH Jr, Castro VM, Hart KL, et al. Genome-wide association study of dimensional psychopathology using electronic health records. Biol Psychiatry. 2018;83:1005–1011. doi:. doi:10.1016/j.biopsych.2017.12.004 [CrossRef]
- McCoy TH Jr, Yu S, Hart KL, et al. High throughput phenotyping for dimensional psychopathology in electronic health records. Biol Psychiatry. 2018;83:997–1004. doi:. doi:10.1016/j.biopsych.2018.01.011 [CrossRef]
- McCoy TH Jr, Castro VM, Roberson AM, Snapper LA, Perlis RH. Improving prediction of suicide and accidental death after discharge from general hospitals with natural language processing. JAMA Psychiatry. 2016;73:1064–1071. doi:. doi:10.1001/jamapsychiatry.2016.2172 [CrossRef]
- Baker JT, Germine LT, Ressler KJ, Rauch SL, Carlezon WA Jr, . Digital devices and continuous telemetry: opportunities for aligning psychiatry and neuroscience. Neuropsychopharmacology. 2018;43(13):2499–2503. doi:. doi:10.1038/s41386-018-0172-z [CrossRef]