Specifically, we use the part of speech POS tagging module from NLTK to leave out sentence parts except the adjectives, nouns, and verbs. The residual words are then lemmatised and represented with their lemmas in order to normalise variations of the same word. Once the text is processed in this manner, we use the Python library wordcloud Footnote 2 to create word clouds with 2 or 3-gram frequency list of common word groups. The results present distinct, understandable word topics.
Indeed, one of our aims is to explore the relevance of the fixed external classes as compared to the content-driven groupings obtained in an unsupervised manner. Hence we provide a double route to quantify the quality of the clusters by computing two complementary measures: an intrinsic measure of topic coherence and a measure of similarity to the external hand-coded categories, defined as follows.
The PMI is an information-theoretical score that captures the probability of being used together in the same group of documents. The PMI score for a pair of words w 1 , w 2 is:.
The PMI score has been shown to perform well Newman et al. See Agirre et al. Similarity between the obtained partitions and the hand-coded categories: To compare against the external classification a posteriori , we use the normalised mutual information NMI , a well-used information-theoretical score that quantifies the similarity between clusterings considering both the correct and incorrect assignments in terms of the information or predictability between the clusterings.
We use the NMI to compare the partitions obtained by MS and other methods against the hand-coded classification assigned by the operator. At each Markov time, we ran independent optimisations of the Louvain algorithm and selected the optimal partition at each time.
Repeating the optimisation from different initial starting points enhances the robustness of the outcome and allows us to quantify the robustness of the partition to the optimisation procedure. To quantify this robustness, we computed the average variation of information VI t a measure of dissimilarity between the top 50 partitions for each t. Once the full scan across Markov time was finalised, a final comparison of all the optimal partitions obtained was carried out, so as to assess if any of the optimised partitions was optimal at any other Markov time, in which case it was selected.
This layered process of optimisation enhances the robustness of the outcome given the NP-hard nature of MS optimisation, which prevents guaranteed global optimality. Figure 3 presents a summary of our analysis. We plot the number of clusters of the optimal partition and the two metrics of variation of information across all Markov times.
To illustrate the multi-scale features of the method, we choose several of these robust partitions, from finer 44 communities to coarser 3 communities , obtained at five Markov times and examine their structure and content. We also present a multi-level Sankey diagram to summarise the relationships and relative node membership across the levels. The MS analysis of the graph of incident reports reveals a rich multi-level structure of partitions, with a strong quasi-hierarchical organisation, as seen in the graph layouts and the multi-level Sankey diagram.
It is important to remark that, although the Markov time acts as a natural resolution parameter from finer to coarser partitions, our process of optimisation does not impose any hierarchical structure a priori. Hence the observed consistency of communities across level is intrinsic to the data and suggests the existence of content clusters that naturally integrate with each other as sub-themes of larger thematic categories.
The detection of intrinsic scales within the graph provided by MS thus enables us to obtain clusters of records with high content similarity at different levels of granularity. This capability can be used by practitioners to tune the level of description to their specific needs. To ascertain the relevance of the different layers of content clusters found in the MS analysis, we examined in detail the five levels of resolution presented in Fig. For each level, we prepared word clouds lemmatised for increased intelligibility , as well as a Sankey diagram and a contingency table linking content clusters i.
We note again that this comparison was only done a posteriori , i. The results are shown in Figs. Analysis of the results of the community partition of documents obtained by MS based on their text content and their correspondence to the external categories.
Some communities and categories are clearly matched while other communities reflect strong medical content. Results for the coarser MS partitions of the document similarity graph into: a 7 communities and b 3 communities, showing in each case their correspondence to the external hand-coded categories.
Some of the MS communities with strong medical content e. The partition into 44 communities presents content clusters with well-defined characterisations, as shown by the Sankey diagram and the highly clustered structure of the contingency table Fig. The content labels for the communities were derived by us from the word clouds presented in detail in the Supplementary Information Figure in Additional file 1 in the SI.
Compared to the 15 hand-coded categories, this community partition provides finer groupings of records with several clusters corresponding to sub-themes or more specific sub-classes within large, generic hand-coded categories. In other cases, however, the content clusters cut across the external categories, or correspond to highly specific content.
Examples of the former are the content communities of records from labour ward, chemotherapy, radiotherapy and infection control , whose reports are grouped coherently based on content by our algorithm, yet belong to highly diverse external classes. At this level of resolution, our algorithm also identified highly specific topics as separate content clusters. These include blood transfusions, pressure ulcer, consent, mental health , and child protection.
We have studied two levels of resolution where the number of communities 12 and 17 is close to that of hand-coded categories The results of the community partition are presented in Fig. As expected from the quasi-hierarchical nature of our multi-resolution analysis, we find that some of the communities in the way partition emerge from consistent aggregation of smaller communities in the way partition.
In terms of topics, this means that some of the sub-themes observed in Fig. This is apparent in the case of Accidents : seven of the communities in the way partition become one larger community community 2 in Fig. Other communities strand across a few external categories. The high specificity of the Radiotherapy , Pressure ulcer and Labour ward communities means that they are still preserved as separate groups on the next level of coarseness given by the 7-way partition Fig.
Figure 6 b shows the final level of agglomeration into 3 communities: a community of records referring to accidents; another community broadly referring to procedural matters referrals, forms, staffing, medical procedures cutting across many of the external categories; and the labour ward community still on its own as a subgroup of incidents with distinctive content. This process of agglomeration of content, from sub-themes into larger themes, as a result of the multi-scale hierarchy of graph partitions obtained with Markov Stability is shown explicitly with word clouds in Fig.
The word clouds of the partitions into 17, 12 and 7 communities show a multi-resolution coarsening in the content descriptive power mirroring the multi-level, quasi-hierarchical community structure found in the document similarity graph. Our framework consists of a series of steps for which there are choices and alternatives. Although it is not possible to provide comparisons to the myriad of methods and possibilities available, we have examined quantitatively the robustness of the results to parametric and methodological choices in different steps of the framework: i the importance of using Doc2Vec embeddings instead of BoW vectors, ii the size of training corpus for Doc2Vec; iii the sparsity of the MST-kNN similarity graph construction.
We have also carried out quantitative comparisons to other methods, including: i LDA-BoW, and ii clustering with other community detection methods.
We provide a brief summary here and additional material in the SI. Quantifying the importance of Doc2Vec compared to BoW: The use of fixed-sized vector embeddings Doc2Vec instead of standard bag of words BoW is an integral part of our pipeline. Doc2Vec produces lower dimensional vector representations as compared to BoW with higher semantic and syntactic content. It has been reported that Doc2Vec outperforms BoW representations in practical benchmarks of semantic similarity, as well as being less sensitive to hyper-parameters Dai et al.
Robustness to the size of dataset to train Doc2Vec : As shown in Table 1 , we have tested the effect of the size of the training corpus on the Doc2Vec model. The results, presented in Figure in Additional file 3 in the SI, show that the performance is affected only mildly by the size of the Doc2Vec training set.
Robustness of the MS results to the level of sparsification: To examine the effect of sparsification in the graph construction, we have studied the dependence of quality of the partitions against the number of neighbours, k , in the MST-kNN graph. Hence our results are robust to the choice of k , provided it is not too small. Due to computational efficiency, we thus favour a relatively small k , but not too small.
A key difference between standard LDA and our MS method is the fact that a different LDA model needs to be trained separately for each number of topics pre-determined by the user. To offer a comparison across the methods, we obtained five LDA models corresponding to the five MS levels we considered in detail. To give an indication of the computational cost , we ran both methods on the same servers. Our method takes approximately 13 h in total to compute both the Doc2Vec model on 13 million records 11 h and the full MS scan with partitions across all resolutions 2 h.
This comparison also highlights the conceptual difference between our multi-scale methodology and LDA topic modelling. While LDA computes topics at a pre-determined level of resolution, our method obtains partitions at all resolutions in one sweep of the Markov time, from which relevant partitions are chosen based on their robustness. However, the MS partitions at all resolutions are available for further investigation if so needed.
Comparison of MS to other partitioning and community detection algorithms:. We have used several algorithms readily available in code libraries i. Figure in Additional file 5 in the SI shows the comparison against several well-known partitioning methods Modularity Optimisation Clauset et al.
Only for very fine resolution with more than 50 clusters, Infomap, which partitions graphs into small clique-like subgraphs Schaub et al. Therefore, Markov Stability allows us to find relevant, good quality clusterings across all scales by sweeping the Markov time parameter. This work has applied a multiscale graph partitioning algorithm Markov Stability to extract content-based clusters of documents from a textual dataset of healthcare safety incident reports in an unsupervised manner at different levels of resolution.
The method uses paragraph vectors to represent the records and obtains an ensuing similarity graph of documents constructed from their content. The framework brings the advantage of multi-resolution algorithms capable of capturing clusters without imposing a priori their number or structure. Since different levels of resolution of the clustering can be found to be relevant, the practitioner can choose the level of description and detail to suit the requirements of a specific task.
Our a posteriori analysis evaluating the similarity against the hand-coded categories and the intrinsic topic coherence of the clusters showed that the method performed well in recovering meaningful categories. The clusters of content capture topics of medical practice, thus providing complementary information to the externally imposed classification categories. Our analysis shows that some of the most relevant and persistent communities emerge because of their highly homogeneous medical content, although they are not easily mapped to the standardised external categories.
A solution today will most likely be a combination of several technologies to solve critical issues. The healthcare industry still faces many challenges on the road to embracing structured data elements and the ultimate goal of one complete, accurate EHR per patient.
Healthcare organizations continue to implement EHRs and HIEs, going back to optimize practices that haven't captured structured data, and change the approaches. As these challenges are addressed, we can anticipate a major change over the next five years in the quality of data and the seamless exchange of patient data. Here are the latest Insider stories. More Insider Sign Out. Sign In Register. Sign Out Sign In Register. Latest Insider.
Check out the latest Insider stories here. Big data is now an essential part of healthcare. However, you need various tools to pull out the hidden value in the data stored in your EHR. Choosing the most suitable tool for effective data extraction requires expert help. Call MediQuant at Reach out to us through our contact page to discuss your data extraction needs.
The essence of the framework is to solve problems associated with EHR data extraction, such as: Missing values Unstructured data Multiple data types Irregularities in sampled data Using a preprocessor for EHR data extraction helps prepare the data into a format suitable for established machine learning techniques. Benshep About the Author:. Data entry is usually made by selecting a concept and navigating through the branches, and there is also the possibility of adding free text to a description.
The final output can be exported to MSWord and used as a letter or summary. In other studies the authors have described that the pediatric OpenSDE is accepted by pediatricians and has completeness and uniformity of data. Comments: Most data obtained from medical history and physical examination is usually obtained as narrative text, this type of data compromises the secondary use of it, because its difficulty for its analysis and codification. The possibility of structuring this information can be useful for more advanced electronic health records that can use the potential benefits of CPOE and CDSS.
Care delivery organizations aspire to increase the content of coded data in their Clinical Information Systems to enhance processes such as: coding for billing purposes; abstracting clinical data for Performance Improvement Efforts; and use in Clinical Decision Support.
Unfortunately, care providers find coded data entry cumbersome because it interferes with individualized patient care and workflow. In addition, when compared to traditional natural language narrative, coded entry captures only a fraction of the information produced in the clinical encounter and hence there is a trade-off between sensitivity and specificity in physician coding. There now exist systems which are capable of coding clinical notes.
0コメント