Post on 07-Jan-2023
Challenges and methodology for indexing the computerized patient record
Frédéric Ehrler
KIM Lab, 2007
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Plan
• Problems• Information retrieval
• Indexing• Retrieval• Evaluation
• Retrieval in patient records• Specificity of our task• Acquiring required resources
• Documents collection• Queries and relevance assessments
• Results• Indexing efficiency• Retrieval efficiency and effectiveness
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Problematic
Improvement of the care providers efficiency
• Fact• Patient records contain most crucial documents for managing the
treatments and healthcare of patients in the hospital
• Problem• Care providers waste precious time searching and browsing the
patient record to collect all information pertinent to the actual situation
• Proposed solution• Indexing the patient record to retrieve efficiently and effectively
relevant information from patient documents Information retrieval
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Information Retrieval: Introduction
• Selecting from a relatively large collection of documents a manageable number of documents that is likely to satisfy an expressed need for information (Query)
• Two type of input• The document collection asynchronous processing • The queries real time processing
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Information Retrieval: Indexing
• Indexing• Creation of inverted files to improve the retrieval
speed• Inverted files
• Contains all the words of the collection• Link each word with the list of documents that contain it
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Information Retrieval: Retrieval
• Retrieval• The user perform a query• Comparison between every document and the query
in the vector space • Text represented by a vector of terms• Cosine distance
• The system return the documents that are the most similar to the query
• The documents returned contain the important terms of the query
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Information Retrieval: Key Techniques
• Weighting schema• Not all the words have the same significance level
• Words occurring with high frequency in a document are better discriminators than words of low frequency
• Words occurring in many documents of a corpus are less discriminative than rare words
• Two metrics reflect these intuitions• TF: frequency of the word in the document• IDF: frequency of the word in the corpus
• Text normalization• Stemming techniques
• Suffixes transformation rules• Reduction of the language variation
• Query extension• Synonym
• Improve the coverage
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Information Retrieval: Evaluation
• Two key metrics: precision and recall• Precision: measures the proportion of retrieved
documents which are relevant • Recall: measures the proportion of relevant
documents retrieved
• Typical balance between recall and precision• F-measure
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Retrieval in Patient Records
• Information retrieval technologies reply to our needs
• Applying IR in patient records• Specificity of our task• Acquiring required resources
• Documents collection• Queries and relevance assessments
• Results• Indexing efficiency• Retrieval efficiency and effectiveness
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Specificity of our Task
• Usually in common IR task • All the documents are stored in a unique and large
corpus• Indexing is asynchronous
• Our task• Numerous small corpora must be indexed
independently • The indexing efficiency is crucial• The data must be always up to date
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Test Collection Construction
• OSHUMED corpus share important common properties with patient record corpus
• Prediction of the behavior of our system on the patient records by looking at the results obtained on the OHSUMED corpus
• Rue to his Homogeneity, OSHUMED facilitate measures and reliability
• Documents share similar properties like length, word distribution, and word frequency
• Avoids biasing the experiments and allows focusing only on the quality of retrieval
• The final application is dedicated to run on the patient records however, the documents used in the experiment are selected from the OHSUMED corpus
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Documents Collection
• Composed of three parts• The collection of documents
• Must reflect the specificity of the patient record corpus• Organized in a two level structure
• The documents are spited in groups that represent a patient records
• The number of group vary in order two study the consequence of the structure on the retrieval process
• The retrieval is performed only at the group level
• The queries• Automatically built
• The relevance assessments
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Structure of the Patient Record Corpus
• Few large patient records• 3% of the records contain more than 100 documents• Returning a small subset of documents containing the
answer could be sufficient• Reducing the possible research space brings a
significant gain of time for the care providers
• Many small patient records• 50% of the records have less
than 7 documents• Little interest in using a tool
not returning the exact answer
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Document Collection Structure
• Structure of the experimental corpus• Study the impact of the corpus structure on the
performance • Variation of the number of groups • Keeping a constant number of documents
• Each increase in groups is accompanied by a proportional decrease in the number of documents per group
• Variation of 1 to 2048 groups for 8192 documents
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Queries and Relevance Assessments
• Usual situation• Experts construct relevant and interesting queries for
specific domains
• Chosen approach• Automatic generation of the queries and their related
relevance assessments by considering “known-item-search”
• Queries build randomly from documents• Must retrieve only the unique document that has been used
to build the query• Simulates a user seeking for a particular, partially
remembered document in the collection
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Indexing Strategies
• The specificity of the application • Numerous documents are daily added in the patient records and
queries can be performed immediately after• The generated data must be process in real time
• Possible indexation triggering strategies• Indexation performed at fixed, but short, intervals
• Lead to a lot of useless processing• Don’t ensure proper indexation when needed
• Indexation launched when a query is performed• Requires less indexation process• Induce a delayed answer when queries are performed• Severe impact on perceived performance
• Indexation triggered by a notification when a document is saved• Retrained strategy
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Indexing and Retrieval Efficiency
• Indexation• Linear increase of required time given the number of
groups• Total number of document does not change
increase in time due to indexation initialization overhead
• Retrieval• Inverted tendency
• It is faster to perform queries on small indexes (Numerous groups)
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Indexing Overhead
• A word can not occur in a larger number of groups than its document frequency
• The closer we approach this threshold, the lower the probability of this word occur in an additional separated group and increase the size of the total index
• Once a word occurs in a number of groups equal to its frequency, increasing the number of group will not bring any further increase in size of the total index
• The number of created entries grows in a logarithmic manner regarding to the number of groups• Consequence of the distribution of the frequencies of the words
in the corpus that follows a logarithmic decrease (there are many infrequent words and few very frequent words)
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Overhead in Indexing Process
1. Tokenization of the documents in order to build the vocabulary• No Overhead
• Task complexity is only dependent from the total number of words in the whole corpus
2. The term frequency and inverse document frequency values extraction • Required to compute the weights• Overhead
• Document frequency is dependant of the groups• Done once per entry
3. storing the indexes in the database• Overhead
• The size of the total index is bigger with a large number of groups
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Retrieval Effectiveness for 1024 Queries
Number of document per group recall at first retrieved document
8 96%
16 95%
32 94%
64 92%
128 88%
256 86%
512 82%
1’024 80%
2’048 74%
4’096 73%
8’192 69%
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Situation with Patient Records
• Given the structure of the patient records corpus • Indexing efficiency
• Suffer of a consequent overhead
• Retrieval efficiency• Very quick answer
• Retrieval effectiveness • Good effectiveness on most of the records• The 3% of the patient records containing more than 100
documents will be problematic
• Significant computational power will be required to offer acceptable efficiency
©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Conclusion
• Highly dedicated tools are needed to answers requirements for real-time and sensitivity
• Indexing in patient records is time-consuming, the finest tuning possible should be done in order to increase the efficiency
23©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Similarity OSHUMED and Patient Record Corpuses
24©2007 Hôpitaux Universitaires de Genève Ehrler Frédéric
Record Size Distribution in Patient Records