Glossary of Search Terms
Glossary of search terms
January 2023
Absolute boosting
Ensuring that a specified document always appears at the same point in a results set, or always appears on the first page of results
Access control list (ACL)
Defines access permissions at a user or group level (often based on Active Directory) to specific repository, a set of documents, or a section of a document
Advanced search
The provision of a search user interface which prompts the user to enter additional terms to assist in retrieving results, often using Boolean operators.
Apache
The Apache Foundation provides support for a wide range of open source applications, including Lucene and Solr
Appliance
A search application pre-installed on a server ready for insertion into a standard server rack
Aggregated search
The presentation of related content items (often referred to as verticals) from a single index in a specific area of a page of search results
Artificial intelligence
A set of technologies that enable machines to sense, comprehend, act and learn in a manner that seeks to emulate a human response to a situation
Auto-categorization
An automated process for creating a classification system (or taxonomy) from a collection of nominally related documents
Auto-classification
An automated process for assigning metadata or index values to documents, usually in conjunction with an existing taxonomy
Average response time
An average of the time taken for the search engine to respond to a query, or the average end-to-end time of a query
BERT
Bidirectional Encoder Representations from Transformers (BERT) is a machine learning technique which enhances the performance of training based on natural language processing.
Best bets
Results that are selected to appear at the top of a list of results that provide a context for other documents generated and ranked by the search application
BM25 (Best Match 25)
A ranking algorithm developed in the 1990s of which there are now multiple variants. It has its origins in the tf.idf ranking function and is widely used as the basis for enterprise search applications
Boolean Operators
A widely used approach to create search queries; examples include And, OR, and NOT—for example, information AND management
Boolean search
A search query using Boolean operators
Boosting
Changing search ranking parameters to ensure that certain documents or categories of documents appear higher in the results than the raw algorithm would suggest.
Chatbot
A chatbot application is able to conduct a voice query against a search index in lieu of providing direct contact with (for example) a call-centre operator
Categorization
The placing of boundaries around objects that share similarities (e.g., taxonomy)
Clustering
A process employed to generate groupings of related words by identifying patterns in a document index
Cognitive search
A description loosely applied by search vendors to applications using machine learning and AI techniques to determine the work context of the user and deliver personalized results
Collection
A group of objects methodically sorted and placed into a category
Computational linguistics
The use of computer-based statistical analysis of language to determine patterns and rules that aid semantic understanding
Concept extraction
The process of determining concepts from text using linguistic analysis
Connector
A software application that enables a search application to index content in another application
Controlled vocabulary
An organized list of words, phrases, or some other set employed to identify and retrieve documents
COTS
Commercial off-the-shelf software
Conversational search
Conversational search applications respond to a spoken request or query with a spoken response. See also Chat Bot.
Crawler
A program used to index documents
Cross-language search
A query in one language is translated into other indexed languages (often using a multi-lingual thesaurus) so that all documents relevant to the concept of the query are returned no matter what language is used for the content
Deep learning
Deep learning builds on machine learning principles but makes use of artificial neutral networks to be able to manage very large collections of data with real-time responses
Description
A brief summary, often generated automatically, that provides a description of a document in the list of results
See also Key sentence
Document
A structured sequence of text information, but often used as a generic description of any content item in a information-based application such as a content management system or enterprise search
Document processing
The deconstruction of a document into a form that can be tokenised and indexed
Document repository
A site where source documents or other content objects are stored, generally a folder or folders
See also Information source
Early binding
A search conducted only across documents that a user has permission to access
See also Late binding
Entity extraction
The automatic detection of defined items in a document, such as dates, times, locations, names, and acronyms
Exact match
Two or more words considered mutually inclusive in a search, often by enclosing them in quotation marks—for example, “United Nations”
Exploratory search
In exploratory search the search goal is imprecise and open-ended and there is no unique single answer that meets the user’s information needs and no clear criterion on when to end the search.
Facet
Presentation of topic categories and content metadata on the search user interface to support the refinement of a search query generated by the search index as the process if query exploration proceeds
Fallout
A quantity representing the percentage of irrelevant hits retrieved in a search
Federated search
A search carried out across multiple repositories, indexes and/or applications
Field query
A search that is limited to a specific field in a document (e.g., a title or date)
Filter
A function that offers specific criteria for search result selection that is independent of the query. For example, file format or publication date
Freshness
The time period between a document being crawled and the index being updated so that a user will be able to find the document
Fuzzy search
A search allowing a degree of flexibility for generating hits (i.e., matches that are phonetically or typographically similar)
Golden set
A set of queries and documents already marked as relevant by topic experts, used to benchmark search performance that is representative of content that will be searched on a regular basis
Guided search
A search in which the system prompts the user for information that will refine the search results
Hit
A search result matching given criteria; sometimes used to denote the number of occurrences of a search term in a document
Index
List containing data and/or metadata indicating the identity and location of a given file or document
Index file
A file that stores data in a format capable of retrieval by a search engine
Ingestion rate
The rate at which documents can be indexed, usually specified in Gb/sec
Inverse document frequency (IDF)
A measure of the rarity of a given term in a file or document collection
Inverted file
A list of the words contained within a set of documents, and which document each word is present in, so acting as a pointer to a document
Inverted index
An index created as an outcome of a crawl of every word, entity and associated metadata in a way that facilitates the very fast retrieval of documents.
Key sentence
A brief statement that effectively summarizes a document, often employed to annotate search results
Keyword
A word used in a query to search for documents
Keyword search
A search that compares an input word against an index and returns matching results
Knowledge graph
A knowledge graph is a representation of entities and related attributes
Language detection
The indexing process identifies the language (or languages) of the content and assigns it to appropriate language specific indexes
Late binding
Access permission checking carried out immediately before the presentation of the document to the user
See also Early binding
Learning to rank (LTR)
Learning to Rank is a class of techniques that apply supervised machine learning to solve ranking problems by presenting a relative re-ordering of relevant items
Lemmatization
A process that identifies the root form of words contained within a given document based on grammatical analysis (e.g., run from running)
See also Stemming
Lexical analysis
An analysis that reduces text to a set of discrete words, sentences, and paragraphs
Linguistics
The study of the structure, use, and development of language
Linguistic indexing
The classification of a set of words into grammatical classes, such as nouns or verbs
Long tail
A feature of text-based search in which there are a significant number of low-use queries forming a long tail which is difficult to optimize for an individual query. An example of a Zipf curve.
Machine learning
Machine learning is a method of data analysis that automates analytical model building.
Meta tag
An HTML command located within the header of a website that displays additional or referential data not present on the page itself
Metadata
Data supplements and/or clarifies index terms generated by text in the document, for example the date of publication or the author or specific controlled terms.
Morphologic analysis
The analysis of the structure of language
Natural language processing
A process that identifies content though using grammatical and semantic rules to understand the intent of a sequence of words in a specified language
Natural language query
A search input entered using conventional language (e.g., a sentence)
Neural IR
Neural ranking models for information retrieval (IR) use shallow or deep neural networks to rank search results in response to a query.
Parametric search
A search that adheres to predefined attributes present within a given data source
Parsing
The process of analyzing text to determine its semantic structure
Pattern matching
A type of matching that recognizes naturally occurring patterns (word usage, frequency of use, etc.) within a document
Phrase extraction
The procurement of linguistic concepts, generally phrases, from a given document
Precision
The quantification of the number of relevant documents returned in a given search
Professional search
A term applied to groups of professionals (for example, lawyers and patent agents) who spend a significant proportion of their time using search applications, often in situations where high levels of recall are required.
Proximity searching
A search whose results are returned based on the proximity of given words (e.g., ‘pressure’ within four words of ‘testing’)
Query by example
A search in which a previously returned result is used to obtain similar results
Query transformation
The process of analyzing the semantic structure of a query prior to processing in order to improve search performance
Ranking
Search applications calculate a relevance score for each content item and return results in decreasing order of relevance
Recall
A percentage representing the relationship between correct results generated by a query and the total number of correct results within an index
Relevance
The value that a user places on a specific document or item of information. Both precision and recall are defined in terms of relevance.
Search results
The documents or data that are returned from a search
Search terms
The terms used within a search query. Sometimes incorrectly referred to as ‘keywords’
Semantic analysis
An analysis based upon grammatical or syntactical constraints that attempts to decipher information contained in a document
Sentiment analysis
The use of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in documents
Session
The duration of the time spent by a user between entering a query term, reviewing results and then closing down the application.
Snippet
The text that is presented to give a concise representation of the content of a search result sufficient for a user to assess its relevance to their query. It may be generated by the author of the document, extracted from text associated with a specific index term or derived algorithmically from the text of the document
Soundex search
A search in which users receive results that are phonetically similar to their query
Spider
An automated process that presents documents to a data extraction or parsing engine by following links on web pages
See also Crawler
Stemming
A process based on a set of heuristic rules that identifies the root form of words contained thin a given document (e.g., run from running)
See also Lemmatization
Stop words
Words that are deemed to have no value in an index
See also Word exclusion
Stopping distance
The point in a search query session where the user decides that time and effort spent in examining further results is not going to result in additional relevant results
Structured data
Data that can be represented according to specific descriptive parameters—for example, rows and columns in a relational database, or hierarchical nodes in an XML document or fragment
Summarization
An automated process for producing a short summary of a document and presenting it in the list of results
Synonym expansion
Automatically expanding a search by adding synonyms of the query terms derived from a thesaurus
Syntactic analysis
An analysis capable of associating a word with its respective part of speech by determining its context in a given statement
Taxonomy
In respect to search, the broad categorization of objects (typically a tree structure of classifications for a given set of objects) in order to make them easier to retrieve and possibly sort
Term frequency
A quantity representing how often a term appears in a document
TF.IDF
The “term frequency.inverse document frequency formulation” gives a score that is proportional to the number of times a word appears in the document offset by the frequency of the word in the collection of documents.
See also BM25
Thesaurus
A collection of words in a cross-reference system that refers to multiple taxonomies and provides a meta-classification, thereby facilitating document retrieval
Thumbnail
An HTML rendition of a page from a document in response (often through a mouse roll-over) to provide the user with additional information about the potential relevance of the result.
Tokenizing
The process of identifying the elements of a sentence, such as phrases, words, abbreviations, and symbols, prior to the creation of an index
Transformer
See BERT
Truncation
Removal of a prefix or suffix
Unstructured information
Information that is without document or data structure (i.e., cannot be effectively decomposed into constituent elements or chunks for atomic storage and management)
Vector space
A model that enables documents to be ranked for relevance against a query by comparing an algebraic expression of a set of documents with that of the query
Weight
The process of boosting index terms in specific areas of a document (for example the title) or on specific topics
Wildcard
A notation, generally an asterisk or question mark, that when used in a query, represents all possible characters (e.g., a search for boo* would return book, boom, boot, etc.)
Word exclusion
A list containing words that will not be indexed—this usually is comprised of words that are excessively common (e.g., a, an, the, etc.) See also Stop List
xAI
eXplainable AI is a set of machine learning techniques that produce more explainable models while maintaining a high level of learning performance and enable humans to understand, appropriately trust, and effectively use AI applications.