Glossary of Search Terms

Glossary of search terms

January 2023

Absolute boosting

Ensuring that a specified document always appears at the same point in a results set, or always appears on the first page of results

Access control list (ACL)

Defines access permissions at a user or group level (often based on Active Directory) to specific repository, a set of documents, or a section of a document

Advanced search

The provision of a search user interface which prompts the user to enter additional terms to assist in retrieving results, often using Boolean operators.

Apache

The Apache Foundation provides support for a wide range of open source applications, including Lucene and Solr

Appliance

A search application pre-installed on a server ready for insertion into a standard server rack

Aggregated search

The presentation of related content items (often referred to as verticals) from a single index in a specific area of a page of search results

Artificial intelligence

A set of technologies that enable machines to sense, comprehend, act and learn in a manner that seeks to emulate a human response to a situation

Auto-categorization

An automated process for creating a classification system (or taxonomy) from a collection of nominally related documents

Auto-classification

An automated process for assigning metadata or index values to documents, usually in conjunction with an existing taxonomy

Average response time

An average of the time taken for the search engine to respond to a query, or the average end-to-end time of a query

BERT

Bidirectional Encoder Representations from Transformers (BERT) is a machine learning technique which enhances the performance of training based on natural language processing.

Best bets

Results that are selected to appear at the top of a list of results that provide a context for other documents generated and ranked by the search application

BM25 (Best Match 25)

A ranking algorithm developed in the 1990s of which there are now multiple variants. It has its origins in the tf.idf ranking function and is widely used as the basis for enterprise search applications

Boolean Operators

A widely used approach to create search queries; examples include And, OR, and NOT—for example, information AND management

Boolean search

A search query using Boolean operators

Boosting

Changing search ranking parameters to ensure that certain documents or categories of documents appear higher in the results than the raw algorithm would suggest.

Chatbot

A chatbot application is able to conduct a voice query against a search index in lieu of providing direct contact with (for example) a call-centre operator

Categorization

The placing of boundaries around objects that share similarities (e.g., taxonomy)

Clustering

A process employed to generate groupings of related words by identifying patterns in a document index

Cognitive search

A description loosely applied by search vendors to applications using machine learning and AI techniques to determine the work context of the user and deliver personalized results

Collection

A group of objects methodically sorted and placed into a category

Computational linguistics

The use of computer-based statistical analysis of language to determine patterns and rules that aid semantic understanding

Concept extraction

The process of determining concepts from text using linguistic analysis

Connector

A software application that enables a search application to index content in another application

Controlled vocabulary

An organized list of words, phrases, or some other set employed to identify and retrieve documents

COTS

Commercial off-the-shelf software

Conversational search

Conversational search applications respond to a spoken request or query with a spoken response. See also Chat Bot.

Crawler

A program used to index documents

Cross-language search

A query in one language is translated into other indexed languages (often using a multi-lingual thesaurus) so that all documents relevant to the concept of the query are returned no matter what language is used for the content

Deep learning

Deep learning builds on machine learning principles but makes use of artificial neutral networks to be able to manage very large collections of data with real-time responses

Description

A brief summary, often generated automatically, that provides a description of a document in the list of results

See also Key sentence

Document

A structured sequence of text information, but often used as a generic description of any content item in a information-based application such as a content management system or enterprise search

Document processing

The deconstruction of a document into a form that can be tokenised and indexed

Document repository

A site where source documents or other content objects are stored, generally a folder or folders

See also Information source

Early binding

A search conducted only across documents that a user has permission to access

See also Late binding

Entity extraction

The automatic detection of defined items in a document, such as dates, times, locations, names, and acronyms

Exact match

Two or more words considered mutually inclusive in a search, often by enclosing them in quotation marks—for example, “United Nations”

Exploratory search

In exploratory search the search goal is imprecise and open-ended and there is no unique single answer that meets the user’s information needs and no clear criterion on when to end the search.

Facet

Presentation of topic categories and content metadata on the search user interface to support the refinement of a search query generated by the search index as the process if query exploration proceeds

Fallout

A quantity representing the percentage of irrelevant hits retrieved in a search

Federated search

A search carried out across multiple repositories, indexes and/or applications

Field query

A search that is limited to a specific field in a document (e.g., a title or date)

Filter

A function that offers specific criteria for search result selection that is independent of the query. For example, file format or publication date

Freshness

The time period between a document being crawled and the index being updated so that a user will be able to find the document

Fuzzy search

A search allowing a degree of flexibility for generating hits (i.e., matches that are phonetically or typographically similar)

Golden set

A set of queries and documents already marked as relevant by topic experts, used to benchmark search performance that is representative of content that will be searched on a regular basis

Guided search

A search in which the system prompts the user for information that will refine the search results

Hit

A search result matching given criteria; sometimes used to denote the number of occurrences of a search term in a document

Index

List containing data and/or metadata indicating the identity and location of a given file or document

Index file

A file that stores data in a format capable of retrieval by a search engine

Ingestion rate

The rate at which documents can be indexed, usually specified in Gb/sec

Inverse document frequency (IDF)

A measure of the rarity of a given term in a file or document collection

Inverted file

A list of the words contained within a set of documents, and which document each word is present in, so acting as a pointer to a document

Inverted index

An index created as an outcome of a crawl of every word, entity and associated metadata in a way that facilitates the very fast retrieval of documents.

Key sentence

A brief statement that effectively summarizes a document, often employed to annotate search results

Keyword

A word used in a query to search for documents

Keyword search

A search that compares an input word against an index and returns matching results

Knowledge graph

A knowledge graph is a representation of entities and related attributes

Language detection

The indexing process identifies the language (or languages) of the content and assigns it to appropriate language specific indexes

Late binding

Access permission checking carried out immediately before the presentation of the document to the user

See also Early binding

Learning to rank (LTR)

Learning to Rank is a class of techniques that apply supervised machine learning to solve ranking problems by presenting a relative re-ordering of relevant items

Lemmatization

A process that identifies the root form of words contained within a given document based on grammatical analysis (e.g., run from running)

See also Stemming

Lexical analysis

An analysis that reduces text to a set of discrete words, sentences, and paragraphs

Linguistics

The study of the structure, use, and development of language

Linguistic indexing

The classification of a set of words into grammatical classes, such as nouns or verbs

Long tail

A feature of text-based search in which there are a significant number of low-use queries forming a long tail which is difficult to optimize for an individual query. An example of a Zipf curve.

Machine learning

Machine learning is a method of data analysis that automates analytical model building.

Meta tag

An HTML command located within the header of a website that displays additional or referential data not present on the page itself

Metadata

Data supplements and/or clarifies index terms generated by text in the document, for example the date of publication or the author or specific controlled terms.

Morphologic analysis

The analysis of the structure of language

Natural language processing

A process that identifies content though using grammatical and semantic rules to understand the intent of a sequence of words in a specified language

Natural language query

A search input entered using conventional language (e.g., a sentence)

Neural IR

Neural ranking models for information retrieval (IR) use shallow or deep neural networks to rank search results in response to a query.

Parametric search

A search that adheres to predefined attributes present within a given data source

Parsing

The process of analyzing text to determine its semantic structure

Pattern matching

A type of matching that recognizes naturally occurring patterns (word usage, frequency of use, etc.) within a document

Phrase extraction

The procurement of linguistic concepts, generally phrases, from a given document

Precision

The quantification of the number of relevant documents returned in a given search

Professional search

A term applied to groups of professionals (for example, lawyers and patent agents) who spend a significant proportion of their time using search applications, often in situations where high levels of recall are required.

Proximity searching

A search whose results are returned based on the proximity of given words (e.g., ‘pressure’ within four words of ‘testing’)

Query by example

A search in which a previously returned result is used to obtain similar results

Query transformation

The process of analyzing the semantic structure of a query prior to processing in order to improve search performance

Ranking

Search applications calculate a relevance score for each content item and return results in decreasing order of relevance

Recall

A percentage representing the relationship between correct results generated by a query and the total number of correct results within an index

Relevance

The value that a user places on a specific document or item of information. Both precision and recall are defined in terms of relevance.

Search results

The documents or data that are returned from a search

Search terms

The terms used within a search query. Sometimes incorrectly referred to as ‘keywords’

Semantic analysis

An analysis based upon grammatical or syntactical constraints that attempts to decipher information contained in a document

Sentiment analysis

The use of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in documents

Session

The duration of the time spent by a user between entering a query term, reviewing results and then closing down the application.

Snippet

The text that is presented to give a concise representation of the content of a search result sufficient for a user to assess its relevance to their query. It may be generated by the author of the document, extracted from text associated with a specific index term or derived algorithmically from the text of the document

Soundex search

A search in which users receive results that are phonetically similar to their query

Spider

An automated process that presents documents to a data extraction or parsing engine by following links on web pages

See also Crawler

Stemming

A process based on a set of heuristic rules that identifies the root form of words contained thin a given document (e.g., run from running)

See also Lemmatization

Stop words

Words that are deemed to have no value in an index

See also Word exclusion

Stopping distance

The point in a search query session where the user decides that time and effort spent in examining further results is not going to result in additional relevant results

Structured data

Data that can be represented according to specific descriptive parameters—for example, rows and columns in a relational database, or hierarchical nodes in an XML document or fragment

Summarization

An automated process for producing a short summary of a document and presenting it in the list of results

Synonym expansion

Automatically expanding a search by adding synonyms of the query terms derived from a thesaurus

Syntactic analysis

An analysis capable of associating a word with its respective part of speech by determining its context in a given statement

Taxonomy

In respect to search, the broad categorization of objects (typically a tree structure of classifications for a given set of objects) in order to make them easier to retrieve and possibly sort

Term frequency

A quantity representing how often a term appears in a document

TF.IDF

The “term frequency.inverse document frequency formulation” gives a score that is  proportional to the number of times a word appears in the document offset by the frequency of the word in the collection of documents.

See also BM25

Thesaurus

A collection of words in a cross-reference system that refers to multiple taxonomies and provides a meta-classification, thereby facilitating document retrieval

Thumbnail

An HTML rendition of a page from a document in response (often through a mouse roll-over) to provide the user with additional information about the potential relevance of the result.

Tokenizing

The process of identifying the elements of a sentence, such as phrases, words, abbreviations, and symbols, prior to the creation of an index

Transformer

See BERT

Truncation

Removal of a prefix or suffix

Unstructured information

Information that is without document or data structure (i.e., cannot be effectively decomposed into constituent elements or chunks for atomic storage and management)

Vector space

A model that enables documents to be ranked for relevance against a query by comparing an algebraic expression of a set of documents with that of the query

Weight

The process of boosting index terms in specific areas of a document (for example the title) or on specific topics

Wildcard

A notation, generally an asterisk or question mark, that when used in a query, represents all possible characters (e.g., a search for boo* would return book, boom, boot, etc.)

Word exclusion

A list containing words that will not be indexed—this usually is comprised of words that are excessively common (e.g., a, an, the, etc.) See also Stop List

xAI

eXplainable AI is a set of machine learning techniques that produce more explainable models while maintaining a high level of learning performance and enable humans to understand, appropriately trust, and effectively use AI applications.