Have you ever wondered how search engines work? Search engines like Google, Yahoo, Ask, Bing retrieve information in milliseconds and satisfies the user's information need. Search engine optimization is what gives you the most accurate results on top of the search results. All these Search Engines, incorporate various techniques to optimize the search results, but none of these can be the perfect Search Engine that everyone is looking for, which drives us towards the development of the semantic web. TF-IDF is one of the mechanisms used by search engines, in order to address the relevance of the user information being retrieved based on certain assumptions in general.
If you are a web developer, have you ever wondered if it is possible to tag certain words in a document automatically? Have you ever tried to get your website on top of the search results by trying to increase search engine optimization? These are the kinds of problems that can be addresses by TF-IDF.
This blog contains all the information you need to understand what TF-IDF is all about. Following are the contents.
What is TF-IDF
Why TF-IDF
How TF-IDF is calculated
Document Ranking for a Given Query
Applications of TF-IDF
Variants of TF and IDF Weights
1) What is TF-IDF?
TF-IDF, short for Term Frequency - Inverse Document Frequency, is a text mining technique, that gives a numeric statistic as to how important a word is to a document in a collection or corpus. This is a technique used to categorize documents according to certain words and their importance to the document.
This is somewhat similar to Bag of Words (BoW), where BoW is an algorithm that counts how many times a word appears in a document. If a perticular term appears in a document, many times, then there is a possibility of that term being an important word in that document. This is the basic concept of BoW, where the word count allows us to compare and rank documents based on their similarities for applications like search, document classification and text mining. Each of these are modeled into a vector space so as to easily categorize terms and their documents. But the general BoW technique, does not omit the common words that appear in documents and it is also being modeled in the vector space.
But when it comes to TF-IDF, this is also considered as a measure to categorize documents based on the terms that appear in it. But unlike BoW, this does provide a weight for each term, rather than just the count. The TF-IDF value measures the relevance, not frequency.
Did I just make it confusing? Well, let's look at this intuitively, with the following example. Example: Assume you are trying to finish an Assignment from your information retrieval class. The last question says something about "Lemmatizing" and you have no clue as to what it is. So you wish to look this up in your text book, which has around 50 chapters. How can you find which chapter has the correct information? This is where TF-IDF comes into play. With TF-IDF you can easily identify which word is important in which document. According to this example, you can simply find out which chapters include the word Lemmatizing and read only the those chapters.
2) Why TF-IDF?
TF-IDF allows us to score the importance of words in a document, based on how frequently they appear on multiple documents.
If the word appears frequently in a document - assign a high score to that word (term frequency - TF)
If the word appears in a lot of documents - assign a low score to that word. (inverse document frequency - IDF)
The second point given above is the main reason as to why, common words like "the", "were" are given lower scores, because it appears everywhere. It has no specific unique importance to the relevant document. The TF-IDF value can be associated with weights where search engines often use different variations of TF-IDF weighting mechanisms as a central tool in ranking a document's relevance to a given user query.
This is by far, the best known weighting scheme used in information retrieval. TF-IDF gives the importance of each of therms in a document not only as an isolated term, but also as a term within the entire document collection. This leads to ranking and scoring documents, against a query as well as classification of documents and modeling documents and terms within a vector space.
3) How TF-IDF is calculated
TF-IDF is the product of two main statistics, term frequency and the inverse document frequency. Different information retrieval systems use various calculation mechanisms, but here we present the most general mathematical formulas. TF-IDF is calculated to all the terms in a document. Sometimes we use a threshold and omit words which give a score lower than the specified threshold.
The following formulas for calculating the Term frequency and the Inverse Document Frequency could differ from the equations given below. These formulas given here are for the basic understanding of the concept. The variants of Term-frequency weight and Inverse-Document frequency weights are given in Section 6 (Variants of TF and IDF weights).
Calculating Term Frequency
Calculating Inverse Document Frequency
Example:
Consider a document d, where the word "car" appears 6 times in it. That document contains 600 words. tf ("car" , d) = 6/600 = 0.01
Now, we have a collection of 10,000 document, and the word "cat" appears in 300 of these documents.
idf ("car" ) = log ( 10000/ 300) = 1.523
tf-idf ("car" , d) = 0.01 x 1.523 = 0.01523
4) Document Ranking for a Given Query
Using this concept, we can simply find the ranking of documents for a given query. When a user queries for certain information, the system needs to retrieve the most relevant documents to satisfy the user's information need. This relevance is called document ranking which ranks the documents in the order of relevance, where the highest relevance ranked as 1st.
Using this, finding the rank of documents for a query, we need to calculate the score of the document for a given query. We can use the following formula.
The score is calculate by taking the terms which are both present in the document d, and the query q. We check for the TF-IDF values for each of those terms, and get a summation. This is the score for document d, for the query q. The technique given above changes with the application and parameters being utilized. For more on this, read: SMART Information Retrieval System.
5) Applications of TF-IDF
Automatic Website Tagging
TF-IDF becomes very useful when you have a very large document set, and you wish to develop an information retrieval system on them, or simply categorize the documents. In this case, TF-IDF will be picking the best and unique tags (words) from each of the documents and assigning them a score within each document. Manual document tagging is always a time consuming task and is not very accurate to the level that we expect. Many web developers try to add tags to their websites in order to achieve good SEO (Search Engine Optimization), but some are reluctant to do so as well. This is an excellent use case for TF-IDF where it generates tags as necessary for each of the documents.
Web Search Engines
Information Retrieval systems are the most common application of TF-IDF. Whenever a user queries for a word or a text, the system will look at the TF-IDF values an retrieve the most relevant documents to the user. Company based information retrieval systems, web search engines, and website search bars, use different variations of TF-IDF weighting so as to achieve best quality results with less trade-offs on the other quality factors like time and relevance.
Digital Libraries
TF-IDF is also used in fields like text mining and user modeling where a weighting factor is attached to the TF-IDF value. The applciations of TF-IDF grows daily where Wikipedia says that, 83% of Text based recommender systems in the domain of digital library use TF-IDF. A digital library is an information hub that contains electronic information such as texts, books, images, graphs etc. Once TF-IDF is incorporated into the digital libraries search engine, it makes it very easy for the students and users to retrieve the most relevant information fast and conveniently, rather than moving from shelf to shelf, trying to figure out whether this book will suffice my information need.
6) Variants of TF and IDF Weights
References [1] Tf-idf