In the digital age, where information is readily available and easily accessible, there is a need for techniques that can detect plagiarism (intentional or unintentional) from content duplication to enhancing natural language processing capabilities. What sets shingling's capabilities apart is the way it extends to various applications, including but not limited to, document clustering, information retrieval, and content recommendation systems.
The article outlines the following:
Shingling is a widely used technique in detecting and mitigating textual similarities. This article introduces you to the concept of shingling, the basics of shingling technique, Jaccard similarity, advanced techniques, and optimizations. The process of converting a string of text in documents into a set of overlapping sequences of words or letters is called Shingling. Programmatically, think of this as a list of substrings from a string value.
Let's take a string: "Generative AI is evolving rapidly." Let's denote the length of the shingle as k and set the value of k to 5.
The result is a set of five letters:
This set of overlapping sequences are called "shingles" or "n-grams." Shingles consist of consecutive words or characters from the text, creating a series of overlapping segments. The length of a shingle denoted above as "k," varies depending on the specific requirements of the analysis, with a common practice involving the creation of shingles containing three to five words or characters.
Shingling is part of a three-step process.
If you are familiar with prompt engineering, you should have heard about Tokenization. It is the process of breaking up a sequence of text into smaller units called tokens. Tokens can be words, subwords, characters, or other meaningful units. This step prepares the text data for further processing by models. With word tokenization, the above example "Generative AI is evolving rapidly" will be tokenized into:
For tokenization, you can use either a simple Python method or Regex. There are libraries like NLTK (Natural Language ToolKit) and spaCy that provide advanced options like stopwords etc.,
Link to the code.
As you know by now, Shingling, also known as n-gramming, is the process of creating sets of contiguous sequences of tokens (n-grams or shingles) from the tokenized text. For example, with k=3, the sentence "Generative AI is evolving rapidly." would produce shingles like
This is a list of shingles. Shingling helps capture local word order and context.
Hashing simply means using special functions to turn any kind of data, like text or shingles, into fixed-size codes. Some popular hashing methods include MinHash, SimHash, and Locality Sensitive Hashing (LSH). Hashing enables efficient comparison, indexing, and retrieval of similar text segments. When you turn documents into sets of shingle codes, it's much simpler for you to compare them and spot similarities or possible plagiarism.
Let's consider two short text passages that are widely used to explain simple shingling
With a word size of 4, using the w-shingle Python above, the shingles for Passage 1 would be:
By comparing the sets of shingles, you can see that the first four shingles are identical, indicating a high degree of similarity between the two passages.
Shingling sets the stage for more detailed analysis, like measuring similarities using things like Jaccard similarity. Picking the right shingle size "k" is crucial. Smaller shingles can catch small language details, while larger ones might show bigger-picture connections.
In text analysis, Jaccard similarity is considered a key metric. It is the similarity between two text samples by calculating the ratio of the number of shared shingles to the total number of unique shingles across both samples.
J(A,B) = (A ∩ B) / (A ∪ B)
Jaccard similarity is defined as the size of the intersection divided by the size of the union of the shingle sets from each text. Though it sounds straightforward and simple, this technique is powerful as it provides a means to calculate textual similarity, offering insights into how closely related two pieces of text are based on their content. Using Jaccard similarity enables researchers and AI models to compare analyses of text data with precision. It is used in tasks like document clustering, similarity detection, and content categorization.
Shingling can also be used to cluster similar documents together. By representing each document as a set of shingles and calculating the similarity between these sets (e.g., using the Jaccard coefficient or cosine similarity), you can group documents with high similarity scores into clusters. This approach is useful in various applications, such as search engine result clustering, topic modeling, and document categorization.
While implementing Jaccard similarity in programming languages like Python, the choice of shingle size (k) and the conversion to lowercase ensures a consistent basis for comparison, showcasing the technique's utility in discerning textual similarities.
Let's calculate the Jaccard similarity between two sentences:
J(A,B) = (A ∩ B) / (A ∪ B) = 12 / 57 = 0.2105
So, the Jaccard Similarity is 0.2105. The score signifies that the two sets are 21.05 % (0.2105 * 100) similar.
Instead of passages, let's consider two sets of numbers:
Calculate Jaccard similarity to see how similar these two sets of numbers are
(A ∩ B) / (A ∪ B) = 2/8 = 0.25.
To calculate dissimilarity, just subtract the score from 1.
Advanced shingling and hashing techniques and optimizations are crucial for efficient similarity detection and plagiarism detection in large datasets. Here are some advanced techniques and optimizations, along with examples and links to code implementations:
Locality-Sensitive Hashing (LSH) is an advanced technique that improves the efficiency of shingling and hashing for similarity detection. It involves creating a signature matrix and using multiple hash functions to reduce the dimensionality of the data, making it efficient to find similar documents.
The key idea behind LSH is to hash similar items into the same bucket with high probability, while dissimilar items are hashed into different buckets. This is achieved by using a family of locality-sensitive hash functions that hash similar items to the same value with higher probability than dissimilar items.
Consider two documents, A and B, represented as sets of shingles:
We can apply LSH by:
This process significantly reduces the number of document pairs that need to be compared, making similarity detection more efficient.
For a detailed implementation of LSH in Python, refer to the GitHub repository.
Minhashing is a technique used to quickly estimate the similarity between two sets by using a set of hash functions. It's commonly applied in large-scale data processing tasks where calculating the exact similarity between sets is computationally expensive. Minhashing approximates the Jaccard similarity between sets, which measures the overlap between two sets.
Here's how Minhashing works:
Minhashing allows for a significant reduction in the amount of data needed to represent sets while providing a good approximation of their similarity.
We can represent these sets as shingles:
Now, let's generate the signature matrix using Minhashing:
Now, let's estimate the similarity between sets A and B:
Code Implementation: You can implement Minhashing in Python using libraries like NumPy and datasketch.
Banding and bucketing are advanced optimization techniques used in conjunction with Minhashing to efficiently identify similar sets within large datasets. These techniques are particularly valuable when dealing with massive collections of documents or data points.
Banding involves dividing the Minhash signature matrix into multiple bands, each containing several rows. By partitioning the matrix vertically into bands, we reduce the number of comparisons needed between sets. Instead of comparing every pair of rows across the entire matrix, we only compare rows within the same band. This significantly reduces the computational overhead, especially for large datasets, as we only need to consider a subset of rows at a time.
Bucketing complements banding by further narrowing down the comparison process within each band. Within each band, we hash the rows into a fixed number of buckets. Each bucket contains a subset of rows from the band. When comparing sets for similarity, we only need to compare pairs of sets that hash to the same bucket within each band. This drastically reduces the number of pairwise comparisons required, making the process more efficient.
Example
Let's say we have a Minhash signature matrix with 100 rows and 20 bands. Within each band, we hash the rows into 10 buckets. When comparing sets, instead of comparing all 100 rows, we only need to compare pairs of sets that hash to the same bucket within each band. This greatly reduces the number of comparisons needed, leading to significant performance gains, especially for large datasets.
Benefits
Several open-source implementations, such as the datasketch library in Python and the lsh library in Java, provide functionality for shingling, minhashing, and banded LSH with bucketing.
Candidate pairs is an advanced technique used in conjunction with shingling and minhashing for efficient plagiarism detection and near-duplicate identification. In the context of shingling, candidate pairs work as follows:
Documents are first converted into sets of k-shingles, which are contiguous sequences of k tokens (words or characters) extracted from the text. This step represents documents as sets of overlapping k-grams, enabling similarity comparisons.
The shingle sets are then converted into compact minhash signatures, which are vectors of fixed length, using the minhashing technique. Minhash signatures preserve similarity between documents, allowing efficient estimation of Jaccard similarity.
The minhash signatures are split into multiple bands, where each band is a smaller sub-vector of the original signature.
Within each band, the sub-vectors are hashed into buckets using a hash function. Documents with the same hash value for a particular band are placed in the same bucket.
Two documents are considered a candidate pair for similarity comparison if they share at least one bucket across all bands. In other words, if their sub-vectors collide in at least one band, they are considered a candidate pair.
The key advantage of using candidate pairs is that it significantly reduces the number of document pairs that need to be compared for similarity, as only candidate pairs are considered. This makes the plagiarism detection process much more efficient, especially for large datasets.
By carefully choosing the number of bands and the band size, a trade-off can be made between the accuracy of similarity detection and the computational complexity. More bands generally lead to higher accuracy but also increase the computational cost.
In conclusion, the combination of shingling, minhashing, banding, and Locality Sensitive Hashing (LSH) provides a powerful and efficient approach for plagiarism detection and near-duplicate identification in large document collections.
Shingling converts documents into sets of k-shingles, which are contiguous sequences of k tokens (words or characters), enabling similarity comparisons. Minhashing then compresses these shingle sets into compact signatures, preserving similarity between documents.
To further improve efficiency, banding splits the minhash signatures into multiple bands, and bucketing hashes each band into buckets, grouping similar documents together. This process generates candidate pairs, which are pairs of documents that share at least one bucket across all bands, significantly reducing the number of document pairs that need to be compared for similarity.
The actual similarity computation is then performed only on the candidate pairs, using the original minhash signatures to estimate the Jaccard similarity. Pairs with similarity above a specified threshold are considered potential plagiarism cases or near-duplicates.
This approach offers several advantages:
Several open-source implementations, such as the datasketch library in Python and the lsh library in Java, provide functionality for shingling, minhashing, and banded LSH with bucketing and candidate pair generation, making it easier to integrate these techniques into plagiarism detection systems or other applications requiring efficient similarity search.
Overall, the combination of shingling, minhashing, banding, and LSH offers a powerful and efficient solution for plagiarism detection and near-duplicate identification, with applications across academia, publishing, and content management systems.