Back to Tools

Content Duplication Finder

Use cases

Finding duplicate product descriptions Identifying thin content to consolidate Content audit for large sites E-commerce duplicate content cleanup

Uses PolyFuzz TF-IDF vectorisation for similarity matching with hierarchical clustering to group related content.

Three configurable thresholds: minimum similarity score (0.5-1.0), URL filter pattern, and group link similarity (0.5-1.0) for cluster formation.

Requires Screaming Frog CSV with Address, H1-1, and Copy 1 columns.

Minimum 2 URLs required.

Streamlit App Crawl Data

Platform

Browser-based (no installation required)

Input

Screaming Frog CSV with columns: Address, H1-1, Copy 1

Minimum 2 URLs with content required

Output

CSV with duplicate pairs and clusters

Launch App View Source

Features

  • PolyFuzz TF-IDF vectorisation
  • Hierarchical clustering for duplicate grouping
  • Minimum similarity score slider (0.5-1.0)
  • Group link similarity threshold (0.5-1.0)
  • URL filter for targeting specific sections
  • UTF-8 and Latin-1 encoding support

How to use

  1. 1 Crawl your site with Screaming Frog (enable custom extraction for Copy 1)
  2. 2 Export internal_html.csv
  3. 3 Upload and set minimum similarity score (0.9 = 90% threshold)
  4. 4 Optionally filter by URL pattern
  5. 5 Adjust group link similarity for cluster tightness
  6. 6 Download CSV with duplicate clusters

Let's work together

Monthly retainers or one-off projects. No lengthy reports that sit in a drawer.

Let's Talk