Content Duplication Finder
Use cases
Uses PolyFuzz TF-IDF vectorisation for similarity matching with hierarchical clustering to group related content.
Three configurable thresholds: minimum similarity score (0.5-1.0), URL filter pattern, and group link similarity (0.5-1.0) for cluster formation.
Requires Screaming Frog CSV with Address, H1-1, and Copy 1 columns.
Minimum 2 URLs required.
Platform
Browser-based (no installation required)
Input
Screaming Frog CSV with columns: Address, H1-1, Copy 1
Minimum 2 URLs with content required
Output
CSV with duplicate pairs and clusters
Features
- PolyFuzz TF-IDF vectorisation
- Hierarchical clustering for duplicate grouping
- Minimum similarity score slider (0.5-1.0)
- Group link similarity threshold (0.5-1.0)
- URL filter for targeting specific sections
- UTF-8 and Latin-1 encoding support
How to use
- 1 Crawl your site with Screaming Frog (enable custom extraction for Copy 1)
- 2 Export internal_html.csv
- 3 Upload and set minimum similarity score (0.9 = 90% threshold)
- 4 Optionally filter by URL pattern
- 5 Adjust group link similarity for cluster tightness
- 6 Download CSV with duplicate clusters
Want me to run this for you?
I offer this as a managed service. You get the insights without touching the tool.
Related Tools
Competitor Content Gap Finder
ContentDiscover which descriptive words competitors use in titles that you are missing.
Content Block Extractor
ContentExtract content blocks and XPath patterns using Claude Haiku for template analysis.
Content Consolidation Analyser
ContentFind cannibalising pages by clustering URLs that share SERP overlap.
Let's work together
Monthly retainers or one-off projects. No lengthy reports that sit in a drawer.
Let's Talk