Content Duplication Finder

Use cases

Finding duplicate product descriptions Identifying thin content to consolidate Content audit for large sites E-commerce duplicate content cleanup

Uses PolyFuzz TF-IDF vectorisation for similarity matching with hierarchical clustering to group related content.

Three configurable thresholds: minimum similarity score (0.5-1.0), URL filter pattern, and group link similarity (0.5-1.0) for cluster formation.

Requires Screaming Frog CSV with Address, H1-1, and Copy 1 columns.

Minimum 2 URLs required.

Streamlit App Crawl Data

Platform

Browser-based (no installation required)

Input

Screaming Frog CSV with columns: Address, H1-1, Copy 1

Minimum 2 URLs with content required

Output

CSV with duplicate pairs and clusters

Features

I offer this as a managed service. You get the insights without touching the tool.

Discover which descriptive words competitors use in titles that you are missing.

Extract content blocks and XPath patterns using Claude Haiku for template analysis.

Find cannibalising pages by clustering URLs that share SERP overlap.

Monthly retainers or one-off projects. No lengthy reports that sit in a drawer.