Clean Up Your Family Tree: Remove Duplicates, Fix Dates and Standardise Places
Inheriting a family tree from a relative — or merging research with a cousin — almost always leaves you with a tangle: the same person entered three times, places written eight different ways, "Müller" mangled into "Müller
t tell you where to start, in what order, or how much progress you've actually made.
GEDminer is built for exactly this cleanup phase. It produces a prioritised, exportable list of every cleanup opportunity in your GEDCOM file, ranked by impact on your tree's overall data quality score. Work through the list, re-export, and watch the score climb. There is no signup, the file never leaves your browser, and every report can be exported as CSV or XLSX so you can track progress outside the app.
The cleanup workflow below mirrors the order most professional genealogists recommend: encoding first, then duplicates, then dates and places, then sourcing. The deep-dive sections explain why that order matters and how each step works under the hood.
How GEDminer solves it
Garbled accented characters from old GEDCOM files (Müller → Müller).
A multi-pass decoder restores the correct characters automatically when the file is parsed.
Encoding Recovery →Duplicate individuals from multiple imports or contributors.
The Duplicate Finder uses combined phonetic + fuzzy matching with adjustable confidence so you can merge confidently.
Duplicate Finder →"Yorkshire", "Yorks", "Yorkshire, England" — the same place written ten ways.
The Location Standardiser groups place-name variants and proposes a canonical form for each.
Location Standardiser →Vague dates ("abt 1850") that block downstream research.
The Vital Sharpener ranks imprecise dates by how much they\'d improve your tree if tightened.
Vital Sharpener →No idea how well-sourced your tree actually is.
A sourcing percentage (Total Sources / Total Facts × 100) and per-fact source coverage tell you where citations are missing.
Tree Health Score →Surname variants ("Smith", "Smyth", "Smithe") fragmenting your analysis.
Soundex + Levenshtein clustering groups likely surname variants so you can see the true frequency and decide on a canonical spelling.
Surname Variants →The right cleanup order — and why it matters
Cleaning a family tree in the wrong order means doing the same work twice. A practical order:
- Encoding first. If accented names are mangled, every search and every comparison from this point forward will work on the wrong strings. Fix the encoding before you fix anything else.
- Duplicates next. Merging duplicates after fixing the records inside them means your fixes get duplicated. Find and merge first; clean afterwards.
- Structural errors. Broken parent-child links, families pointing to nonexistent IDs — these distort every downstream report and have to be fixed before sourcing or sharpening makes sense.
- Dates and places. Run the Vital Sharpener and Location Standardiser to tighten and normalise vital facts.
- Sourcing last. Once the data is clean, attaching citations is far less likely to need redoing.
Re-export your file and re-run GEDminer after each step. The Tree Health Score will tell you, in a single number, whether your work moved the needle.
Standardising place names without losing nuance
Place names are messy because they evolve: counties merge, parishes are absorbed, countries change name and borders move. The Location Standardiser does not try to "correct" history — it groups variants of the same place so you can pick a canonical form for each cluster.
Variants are grouped by:
- Component matching — comma-separated location pieces are matched piece by piece, so "Manchester, England" matches "Manchester, Lancashire, England".
- Case and accent folding — "Müller" matches "Mueller" matches "MULLER".
- Common abbreviations — "Yorks" → "Yorkshire", "Co. Cork" → "County Cork".
- Ordering tolerance — "England, Manchester" matches "Manchester, England".
Each cluster shows the count of records using each variant, so you can pick the canonical form (usually the longest, fully-qualified version) and apply it back in your editor.
How the Vital Sharpener decides what to fix first
Not every vague date is worth fixing. A vague death date for a peripheral cousin matters less than a vague birth date for a great-great-grandparent who anchors a whole branch.
The Vital Sharpener scores every imprecise vital fact by:
- Centrality — how many descendants are downstream of this person in your tree?
- Recency — more recent vague dates are usually easier to research and produce more downstream improvement.
- Type — birth dates rank above marriage dates which rank above death dates.
- Source potential — facts in eras with strong record coverage (e.g. 1841+ in England) are worth fixing first because the records exist.
The result is a prioritised work list. Spend an hour at the top of the list and your tree's quality score will jump noticeably; spend it at the bottom and the change is barely measurable.
Sourcing: the slowest but most valuable cleanup step
Adding citations is slow because every fact needs a real source attached. But sourcing is what turns a tree from a private collection into a research-grade asset that other people can verify.
GEDminer measures sourcing as a single percentage: total citations divided by total facts, expressed as a percentage. A tree at 0% has no sources. A tree at 100% has at least one citation per fact. Most published, research-grade trees land in the 50–80% range.
The cleanup pattern that works:
- start by attaching citations to vital facts (birth, marriage, death) of every direct ancestor,
- then to vital facts of every collateral ancestor within four generations,
- then to non-vital facts (occupation, residence, immigration) for direct ancestors,
- and only then to peripheral relatives.
This pattern produces the largest jump in the sourcing score for the smallest investment of time.
When to re-export and re-analyse
Re-export and re-analyse after each significant cleanup pass — after a duplicate merge session, after a place-name standardisation pass, after a sourcing sprint. Comparing scores between exports is the only objective way to know whether your work is producing measurable improvement.
A useful rhythm: end every research session with a quick re-export and a glance at the score components. Set a target (e.g. "I want sourcing above 60% by the end of the year") and let the score guide your effort allocation between the cleanup categories.
Step-by-step guides
Fix Garbled Names in GEDCOM Files: Character Encoding Guide
If accented characters, apostrophes, or non-Latin scripts look broken in your family tree, the problem is almost always character encoding. Here is how to diagnose and fix it.
Finding and Merging Duplicate Individuals
Find potential duplicate individuals in your tree using smart matching, compare their records side-by-side, and learn best practices for merging them.
Standardising Locations in Your Family Tree
Inconsistent place names cause mapping errors and missed connections. Learn how to identify and fix location formatting issues across your tree.
Using the Vital Sharpener to Improve Date Precision
The Vital Sharpener helps you identify estimated, incomplete, or missing vital records and prioritise which ones to research first for maximum impact.
Understanding Your Tree Health and Data Quality Score
Learn how GEDminer evaluates your tree\'s data quality, what the health score means, and practical steps to improve your tree\'s completeness and accuracy.
How to Cite Genealogy Sources Properly
Source citations turn a family tree into evidence. Learn the standard format, how to record citations efficiently, and how GEDminer measures your sourcing.
How to Organise Your Genealogy Research
Genealogy generates hundreds of files, screenshots, and notes. A simple organisation system - plus regular GEDCOM analysis - keeps your research efficient and shareable.
10 Common Genealogy Mistakes and How to Avoid Them
Even experienced researchers make these mistakes. Learn the 10 most common genealogy errors — from unsourced facts to name assumptions — and how GEDminer helps you catch them.
How to Merge Two GEDCOM Files Without Losing Data
Combining two family trees is one of the most error-prone tasks in genealogy. This guide walks through a safe merge workflow using GEDminer to spot duplicates, conflicting dates, and overlapping branches before you commit.
Frequently asked questions
Where should I start when cleaning up an inherited tree?
Encoding first (so names render correctly), then duplicates (so you don\'t fix the same record twice), then dates and places, and finally sourcing. GEDminer presents tools in roughly that order.
Will cleaning up my tree change the GEDCOM file?
GEDminer is read-only. It tells you what to clean and shows you the records involved; the actual edits happen in your usual genealogy program before you re-export.
How does GEDminer decide which places are duplicates?
The Location Standardiser groups places by canonical form (case folding, comma-separated component matching, common abbreviations) and shows the variant counts so you can pick a master spelling.
Does the data quality score actually mean anything?
Yes — it\'s a weighted sum of completeness (40%), sourcing (30%) and consistency (30%). Comparable across exports, so you can tell if a cleanup pass genuinely improved the tree.
How long does a typical cleanup take?
For a 5,000-person tree, expect a few evenings\' work to clear the high-impact items. The tool surfaces the top issues first so you get visible improvements quickly.
How do I merge two GEDCOM files without creating duplicates?
Run each file through the analyzer separately first to clean the obvious issues, then merge in your editor. Re-import the merged file into GEDminer and use the Duplicate Finder to catch the duplicates the merge created — there will always be some.
My GEDCOM has thousands of place-name variants — can I bulk-standardise them?
GEDminer surfaces the variants and the canonical suggestion for each cluster, but the actual edits happen in your editor. Many editors support bulk find-and-replace on places, which is the fastest way to apply the standardised forms.
How do I know which sources to add first?
The sourcing report ranks unsourced facts by centrality (how many descendants depend on the person) and by era (facts in well-recorded eras are easier to source). Working top-down gives the largest score improvement per hour.
What\'s the difference between cleaning and merging a tree?
Cleaning means improving the data already in your tree (fix errors, tighten dates, add sources). Merging means combining two trees into one. You should clean both trees before merging them, then re-clean the merged result — merge always creates new duplicates.
Will cleaning my tree improve the privacy of my data?
Indirectly, yes. A clean, well-structured tree is easier to selectively share. GEDminer also has a Presenter Mode that obscures names and specific dates if you want to demo or screenshot the tree.
Can the analyzer suggest a recommended target score for my tree size?
Yes. Each analysis shows your score against a community percentile for trees of similar size, so you can see whether your tree is above, around, or below typical and set a realistic target.
Related tools
Find and Fix Family Tree Errors Automatically
Detect impossible dates, duplicate ancestors, missing parents and broken relationships in your family tree in seconds. Free, browser-based GEDCOM error checker.
Free GEDCOM Analyzer: Inspect, Validate and Visualise Your Family Tree Online
Upload a .ged file and get instant analysis: errors, duplicates, missing dates, migration maps, census gaps and a data quality score. 100% in-browser, no signup required.
Genealogy Data Analysis: Statistics, Maps and Patterns from Your GEDCOM File
Turn your family tree into insight: birth-year charts, migration maps, surname distributions, occupation breakdowns and lifespan trends. Free GEDCOM analytics.
Ready to analyse your tree?
Drop your .ged file into GEDminer and get a full diagnostic in seconds. Your file never leaves your browser.
Upload GEDCOM file