Avoiding Dataset Leakage: Annotation & Versioning Across MVCreate's Defect Library
The biggest moat of an AI model is the dataset. But bigger isn't better — it must be non-conflated by type, consistent in labeling, and traceable across versions. We've been accumulating PV defect data since 2018, and every pitfall up to 2024 traces back to one phrase — "two labelers look at the same EL image and disagree". This article describes how MVCreate's annotation and versioning practice evolved as the library grew from 50K images / 3 cell types to 2M images / 9 cell types.
1. Why multi-cell-type libraries can't be merged
Originally we organized the library by defect type — microcracks together, broken fingers together, all cell types blended. This worked through 2019 but broke in 2021, because:
| Defect | PERC | TOPCon | HJT | IBC |
|---|---|---|---|---|
| Microcrack | Dark line, 50–200 μm wide | 30–150 μm | 20–100 μm | Often shorter, branched |
| Broken finger | Visible busbar break | Busbar + finger together | Dark patches on transparent TCO | No busbars |
| Black spot | 200 μm – 2 mm | 100–500 μm | Invisible in EL (PL only) | Different morphology entirely |
Key insight: "the same defect" looks like different objects across cell types. Mixing PERC microcracks with IBC microcracks makes the model learn an "averaged microcrack" — accurate for neither.
We restructured in 2021: each defect type sub-divided by cell type. 29 defects × 9 cell types ≈ 180 valid sub-libraries. Training selects per-type subsets — per-type fine-tuning + shared representation.
2. The "drift problem" of annotation consistency
The second killer is annotation drift — the same labeler's judgments slowly shift over time:
- New-hire phase (first 3 months): conservative — "suspicious" labeled as "non-defect";
- Mature phase (3–12 months): aggressive — highest detection;
- Fatigue phase (12 months+): boundaries soften — "edge brightness" mislabeled as "black edge".
After 3 years, old vs new labels disagree at the boundaries, and models "fit old data well, lose accuracy on new data."
Our solution: three-layer verification + calibrator rotation:
Layer 1: Labelers
Each image initially labeled by 1 labeler, recording labeler ID, timestamp, label, self-confidence.
Layer 2: Verifiers
5% of daily labels random-sampled and reviewed by senior verifiers, with pass/reject/relabel decisions. Verifiers rotate every 6 months to prevent verifier drift.
Layer 3: Gold-standard set
We maintain a 5,000-image gold-standard set (jointly labeled by 5 industry experts, contested samples removed). Every month all labelers are tested against the gold set; labelers below 90% pause for a week of retraining.
Annotation consistency (Cohen's kappa) rose from 0.71 (2019) to 0.91 (2024).
3. Git-like version control
The third challenge is versioning. Our earliest scheme tagged datasets by month (dataset_2020_03 ...) — and broke quickly. One monthly update introduced 200 mislabels but had already trained 3 deployed models. Rolling back required retraining all three.
In 2022 we built git-like version control:
3.1 The "commit" concept
Each new batch is a "commit" recording committer, timestamp, affected types, source (production / customer / synthetic), verification status.
3.2 Branches
main accepts only fully verified data; experimental branches accept under-verified data. Models train against main at a specific commit hash.
3.3 Rollback
A bad commit can be rolled back; downstream models retrain against the rolled-back version.
3.4 Tooling
Built on DVC + S3 + a custom annotation-quality dashboard. Every deployed model carries the training-data commit hash in metadata — any issue traces to a specific data version.
4. Avoiding leakage with synthetic data
Customers ask: "can you synthesize training data?" — yes, with discipline.
Synthetic data (GAN/Diffusion-generated EL images) helps:
- Long-tail classes can be quickly padded;
- Data privacy concerns minimal;
- Cheap.
But it carries a leakage risk — if both train and test contain images from the same generator, the model overfits the generator's quirks and real-line performance degrades.
Our rules:
- Real-first: synthetic ≤ 30% of any single class's training set;
- Pure-real test sets: all benchmark test sets are 100% real production data;
- Multi-generator mixing: synthetic data must come from ≥ 3 different GAN/Diffusion models;
- Physical consistency check: every synthetic image passes a "physical plausibility" filter (intensity distribution, contrast, defect-edge sharpness); failures are discarded.
5. How customer feedback enters the model
Each customer line's data has unique value — process signature, defect distribution, scenario context — that lab-synthesized data can't replicate.
Our customer-feedback mechanism:
- Raw images stay on customer line — never uploaded;
- Local annotation — customer engineers label their own results;
- Aggregated "label + feature vector" upload — only feature vectors leave site;
- Federated update — central model updates from aggregated vectors; new model is redistributed.
This preserves data privacy while letting MVCreate's central model learn from every line. In Q4 2024, 14 customers participated, averaging ~120K new valid labels per month.
6. Recommendations for the industry
PV inspection AI is still young. Maturity varies. A few suggestions:
- Start versioning early — wait until 1M images and it's too late;
- Labeler rotation + gold-set calibration is mandatory — don't skimp;
- Cell-type partitioning is mandatory — never merge for convenience;
- Customer privacy + federated learning will be table-stakes by 2027 — plan ahead.
For dataset management methodology exchange or federated-learning onboarding, contact MVCreate at +86 159-5048-9233.
Originally published by Vision Potential (Nanjing MVCreate Intelligent Technology Co., Ltd.). Reproductions must credit the source.
