Avoiding Dataset Leakage: Annotation & Versioning Across MVCreate's Defect Library

Avoiding Dataset Leakage: Annotation & Versioning Across MVCreate's Defect Library

The biggest moat of an AI model is the dataset. But bigger isn't better — it must be non-conflated by type, consistent in labeling, and traceable across versions. We've been accumulating PV defect data since 2018, and every pitfall up to 2024 traces back to one phrase — "two labelers look at the same EL image and disagree". This article describes how MVCreate's annotation and versioning practice evolved as the library grew from 50K images / 3 cell types to 2M images / 9 cell types.

1. Why multi-cell-type libraries can't be merged

Originally we organized the library by defect type — microcracks together, broken fingers together, all cell types blended. This worked through 2019 but broke in 2021, because:

Defect PERC TOPCon HJT IBC
Microcrack Dark line, 50–200 μm wide 30–150 μm 20–100 μm Often shorter, branched
Broken finger Visible busbar break Busbar + finger together Dark patches on transparent TCO No busbars
Black spot 200 μm – 2 mm 100–500 μm Invisible in EL (PL only) Different morphology entirely

Key insight: "the same defect" looks like different objects across cell types. Mixing PERC microcracks with IBC microcracks makes the model learn an "averaged microcrack" — accurate for neither.

We restructured in 2021: each defect type sub-divided by cell type. 29 defects × 9 cell types ≈ 180 valid sub-libraries. Training selects per-type subsets — per-type fine-tuning + shared representation.

2. The "drift problem" of annotation consistency

The second killer is annotation drift — the same labeler's judgments slowly shift over time:

  • New-hire phase (first 3 months): conservative — "suspicious" labeled as "non-defect";
  • Mature phase (3–12 months): aggressive — highest detection;
  • Fatigue phase (12 months+): boundaries soften — "edge brightness" mislabeled as "black edge".

After 3 years, old vs new labels disagree at the boundaries, and models "fit old data well, lose accuracy on new data."

Our solution: three-layer verification + calibrator rotation:

Layer 1: Labelers

Each image initially labeled by 1 labeler, recording labeler ID, timestamp, label, self-confidence.

Layer 2: Verifiers

5% of daily labels random-sampled and reviewed by senior verifiers, with pass/reject/relabel decisions. Verifiers rotate every 6 months to prevent verifier drift.

Layer 3: Gold-standard set

We maintain a 5,000-image gold-standard set (jointly labeled by 5 industry experts, contested samples removed). Every month all labelers are tested against the gold set; labelers below 90% pause for a week of retraining.

Annotation consistency (Cohen's kappa) rose from 0.71 (2019) to 0.91 (2024).

3. Git-like version control

The third challenge is versioning. Our earliest scheme tagged datasets by month (dataset_2020_03 ...) — and broke quickly. One monthly update introduced 200 mislabels but had already trained 3 deployed models. Rolling back required retraining all three.

In 2022 we built git-like version control:

3.1 The "commit" concept

Each new batch is a "commit" recording committer, timestamp, affected types, source (production / customer / synthetic), verification status.

3.2 Branches

main accepts only fully verified data; experimental branches accept under-verified data. Models train against main at a specific commit hash.

3.3 Rollback

A bad commit can be rolled back; downstream models retrain against the rolled-back version.

3.4 Tooling

Built on DVC + S3 + a custom annotation-quality dashboard. Every deployed model carries the training-data commit hash in metadata — any issue traces to a specific data version.

4. Avoiding leakage with synthetic data

Customers ask: "can you synthesize training data?" — yes, with discipline.

Synthetic data (GAN/Diffusion-generated EL images) helps:

  • Long-tail classes can be quickly padded;
  • Data privacy concerns minimal;
  • Cheap.

But it carries a leakage risk — if both train and test contain images from the same generator, the model overfits the generator's quirks and real-line performance degrades.

Our rules:

  1. Real-first: synthetic ≤ 30% of any single class's training set;
  2. Pure-real test sets: all benchmark test sets are 100% real production data;
  3. Multi-generator mixing: synthetic data must come from ≥ 3 different GAN/Diffusion models;
  4. Physical consistency check: every synthetic image passes a "physical plausibility" filter (intensity distribution, contrast, defect-edge sharpness); failures are discarded.

5. How customer feedback enters the model

Each customer line's data has unique value — process signature, defect distribution, scenario context — that lab-synthesized data can't replicate.

Our customer-feedback mechanism:

  1. Raw images stay on customer line — never uploaded;
  2. Local annotation — customer engineers label their own results;
  3. Aggregated "label + feature vector" upload — only feature vectors leave site;
  4. Federated update — central model updates from aggregated vectors; new model is redistributed.

This preserves data privacy while letting MVCreate's central model learn from every line. In Q4 2024, 14 customers participated, averaging ~120K new valid labels per month.

6. Recommendations for the industry

PV inspection AI is still young. Maturity varies. A few suggestions:

  1. Start versioning early — wait until 1M images and it's too late;
  2. Labeler rotation + gold-set calibration is mandatory — don't skimp;
  3. Cell-type partitioning is mandatory — never merge for convenience;
  4. Customer privacy + federated learning will be table-stakes by 2027 — plan ahead.

For dataset management methodology exchange or federated-learning onboarding, contact MVCreate at +86 159-5048-9233.

Originally published by Vision Potential (Nanjing MVCreate Intelligent Technology Co., Ltd.). Reproductions must credit the source.