The false positive problem
You're building a deal detector. The logic: if the listed price is significantly below the median price for that item, flag it as a deal.
You compute the median across all listings for a search keyword. An item at €25 gets flagged as a deal because the median is €60.
But the €60 median is the median across mixed sizes. Size XL listings average €70; size XS listings average €25. The €25 item is median for its size — not a deal at all. Your detector is producing noise.
Why aggregating across attributes destroys signal
Price is not a single distribution across a product category — it's a family of distributions, one per meaningful attribute combination. Items in the same keyword search may have different:
- Size: a size W28 jean and a size W38 jean are not comparable price benchmarks
- Condition: new with tags vs heavily used are different markets
- Brand tier: luxury vs high street vs fast fashion have different price floors
- Gender target: men's and women's versions of the same garment often price differently
If you compute a median across all of these, you're averaging apples and oranges. The result is a number that doesn't accurately describe any individual market.
The fix: segment first, score within segment
def compute_segment_key(item: dict) -> str:
"""Create a segment key from the attributes that drive price."""
size = normalise_size(item.get("size", ""))
condition = item.get("condition", "unknown")
brand_tier = classify_brand(item.get("brand", ""))
return f"{condition}:{size}:{brand_tier}"
def score_item(item: dict, all_items: list[dict]) -> float | None:
segment = compute_segment_key(item)
comparable = [i for i in all_items if compute_segment_key(i) == segment]
if len(comparable) < 5: # not enough data for a reliable median
return None
median_price = statistics.median(i["price"] for i in comparable)
if median_price == 0:
return None
return (median_price - item["price"]) / median_price # positive = below median
A positive score means the item is priced below its segment's median. Now you're comparing like with like.
Size normalisation is non-trivial
Size labels are inconsistent. "M", "Medium", "38", "EU 38", "UK 10" can all mean the same thing — or different things depending on brand and gender targeting.
Before you can segment by size, you need a normalisation step:
SIZE_MAP = {
"xs": ["xs", "extra small", "34", "eu 34"],
"s": ["s", "small", "36", "eu 36"],
"m": ["m", "medium", "38", "eu 38"],
"l": ["l", "large", "40", "eu 40"],
"xl": ["xl", "extra large", "42", "eu 42"],
}
def normalise_size(raw: str) -> str:
clean = raw.lower().strip()
for canonical, variants in SIZE_MAP.items():
if clean in variants:
return canonical
return "unknown"
Items with size = "unknown" should be scored against other unknown items — not mixed into the overall population.
The minimum sample threshold
A segment median is only meaningful above a minimum sample size. With 3 items in a segment, one outlier moves the median dramatically.
A practical floor: require at least 5–10 comparable items in a segment before computing a score. Below this threshold, return None and do not display a deal rating. A "no data" result is better than a false one.
What this looks like at category level
For electronics, size is irrelevant — score by condition and brand tier.
For clothing, size, condition, and brand tier all matter.
For collectibles, condition and specific model matter; size is irrelevant.
The attributes that define a segment are category-specific. A generic segmentation key won't work across all categories. You need a per-category definition of what makes two items comparable.
The underlying principle
Before computing any aggregate statistic (median, mean, standard deviation), ask: is this population actually homogeneous? If not, segment it until it is. Statistical measures are only meaningful within comparable groups.