The false positive problem
You're building a deal detector. The logic: if the listed price is significantly below the median price for that item, flag it as a deal.
You compute the median across all listings for a search keyword. An item at €25 gets flagged as a deal because the median is €60.
But the €60 median is the median across mixed sizes. Size XL listings average €70; size XS listings average €25. The €25 item is median for its size, not a deal at all. Your detector is producing noise.
Why aggregating across attributes destroys signal
Price is not a single distribution across a product category, it's a family of distributions, one per meaningful attribute combination. Items in the same keyword search may have different:
- Size: a size W28 jean and a size W38 jean are not comparable price benchmarks
- Condition: new with tags vs heavily used are different markets
- Brand tier: luxury vs high street vs fast fashion have different price floors
- Gender target: men's and women's versions of the same garment often price differently
If you compute a median across all of these, you're averaging apples and oranges. The result is a number that doesn't accurately describe any individual market.
The fix: segment first, score within segment
def compute_segment_key(item: dict) -> str:
"""Create a segment key from the attributes that drive price."""
size = normalise_size(item.get("size", ""))
condition = item.get("condition", "unknown")
brand_tier = classify_brand(item.get("brand", ""))
return f"{condition}:{size}:{brand_tier}"
def score_item(item: dict, all_items: list[dict]) -> float | None:
segment = compute_segment_key(item)
comparable = [i for i in all_items if compute_segment_key(i) == segment]
if len(comparable) < 5: # not enough data for a reliable median
return None
median_price = statistics.median(i["price"] for i in comparable)
if median_price == 0:
return None
return (median_price - item["price"]) / median_price # positive = below median
A positive score means the item is priced below its segment's median. Now you're comparing like with like.
Size normalisation is non-trivial
Size labels are inconsistent. "M", "Medium", "38", "EU 38", "UK 10" can all mean the same thing, or different things depending on brand and gender targeting.
Before you can segment by size, you need a normalisation step:
SIZE_MAP = {
"xs": ["xs", "extra small", "34", "eu 34"],
"s": ["s", "small", "36", "eu 36"],
"m": ["m", "medium", "38", "eu 38"],
"l": ["l", "large", "40", "eu 40"],
"xl": ["xl", "extra large", "42", "eu 42"],
}
def normalise_size(raw: str) -> str:
clean = raw.lower().strip()
for canonical, variants in SIZE_MAP.items():
if clean in variants:
return canonical
return "unknown"
Items with size = "unknown" should be scored against other unknown items, not mixed into the overall population.
The minimum sample threshold
A segment median is only meaningful above a minimum sample size. With 3 items in a segment, one outlier moves the median dramatically.
A practical floor: require at least 5-10 comparable items in a segment before computing a score. Below this threshold, return None and do not display a deal rating. A "no data" result is better than a false one.
What this looks like at category level
For electronics, size is irrelevant, score by condition and brand tier.
For clothing, size, condition, and brand tier all matter.
For collectibles, condition and specific model matter; size is irrelevant.
The attributes that define a segment are category-specific. A generic segmentation key won't work across all categories. You need a per-category definition of what makes two items comparable.
The underlying principle
Before computing any aggregate statistic (median, mean, standard deviation), ask: is this population actually homogeneous? If not, segment it until it is. Statistical measures are only meaningful within comparable groups.