Building a touchpoint taxonomy for lead source accuracy

Why a taxonomy matters

Lead source accuracy depends on consistent naming. Without a taxonomy, one team may use utm_medium=social, another may send utm_medium=paid-social, and a third may omit the parameter entirely. Each variation fragments reporting, inflates direct traffic, and hides the channels that drive awareness. A simple, enforceable taxonomy turns scattered labels into a single source of truth.

Core design principles

Make the smallest set of allowed values that meets current needs.
Prefer prescriptive lists over free text for source and medium.
Separate how campaigns are tagged from how they are grouped into channels.
Enforce normalization at ingest and produce clean data in the warehouse.

Standard UTM policy

Use lowercase, hyphen-separated words. Keep fields short and unambiguous.

utm_source: vendor or property, e.g., google, instagram, newsletter
utm_medium: one of cpc, email, social, referral, affiliate, display, video, audio, organic, direct
utm_campaign: marketing initiative, e.g., fall-sale-202509
utm_content: creative or placement id, e.g., carousel-a
utm_term: paid search keyword when applicable

If a parameter is missing, supply a default at collection time. For example, empty medium for a known paid partner can be mapped to display. Do not perform these fixes inside dashboards; apply them once during ingestion so downstream consumers see the same values.

Channel mapping rules

Translate source and medium pairs into a canonical channel dimension. Keep mappings explicit and audited in version control.

Examples:

google + cpc -> paid search
bing + cpc -> paid search
instagram + social -> paid social
newsletter + email -> email
empty source + direct -> direct

Ambiguous or missing inputs should map to unclassified until governance decisions are made. This makes gaps visible so the team can fix upstream links or expand the dictionary.

Event and property schema

Adopt a compact set of events for acquisition and conversion. Recommended names:

page_view
lead_form_viewed
lead_form_started
lead_form_submitted
signup_started
signup_submitted
account_created

Common properties:

source, medium, campaign, content, term
channel
session_id, device_id, user_id when available
touchpoint_id (hash of device, timestamp, and URL)
first_touch_ref, last_touch_ref objects when known

Naming conventions

Keep event names lowercase with underscores.
Use ISO timestamps in UTC.
Avoid spaces and punctuation in parameter values.
For recurring campaigns, append a period marker like yyyymm or qN.

Governance workflow

Data quality improves when ownership is clear and feedback is quick. A lightweight process is sufficient for most teams:

Maintain the UTM and channel dictionary in a shared repository.
Validate incoming parameters at the collector or ETL tier.
Reject or quarantine events with invalid values.
Review weekly: unclassified shares, top sources, and new patterns.
Announce changes with effective dates to keep dashboards consistent.

Implementation checklist

Create a campaign dictionary with allowed values and examples.
Add a server endpoint that normalizes UTMs and enriches channel.
Store a durable first_touch_ref and pass it through conversion events.
Dedupe events by touchpoint_id plus event key.
Build a QA report that flags unclassified or missing medium.

Measuring success

Track three ratios over time:

the share of conversions with a valid source and medium
the share of traffic mapped to a canonical channel
the size of the unclassified and direct buckets

When those percentages move in the right direction and stay there, the taxonomy is working. Campaign review becomes faster, budget shifts are grounded in consistent data, and teams develop a shared language for acquisition and retention.