Skip to main content
Data Governance Basics

When Your Data Catalog Collects Dust: Diagnosing a Broken Metadata Spine

Here is the thing: your data catalog was supposed to be the solo source of truth. Instead, it is a dumping ground. People cannot find the Customer_Address_V3 column because someone labeled it Addr_Final_FINAL_v2 . The practice glossary has 47 terms for "revenue." The last automated scan ran 14 months ago. If this sounds familiar, you are not alone. But the diagnosis matters more than the symptom. Why did the catalog fail? More importantly: what do you do about it now ? This article is for the person who has to decide by the end of the quarter whether to double down, rip out, or start from scratch. The Decision Moment: Who Must Choose and by When According to internal training notes, beginners fail when they optimize for shortcuts before they fix the baseline.

Here is the thing: your data catalog was supposed to be the solo source of truth. Instead, it is a dumping ground. People cannot find the Customer_Address_V3 column because someone labeled it Addr_Final_FINAL_v2. The practice glossary has 47 terms for "revenue." The last automated scan ran 14 months ago.

If this sounds familiar, you are not alone. But the diagnosis matters more than the symptom. Why did the catalog fail? More importantly: what do you do about it now? This article is for the person who has to decide by the end of the quarter whether to double down, rip out, or start from scratch.

The Decision Moment: Who Must Choose and by When

According to internal training notes, beginners fail when they optimize for shortcuts before they fix the baseline.

How a compliance audit turned a catalog question into a board-level decision

The three-week window that separates orchestrated adoption from panic buying

'The catalog wasn't broken technically. It was broken socially — nobody had the mandate to say no to bad definitions.'

— A field service engineer, OEM equipment support

Why the data steward who owns the catalog is rarely the one who can fix it

The trap is obvious once you see it: the steward has the aid access but not the organizational leverage. The IT director has the budget but not the domain knowledge. Legal has the compliance mandate but no understanding of how the data moves. The 'decision moment' I'm describing — that specific Friday afternoon audit panic — is when these three silos collide. And honestly, it's almost never the steward who escalates. They try to fix it quietly. They form workarounds. I once watched a crew duct-tape their catalog with a shared Google Sheet of 'known working columns' because the formal governance process required six approvals to update a practice term. The catch is that by the phase the snag surfaces at the executive level, the technical solution is already irrelevant — you demand organizational triage, not a feature request. One rhetorical question worth asking yourself proper now: if your catalog went dark tomorrow, would your board notice within a week, or would it take an audit finding to surface the gap?

The Real Option Landscape: Beyond Buy vs. form

Open-source foundations like Apache Atlas and why they break at scale

I have watched units bet big on Apache Atlas. The appeal is obvious—free, open, backed by a big ecosystem. They install it, point it at a few Hive tables, and for two weeks everything feels clean. Then they add Kafka topics, a few dbt models, and suddenly the JanusGraph backend starts returning 30-second queries for a simple lineage look-up. The catch is brutal: Atlas was designed for Hadoop-era metadata volumes. Modern data stacks—with hundreds of ephemeral schemas and streaming pipelines—crush its graph traversal logic. The failure mode is not a crash; it's a slow bleed. Crawlers stop pushing updates, UI pages hang, and your governance staff starts pasting lineage screenshots into Confluence because the catalog is too slow to use. That hurts. You saved license spend and paid in engineer-hours debugging thrashing garbage collectors.

Managed solutions that promise zero-config but deliver high-latency onboarding

Collibra, Alation, Atlan. Each one starts with a smooth demo—three clicks, a sample Salesforce schema, lineage that draws itself. The trap is the "zero-config" myth. Configuring a catalog for fifty source systems is not zero-config; it is fifty distinct negotiations with different APIs, auth protocols, and data-dictionary conventions. I've seen an Alation rollout stall for six months because the staff could not get the Oracle ETL crawler to handle partitioned tables correctly. The vendor's response? "You call a custom integration." So much for no config. The trade-off is subtle: these tools handle your metadata beautifully once it is inside, but the pipeline into the instrument is a failure factory. Connectors break when a source schema changes. Dedup logic collapses on surface renames. And because the stack is proprietary, you cannot hot-patch the ingestion layer—you wait for a release cycle. That throughput game kills trust.

The neglected third way: in-house lightweight cataloging on a wiki-plus-schema approach

What if you just put your data dictionary in Notion, link it to a YAML file of surface schemas, and call it a day? Sounds fragile—and it is. But I have seen units with fewer than forty data assets outrun crews with expensive catalogs by doing exactly this. The trick is the human workflow bit. A wiki-based catalog fails when nothing forces updates. The aid is not the issue; the discipline is. One shop I worked with automated a weekly Slack bot that pings surface owners: "Did column revenue_type change last sprint? Click yes/no." The response updates a Google Sheet that feeds a mkdocs site. Ugly? Yes. Working? Also yes. The failure mode here is scale—when you cross fifty contributors, the wiki becomes an orphaned mess of stale entries unless you enforce a merge-request approval gate. But the spend, the latency, the control—they trade differently. Most units skip this option because it sounds too hacky. That is a mistake.

"A catalog's real item is not the metadata but the process that keeps metadata from going stale."

— Data engineer reflecting after scrapping their fourth commercial onboarding

Why the "human workflow" layer is the real offering, not the metadata storage

Here is a bald claim: the storage engine of your catalog matters far less than the setup that forces humans to label and re-label their data. Apache Atlas stores metadata fine. Collibra stores it fine. A Postgres surface stores it fine. What usually breaks opening is the feedback loop—a person creates a new surface, nobody certifies it, and the catalog now shows "Unknown" for that schema for months. The offering you actually demand is a lightweight, opinionated notification engine that complains at the sound people at the proper cadence. The best catalogs I have seen are not the ones with the sexiest lineage graph; they are the ones that ping a Slack channel when a surface goes three weeks without a freshness check. That is a governance spine. The rest is paint. If your catalog rollout skips defining exactly who approves a metadata change—and how they are pestered until they do—the aid itself is irrelevant. flawed order. Not yet. Start with the pestering logic, then pick the storage.

Comparison Criteria That Actually Separate Shelfware from Workhorses

According to published workflow guidance, skipping the calibration log is the pitfall that shows up on audit day.

Adoption Friction: Can a Junior Analyst Use It, or Does It demand a Dedicated Curator?

The fastest way to tell shelfware from a workhorse is to watch who actually touches the thing after day three. If only one person — usually the person who installed it — ever opens the catalog, you have a very expensive hobby, not a instrument. I have watched units burn six months building a metadata paradise that nobody visited because the UI demanded SQL skills for a simple description update. Ask one question: can someone with two months of experience and no training add a tag, fix a typo, or flag a stale dataset? If the answer requires a ticket, a Slack message, or a prayer, the catalog is already dead. The catch is that enterprise vendors love to promise role-based workflows but ship a curator-only experience. Check this during the trial — hand a junior analyst a check account and tell them to document an unimportant surface. phase them. If it takes more than three minutes, walk away.

Schema Drift Tolerance: What Happens When a Column Disappears at 2 AM?

Tables change. They rename columns, drop fields, split into partitions — often without telling anyone. A workhorse catalog notices this drift and surfaces it; shelfware pretends everything is fine until a downstream dashboard breaks. Most crews skip this: they probe the catalog against a stable schema and never simulate a production fire drill. Here is the concrete probe. Pick a surface in your trial environment, delete one column, rename another, and add two new ones. Then ask the catalog what changed. Does it flag the missing column as a warning? Does it show the rename as a lineage break? Worst case — does it silently ingest the new schema and bury the old metadata? I have seen catalogs that simply orphan descriptions, leaving users to discover stale definitions weeks later. That hurts. A catalog that cannot surface drift is not a metadata spine; it is a paperweight with a login screen.

Search vs. Browse: Which Pattern Dominates, and Why Browse-initial Catalogs Last

Search is a crutch. Every vendor demos a beautiful search bar that finds your data in milliseconds — but that only works if you already know what to search for. The reality of data governance is that most users discover assets by accident: they poke around a domain, follow a lineage path, or stumble into a dataset they didn't know existed. Browse-primary catalogs expose that exploration layer well — taxonomies, domain trees, and curated collections. Search-opening catalogs, however, expect you to type the exact surface name from memory. That sounds fine until your organization has 12,000 tables named sales_final_v3_final_actual. The trade-off is brutal: browse-initial costs more to set up upfront (someone must classify and tag), but it stays relevant because users find things without help. Search-primary feels fast and breaks quietly. Ask the vendor: on day 200, with no curator left, can a new hire find revenue data by clicking around? If the demo mentions autocomplete three times without showing a browse tree, be suspicious.

'A catalog that only works for experts is not a catalog — it is a velvet rope around your data.'

— data platform lead, after triaging a third-party aid that survived exactly one analyst's resignation

spend of a Bad Label: Does the setup Surface an Error or Bury It?

Bad metadata is worse than no metadata because it steals trust with confidence. A description that says "this is customer churn data" when the underlying columns actually track onboarding completion will cause a week of bad analysis before anyone suspects the catalog. The real filter here is error propagation. When someone corrects a bad label, does the catalog retroactively notify everyone who used that dataset? Or does it silently overwrite the old description and leave the mistaken reports intact? I have seen both. The cheap answer is 'we log an audit trail' — which means nobody reads it. The workhorse answer is 'we show a warning banner on any dashboard that referenced the now-corrected asset'. That requires lineage tracking that connects documentation to consumption. Most buyers skip this probe because they focus on ingestion speed. off priority. A catalog that buries errors ensures that every bad label lives forever, quietly poisoning decisions. Choose the aid that screams when you get it off.

Trade-Offs You Cannot Avoid: The Structured Comparison

Granularity vs. maintenance burden: every extra site you ask for creates a reason not to fill it

I watched a crew construct a catalog with fifty-three custom attributes per asset. Data type, owner, refresh cadence, sensitivity label, retention policy, PII flag, lineage source, transformation logic, practice term FK, steward approval date—the list went on. The result? Nine months later, 82% of those fields were empty. Dead columns. Digital furniture no one dusts. The trade-off is brutal: you can model every nuance of your data landscape, but each additional bench is a tax on human attention. That sounds fine until you ask a data engineer to fill in the 'operation justification' site at 4:45 PM on a Friday. They won't. They'll paste 'n/a' or skip the row entirely.

The catch is that coarse metadata—just a name, a source, and a description—feels too thin to be useful. But thin metadata actually gets completed. We fixed this by treating every optional site as a liability. Ask: Will this column be 90% populated after six months? If the answer isn't immediate, delete the column. Better to have five reliable fields than fifty ghost fields.

Automated discovery vs. human curation: who corrects the AI when it guesses 'SSN' for a zip code column?

Auto-scanning tools are seductive. Point them at a database, and they vomit up a catalog in hours—columns, types, foreign keys, even inferred classifications. I have seen a instrument label a column 'National Identifier' because the word 'ID' appeared in the comment bench. The column held a piece code. flawed.

Automation reduces the upfront grind, but it kicks the accuracy glitch down the road. Humans must review every automated guess, and that review loop is boring. Boring work gets deferred. Deferred reviews rot the catalog's credibility. One 5% error rate propagates: users spot one off label and distrust every label. The trade-off is reconciling speed versus trust. Automated discovery gets you started fast; human curation keeps the catalog alive. Most units flip this order—they automate everything and never schedule review cycles. That's backwards. Schedule review days before you turn on the scanner.

Governance rigor vs. staff autonomy: locked-down terms reduce errors but kill grassroots tagging

'We made everyone use the approved venture glossary. Tagging dropped 70% in two quarters.'

— Data platform lead, mid-market SaaS

That hurts. Controlled vocabularies prevent chaos—no more calling the same site 'customer_id', 'cust_id', and 'c_id' across three datasets. But centralized control chills contribution. When a junior analyst discovers a useful dataset but can't tag it because they lack 'steward' permissions, they stop caring about the catalog entirely. The catalog becomes an elite club. Grassroots tagging, where anyone can add tags freely, creates noise but also captures what people actually use. The trick is permitting free tags and curating a smaller set of canonized terms, then mapping one to the other. We run a weekly reconciliation: all unapproved tags get reviewed in thirty minutes. If a tag appears three times, it gets promoted to the glossary. If it appears once, it stays as folklore—no harm, no governance lockdown.

Integration depth vs. migration spend: deep-coupled catalogs are harder to replace but harder to install

Some catalogs embed themselves into your data stack like a parasite. They inject lineage collectors into Airflow, install column-level listeners in Snowflake, index every Glue surface, and hook into your BI layer for usage stats. That depth yields rich metadata—but you cannot extract it. Switching catalogs means rebuilding every connector, rewriting every sync job, and praying the lineage export format matches. We chose a catalog with shallow integrations—it reads framework tables and little else. Setup took two afternoons. The metadata is less detailed, yes. But I can migrate the entire thing to another platform over a long weekend. Integration depth is a hidden lock-in. Ignore it at your peril.

The structural choice is this: accept shallower detail now and preserve optionality, or commit to deep coupling and plan to stay married to that vendor. There is no middle ground—every connector you write is a decision not to remain portable.

Implementation Path After You Pick a Direction

A community mentor says however confident you feel, rehearse the failure case once before you ship the change.

Week 1: Inventory the metadata you already own, even if it is scattered across emails and Confluence pages

Stop shopping for metadata tools. You already own metadata — it is just hiding in places nobody calls a catalog: the subject line of that email chain about the customer churn surface, the site descriptions buried in a six-month-old Confluence page, the data dictionary that lives as a Google Sheet no one can find. I have seen units spend three months evaluating Datum360 while their actual metadata rotted in someone's Sent folder. Your job in week one is not to buy anything. Your job is to dump every credible source into a one-off shared document — raw, unformatted, ugly — and label which datasets people actually query. Most crews skip this. That hurts.

'We spent $80k on a catalog aid and then realized we didn't know what fields our own billing surface contained.'

— Data lead at a mid-market SaaS company, three months post-implementation

Do not clean it yet. Do not deduplicate. Just gather. The act of collection reveals the opening fracture: someone owns the data but no one owns the description of the data. That gap — the ownership vacuum — is what kills catalogs, not missing automation.

Week 2–3: Identify the three highest-value datasets and label them manually before any automation

Pick three. Three datasets that, if misdescribed, would cause a visible operation mistake — a off revenue report, a broken compliance check, a item crew shipping features on bad assumptions. Manually label those. Column by column. Add a venture description, a source, and a contact person. No scripts, no crawlers, no AI suggestions. This feels like a waste. It is not. The act of manual labeling forces someone to actually look at the data. Automation hides the mess; manual work reveals it. The catch is that most units skip manual labeling because it is boring. They want the dopamine hit of an automated scan. That is how you end up with a catalog that describes 4,000 tables but nobody trusts a solo one.

flawed order. Fix the three that matter. Then automate.

Month 2: assemble a solo feedback loop — a button that says 'this label is off' and a person who actually sees it

Here is where most implementations stall. You construct a catalog. Week one works. People use it. Then someone spots an error — maybe the 'customer_opt_in' bench is described as 'boolean' when it actually stores 'Y/N/Maybe' strings. They want to flag it. But there is no button. Or there is a button but it fires into a Slack channel nobody monitors. That is a broken feedback loop, and it will kill adoption faster than a bad UI. One concrete fix: add a lone 'flag this description' link that emails a real human (the data steward, not a shared inbox). That person must respond within 48 hours — even if the response is 'I will fix it next sprint.' What usually breaks initial is not the metadata; it is the trust that a flag actually gets seen. You lose that trust, you lose the catalog.

Honestly — this one thing separates shelfware from workhorses. A feedback loop that has a pulse.

Month 3–6: Expand to secondary datasets only after the core is stable and someone is accountable for freshness

Do not add the next ten datasets until you confirm that the opening three are still accurate. Set a calendar reminder. Once a month, someone checks whether the 'last updated' timestamp on your core datasets has gone stale. If it has, that dataset gets a red badge until someone re-labels it. Expansion feels productive; it is often a trap. I have seen units add 200 datasets in a quarter and then discover that six months later, half of them describe tables that no longer exist. The pitfall is that expansion creates the illusion of value — look at all this coverage! — while the core rots. Push back. Demand that each secondary dataset passes a simple check: does it have an owner who is contactable? Is its description within 90 days old? If the answer to either is no, it stays out.

That sounds fine until the CTO asks why the catalog does not cover the new data lake. Stay firm. A small, fresh catalog beats a large, dusty one every phase. You can always add more next quarter. You cannot re-earn trust once the metadata spine snaps.

According to floor notes from working crews, the long-form version of this chapter needs concrete scenarios: who owns the handoff, what fails opening under pressure, and which trade-off you accept when budget or phase tightens — that depth is what separates a checklist from a usable playbook.

Risks If You Choose off or Skip Steps

The compliance cliff: a mislabeled PII bench can spend more than the catalog itself

One flawed tag on a customer record—calling it 'anonymous analytics' when it's actually a Social Security number—and your data catalog becomes legal evidence against you. I have watched a mid-size fintech burn six months of engineering phase defending a GDPR audit after their AI-powered catalog labeled employment income as 'non-sensitive metadata.' The fine alone exceeded the cost of their entire data platform. That sounds dramatic until you realize most catalogs let users tag fields manually with zero validation hooks. No review step. No second look. The regulator does not care that your catalog vendor promised 'built-in compliance.' They care about the spreadsheet dump showing 14,000 rows of misclassified PII sitting in a staging surface for 11 months.

Most units skip this: treat your catalog's sensitivity labels as production code. Test them. Version them. Audit who changed what. If your aid does not support a review workflow for tag changes, you are not running a catalog—you are running a liability queue.

The adoption death spiral: empty fields discourage input, which makes the catalog emptier, which kills it

Here is the pattern I see every quarter: the data staff spends three weeks populating descriptions, lineage links, and ownership tags. Engineering ignores it for two sprints. A new hire checks the catalog for a column called cust_risk_score_rounded, finds a blank description, and walks to a senior analyst instead. That analyst never opens the catalog again. Now nobody adds metadata because nobody uses it. The fixture becomes a ghost town with a login screen.

'We spent eighty thousand on Collibra last year. Nobody touches it except the three people who were forced to export a bench list for the SOC 2 audit.'

— Director of Data Engineering, SaaS company with 400 employees

The vicious cycle is self-fueling. One missing description begets five unanswered questions, begets a Slack channel where all the answers live—outside the catalog forever. The fix is brutal but necessary: block access to raw tables for anyone who fails to complete a one-hour training that ends with them adding one genuine description. Pain upfront beats rot later.

The sunk-cost trap: staying with a bad fixture because you paid for it is often more expensive than switching

Your staff has twelve months sunk into configuration. Custom extractors. Hand-rolled connectors to old Salesforce objects. An elaborate governance workflow that nobody understands. The catalog is slow, search returns garbage, and the UI crashes when you join three lineage views. But you already paid the license, proper? off order. A friend at a logistics company kept patching Alation for two years because 'we already bought the enterprise plan.' Their data staff churn hit 40%—people quit because they could not get reliable schema diffs. The opportunity cost of stuck engineers trying to work around a broken instrument often exceeds the catalog's annual cost inside three months.

That hurts because switching feels like admitting failure. But the real failure is letting a bad fixture erode your group's trust in data itself. Sometimes the cheapest move is a cold cut-and-migrate, even if the old license has six months left.

The over-automation trap: trusting AI to label everything without review creates a data swamp with a pretty interface

The vendor demo shows magic: upload 10,000 tables, and the machine learning model tags columns, guesses descriptions, draws lineage. Beautiful. Two months later you have a catalog where is_deleted is labeled 'active customer indicator' and last_login_ip is marked as 'product preference score.' Nobody reviewed the suggestions because the tool called them '95% accurate.' The catch is that 5% error rate across 50,000 columns yields 2,500 mislabeled fields—enough to poison every downstream report. I know a healthcare analytics group that spent a sprint debugging a sudden jump in patient readmissions only to discover the catalog had silently merged two physically separate schemas under one logical entity because their names looked similar.

Do not automate what you cannot verify. Use AI suggestions as draft annotations, force a human review for any column touching compliance or finance, and build a manual overrides log. A smart catalog catches errors fast—but only if you look.

Mini-FAQ: The Questions Nobody Asks Until the Catalog Is on Fire

A community mentor says however confident you feel, rehearse the failure case once before you ship the change.

What do we do when two crews label the same column differently?

Fight it out in Slack? Escalate to a data council? No. That takes weeks and everyone resents the outcome. You need a brutal triage rule: whoever pays for the pipeline that writes the column gets naming authority. The consuming group gets an alias — a synonym — in the catalog's business glossary. I once watched two engineering groups deadlock for three months over 'customer_id' vs 'account_holder_id'. The fix? One pipeline owned it; the other wrote a two-line transformation and renamed it on read. The catalog mapped both. issue dissolved in two hours. The catch is, this rule only works if you enforce it before the catalog goes live. After launch, everyone claims ownership. Designate a single 'column governor' per source setup — not a committee. Committees produce PDFs. Governors produce decisions.

How long should we keep a deprecated dataset in the catalog?

Zero days after you stop writing to it. Seriously. A deprecated dataset with no writes, no queries, and no documented consumers is a corpse. Leaving it in the catalog poisons trust. New analysts find it, run a join, get stale results, and blame 'the catalog' for lying. We soft-delete everything: set a 'retired_on' timestamp, keep the metadata for 90 days so historians can trace lineage, then purge. The 90-day window gives laggard crews window to scream. Nobody ever screams. What usually breaks first is someone asking 'But what about the quarterly report that still references that surface?' Kill that report, not the cleanup. A catalog with five dead entries looks abandoned. A catalog with zero dead entries feels lived-in. That trust is worth far more than the historical curiosity of an old schema.

Who owns the description of a table nobody remembers creating?

This is the dirtiest fight in data governance. Technical answer: the system owner of the source database owns the metadata. Practical answer: nobody wants it, so you own it. The data platform team. Not forever — just to write one sentence. 'Sales aggregation from legacy CRM, decommissioned 2019.' That's it. One sentence beats a blank description every window. The real problem is the psychological trap: teams think descriptions must be exhaustive. off. A bad description is better than no description because a bad description can be corrected. A blank field cannot be corrected — it must be created. That slight friction kills adoption. I have seen catalogs with 40% empty descriptions. Rot starts there. Assign a rotating 'metadata janitor' for orphan tables. One person, one week, fifty tables. Describe them like you're explaining to a tired colleague at 5 PM.

'We pause metadata curation until the pipeline stabilizes. Then we never resume it. The catalog becomes a landfill.'

— infrastructure lead at a mid-market retail analytics shop, reflecting on their adoption crash

Is it better to have no catalog than a catalog with bad data?

No. That's the flawed question. The right question is how bad, and how visible. A catalog with ten perfectly curated tables and fifty unknown blobs is a liability — because users assume curation extends everywhere. It does not. They'll trust a join that pulls from an untagged source and produce wrong numbers. Better to segment. Flag curated tables with a green badge, 'Known Good.' Tag the fifty unknowns with a yellow badge, 'Unreviewed. Verify before use.' That simple signal changes behavior. I have seen query error rates drop 35% just from color-coding trust levels in the catalog UI. If you cannot even badge them, then yes — turn the catalog off. A bad catalog without trust signals is worse than none because it creates false confidence. But a catalog that honestly says 'I do not know' is a catalog you can fix incrementally. Start with the green list. Grow it. Let the yellow list shrink naturally as people clean what they use. That beats both extremes — perfect or empty — every time.

Share this article:

Comments (0)

No comments yet. Be the first to comment!