Running the NuGet Classification Pipeline - Part 5

Part 4 ended with the pipeline wired but mute: three Dagster assets that ingest, reference, and classify, and a list of numbers I’d deferred. How often do we land in unknown? What’s the real open-source-versus-proprietary split? This part deploys the thing, runs it against the full NuGet catalog, and reads what the data actually says, which is also how I found a bug in the classifier that the unit tests were perfectly happy with.

Deployment

Deployment is a four-service docker-compose stack: Postgres 16, a one-shot migrate job, the Dagster webserver, and the Dagster daemon that drives schedules and sensors. The webserver and daemon both wait on migrate finishing cleanly (condition: service_completed_successfully) so nothing starts against an un-migrated database, and the app data and Dagster’s own run storage live in two separate logical databases on the same server (nuget_pipeline and dagster_storage) so a wipe of one never touches the other. The daemon runs a QueuedRunCoordinator capped at two concurrent runs, telemetry is off, and run/sensor history purges on a retention schedule so the system DB doesn’t grow without bound. The one non-obvious bit is the UI binding: the webserver port falls back to 127.0.0.1:3000 unless TAILSCALE_IP is set, so by default the UI is loopback-only and pointing TAILSCALE_IP at a tailnet address publishes it over the private network instead of the public internet. The Dagster UI launches jobs and has no auth of its own, so it should never be a public port.

  dagster-webserver:
    depends_on:
      migrate:
        condition: service_completed_successfully
    ports:
      - "${TAILSCALE_IP:-127.0.0.1}:3000:3000"   # private by default

Run stats

Numbers in this part are a snapshot taken on 2026-05-02; the catalog grows daily, so treat them as a reading rather than a constant.

Ingestion was the expensive part. With the NUGET_MAX_PAGES cap lifted, the sync walks the catalog from INITIAL_WATERMARK forward at NUGET_CONCURRENCY=50, a number that isn’t a guess: throughput peaked around 50 in testing, and HTTP was never the bottleneck. The per-leaf transactions were, which is the whole reason the upsert batches into one bulk INSERT ... ON CONFLICT. What sits in Postgres now is 827,512 packages and 14.1 million versions, and the every-6h incremental keeps it current. A routine tick is cheap: ingestion lands in under three minutes, and because AutomationCondition.eager() fires the classifier the moment ingestion lands, and the enrichment watermark scopes that run to just the rows that changed, enrichment is a short tail on the end of it rather than a second full pass over the catalog.

Those versions aren’t free to keep. The raw.nuget_versions table is about 43 GB on disk, and only a third of that is the typed columns; the rest is the full raw_metadata leaf bodies sitting in TOAST. That’s the escape hatch from Part 4 showing up on the invoice, and I’d still pay it, because a re-read of a column beats re-walking the catalog over the wire.

catalog        packages=827,512   versions=14,146,927   classified=827,512
storage        raw.nuget_versions=43 GB (13 GB heap + 30 GB toast/indexes)
6h increment   ingest ~168s       classify ~60s over 4,595 changed rows

Collection stats

This is the payoff, the run metadata I deferred in Part 4, now filled in, and the headline isn’t the one I expected: unknown is the majority verdict at 65%, not open source. Open source is a quarter of the catalog, proprietary under a tenth, and everything else the classifier couldn’t decide.

classification   open_source=25.4%  proprietary=9.3%   unknown=65.3%
signal           expression=25.0%   url=22.1%          none=52.9%

The reason is sitting in the signal row. More than half of all packages declare no license at all, no expression and no URL, and that single fact is the biggest contributor to unknown: the classifier’s single most common reasoning string, stamped on 53% of all rows, is literally no license declared. Only a quarter carry a parseable SPDX expression, and another fifth fall through to the URL fallback, which earns its keep rather than being the afterthought the diagram made it look like. The open-source and proprietary numbers are really computed over the minority of packages that declare anything at all.

The proprietary slice has an equally mundane explanation. Almost all of that 9% is Microsoft’s own packages pointing their license URL at a go.microsoft.com/fwlink EULA redirect, which the hardcoded branch from Part 4 already catches. That branch is also the template for the easiest win against unknown. The URL index only knows SPDX’s seeAlso links plus those few patterns, so the cheapest lever is to curate the highest-volume unrecognized URLs and map each one to its license: the single most common unmapped URL alone accounts for ~3,675 packages, and a short hand-built list would move tens of thousands of rows out of unknown in one pass. The harder, bigger prize is the 53% that declare nothing at all, which only the v2 license-file probe can reach, by cracking open the package and reading its LICENSE file directly.

There’s a consequence I have to own. At 65%, unknown sailed past the 60% WARN ceiling I set in Part 4, so the check went yellow on the very first real run: I guessed where the data would land, and guessed a little low. Most of that 65% is packages with no license to find, which no amount of classifier cleverness can fix, so I’m raising the ceiling to 70% to keep the check a regression alarm rather than the permanent yellow that teaches everyone to ignore it.

The bug

Spot-checking the open_source bucket is what caught it. A handful of packages declared MIT AND a second, non-OSI license and came back open_source with the reasoning MIT is OSI-approved. That’s wrong: AND means you must satisfy both licenses, so one OSI term doesn’t make the whole thing freely usable.

The cause is upstream in the tokenizer. _extract_license_terms deliberately drops the AND/OR operators and returns a flat list of license ids, and classify then applies a single rule to that list: “open source if any term is OSI-approved.” That rule is correct for a disjunction (GPL-2.0 OR Commercial, where a consumer can just pick GPL) and silently wrong for a conjunction.

# classify(): one rule applied to a list that has lost its operators
osi_terms = [t for t in terms if spdx_dict[t].is_osi_approved]
if osi_terms:                                  # any OSI term -> open_source
    return Classification("open_source", primary, normalized, True,
                          f"{primary} is OSI-approved")

The tests never caught it because the AND fixtures I’d written, MIT AND Apache-2.0 among them, all had OSI-approved licenses on both sides, so the flattening was lossless for everything I thought to check. The conjunction case existed in the test suite; the conjunction case that disagrees with the disjunction rule didn’t. The blast radius is small: AND shows up in just 0.29% of declared expressions, 603 packages in all, and every one of them came back open_source, which is the exact set worth auditing. Most are probably fine, since a dual-permissive MIT AND Apache-2.0 really is open source, and the bug only bites when one side isn’t OSI. But wrong at any frequency is still wrong, and a classifier that calls a proprietary-encumbered package “open source” is failing at precisely its one job.

The fix

The flattening threw away the one thing the verdict depends on, so the fix is to stop throwing it away. The tokenizer now returns the combinator alongside the term list (AND, OR, or MIXED when both appear), and classify branches on it: OR stays “any term OSI,” and AND becomes “every term OSI.” A consumer of an AND expression is bound by all of it, so the package is only open source if nothing in it is encumbered. One thing the snippet relies on: by this point classify has already routed unrecognized terms to unknown (and bare LicenseRef-* to proprietary), so the AND branch only ever sees licenses SPDX knows.

# classify(): the combinator now decides the rule
if combinator == "AND":
    non_osi = [t for t in terms if not spdx_dict[t].is_osi_approved]
    if non_osi:
        return Classification("proprietary", non_osi[0], normalized, False,
                              f"{non_osi[0]} in an AND conjunction is not OSI-approved")
    return Classification("open_source", terms[0], normalized, True,
                          "every term in the AND conjunction is OSI-approved")
# OR / single term: any OSI-approved term wins, unchanged

This handles the uniform OR/AND shapes that cover virtually all real expressions. Expressions that mix AND and OR, which are necessarily parenthesised, lose their grouping in tokenization and fall back to the old any-OSI rule: a v3 problem I’m explicitly deferring, with a test that pins the deferral so it can’t change unnoticed.

Shipping it is almost free, which is the payoff of keeping enrichment a derived asset. The fix changes one pure function, and nothing upstream moves: no re-ingest, no re-walk of the catalog, no touching the 43 GB of raw versions. I reset the enrichment watermark, re-materialize the classifier, and it streams back over the raw tables already sitting in Postgres in a single pass of minutes. Re-running the classification is the whole deployment.

The honest footnote is that the reasoning string is what made this debuggable at all. Because every verdict recorded why, I could grep the open_source rows for AND in the expression and find the lie without re-running anything.

Conclusion

Five parts ago this was a list of requirements I’d hold any orchestrator to. It became a comparison that picked Dagster, then an architecture diagram, then the code behind the diagram, and finally a deployment that ran against the whole NuGet catalog and reported back. The thread running through all of it is the gap between the diagram and the code, and then between the code and the data. Each step looked clean one level up and turned out to hide a judgement call, an edge case, or an outright bug one level down. URL normalization was a single arrow. “Classify” hid an AND/OR bug the unit tests were happy with. And the tidy three-way split I’d pictured came back two-thirds unknown.

That unknown number is where the interesting work still lives, so it’s worth being concrete about how the classifier’s coverage gets better from here, in rough order of effort:

Ship the AND conjunction fix above, so the 603 multi-license packages are judged on every term instead of any single OSI one. Correctness before coverage.
Curate the highest-volume unrecognized license URLs and map them to SPDX ids, the same move the Microsoft EULA branch already makes. The top unmapped URL alone is ~3,675 packages; a short hand-built list claws tens of thousands of rows back from unknown for almost no code.
Probe package contents for the 53% that declare nothing at all, reading the LICENSE file straight out of the .nupkg. This is the v2 feature that actually moves the needle, and the hardest, since it means fetching and unpacking archives rather than reading a metadata field.
Teach the tokenizer the nested-parenthesised expressions it currently punts on. A small tail, but a real one.

None of that needs a new architecture. The asset graph, the watermarks, the quality gates, and the eager enrichment all stay exactly as they are; the classifier is a pure function with a reasoning string on every verdict, so each item above is a contained change with a test and a number to watch. The same shape was always meant to wrap npm, PyPI, or Maven by swapping the ingest asset and keeping the rest, and that is the experiment I’d reach for first if this were to continue.

But the series ends here. The pipeline is built, deployed, and honest about what it does and doesn’t know, which was the point all along: not a classifier that is always right, but one that records why it reached every verdict, so the wrong ones stay findable. That is a good place to stop.