How to evaluate AI and machine learning products before acquisition — model ownership, data moats, inference costs, retraining pipelines, benchmark integrity, and the difference between real AI and GPT wrappers.
AI products are the most opaque category to evaluate in a technical acquisition. The surface-level demo is impressive; the engineering reality underneath — model ownership, data provenance, inference economics, and benchmark integrity — is where the real value (or risk) lives.
This guide is for acquirers evaluating AI-native products, ML-enhanced software, and LLM-based applications. It covers what the demo doesn't show you.
Before evaluating anything else, establish what layer of the AI stack the product actually occupies:
| Layer | What it means | Value and risk |
|---|---|---|
| Foundation model provider | OpenAI, Anthropic, Google — the model itself | Not acquirable as a product |
| Fine-tuned model | Pre-trained model adapted on proprietary data | Valuable if the data is owned |
| Retrieval-augmented application | LLM + proprietary document store (RAG) | Valuable if the data corpus is proprietary |
| Orchestration layer | Prompts, chains, agents built on foundation models | Valuable if the workflow design is defensible |
| Data collection business | Product that generates training data as a byproduct | Highest long-term defensibility |
Most acquired "AI products" are in the orchestration or RAG categories. That's not inherently bad — but the valuation should reflect it.
The legal risk landscape for AI training data is actively evolving:
Ask for the training data documentation. If there isn't any, that's a red flag.
AI product benchmarks are frequently misleading. Common problems:
The test set used to report accuracy should be held out — never used to train or tune the model. If the team that built the model also built the benchmark, ask how they ensured the test data wasn't leaked into training.
Demo outputs are selected to be impressive. Request a random sample of 50 production outputs from the last 30 days. Evaluate those.
A model can achieve high accuracy on a benchmark by exploiting statistical patterns in the test set rather than demonstrating general capability. Ask for the failure cases — what does the model get wrong, and how often?
What is the performance improvement over a baseline (GPT-4o with a basic prompt, or a rule-based system)? A 2% improvement over a baseline that costs 10x less to operate is not a moat.
This is the hidden cost most acquirers miss.
Model inference costs as a function of user scale:
A product at ₹50L ARR with 40% gross margins might have 15% gross margins at ₹5Cr ARR if inference costs scale linearly and compute isn't optimised.
The most valuable AI businesses have data flywheels — the product generates training data that improves the model, which improves the product, which attracts more users, which generates more data.
Evaluate:
A product without a data flywheel is a point-in-time implementation. One with a working flywheel has compounding value.
Without monitoring, a model can silently degrade in production for months before anyone notices.
For AI products where accuracy matters (legal, medical, financial, factual Q&A):
A product where hallucinations are business-critical risks (wrong legal advice, incorrect financial calculations) requires explicit hallucination mitigation — not just disclaimers.
If the product depends on a third-party foundation model:
| Finding | Implication | |---------|-----------|| | No proprietary training data | Value is workflow, not AI — adjust multiple | | Inference margins below 30% | Profitability risk at scale | | No data flywheel | Point-in-time moat, not compounding | | Single model provider dependency | Provider risk; factor in migration cost | | Test set contamination suspected | Discount performance claims | | No retraining pipeline | Model will stagnate; budget for MLOps buildout | | User data used for training without clear consent | Legal risk in EU/UK markets |
Evaluating an AI or ML product for acquisition and want to understand what's actually under the hood? Contact us — we conduct AI-specific technical due diligence covering model ownership, inference economics, data provenance, and MLOps maturity.
If this guide resonated with your situation, let's talk. We offer a free 30-minute discovery call — no pitch, just honest advice on your specific project.
Your developer went silent. Your project is half-built. You don't know what state the code is in. This is the step-by-step guide to recovering your project and getting back on track.
10 min readRescuing SoftwareHow enterprise buyers should evaluate SaaS vendor security — what certifications actually mean, what to look for in security questionnaires, data residency requirements, incident response, and the contract clauses that protect you.
11 min read