Buyer guide

How to evaluate an AI vendor

A criteria-first framework for evaluating AI partners — the questions to ask, the answers to listen for, the red flags to avoid, and a scoring rubric you can run across multiple vendors.

The AI services market is crowded with firms that pitch well and ship little — tool resellers calling themselves consultants, strategy shops that hand you a deck and disappear, startups with no track record, and body shops that put senior people in the sales meeting and juniors on your account. This guide is a pragmatic framework for telling them apart: the questions to ask, the answers to listen for, the red flags to walk away from, and a scoring rubric you can run across every vendor on your shortlist.

It assumes you’ve already decided to bring in an outside partner rather than hire internally. If you’re still on that question, that’s a different decision — work it out first, then come back here.

The Five Evaluation Dimensions

A good partner scores well on all five. A bad one usually fails on the same one or two. Weight them in this order — substance carries the most weight because everything else is downstream of whether the firm can actually build and run a working system.

1. Substance (40% weight)

What you’re testing: have they actually shipped AI work in production? Or are they reselling tools, repackaging slides, or pivoting in from another consulting domain?

Questions to ask:

  • How many mid-market AI engagements have you delivered in the last 18 months?
  • Show me a working system you shipped — a real one running in production, not a demo.
  • What’s your longest-running client relationship, and why has it lasted?
  • What AI work have you turned down recently, and why?

Listen for:

  • Specific stories with specific clients (or blinded ones with specific outcomes)
  • Examples of failures and what they learned from them
  • Confident “no” answers — firms that turn work down know what they’re good at
  • Working systems in production, not a pile of pilots that never shipped

Red flags:

  • Every example is hypothetical, or a “case study” with no numbers attached
  • Every answer is about strategy and roadmaps; nothing about shipped builds
  • They keep steering back to their tool or platform
  • A recent founding date paired with claims of “decades of AI expertise”

2. Senior Team (20% weight)

What you’re testing: will the senior people in the sales meeting be the ones doing the work? Or do you get juniors after you sign?

Questions to ask:

  • Who specifically will be on my account?
  • What’s their background, and can I meet them before signing?
  • What share of your work is delivered by senior staff vs. contractors or offshore?
  • Will my main contact change over the course of the relationship?

Listen for:

  • Specific names, with profiles you can verify
  • A lead with real years behind them, not a title invented for the pitch
  • A stated policy on offshoring, not a maybe
  • A pattern of long tenure on the team

Red flags:

  • “We have a network of contractors we tap as needed”
  • The salesperson can’t tell you who will actually deliver
  • A large offshore footprint that quietly does the real work
  • High turnover hidden behind “we’re growing fast”

3. Method (15% weight)

What you’re testing: do they have a real delivery method, or are they figuring it out as they go on your dime?

Questions to ask:

  • Walk me through a typical first 90 days.
  • How do you decide what to build first?
  • How do you measure ROI?
  • What happens when the AI gets something wrong in production?
  • How do you transfer knowledge to my team?

Listen for:

  • A named, repeatable process — not improvisation
  • Concrete deliverables in the first 30 days
  • Quantitative ROI tracking, not “we’ll show value”
  • Clear human-in-the-loop and error-handling protocols
  • A defined knowledge-transfer plan, not “we’ll work alongside your team”

Red flags:

  • Every answer is “it depends” with no follow-up
  • No defined onboarding process
  • ROI measurement is vague — “we’ll know it when we see it”
  • No plan for what happens when the AI is wrong

4. Commercial (15% weight)

What you’re testing: is the firm aligned with your results, or with its own billable hours?

Questions to ask:

  • What’s your pricing model — fixed-scope, retainer, or hourly?
  • Where’s your pricing published?
  • What are your cancellation terms?
  • What happens if we don’t see the payback you projected?
  • How do you handle scope changes?

Listen for:

  • Published pricing on the website, not a quote form and a sales call
  • Fixed-scope or retainer as the primary model, not open-ended hourly
  • Reasonable cancellation terms — 30 days’ notice is typical
  • A clear scope-change process you can see coming
  • A firm willing to put real terms behind its results — a payback target it stands behind on a retainer, a KPI commitment on a larger fixed-scope program

Red flags:

  • Everything is hourly
  • No published pricing — quote forms only
  • Long minimum commitments with no justification
  • Vague answers on what happens if results don’t materialize
  • Constant scope-change quote-ups that turn every request into a change order

5. Cultural fit (10% weight)

What you’re testing: will you actually want to work with these people every week for the next year?

Questions to ask:

  • How would you describe your firm’s culture?
  • What kind of clients do you not work well with?
  • How do you handle disagreement with a client’s direction?
  • Tell me about a client you had to push back on hard.

Listen for:

  • Real personality, not just polish
  • Honesty about who they’re not for
  • Willingness to disagree — you want this, even when it’s uncomfortable
  • Stories where they pushed back and were right, and where they were wrong

Red flags:

  • Overly polished, no edges, no opinions
  • “Every client is a great fit” answers
  • Conflict-avoidant — “the customer is always right”
  • Can’t name a single time they pushed back

The Scoring Rubric

Run this for every vendor in active consideration. Score each dimension out of 10, multiply by the weight, and total.

DimensionWeightVendor AVendor BVendor C
Substance40%/10/10/10
Senior Team20%/10/10/10
Method15%/10/10/10
Commercial15%/10/10/10
Cultural Fit10%/10/10/10
Weighted Total100%/ 100/ 100/ 100

How to read the score:

  • Below 70 — pass.
  • 70–85 — proceed with caution. Get references and start with a small first engagement before committing further.
  • Above 85 — green light.

Reference Calls: What to Ask

Always do two or three reference calls. If a firm can’t or won’t connect you with past clients, that’s your answer — walk away.

To another mid-market operator who used the vendor:

  1. Walk me through your engagement — when did you start, what did you buy, what did you build?
  2. Did the work actually pay for itself?
  3. What did the vendor do well that surprised you?
  4. What didn’t go well?
  5. Did you renew? Why or why not?
  6. Who specifically did your work — was it the team they pitched?
  7. Was the pricing what they quoted?
  8. Were there any unexpected costs?
  9. What advice would you give me before signing?
  10. Would you hire them again, or do it differently?

Common Vendor Archetypes (and What to Watch For)

The Tool Reseller

“We’re partners with [AI platform]. We’ll implement it for you.”

The revenue model is tool licensing, not your outcome. They’ll push the tool whether or not it’s the right fit. Ask what they’d recommend if the tool weren’t on the table.

The Strategy Consultant

“We’ve built AI roadmaps for hundreds of companies.”

They hand you a deck and disappear at exactly the moment the work gets hard. Ask: have you actually shipped a working system, and can I see it running?

The Engineering Boutique

“We’re the engineers. We build.”

They build what you ask for, even when what you asked for is wrong. Ask: what would you tell me not to build?

The AI Startup

“We were built for the AI era.”

They don’t have years of mid-market software experience behind them, and their playbook is unproven. Ask: who on your team has done this before, with a company like mine?

The Offshore Body Shop

“We have 200 engineers and very competitive rates.”

Senior people in the sales meeting, junior people on your account. Ask: who specifically will deliver, and can I meet them?

The Real Partner (rare)

Senior team, real shipped work, retainer or fixed-scope pricing with reasonable terms, and a culture you’d actually want in the room. This is who you’re looking for.

A Two-Step Approach

Even after a clean evaluation, the safest play is to earn trust in stages rather than betting everything up front.

Step 1: A small, scoped first engagement

  • A short, fixed-scope Value Sprint ($2K–$25K, up to roughly $95K for the largest), or a low-commitment on-ramp like a workshop or trial
  • 30 to 90 days
  • Real work, a real deliverable, real payback you can measure
  • Low commitment, low downside

Step 2: Scale up if it works

  • Move to an AI Office retainer — the tier that pays for itself, targeting at least 3X its cost — or a larger fixed-scope program
  • Commit based on actual performance, not promises

This beats every alternative. It costs you 90 days and a few thousand dollars to find out whether a partner actually delivers — instead of betting six figures and a year on someone who simply pitched well.

The Bottom Line

Most AI vendors fail to deliver. The market is full of tool resellers pretending to be consultants, strategy firms pretending to be builders, startups pretending to have track records, and body shops pretending to have senior teams.

The right partner is rare and worth finding. Use the rubric. Always do references. Always start with a small first engagement.

If you’d like to evaluate Frogslayer against this framework, book a short intro and ask us every question on this list. We’ll answer straight.

Get started

Want this applied to your business?