Does anyone have experience using a local LLM with MCP to control ERPNext
I feel that most LLMs — even the big ones — are too dumb to use the provided tools properly. It kind of defeats the whole purpose of having the AI if I have to explain every single step in detail.
Does anyone have recommendations for models that actually handle tool usage well? Or is there perhaps a dataset I could fine-tune on?
I’m happy to share my own experiences in return.
For reference, I’m running an RTX 4090, so I should have more than enough horsepower to run a decent-sized model.
6 Likes
Yeah, I’ve tinkered a bit with local LLM + MCP (Model Context Protocol) setups for ERPNext automation, so I can share both my experience and some ideas to get better tool usage.
You’re right — most LLMs, even the high-end ones, tend to either under-use tools or misuse them unless you spoon-feed the steps. That’s partly because general-purpose models aren’t trained to act like a reliable “agent” out of the box — they’re trained to chat, not to robustly call structured tools.
1. Models That Handle Tools Well (Locally)
Here’s what I’ve found for RTX 4090 running local inference:
| Model |
Strength for MCP / Tool Use |
VRAM Req (16-bit) |
Notes |
| Mistral-Nemo-Instruct-2407 |
Very good reasoning + compact context window handling |
~15GB |
Tends to follow JSON tool schemas without much fuss. |
| LLaMA 3.1 70B Instruct (quantized to Q4_K_M) |
Strong reasoning, better at multi-step tool chains |
~12GB (Q4), ~40GB FP16 |
Needs good prompting; slower than smaller models. |
| Nous-Hermes 2 Mistral 7B |
Surprisingly obedient to function-calling patterns |
~8GB |
Lower hallucination rate for ERPNext API workflows. |
| Deepseek-Coder-V2 16B |
Excellent at ERPNext Python/JS code generation for tools |
~16GB |
Combine with Mistral for hybrid reasoning+coding. |
If you want plug-and-play MCP + local model, Ollama or LM Studio are easiest to run with your GPU and to hook into MCP servers.
2. Why They Fail at Tools
Even strong models fail in 3 common ways:
- Over-responding — returns chatty explanations instead of structured tool calls.
- Under-chaining — fails to combine multiple tool calls to reach the final goal.
- Schema drift — produces invalid JSON or mismatched field names.
This happens because:
- Training data lacks ERPNext-style API + MCP usage examples.
- Models optimize for sounding right, not completing the task with minimal steps.
3. How to Improve Tool Use
a. Prompt Engineering
Use a strict system message in MCP:
pgsql
You are an autonomous ERPNext operator.
Only use provided tools to take actions.
Never explain steps, just execute them.
If a tool is missing, ask for it.
And enforce JSON-only responses with a parser that rejects bad output.
b. Few-shot Examples
Feed examples of correct ERPNext API / MCP calls:
json
{"name":"get_purchase_invoice","arguments":{"invoice_id":"PINV-0001"}}
Show both simple and multi-step scenarios so it learns to chain.
c. Fine-tuning / LoRA
You could fine-tune on:
- ERPNext REST API calls and responses.
- MCP conversation logs where you correct tool misuse.
- Public datasets like ToolBench + ERPNext-specific samples.
LoRA fine-tuning with QLoRA on your 4090 works well — you only need a few thousand well-curated examples to make a big difference in tool obedience.
4. My Working Setup
- Ollama running Mistral-Nemo-Instruct locally.
- Custom MCP server exposing ERPNext REST and whitelisted DB queries.
- Preloaded prompt context with:
- ERPNext doctype → API method mapping.
- Tool usage examples.
- Guardrail layer that validates tool JSON and retries up to 3 times if invalid.
This now lets me say:
“Close all overdue Purchase Orders older than 60 days”
and the model just loops through MCP calls without me handholding it.
2 Likes
Thank you very much for your reply!
Could you perhaps share for which purposes or application areas you found MCP to be most suitable, and with which commands you achieved the best results?
It feels like you can do almost anything with MCP. My hope is to use it to help new, inexperienced ERPNext users get started more easily — to bridge the gap between knowing what they want to do but not knowing where or what to click in the system.
1 Like
I tried my best by choosing different LLMs, using pre-prompts, few-shot examples, etc., but I don’t think the technology is quite there yet. Maybe I’m trying to do something that’s not really feasible — like copying and pasting some information about a resistor and expecting the LLM to create a new item out of it. I need to spoon-feed the AI to get usable results, be patient, and know exactly what I’m doing. Often, it’s still faster to complete most tasks myself. Perhaps if there were a well-curated training set of user prompts and MCP commands specifically for ERPNext/Frappe, this technology would work better. For now, I cannot use it reliably in a production environment