LangAlpha

Financial research agent with PTC (Programmatic Tool Calling) — the LLM writes and runs Python code in a sandbox to call MCP-backed financial data tools, produce charts, and do multi-step analysis.

Repo: ginlix-ai/LangAlpha · Python/FastAPI · LangGraph · React frontend

Evaluated 2026-04-14–15. Three flash models (qwen3.5 9B, qwen3.5 27B, Gemma 4 31B), docker sandbox PTC with Gemma 4. 17 experiments. Full log: /workspace/group/projects/langalpha/LAB_NOTEBOOK.md.

What it is

LangAlpha is a self-hostable AI research assistant for stock analysis. Two modes:

Flash — lightweight, no sandbox. Web search, real-time market data, SEC filings, analyst ratings. Quick answers, direct tool dispatch.

PTC (Programmatic Tool Calling) — the differentiator. Instead of calling financial tools directly, the agent writes Python code that executes in an isolated sandbox. The sandbox has the full MCP financial data library pre-loaded as importable modules. Ask for a DCF model; get back a five-year projection table and executable code.

The market data stack is real: Yahoo Finance (no API key), SEC EDGAR, analyst price targets, revenue breakdowns, real-time quotes. Worth extracting as a standalone library.

What we ran

Custom docker-compose.local.yml (no bind mounts, DooD-compatible)
OPENAI_BASE_URL pointed at local Ollama; OPENAI_API_KEY=ollama
Flash: SANDBOX_PROVIDER=memory, three models, 15 experiments
PTC: SANDBOX_PROVIDER=docker, langalpha-sandbox:latest, Gemma 4 31B, 2 experiments

Flash mode findings

qwen3.5 (both sizes): ★☆☆☆☆

Both the 9.7B and 27B variants were blocked by the secretary skill onboarding loop. Every query — "AAPL price", "use get_company_overview for AAPL" — was intercepted and replaced with a canned greeting. Zero financial tool calls across all trials, regardless of context size (32K or 128K). Expanding num_ctx to 131072 worsened instruction-following with no loop escape.

The secretary skill prompt structure is the bottleneck. It's calibrated for frontier models that can follow multi-step conditional instructions. Local models at this size pattern-match the examples and treat the onboarding state as terminal.

Gemma 4 31B: ★★★☆☆

Makes tool calls. Correctly populates schemas (fixes a get_user_data entity-field bug that qwen3.5 silently misses). Actually dispatches financial tool calls.

One persistent issue resolved itself: for several experiments Gemma 4 returned NVDA regardless of the queried ticker. The cause was not a model weight bias — it was the shared-flash-workspace checkpointer. All flash queries for a given user share a deterministic workspace ID (uuid5(namespace, user_id)). LangGraph's postgres checkpointer accumulated ~87 checkpoint_writes from prior NVDA-returning sessions. The model was loading its own prior wrong outputs as in-context examples and pattern-matching forward.

After clearing checkpoints and checkpoint_writes: Gemma 4 correctly routes "What is AAPL trading at?" → get_company_overview(symbol="AAPL"). Confirmed with direct Ollama tests at full 24-tool context.

Remaining real issues on Gemma 4: instruction-following degrades under vague prompts (scope-creep, query rewriting), and the secretary skill loop does not appear — but only because Gemma 4's stronger tool-calling avoids it; the root cause in the prompt is still there for smaller models.

PTC mode findings

Sandbox boot: ✓ (~3s)

Docker provider (SANDBOX_PROVIDER=docker, /var/run/docker.sock mounted) works without Daytona. langalpha-sandbox:latest spawns in ~3 seconds. Code executes in ~4 seconds. Files persist in /home/workspace/work/.

Experiment 16 — AAPL DCF, vague prompt

Query: "Build a DCF model for AAPL using the last 3 years of FCF."

Result: 7 execution rounds, 700 seconds, no final answer. The agent self-corrected a tool API error (called get_cash_flow_statement → auto-read tool docs → retried with correct get_cash_flow). But it scope-crept from "DCF model" into a full research package: news, analyst upgrades, price targets, a Tavily web search. Data was fetched correctly (real FCFs: $108.8B 2024, $99.6B 2023, $111.4B 2022). The DCF was never computed.

Experiment 17 — AAPL DCF, explicit prompt

Query: "Build a DCF. Use get_cash_flow(ticker='AAPL', quarterly=False). 10% WACC, 3% terminal growth. Print intrinsic value per share."

Result: Clean answer in two execution rounds. Self-corrected a dict indexing error (fcf_data[0] → navigated dict keys). Final output:

Year	FCF	PV
1	$103.7B	$94.3B
2	$108.9B	$90.0B
3	$114.3B	$85.9B
4	$120.1B	$82.0B
5	$126.1B	$78.3B
Terminal	$1,854.8B	$1,151.7B

Enterprise Value: $1,582B → Intrinsic value per share: $106.16

Real data. Real calculation. The PTC pattern works.

What does Daytona get you?

Nothing you need. The docker provider (SANDBOX_PROVIDER=docker, /var/run/docker.sock mounted, langalpha-sandbox:latest) delivers working PTC: 3-second sandbox boot, real code execution, real financial data, file persistence per session. That's all PTC requires.

Daytona is a 14-service stack (API, runner, ssh-gateway, dex, postgres, redis, minio, registry, proxy, pgadmin, jaeger, otel-collector, maildev) that adds SSH access into sandboxes, MinIO-backed persistence across restarts, and multi-user auth scoping. These matter at scale. For local or single-user deployment they're overhead with no payoff. Use the docker provider.

What's fixable

Checkpointer scope — flash workspace IDs should be per-session, not per-user. One line: replace uuid5(namespace, user_id) with uuid5(namespace, f"{user_id}/{session_id}"). Prevents accumulation of stale context across separate research sessions.
Secretary skill priority — onboarding state machine overrides user queries. A message-count check or "skip_onboarding": true preference would fix repeat-user UX.
Token budget — 120K summarization threshold in agent_config.yaml is wrong for 32K models. Should read from model config, not be hardcoded.
Model-aware examples — flash_identity.md.j2 uses NVDA as the example ticker in 4 locations. Local models anchor on examples; frontier models ignore them. Strip or randomize for local deployments.
Scope discipline on PTC — Gemma 4 rewrites vague tasks. Either tighten the system prompt or enforce a planning step that confirms task scope before execution.

Verdict

The PTC pattern is real and works. A DCF model from a natural language prompt, with real Yahoo Finance data, in a sandboxed Python environment — that's a genuine capability, not a demo. The sandbox (docker or Daytona) takes 3 seconds to boot and 4 seconds to run code.

The model is the constraint. Gemma 4 31B gets you working PTC with explicit prompting but scope-creeps on vague requests and takes 12 minutes on a DCF. A frontier model (Claude 3.5+, GPT-4o) would resolve both. qwen3.5 is blocked by the secretary skill loop entirely.

The market data layer is production-quality and reusable. The deployment story (docker provider path) is straightforward once you know the Dockerfile.backend is missing two COPY lines. The Daytona path works but adds 13 services you probably don't need until you have multiple users.

Component	Rating	Notes
Market data API	★★★★☆	Real, clean, no key required
Flash on Gemma 4	★★★☆☆	Works on clean checkpointer; scope-creeps
Flash on qwen3.5	★☆☆☆☆	Secretary loop, no tool calls
PTC (explicit prompt)	★★★★☆	Real calculations, self-correcting
PTC (vague prompt)	★★☆☆☆	Scope-creep, 12 min, no answer
Docker sandbox	★★★★☆	Works, 3s boot, free — use this
Daytona sandbox	★☆☆☆☆	14 services for zero additional value at solo scale