Writing
Running AI On-Site: Local LLMs for the Field

Running AI On-Site: TurboQuant, Local LLMs, and the End of Cloud Dependency

The Three Problems with Cloud AI on a Job Site

You're standing on the 4th floor of a building under construction. You need to check something in the specs. You pull out your phone and open your AI assistant... and it spins. And spins. One bar of signal. The nearest cell tower is behind a concrete core wall.

This is the reality of cloud-based AI on construction sites:

Problem 1: Connectivity. Job sites have terrible internet. Basements, concrete structures, and remote locations kill cell signal. WiFi from the trailer doesn't reach the 4th floor. If your AI lives in the cloud, it's useless when you need it most.

Problem 2: Cost. AI models like Claude and GPT charge per use. A busy construction AI agent might cost $200-500/month in API fees. For a small contractor running on thin margins, that's real money — every month, forever.

Problem 3: Privacy. When you send a question to a cloud AI, your data travels to someone else's server. Your bid numbers, client information, subcontractor pricing, and project details are now on Anthropic's or OpenAI's infrastructure. For many contractors — especially those working on government or military projects — this is a non-starter.

What if you could run the AI locally — on a laptop sitting in the job trailer? No internet needed. No monthly bills. Your data never leaves the device.

Two recent breakthroughs make this possible.


Breakthrough 1: TurboQuant (6x Memory Reduction)

Why Memory Matters

AI models are huge. A powerful model like Llama 70B needs about 140 GB of memory (RAM) to run. Your laptop probably has 16-32 GB. That's why most people use cloud APIs — the models don't fit on their computers.

Compression makes models smaller so they DO fit on normal hardware. But compression usually comes with a catch: the model gets dumber. Like compressing a high-res photo into a blurry thumbnail.

TurboQuant's breakthrough is compression that makes models 6x smaller with zero quality loss. Not "almost as good." Actually the same quality. That's the key.

TurboQuant (opens in a new tab) is a new compression algorithm from Google Research (published at ICLR 2026, the top AI conference). Here's what it does:

Without TurboQuantWith TurboQuant
Llama 70B needs 140 GB of memoryLlama 70B needs ~23 GB of memory
Requires a $10,000+ GPU serverRuns on a MacBook Pro with 32GB
$2,000+/month for cloud GPU rental$0/month — it's your laptop
Requires internet connectionWorks completely offline

The technical magic (simplified): Traditional compression reduces the precision of the numbers inside the model (like rounding 3.14159 to 3.1). TurboQuant does something clever — it converts the model's internal data into a different mathematical format (polar coordinates) where compression is much more efficient. The result: 3-bit precision with the same accuracy as 16-bit. Hence the name: turbo compression with quant(ization).

# Install TurboQuant
pip install turboquant
 
# Compress a model (one-time process, takes 30-60 minutes)
turboquant compress \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --bits 3 \
  --output ./models/llama-70b-turbo/
 
# The compressed model is now ~23 GB instead of 140 GB
# It fits on a 32 GB MacBook Pro

Breakthrough 2: mac-code (Free Local AI Agent)

mac-code (opens in a new tab) is an open-source project that gives you a Claude Code-like AI agent experience running entirely on your Mac — for free. No API key needed. No internet needed. No subscription.

It runs a 35-billion-parameter AI model locally at 30 tokens per second on Apple Silicon (M1/M2/M3/M4 Macs). That's fast enough for real-time conversation.

# Install mac-code
git clone https://github.com/walter-grace/mac-code.git
cd mac-code
./install.sh
 
# Start the agent (downloads the model on first run — ~20 GB)
mac-code

That's it. You now have an AI agent running on your laptop. No internet. No API key. No monthly cost.

How Fast Is "30 Tokens Per Second"?

A "token" is roughly 3/4 of a word. So 30 tokens/second ≈ 22 words/second. That's fast enough that the AI's response appears to stream in real-time, just like typing in ChatGPT. For practical purposes, it feels instantaneous for short answers and takes 10-20 seconds for long, detailed responses.


Combining Them: A Powerful AI on a Laptop

TurboQuant + mac-code together means you can run a larger, more capable model locally:

Setup for the maximum-power local agent:

# Step 1: Compress a 70B model with TurboQuant
pip install turboquant
turboquant compress \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --bits 3 \
  --output ./models/llama-70b-turbo/
 
# Step 2: Point mac-code at the compressed model
mac-code --model ./models/llama-70b-turbo/
 
# You now have a 70-billion-parameter AI agent
# running on your MacBook. Offline. Free.

Construction Use Cases for Local AI

1. Offline Spec Lookup in the Field

The #1 use case: looking things up in the specs when you're on-site without reliable internet.

# Load the project specs into the local agent
mac-code --documents ~/projects/main-st/specs/
 
# Now ask questions — no internet needed
> What paint system is specified for exterior steel?
 
"Section 09 91 00, 2.1.B: Exterior structural steel shall receive
a 3-coat system: zinc-rich primer (3-5 mils DFT), epoxy intermediate
coat (4-6 mils DFT), and polyurethane topcoat (2-3 mils DFT).
Color per Architect's selection."

2. Daily Report Generation

At the end of the day, feed the agent your notes and get a formatted daily report:

> Write a daily report from these notes:
> Weather: clear 38F
> 22 workers: 10 iron, 8 carps, 4 laborers
> Poured 3rd floor slab section A, 45 CY
> Steel delivery from Allied, 12 tons for 4th floor
> RFI 247 still pending
> No safety incidents
 
"DAILY CONSTRUCTION REPORT
Date: March 28, 2026
Project: Main St. Commercial Build
 
WEATHER: Clear, 38°F. No weather delays.
 
MANPOWER (22 total):
- Ironworkers: 10
- Carpenters: 8
- Laborers: 4
 
WORK COMPLETED:
- Completed concrete pour, 3rd floor slab Section A (45 CY)
 
DELIVERIES:
- Structural steel: 12 tons from Allied Steel (4th floor package)
 
OPEN ITEMS:
- RFI #247: Pending architect response
 
SAFETY: No incidents or near-misses.
 
Prepared by: [Superintendent Name]"

All of this happened on the laptop. No data sent anywhere.

3. Quick Calculations and Checks

> I need 4000 PSI concrete for a 30' x 40' x 8" slab.
> How many cubic yards? What's the rough cost at $180/CY delivered?
 
"Calculation:
- Volume: 30' × 40' × (8/12)' = 800 cubic feet
- Conversion: 800 ÷ 27 = 29.6 cubic yards
- Order quantity: 30 CY (round up, always order extra)
- Plus 5% waste: ~31.5 CY → order 32 CY
 
Cost estimate:
- 32 CY × $180/CY = $5,760 for concrete delivery
- Add pump truck if needed: ~$1,200-1,500
- Total: approximately $7,000-7,300"

4. Safety Toolbox Talk Generator

> Generate a toolbox talk for concrete pour day.
> Focus on: pump truck safety, vibrator use, fall protection at slab edge.
 
The agent generates a complete toolbox talk script 
on the laptop, offline, in 15 seconds.

Cost Comparison: Cloud vs. Local

For a small-to-mid contractor running AI for one project team:

Cloud AI (API)Local AI (mac-code + TurboQuant)
HardwareAny computer with internetMacBook Pro 32GB ($2,499 one-time)
Monthly AI cost$200-500/month$0/month
Internet requiredYes — alwaysNo
Annual cost (Year 1)$2,400-6,000 + existing laptop$2,499 (laptop) + $0 (AI)
Annual cost (Year 2+)$2,400-6,000$0
Data privacyData sent to cloudData stays on device
Works offlineNoYes
Model qualityBest (Claude Opus, GPT-4)Very good (70B local)
⚠️

Honest Assessment: Cloud Models Are Still Smarter

Let me be real: a local 70B model is very capable, but it's not as smart as Claude Opus or GPT-4. For complex multi-step reasoning (like analyzing a 200-page spec for contradictions), cloud models are significantly better.

The practical sweet spot: Use local AI for routine field tasks (spec lookup, daily reports, calculations, toolbox talks) and cloud AI for complex office tasks (estimating, contract analysis, multi-document synthesis). This hybrid approach gives you the best of both worlds — offline field capability and maximum intelligence for office work.


The Hybrid Setup: Local + Cloud

For most contractors, the ideal setup is both — local AI for the field, cloud AI for the office:

How the sync works: The field laptop runs locally all day. When the superintendent is back at the trailer (or has cell signal), a quick sync pushes the day's data to the office systems — daily logs, safety observations, delivery records — where the cloud-based agents (managed by Paperclip) can process them at full intelligence.


Getting Started: The 30-Minute Setup

Get the Hardware

You need a Mac with Apple Silicon (M1 or newer) and at least 16 GB of RAM. 32 GB is recommended for the 70B model. If you already have a MacBook Pro from the last 3 years, you're probably good.

Install mac-code

git clone https://github.com/walter-grace/mac-code.git
cd mac-code
./install.sh

This downloads the default 35B model (~20 GB). Takes about 15-30 minutes on a decent connection.

(Optional) Upgrade with TurboQuant

If you have 32 GB RAM and want a smarter model:

pip install turboquant
turboquant compress \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --bits 3 \
  --output ./models/llama-70b-turbo/

Load Your Project Documents

# Copy your specs to a local folder
cp -r ~/Dropbox/MainSt/Specs/ ~/local-ai/specs/
 
# Start mac-code with your documents
mac-code --documents ~/local-ai/specs/

Test It

> What concrete strength is specified for the foundations?
> What are the liquidated damages in the contract?
> Generate a daily report from: 18 workers, clear 45F, poured grade beams

Who Is This For?

Contractor TypeLocal AI ValueRecommendation
Solo / 1-3 person operationHigh — saves $200-500/mo in API costsStart with mac-code for daily reports and spec lookup
Small GC (5-20 employees)High — offline field access + cost savingsmac-code for field, cloud for estimating
Mid-size GC (20-100 employees)Medium — field access matters, cost less criticalHybrid setup (local field + cloud office)
Large GC / ENR Top 400Lower — they can afford cloud, need max intelligenceCloud primary, local as backup for remote sites
Government / military projectsVery high — data privacy requirementsLocal AI may be required (data can't leave the device)

Conclusion

Cloud AI is powerful but comes with strings attached: monthly costs, internet dependency, and data leaving your control. For construction field operations — where connectivity is unreliable, budgets are tight, and sensitive project data is at stake — local AI is no longer a compromise. It's a legitimate option.

TurboQuant compressed a 70-billion-parameter model to fit on a laptop. mac-code made it free and easy to use. Together, they put a capable AI agent in every job trailer — no internet required, no monthly bill, no data leaving the device.

The smartest models still live in the cloud. But the "good enough" models now live on your laptop, and for 80% of field tasks, good enough is all you need.