Learning AI Vision

What is Multi-Modal AI?

Most people know AI through chatbots like ChatGPT—you type text, it responds with text. But modern AI has evolved far beyond simple text conversations. Today's Large Language Models (LLMs) are multi-modal, meaning they can reason about multiple types of input/output.

Text is just the beginning. Based on the way they were trained, models can also work with:

Structured text — Markdown, JSON, CSV, and other formatted text. AI can parse, generate, and transform between these formats effortlessly (more on Markdown in a future post!)
Code — Coding assistants like GitHub Copilot and Claude Code can read, write, and debug code across dozens of programming languages based on natural language commands
Vision — Upload an image and the AI can describe it, read text within it, or answer questions about what it "sees"
Audio — Some models can transcribe speech, understand spoken commands, or even reason about sounds and music

All of these modalities are processed by training a model in different ways. What makes multi-modal AI powerful is the ability to combine context across formats.

Usually these capabilities are cloud-hosted and accessible via API, which means developers can build applications that leverage a combination of these modalities.

Vision APIs: Impressive, But Limited

Upload an image to a vision-enabled LLM and it can describe what it sees with surprising accuracy. It can read text in photos, identify objects, describe scenes, and even interpret charts and diagrams.

But how fine can it see? What does it actually know?

I ran into limitations quickly while using vision APIs in building my AI Typing Tutor. The idea was simple: could AI vision look at an image of hands on a keyboard and guess which keys the fingers were touching?

Even with crystal-clear images—good lighting, high resolution, perfect angle—the results were way off. The vision API could tell me "there are hands on a keyboard," but when I asked which specific keys the fingertips were touching? Not even close.

This was my first real lesson in the boundaries of general-purpose AI vision. These models are trained on massive datasets to understand the concept of things, but they lack the precision needed for fine-grained spatial tasks. Basic cloud vision would never work for something so detailed.

Enter Local Inference

This is where things get interesting. Instead of relying on cloud APIs, you can run AI models locally on your own hardware.

So what is inference?

Training a model teaches it to recognize patterns from data. Inference is using that trained model to make predictions based on a prompt. When you send a photo to an AI and it tells you what's in it—that's LLM vision inference.

Think of it like this: training is learning to recognize dogs by looking at thousands of dog photos. Vision inference is looking at a new photo and saying "that's a dog."

Context for Reasoning

The AI hand tracking tech demonstrated runs in a modern browser so knowing the fingertip locations doesn't require the compute of an inference server. The hand tracking data helps the user and adds context, but the tutor analysis and other features require added modes of AI to supplement hand tracking.

Hot Dog or Not Hot Dog

Not Hotdog app from Silicon Valley

If you've seen the show Silicon Valley, you might remember the app that could only identify whether something was a hot dog or not a hot dog. Ridiculous premise, but it perfectly illustrates how inference works for object detection. (Fun fact: HBO actually built the real app using TensorFlow and Keras—trained on 150,000 images.)

A model trained specifically to detect hot dogs can do that one task extremely well. It doesn't need to know about cats, cars, or keyboards—just hot dogs.

A more practical example: security cameras that detect people. The model is trained specifically to recognize human forms, and it runs inference on every frame to determine: person detected, or not?

Specialized Training

Before these projects, I had experience tweaking what LLMs had access to through tools, dynamic context or Retrieval-Augmented Generation (RAG) but had not done model training.

Training a model on classified image datasets was entirely new ground. Using YOLO (You Only Look Once) to build custom object detection meant labeling images, running training cycles, and watching the model learn to see specific things.

This is the key insight: specialized models running local inference can outperform general-purpose cloud APIs for specific tasks. They're faster (no network latency), more private (data stays on your machine), and can be optimized for exactly what you need.

Looking Ahead

In upcoming posts, I'll dive deeper into the local inference setup—the hardware, the software stack, and how to get models running efficiently on consumer GPUs. I will cover incorporating AI vision in mobile apps, but will not be focusing on SEE FOOD.

I'll post more about projects I'm working on using trained models and edge inference. This experiment has given me an appreciation for frontier models, but it's also changed the way I approach model decisions and think about AI economics and privacy.