Back to Homepage

Webcam to AI (LLaVA)

Bridging Real-Time Visual Input with AI-Driven Descriptions

The Idea

This project aimed to bridge the gap between real-time visual input and AI-driven language descriptions. By capturing images from a user’s webcam and processing them through a locally hosted LLaVA model (via Ollama), it generated on-the-fly scene descriptions. The ultimate goal was to connect the user’s immediate environment with meaningful textual narratives, potentially aiding memory formation through the pairing of visual data and descriptive language.

Development

A Python-based approach employed OpenCV to capture images at fixed intervals. These images were encoded and sent to the LLaVA model through the Ollama API. Textual annotations returned by the model were then saved, forming a corpus of scene-based descriptions. While the hardware requirements were minimal, integrating Python’s request-handling with the Ollama server and ensuring stable LLaVA operations required careful orchestration.

Reflection

The system successfully enriched the raw camera feed with semantic detail, producing meaningful textual descriptors in real time. However, it remained a static pipeline: users could collect images and read descriptions, but these insights were not integrated into a broader spatial or contextual framework. Nonetheless, the proof-of-concept demonstrated that on-demand, local AI scene analysis was achievable, setting the stage for more immersive, context-aware applications.

What Worked

  • Seamless integration between webcam input and model inference
  • Reliable text output that enriched raw imagery with semantic details
  • Organized data logging for historical review and analysis

What Did Not Work

  • Lack of interactive features for user-driven refinement of descriptions
  • Potential performance bottlenecks with rapid image capture frequency

Github

Previous project
Back to all projects