01  ·  2026  ·  Computer Vision  ·  Python, YOLOv8, Custom IoU Tracker

La Mano Vision

Computer vision that counts how many tortilla bags leave the prep table — without anyone having to watch.

mAP50
91.7%
Classes
2
Training frames
268
Inference speed
2–5 ms
Baseline accuracy
15/15
Inference hardware
RTX 5060 Ti

La Mano is a family-owned tortillería in Los Angeles. Every day, workers at the prep table wrap tortilla stacks into plastic bags, then carry them to inventory. The owner — my father — has always tracked daily production by memory and rough observation. That means no reliable count, no trend data, and no way to know whether a busy day actually produced more bags or just felt busier.

The ask was simple: know how many bags leave the table each session. The constraint was equally simple: one existing Hikvision camera, no additional hardware budget, and the system had to run without a developer present to operate or debug it.

Standard off-the-shelf counting solutions either require dedicated hardware or cloud inference with ongoing per-frame costs. Neither fit a small family business. The system needed to be on-prem, fast enough to run on a consumer GPU, and honest about its limitations — if it gets the count wrong, the owner will notice. He’s in that room every day.

This shaped every technical decision: model size (nano, not large), inference method (sampled frames, not every frame), and tracking approach (custom logic I can debug and explain, not a black-box library).

I extracted training frames from NVR-exported footage using a custom script, then labeled 268 frames in Roboflow across two classes:

tortilla_stack — unbagged stack on the prep table
bagged_stack  — stack wrapped in plastic, ready to sell

# Class IDs assigned alphabetically by Roboflow on export.
# bagged_stack = 0, tortilla_stack = 1.
# Always read from model.names at runtime — never hardcode.

I fine-tuned YOLOv8n on these 268 frames in Google Colab over 150 epochs (T4 GPU, imgsz=640, batch=16), then built a custom tracking pipeline using IoU-based greedy matching across sampled frames — one frame every 0.25 seconds. A region-of-interest polygon constrains detections to the prep table surface so anything off-frame doesn’t generate false events.

Screenshot — ROI polygon overlay on prep table

Replace with: outputs/tracking/<clip>/frame with ROI polygon drawn in green

A bag is counted when it satisfies both conditions: it received a PACKAGED event (model reclassified the track from unbagged to bagged), and its final classification at exit is still bagged_stack. Bags that sit on the table for 10+ seconds without a PACKAGED event also count — this handles pre-existing bags that were already wrapped when the session began.

Screenshot — bounding box detections on frame

Replace with: detection output showing bagged_stack and tortilla_stack labels with confidence scores

Screenshot — events.csv output

Replace with: events.csv showing APPEARED, PACKAGED, EXITED events with timestamps

Several approaches were tried and reverted. Each failure taught me something about where the real constraint actually lived.

Rejected

ByteTrack (model.track())

Loses track IDs when frames are skipped. With 0.25s sampling intervals, ByteTrack generated 84+ track IDs for ~15 physical stacks. The ID discontinuity between sampled frames is fundamental — not a tuning problem.

Reverted

Class lock after PACKAGED event

Once a PACKAGED event fired, lock the track's class_id permanently. This caused overcounting: brief false PACKAGED events on unbagged stacks permanently marked them as bags. Workers shifting a stack to the table edge triggered immediate removal and re-detection as a new track.

Reverted

Immediate removal on ROI exit

Remove tracks the moment their centroid leaves the ROI polygon. Caused double counting: if a worker shifts a bag to the edge, the track is removed and counted — then when the bag comes back into frame, a new track is created and counted again.

Reverted

Count on PACKAGED event alone (regardless of final class_id)

Recovered some missed bags where class_id reverted post-packaging. But also counted pre-existing bags that oscillated between classes without real packaging. Net effect: went from 17 → 20 on the April 18 clip. Reverted.

The honest conclusion: the remaining accuracy gap (~2 bags per session) is a model consistency problem, not a tracking logic problem. More diverse training data — specifically footage from high-volume days where larger stacks and faster packaging are common — is the right fix. Tuning the tracker further risks overfitting to one day’s footage.

The system runs on a dedicated Windows gaming PC (RTX 5060 Ti, 16GB VRAM) permanently installed at the store. Inference runs at 2–5 ms per sampled frame — fast enough to run in real time on the live RTSP stream. On the April 18 baseline clip (15 bags, manually verified), the system counts correctly within ±2 bags.

Validation mAP50
91.7%
bagged_stack mAP
88.4%
tortilla_stack mAP
95.1%
Apr 18 baseline
15/15 bags
Typical overcount
±2 bags
Root cause
Track splitting in bursts

The system is honest about what it doesn’t know. During busy packaging bursts, bounding box instability causes the tracker to occasionally split one physical bag into two tracks — both get counted. This is documented in the roadmap and will be addressed by retraining on April 19 footage (larger stacks, faster packaging), not by patching the tracker.

01

Retrain on April 19 footage

Label ~100 frames from a high-volume day with larger stacks and faster packaging. Expected to close the remaining accuracy gap.

02

Live RTSP integration

Wrap the script in a session manager with auto-reconnect on camera disconnect, start/stop controls, and daily summary JSON output.

03

Dashboard integration

Connect events.csv output to La Mano Dashboard so daily counts update automatically. Already partially built — see project 02.