Technology
5 min readI Gave AI a Smartphone. It Did Everything a QA Engineer Does.
We gave an AI agent the ability to see and touch a smartphone screen via USB. Had it analyze three competitor apps. What would take a human weeks was done in a day.
What we did
- Treated apps as graphs. Screens = nodes, buttons = edges. DFS to visit every screen without exception
- 291 screenshots across 3 apps. Every dropdown, toggle, scroll-to-bottom. Checklist proved zero gaps
- Executed 5 real transactions. Swaps, bridges, futures — captured confirmation/success/error screens and actual fee structures
- Auto-generated 37-axis comparison + HTML report. Every claim linked to screenshot evidence
What we found
- AI caught what QA passed. A 0.4% fee — within spec, so QA approved. AI flagged it as high for the industry and compared against 5 competitors
- Humans see about 30%. You think you checked everything. Compared to AI's checklist-based exhaustive exploration, the coverage gap is overwhelming
- It makes domain judgments. Detected anomalies undefined in any spec. A model trained on thousands of app patterns knows "this number is unusual" before any human does
Why this matters
- Observe, judge, report — there are roles where these three things are the job. QA, researchers, analysts, consultants, auditors. AI did all three
- The cost structure changes. Human weeks = AI day. And AI holds 291 screenshots in memory while comparing consistently across 37 axes. Humans can't
We open-sourced the methodology. Give this document to an LLM and it starts exploring on its own.