Technology
12 min read

We Made AI Control a Phone 10x Faster

"Bridge $5 on Rabby."

The AI grabbed the phone, opened Bridge, entered the amount, waited for the quote, hit confirm, cancelled biometric auth, typed the password, and hit final confirm. Two MCP calls. Done.

A week ago the same task took over 20 calls. And React Native apps didn't work at all.

What changed

The biggest change was merging 6 separate tools into one called mobile_do. Screen reading, tapping, typing, swiping, and screenshots are all handled by a single tool. Call it without actions and it tells you what's on screen. Pass actions and it executes them, then returns the updated screen.

text
mobile_do(actions: ["tap TEXT:Bridge", "wait 2000", "type 5", "tap BUTTON:Confirm"])
text
> tapped TEXT:Bridge > typed "5" > tapped BUTTON:Confirm --- TEXT:Authentication required BUTTON:Cancel

Four actions in one call, with the final screen state returned immediately.

We also got rid of coordinates. Previously you had to pass pixel coordinates like tap 541 2316, but keyboards and modals would shift everything by hundreds of pixels, causing taps to land in the wrong place. Now it's tap BUTTON:Confirm — targeting elements by name. Each tap reads the screen fresh and resolves the element's current position, so layout changes don't matter.

When the same label appears multiple times, @N disambiguates. Two "Confirm" buttons on a password sheet? The outer one is BUTTON:Confirm@1, the password confirm is BUTTON:Confirm@2. These numbers stay consistent as long as the screen structure is the same.

React Native apps work now

D'CENT wallet is built with React Native. The standard approach (uiautomator dump) hangs forever on this app — carousel animations fire screen-change events every 100ms, and the "is the screen stable?" check never passes.

We built a tiny 7KB program (DEX) that runs directly on the device and skips the stability check entirely. The standard approach is tried first, and if it fails, the system switches automatically. The program ships bundled in the package, installs itself on first run, and requires zero configuration.

What we actually did with it

Rabby Bridge — $5 bridged in 2 MCP calls:

The first call handled everything from the home screen to the Bridge confirmation — navigating, entering the amount, waiting for a quote, and confirming (6 actions). The second call cancelled biometric auth, entered the password, and hit final confirm (6 actions). $5 moved from Polygon to Base.

D'CENT Swap — 1 MCP call, 13 actions:

Navigate to Swap → select token → enter amount → wait for quote → tap Swap → pass price impact warning → acknowledge terms → confirm → enter 6-digit PIN. 1 USDC exchanged for 10.29 POL. A week ago we couldn't even open this app.

DFS exploration — 1 MCP call, 4 screens + 4 screenshots:

json
["tap TEXT:Security", "wait 1000", "screenshot /path/021.png", "press BACK", "wait 500", "tap TEXT:Notifications", "wait 1000", "screenshot /path/022.png", "press BACK"]

Visit each screen, capture a screenshot, and backtrack — all in a single call. Building a complete screenshot catalog of an app became several times faster.

ScenarioBeforeAfter
Single tap4 calls1 call
4-screen DFS24+ calls1 call
Rabby Bridge E2E~20 calls2 calls
D'CENT Swap E2Eimpossible1 call

Transition tables

After exploring an app once, you end up with a map of screen + action → next screen. We call it TRANSITIONS.md. With this map, you can pre-plan an entire flow and execute it in one batch — no need to check the screen between steps.

text
Bridge → tap BUTTON:Confirm → BiometricPrompt BiometricPrompt → tap BUTTON:Cancel → PasswordSheet PasswordSheet → type {password} + tap BUTTON:Confirm@2 → Done

Since mobile_do returns the new screen after every action, the transition map builds itself naturally during exploration.

Limitations

Text matching struggles with elements that change in real-time, like prices that update every second. Each tap re-reads the screen to find the element, so single taps aren't faster than the old coordinate approach — just far more reliable. Since we read the accessibility tree rather than pixels, unlabeled icons and purely visual states (greyed-out buttons, color indicators) are invisible.

Code

© 2025 Forrest Kim. Copyright.

contact: humblefirm@gmail.com