# Oya Browser

> The browser built for AI agents. A real browser that agents control via MCP.

## Overview

Oya Browser is a desktop browser (Electron) that connects to the Oya Browser server via WebSocket. AI tools (Claude Desktop, Cursor, Windsurf, any MCP client) connect to the server via MCP to control the browser.

Users run the browser on their machines with their real cookies, logins, and sessions. The server is hosted at browser.oya.ai. Each user creates an API key from the dashboard and uses it to connect their browser.

Architecture:
- User's machine: Oya Browser app → WebSocket → Oya Server
- AI tool: MCP client → Oya Server → WebSocket → Browser

## Quickstart

1. Go to https://browser.oya.ai/dashboard and click Generate to create an API key
2. Download Oya Browser (macOS .dmg or Linux .AppImage)
3. macOS: after install run `xattr -cr /Applications/Oya\ Browser.app` (app is not notarized yet). To open multiple instances: `open -n "/Applications/Oya Browser.app" --args --user-data-dir=/tmp/oya-2`
4. Open the app, enter `wss://browser.oya.ai/ws` as server URL and paste your API key
4. Your browser appears in the dashboard — connect AI tools via MCP

## MCP Endpoint

Each connected browser gets an MCP endpoint:

    https://browser.oya.ai/mcp/{BROWSER_ID}

MCP config for Cursor / Claude Desktop:

    {
      "mcpServers": {
        "oya-browser": {
          "url": "https://browser.oya.ai/mcp/BROWSER_ID",
          "transport": "streamable-http",
          "headers": {
            "Authorization": "Bearer YOUR_API_KEY"
          }
        }
      }
    }

## MCP Tools

### analyze_page
Analyzes the current page. Returns full page as structured markdown with every interactive element numbered as [#id type "label"]. Includes viewport size, scroll position, visibility flags.

No parameters.

IMPORTANT: Always call analyze_page BEFORE click or type. Element IDs only exist after analysis and reset on every call. After navigating or clicking a link that changes the page, call analyze_page again.

Returns:
- Page metadata (url, title, viewport, scroll position)
- Full page content as markdown with inline element annotations
- Element index grouped by visible/off-screen

### navigate
Navigate the browser to a URL.
Parameters: url (string, required)

### click
Click an interactive element by its ID number from analyze_page.
Parameters: element_id (number, required)

### type
Type text into an input element. Clears existing content first, types character by character.
Parameters: element_id (number, required), text (string, required)

### press_key
Press a keyboard key.
Parameters: key (string, required) — "Enter", "Escape", "Tab", "Backspace", "ArrowDown", "ArrowUp", or any character

### screenshot
Capture the visible tab as a base64 PNG image.
No parameters.

### scroll
Scroll the page up or down.
Parameters: direction ("up" or "down", required), amount (number, optional, default 500)

### list_tabs
List all open tabs with ID, title, URL, and which is active.
No parameters.

### open_tab
Open a new browser tab.
Parameters: url (string, optional)

### switch_tab
Switch to a different tab.
Parameters: tab_id (number, required)

### close_tab
Close a tab. Closes active tab if no tab_id specified.
Parameters: tab_id (number, optional)

### wait
Wait for an element matching a CSS selector to appear.
Parameters: selector (string, required), timeout (number, optional, default 10000ms)

### read_elements
List interactive elements on the page. Lighter than analyze_page.
Parameters: selector (string, optional), limit (number, optional, default 50)

## Workflow Pattern

1. analyze_page → understand the page, get element IDs
2. Act: click(element_id), type(element_id, text), press_key("Enter"), scroll("down")
3. If page changed → analyze_page again (old IDs are invalid)
4. Repeat until task is done

## Element Annotation Format

analyze_page returns elements inline in the markdown:

    [#5 button "Submit"]
    [#9 input:email placeholder="you@example.com" required]
    [#12 link "Settings" → /settings]
    [#8 ☑ "Remember me"]

Element IDs are real attributes (data-ac-id) on the DOM — click(5) resolves via querySelector('[data-ac-id="5"]').

## Browser Pool (scale mode)

Run hundreds or thousands of browsers as a single pool with round-robin dispatch and automatic cookie sync.

### Setup

Set FLEET_TOKEN env var on the server. All browsers use that token as their API key. They auto-join the pool.

### Pool MCP Endpoint

    POST /mcp/pool
    Authorization: Bearer FLEET_TOKEN

Same tools as per-browser MCP. navigate and analyze_page advance the round-robin. click/type/screenshot stay pinned to the last-used browser so element IDs remain valid.

Additional tool: pool_status — shows pool size and connected browsers.

### Cookie Sync

Cookies are scoped to the API key. Browsers sharing a key share cookies;
different API keys are fully isolated from each other.
- On connect: browser sends its cookies, receives the jar for its API key
- On change: cookie changes are broadcast to other browsers with the same key
- Login once with key K → every browser using K gets the session; other keys are unaffected

### Pool REST API

    GET  /pool                      Pool status (size + browser list)
    POST /pool/command              Round-robin command dispatch
    GET  /pool/cookies              View shared cookie jar
    DELETE /pool/cookies            Clear shared cookie jar
    POST /fleet/provision?count=N   Batch-generate N API keys (admin only)

### Pool MCP Config (for AI clients)

    {
      "mcpServers": {
        "oya-pool": {
          "url": "https://browser.oya.ai/mcp/pool",
          "transport": "streamable-http",
          "headers": {
            "Authorization": "Bearer FLEET_TOKEN"
          }
        }
      }
    }

## REST API

All endpoints require Authorization: Bearer API_KEY header (except /health).
API keys are created via the dashboard after sign-in (POST /auth/keys).
Interactive API testing (Swagger UI): GET /swagger

    GET  /health                    Server status + browser count
    POST /auth/signup               Create user account
    POST /auth/login                Sign in → access_token
    POST /auth/keys                 Create API key (user JWT required)
    GET  /browsers                  List connected browsers (scoped to your key)
    POST /browsers/:id/command      Send command (body: { "action": "...", "params": {} })
    POST /browsers/:id/chat         Chat with LLM (body: { "messages": [...] })
    GET  /live/:id?key=...          SSE live view frame stream
    GET  /mcp/:id                   MCP Streamable HTTP endpoint
    POST /mcp/:id                   MCP Streamable HTTP endpoint
    POST /mcp/pool                  Pool MCP endpoint (round-robin)
    GET  /pool                      Pool status
    POST /pool/command              Pool round-robin command
    GET  /pool/cookies              Per-key cookie jar (admin sees all)
    POST /fleet/provision?count=N   Batch-generate API keys (admin)
    GET  /config                    Get server settings (admin)
    POST /config                    Update server settings (admin)

### Command API — POST /browsers/:id/command

Each action uses only specific params. Send { "action": "...", "params": { ... } }.

Navigation actions:
  navigate    — params: url (required)              — Navigate to a URL
  open_tab    — params: url (optional)              — Open a new tab
  switch_tab  — params: tab_id (required)           — Activate a tab by ID
  close_tab   — params: tab_id (optional)           — Close a tab (defaults to active)
  list_tabs   — no params                           — List all open tabs

Page analysis actions:
  analyze     — no params                           — Full page as markdown + numbered elements
  read_page   — params: selector, limit (default 50) — Lightweight element listing
  screenshot  — no params                           — Capture page as PNG

Interaction actions:
  click       — params: selector (e.g. [data-ac-id="3"])       — Click an element
  type        — params: selector + text                        — Type into an input
  press_key   — params: key (Enter, Tab, Escape, etc.)         — Press a keyboard key
  scroll      — params: direction (up/down), amount (default 500) — Scroll the page
  wait        — params: selector, timeout (ms, default 10000)  — Wait for element to appear

Examples:
  { "action": "navigate", "params": { "url": "https://google.com" } }
  { "action": "analyze" }
  { "action": "click", "params": { "selector": "[data-ac-id=\"3\"]" } }
  { "action": "type", "params": { "selector": "[data-ac-id=\"9\"]", "text": "hello" } }
  { "action": "press_key", "params": { "key": "Enter" } }
  { "action": "scroll", "params": { "direction": "down", "amount": 500 } }
  { "action": "screenshot" }
  { "action": "list_tabs" }
  { "action": "open_tab", "params": { "url": "https://gmail.com" } }
  { "action": "switch_tab", "params": { "tab_id": 2 } }
  { "action": "close_tab", "params": { "tab_id": 3 } }
  { "action": "wait", "params": { "selector": ".results", "timeout": 10000 } }
  { "action": "read_page", "params": { "limit": 20 } }

## WebSocket Protocol

Browsers connect via WebSocket at wss://browser.oya.ai/ws

Auth: { "type": "auth", "api_key": "...", "browser_id": "...", "browser_name": "..." }
Response: { "type": "auth_ok", "browser_id": "..." }

Commands (server → browser): { "type": "cmd", "id": "uuid", "action": "analyze", "params": {} }
Results (browser → server): { "type": "cmd_result", "id": "uuid", "ok": true, "data": { ... } }

Ping/pong: Both sides send { "type": "ping" } and respond with { "type": "pong" } every 15-20s.
Live stream: { "type": "stream_start", "fps": 2 } / { "type": "frame", "data": "data:image/jpeg;base64,..." } / { "type": "stream_stop" }

## Key Scoping

Each API key only sees browsers connected with that key. Users cannot see or control other users' browsers. Admin keys (set via API_KEYS env var) can see all browsers.

## OpenAPI Schema

Machine-readable OpenAPI 3.0 spec: GET /openapi.json