***

title: Transcription
description: Convert speech to text with the /v1/listen endpoint
---------------------

For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.smallest.ai/waves/self-host/api-reference/endpoints/llms.txt. For full documentation content, see https://docs.smallest.ai/waves/self-host/api-reference/endpoints/llms-full.txt.

## Overview

The transcription endpoint converts audio files to text using Lightning ASR. Supports both batch processing and streaming.

## Endpoint

```
POST /v1/listen
```

## Authentication

Requires Bearer token authentication with your license key.

```http
Authorization: Token YOUR_LICENSE_KEY
```

See [Authentication](/waves/self-host/api-reference/authentication) for details.

## Request

### From URL

Transcribe audio from a publicly accessible URL:

```json
{
  "url": "https://example.com/audio.wav"
}
```

### From File Upload

Upload audio directly:

```bash
curl -X POST http://localhost:7100/v1/listen \
  -H "Authorization: Token ${LICENSE_KEY}" \
  -F "audio=@/path/to/audio.wav"
```

### Parameters

<ParamField path="url" type="string">
  URL to audio file (mutually exclusive with file upload)

  Supported protocols: `http://`, `https://`, `s3://`
</ParamField>

<ParamField path="audio" type="file">
  Audio file upload (mutually exclusive with URL)

  Supported formats: WAV, MP3, FLAC, OGG, M4A
</ParamField>

<ParamField path="language" type="string" default="en">
  Language code (ISO 639-1)

  Examples: `en`, `es`, `fr`, `de`, `zh`
</ParamField>

<ParamField path="punctuate" type="boolean" default="true">
  Add punctuation to transcript
</ParamField>

<ParamField path="diarize" type="boolean" default="false">
  Enable speaker diarization (identify different speakers)
</ParamField>

<ParamField path="num_speakers" type="integer">
  Expected number of speakers (for diarization)

  If not specified, automatically detected
</ParamField>

<ParamField path="timestamps" type="boolean" default="false">
  Include word-level timestamps
</ParamField>

<ParamField path="callback_url" type="string">
  Webhook URL for async results delivery

  If provided, returns immediately with job ID
</ParamField>

## Response

### Successful Response

```json
{
  "request_id": "req_abc123",
  "text": "Hello, this is a sample transcription.",
  "confidence": 0.95,
  "duration": 3.2,
  "language": "en",
  "words": [
    {
      "word": "Hello",
      "start": 0.0,
      "end": 0.5,
      "confidence": 0.98
    },
    {
      "word": "this",
      "start": 0.6,
      "end": 0.8,
      "confidence": 0.97
    }
  ]
}
```

### Response Fields

<ResponseField name="request_id" type="string">
  Unique identifier for this transcription request
</ResponseField>

<ResponseField name="text" type="string">
  Complete transcription text
</ResponseField>

<ResponseField name="confidence" type="float">
  Overall confidence score (0.0 to 1.0)
</ResponseField>

<ResponseField name="duration" type="float">
  Audio duration in seconds
</ResponseField>

<ResponseField name="language" type="string">
  Detected or specified language
</ResponseField>

<ResponseField name="words" type="array">
  Word-level details (if `timestamps: true`)

  Each word object contains:

  * `word`: The word text
  * `start`: Start time in seconds
  * `end`: End time in seconds
  * `confidence`: Word confidence score
</ResponseField>

## Examples

### Basic Transcription

<Tabs>
  <Tab title="cURL">
    ```bash
    curl -X POST http://localhost:7100/v1/listen \
      -H "Authorization: Token ${LICENSE_KEY}" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/audio.wav"
      }'
    ```
  </Tab>

  <Tab title="Python">
    ```python
    import requests

    response = requests.post(
        "http://localhost:7100/v1/listen",
        headers={
            "Authorization": f"Token {LICENSE_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "url": "https://example.com/audio.wav"
        }
    )

    result = response.json()
    print(result['text'])
    ```
  </Tab>

  <Tab title="JavaScript">
    ```javascript
    const response = await fetch('http://localhost:7100/v1/listen', {
      method: 'POST',
      headers: {
        'Authorization': `Token ${LICENSE_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        url: 'https://example.com/audio.wav'
      })
    });

    const result = await response.json();
    console.log(result.text);
    ```
  </Tab>
</Tabs>

### With Punctuation and Timestamps

```json
{
  "url": "https://example.com/audio.wav",
  "punctuate": true,
  "timestamps": true
}
```

Response:

```json
{
  "request_id": "req_abc123",
  "text": "Hello, this is a sample transcription.",
  "confidence": 0.95,
  "duration": 3.2,
  "words": [
    {"word": "Hello", "start": 0.0, "end": 0.5, "confidence": 0.98},
    {"word": ",", "start": 0.5, "end": 0.5, "confidence": 1.0},
    {"word": "this", "start": 0.6, "end": 0.8, "confidence": 0.97}
  ]
}
```

### With Speaker Diarization

```json
{
  "url": "https://example.com/conversation.wav",
  "diarize": true,
  "num_speakers": 2
}
```

Response:

```json
{
  "request_id": "req_abc123",
  "text": "Hello. Hi there!",
  "speakers": [
    {
      "speaker": "SPEAKER_00",
      "text": "Hello.",
      "start": 0.0,
      "end": 0.8
    },
    {
      "speaker": "SPEAKER_01",
      "text": "Hi there!",
      "start": 1.0,
      "end": 1.8
    }
  ]
}
```

### File Upload

```bash
curl -X POST http://localhost:7100/v1/listen \
  -H "Authorization: Token ${LICENSE_KEY}" \
  -F "audio=@recording.wav" \
  -F "punctuate=true" \
  -F "language=en"
```

### Async with Callback

```json
{
  "url": "https://example.com/long-audio.wav",
  "callback_url": "https://myapp.com/webhook/transcription"
}
```

Immediate response:

```json
{
  "job_id": "job_xyz789",
  "status": "processing"
}
```

Later, webhook receives:

```json
{
  "job_id": "job_xyz789",
  "status": "completed",
  "result": {
    "text": "...",
    "confidence": 0.95
  }
}
```

## Error Responses

### 400 Bad Request

```json
{
  "error": "Missing required parameter: url or audio file",
  "code": "MISSING_PARAMETER"
}
```

### 415 Unsupported Media Type

```json
{
  "error": "Unsupported audio format",
  "code": "UNSUPPORTED_FORMAT",
  "supported_formats": ["wav", "mp3", "flac", "ogg", "m4a"]
}
```

### 422 Unprocessable Entity

```json
{
  "error": "Audio file too large",
  "code": "FILE_TOO_LARGE",
  "max_size_mb": 100
}
```

### 503 Service Unavailable

```json
{
  "error": "No ASR workers available",
  "code": "SERVICE_UNAVAILABLE",
  "retry_after": 30
}
```

## Audio Format Requirements

### Supported Formats

<table>
  <thead>
    <tr>
      <th>
        Format
      </th>

      <th>
        Extension
      </th>

      <th>
        Notes
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        WAV
      </td>

      <td>
        <code>.wav</code>
      </td>

      <td>
        Recommended for best quality
      </td>
    </tr>

    <tr>
      <td>
        MP3
      </td>

      <td>
        <code>.mp3</code>
      </td>

      <td>
        Widely supported
      </td>
    </tr>

    <tr>
      <td>
        FLAC
      </td>

      <td>
        <code>.flac</code>
      </td>

      <td>
        Lossless compression
      </td>
    </tr>

    <tr>
      <td>
        OGG
      </td>

      <td>
        <code>.ogg</code>
      </td>

      <td>
        Open format
      </td>
    </tr>

    <tr>
      <td>
        M4A
      </td>

      <td>
        <code>.m4a</code>
      </td>

      <td>
        Apple format
      </td>
    </tr>
  </tbody>
</table>

### Recommended Specifications

* **Sample Rate**: 16 kHz or higher (44.1 kHz recommended)
* **Bit Depth**: 16-bit or higher
* **Channels**: Mono or stereo
* **Max Duration**: 2 hours
* **Max File Size**: 100 MB

### Audio Preprocessing

For best results:

* Remove background noise
* Normalize audio levels
* Use mono audio when possible
* Encode at 16 kHz or 44.1 kHz

## Rate Limits

Default rate limits:

* **Requests per minute**: 60
* **Concurrent requests**: 10
* **Audio hours per day**: 100

<Note>
  Contact [support@smallest.ai](mailto:support@smallest.ai) to increase limits for your license.
</Note>

## Performance

Typical performance metrics:

<table>
  <thead>
    <tr>
      <th>
        Metric
      </th>

      <th>
        Value
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        Real-time Factor
      </td>

      <td>
        0.05-0.15x
      </td>
    </tr>

    <tr>
      <td>
        Latency (1 min audio)
      </td>

      <td>
        3-9 seconds
      </td>
    </tr>

    <tr>
      <td>
        Concurrent capacity
      </td>

      <td>
        100+ requests
      </td>
    </tr>

    <tr>
      <td>
        Throughput
      </td>

      <td>
        100+ hours/hour
      </td>
    </tr>
  </tbody>
</table>

<Tip>
  Performance varies based on:

  * Audio duration and complexity
  * Number of speakers
  * GPU instance type
  * Current load
</Tip>

## Best Practices

<AccordionGroup>
  <Accordion title="Optimize Audio Quality">
    * Use lossless formats (WAV, FLAC) when possible
    * Ensure clear audio with minimal background noise
    * Use appropriate sample rate (16 kHz minimum)
  </Accordion>

  <Accordion title="Handle Errors Gracefully">
    Implement retry logic with exponential backoff:

    ```python
    import time
    from requests.adapters import HTTPAdapter
    from requests.packages.urllib3.util.retry import Retry

    session = requests.Session()
    retry = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504]
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    ```
  </Accordion>

  <Accordion title="Use Async for Long Audio">
    For audio longer than 5 minutes, use callback URL:

    ```json
    {
      "url": "https://example.com/podcast.mp3",
      "callback_url": "https://myapp.com/webhook"
    }
    ```
  </Accordion>

  <Accordion title="Cache Results">
    Cache transcription results to avoid duplicate processing:

    ```python
    import hashlib

    def get_cache_key(audio_url):
        return hashlib.md5(audio_url.encode()).hexdigest()

    cache_key = get_cache_key(audio_url)
    if cache_key in cache:
        return cache[cache_key]

    result = transcribe(audio_url)
    cache[cache_key] = result
    return result
    ```
  </Accordion>
</AccordionGroup>

## What's Next?

<CardGroup cols={2}>
  <Card title="Health Check" href="/waves/self-host/api-reference/endpoints/health-check">
    Monitor service availability
  </Card>

  <Card title="Examples" href="/waves/self-host/api-reference/examples">
    Complete integration examples
  </Card>
</CardGroup>