Speech Overview

You can use the Speech service to convert media files to readable text that's stored in JSON and SRT format.

Speech harnesses the power of spoken language enabling you to easily convert media files containing human speech into highly exact text transcriptions. The service is an Oracle Cloud Infrastructure (OCI) native application that you can access using the Console, REST API, CLI, and SDK. In addition, you can use the Speech service in a Data Science notebook session.

Speech uses automatic speech recognition (ASR) technology to provide a grammatically correct transcription. Speech handles low-fidelity media recordings and transcribes challenging recordings such as meetings or call centers calls. Using Speech, you can turn files stored in Object Storage or a data asset into exact, normalized, timestamped, and profanity-filtered text. This functionality is only available with the Speech. For example, you could index the output of speech (a text file) using Data Lake. Without the downstream services, this capability doesn't exist in Speech.

Shows the speech engine process, media to front-end, to backend to results.

The Speech models are robust to acoustic environments and recording channels that ensure that this is a good quality transcription service.

Multiple Media Format Support per Language

These media formats are supported for all supported languages in the Speech service:

  • AAC
  • AC3
  • AMR
  • AU
  • FLAC
  • M4A
  • MKV
  • MP3
  • MP4
  • OGA
  • OGG
  • WAV
  • WEBM
Language Language Code Sample Rate
English—United States en-US >= 8 khz
Spanish—Spain es-ES >= 8 khz
Portuguese—Brazil pt-BR >= 8 khz
English—Great Britain en-GB >= 16 khz
English—Australia en-AU >= 16 khz
English—India en-IN >= 16 khz
Hindi—India hi-IN >= 16 khz
French—French fr-FR >= 16 khz
German—Germany de-DE >= 16 khz
Italian—Italy it-IT >= 16 khz

For best results:

  • Use a lossless format such as FLAC or WAV with PCM 16-bit encoding.
  • Use a sample rate of 8, 000 Hz for low-fidelity media and 16,000 to 48, 000 Hz for high fidelity media.

You can use single-channel, 16-bit PCM WAV media files with an 8 kHz or 16 kHz sample rate. We recommend Audacity (GUI) or FFmpeg (command line) for media transcoding. A maximum media file length of four hours and up to 2 GB is supported.

Speech is susceptible to the quality of the input media files. Different accents, background noises, switching from one language to another, using fusion languages, or multiple speakers at the same time impact the quality of the transcription.

Speech Provides These Capabilities

  • Accurate transcriptions—Produces an accurate and easy to use JSON and SubRip Subtitle (SRT) files written directly to the Object Storage bucket you choose. You can take advantage of the transcription and integrate it directly with applications, and use it for subtitles or content search and analysis.

  • Whisper model—Multilingual data is collected from the web and supports file based voice to text transcription for 50+ languages.
  • Time stamped JSON—The transcription provides a timestamp for each token (word). You can use the timestamp to search and find the text you're looking for within the media file then quickly jump to that location.

  • Multilingual—Produces accurate transcriptions in English, English-Great Britain, English-Australia, English-India, Spanish, Portuguese, French, Italian, German, and Hindi.

  • Asynchronous API—Straightforward asynchronous APIs with transcription task batching. The APIs enable canceling jobs that aren't yet processed saving time and money.

  • Text normalizations—Provides text normalizations for numbers, addresses, currencies, and so on. With text normalizations, you get a higher-quality transcription from artificial intelligence that's easier to read and understand.

  • Profanity filtering—Allows you to remove, mask, or tag words that are offensive from the transcription.

  • Confidence score per word and transcription—Produces word and transcription confidence scores on the generated JSON file. You can use the confidence scores to quickly identify words that require attention.

  • Closed captions—Provides you with an SRT file as an extra output format. Use the SRT to add closed captions to video files.

  • Punctuation—Long text requires punctuation so Speech punctuates the transcribe content automatically.

  • Telephoney ready—Files can be 8 kHz or 16 kHz and each are automatically detected so that the correct model is applied. With this capability, you can transcribe telephone recordings.

  • Speaker diarization—Associates transcription text with specific speakers using natural-language understanding scenarios, such as extracting a prescription from medical audio by identifying the service provider compared to the patient. Speaker diarization is a combination of speaker segmentation and speaker clustering. Speaker segmentation finds the speaker change points in an audio stream. Speaker clustering groups together speech segments based on speaker characteristics.

Key Concepts

These are the key Speech service concepts:

Transcription Jobs

A job is a single asynchronous request from the Console or the Speech API. Each job is uniquely identified by an id, which you can use to retrieve job status and results.

A job in a tenant is processed in a strict first in first out manner. Each job can contain up to 100 tasks. If you submit a job that exceeds the maximum tasks, that job fails. Jobs are retained for 90 days.

Live Transcribe Jobs
Allows you to send an audio stream to the service and receive the results in text (JSON & SRT format) in real time.
Tasks

A task is the result of a single file processed in a job. Jobs can have multiple tasks based on what's stored in your Object Storage bucket that you specify for a job.

Models

Pretrained acoustic and language models, including Whisper models, power the job transcription process.

Authentication and Authorization

Each service in OCI integrates with IAM for authentication and authorization, for all interfaces (the Console, SDK or CLI, and REST API).

An administrator in your organization needs to set up groups , compartments , and policies  that control which users can access which services, which resources, and the type of access. For example, the policies control who can create new users, create and manage the cloud network, launch instances, create buckets, download objects, and so on. For more information, see Getting Started with Policies.

If you're a regular user (not an administrator) who needs to use the OCI resources that your company owns, contact your administrator to set up a user ID for you. The administrator can confirm which compartment or compartments you should be using.

Resource Identifiers

The Speech service supports jobs and tasks as OCI resources. Most types of resources have a unique, Oracle-assigned identifier called an Oracle Cloud ID (OCID). For information about the OCID format and other ways to identify your resources, see Resource Identifiers.

Regions and Availability Domains

Speech is available in all OCI commercial regions. See About Regions and Availability Domains for the list of available regions for OCI, along with associated locations, region identifiers, region keys, and availability domains.

Note

Text to Speech is only available in the US West (Phoenix) commercial region.

Ways to Access

You can access Speech using the Console (a browser-based interface), the command line interface (CLI), or the REST API. Instructions for the Console, CLI, and API are included in topics throughout this guide.

To access the Console, you must use a supported browser. To go to the Console sign-in page, open the navigation menu at the top of this page and click Infrastructure Console. You are prompted to enter your cloud tenant, your user name, and your password.

For a list of available SDKs, see SDKs and the CLI. For general information about using the APIs, see REST API.

Service Limits

In each region that is enabled for your tenancy, these limits apply:

File Limits

  • The maximum file size is 2 GB.

  • File duration is a maximum of 4 hours.

Job Limits

  • Each job can have up to 100 tasks.

  • Jobs are retained for 90 days.

Text to Speech

Text to speech supports maximum 10000 characters per request.