Skip to main content
Scrape functions extract data from HTML pages using CSS selectors, similar to Home Assistant’s scrape integration. They’re useful for websites without APIs.

Configuration

function:
  type: scrape
  resource: https://example.com
  value_template: "{{ value }}"
  sensor:
    - name: sensor_name
      select: ".css-selector"
      value_template: "{{ value }}"
resource
string
required
The URL of the web page to scrape
sensor
array
required
List of data points to extract from the page
value_template
string
required
Template that combines scraped sensor values into the function result

Sensor Configuration

Each sensor extracts a piece of data from the page:
name
string
required
Variable name to use in the main value_template
select
string
required
CSS selector to find the element
value_template
string
Template to process the selected element’s text
attribute
string
Extract an HTML attribute instead of text (e.g., href, src)

CSS Selectors

.classname        /* Select by class */
#idname          /* Select by ID */
tagname          /* Select by tag */
[attribute]      /* Select by attribute */
parent child           /* Descendant */
parent > child         /* Direct child */
element + sibling      /* Adjacent sibling */
element ~ sibling      /* General sibling */
[href^="https"]        /* Starts with */
[href$=".pdf"]         /* Ends with */
[href*="example"]      /* Contains */
[data-id="123"]        /* Exact match */
:first-child           /* First child */
:last-child            /* Last child */
:nth-child(2)          /* Nth child */
:not(.excluded)        /* Negation */

Advanced Features

Get HTML attributes instead of text:
sensor:
  - name: image_url
    select: "img.product-image"
    attribute: src
  - name: link
    select: "a.read-more"
    attribute: href
Extract all matching elements:
sensor:
  - name: all_prices
    select: ".price"
    index: all  # Returns a list
Add custom headers (e.g., for authentication):
function:
  type: scrape
  resource: https://example.com
  headers:
    User-Agent: "Mozilla/5.0"
    Cookie: "session=abc123"
  sensor:
    - name: data
      select: ".content"

Use Cases

Version Tracking

Monitor software versions and releases

Price Monitoring

Track product prices and availability

News Aggregation

Extract headlines and articles

Status Pages

Monitor service status pages

Best Practices

1

Inspect the page

Use browser DevTools to find the right CSS selectors. Right-click an element and select “Inspect” to see its HTML structure.
2

Handle missing elements

Pages may change. Use safe filters and defaults:
{{ value | default('Not found') }}
3

Respect robots.txt

Check the website’s robots.txt to ensure scraping is allowed. Add appropriate delays for rate limiting.
4

Add User-Agent

Some sites block requests without a User-Agent header:
headers:
  User-Agent: "Home Assistant Extended OpenAI Conversation"

Debugging

Test CSS selectors in browser console:
document.querySelector('.css-selector')
document.querySelectorAll('.css-selector')
Enable debug logging:
logger:
  logs:
    custom_components.extended_openai_conversation: debug

Examples

Get Home Assistant Version

Scrape the current version from home-assistant.io:
- spec:
    name: get_ha_version
    description: Use this function to get Home Assistant version
    parameters:
      type: object
      properties:
        dummy:
          type: string
          description: Not used (placeholder)
  function:
    type: scrape
    resource: https://www.home-assistant.io
    value_template: "version: {{version}}, release_date: {{release_date}}"
    sensor:
      - name: version
        select: ".current-version h1"
        value_template: '{{ value.split(":")[1] }}'
      - name: release_date
        select: ".release-date"
        value_template: '{{ value.lower() }}'
Scrape Example

Scrape Product Price

- spec:
    name: get_product_price
    description: Get current price of a product
    parameters:
      type: object
      properties:
        url:
          type: string
          description: Product page URL
      required:
      - url
  function:
    type: scrape
    resource: "{{ url }}"
    value_template: "Price: {{ price }}, In Stock: {{ stock }}"
    sensor:
      - name: price
        select: ".price-current"
        value_template: '{{ value | replace("$", "") | float }}'
      - name: stock
        select: ".stock-status"
        value_template: '{{ "yes" if "in stock" in value.lower() else "no" }}'

Extract Multiple Items

- spec:
    name: get_news_headlines
    description: Get latest news headlines
    parameters:
      type: object
      properties:
        dummy:
          type: string
  function:
    type: scrape
    resource: https://news.ycombinator.com
    value_template: >-
      Top Stories:
      {% for i in range(1, 6) %}
      {{ i }}. {{ headlines[i-1] }}
      {% endfor %}
    sensor:
      - name: headlines
        select: ".titleline > a"
        value_template: '{{ value }}'
        index: all

Comparison: Scrape vs REST

FeatureScrapeREST
Data formatHTMLJSON/XML
Requires APINoYes
Selector typeCSSJSON path
Best forPublic websitesAPIs with keys
MaintenanceHigher (HTML changes)Lower (stable APIs)

Next Steps