Parsing HTML with Soup Objects
Learn to parse HTML content using gazpacho’s Soup class. Master finding elements by tags, classes, and attributes, navigating HTML structure, and extracting text and data from web pages.
- Create and work with gazpacho Soup objects
- Find elements by tag names, classes, and attributes
- Navigate HTML document structure effectively
- Extract text content and attribute values
- Use CSS selectors for precise element targeting
- Handle multiple elements and nested structures
- How do I convert HTML into a searchable Soup object?
- What are the different ways to find elements in HTML?
- How do I extract text content vs. attribute values?
- How can I navigate parent-child relationships in HTML?
This tutorial is based on the gazpacho library by Max Humber (MIT License) and incorporates concepts from the calmcode.io gazpacho course (CC BY 4.0 License).
Introduction to Soup Objects
After fetching HTML with gazpacho.get(), you need to parse it to extract specific data. The Soup class transforms raw HTML into a searchable, navigable object.
Creating a Soup Object
from gazpacho import get, Soup
# Fetch HTML
url = "https://example.com"
html = get(url)
# Create Soup object
soup = Soup(html)
print(f"Created Soup object from {len(html)} characters of HTML")Basic Soup Structure
A Soup object represents the entire HTML document as a tree structure:
from gazpacho import Soup
# Simple HTML example
html = """
<html>
<head><title>Example Page</title></head>
<body>
<h1 class="main-title">Welcome</h1>
<p class="intro">This is an introduction.</p>
<div class="content">
<p>First paragraph</p>
<p>Second paragraph</p>
</div>
</body>
</html>
"""
soup = Soup(html)
print("Soup object created successfully")Finding Elements
Finding by Tag Name
Use .find() to locate elements by their HTML tag:
from gazpacho import Soup
html = """
<html>
<body>
<h1>Main Title</h1>
<p>First paragraph</p>
<p>Second paragraph</p>
</body>
</html>
"""
soup = Soup(html)
# Find first occurrence of a tag
title = soup.find('h1')
print(f"Title: {title}")
# Find first paragraph
first_paragraph = soup.find('p')
print(f"First paragraph: {first_paragraph}")Finding by Class
Search for elements with specific CSS classes:
from gazpacho import Soup
html = """
<div class="header">Header content</div>
<p class="intro">Introduction text</p>
<p class="content">Main content</p>
<div class="footer">Footer content</div>
"""
soup = Soup(html)
# Find element by class
intro = soup.find('p', {'class': 'intro'})
print(f"Intro: {intro}")
# Find div with header class
header = soup.find('div', {'class': 'header'})
print(f"Header: {header}")Finding by Multiple Attributes
Combine multiple attributes for precise targeting:
from gazpacho import Soup
html = """
<input type="text" name="username" class="form-input">
<input type="password" name="password" class="form-input">
<input type="submit" value="Login" class="btn">
"""
soup = Soup(html)
# Find by multiple attributes
username_input = soup.find('input', {
'type': 'text',
'name': 'username'
})
print(f"Username input: {username_input}")
# Find by type and class
submit_button = soup.find('input', {
'type': 'submit',
'class': 'btn'
})
print(f"Submit button: {submit_button}")Practice finding elements with different selectors:
from gazpacho import Soup
html = """
<article class="post">
<h2 class="post-title">Article Title</h2>
<div class="post-meta">
<span class="author">John Doe</span>
<time datetime="2023-01-15">January 15, 2023</time>
</div>
<div class="post-content">
<p>First paragraph of content.</p>
<p class="highlight">Important paragraph.</p>
</div>
</article>
"""
soup = Soup(html)
# Practice different finding methods
title = soup.find('h2', {'class': 'post-title'})
author = soup.find('span', {'class': 'author'})
date = soup.find('time')
highlight = soup.find('p', {'class': 'highlight'})
print(f"Title: {title.text}")
print(f"Author: {author.text}")
print(f"Date: {date.text}")
print(f"Highlight: {highlight.text}")Working with Multiple Elements
Finding All Matching Elements
Use .find() to get all matching elements instead of just the first:
from gazpacho import Soup
html = """
<ul>
<li class="item">Item 1</li>
<li class="item">Item 2</li>
<li class="item special">Item 3</li>
<li class="item">Item 4</li>
</ul>
"""
soup = Soup(html)
# Find all list items
all_items = soup.find('li') # This returns the first item
print(f"First item: {all_items}")
# To get all items, we need to use different approach
# Let's extract all items by parsing the structureExtracting Data
Getting Text Content
Extract text content from elements:
from gazpacho import Soup
html = """
<div class="product">
<h2 class="name">Laptop Computer</h2>
<span class="price">$999.99</span>
<div class="description">
High-performance laptop with <strong>16GB RAM</strong>
and <em>1TB SSD</em>.
</div>
</div>
"""
soup = Soup(html)
# Extract text content
name = soup.find('h2', {'class': 'name'})
price = soup.find('span', {'class': 'price'})
description = soup.find('div', {'class': 'description'})
print(f"Product: {name.text}")
print(f"Price: {price.text}")
print(f"Description: {description.text}")Getting Attribute Values
Extract HTML attributes using .attrs:
from gazpacho import Soup
html = """
<div class="profile">
<img src="profile.jpg" alt="User Avatar" width="100" height="100">
<a href="mailto:user@example.com" class="contact">Contact</a>
<time datetime="2023-01-15T10:30:00">January 15, 2023</time>
</div>
"""
soup = Soup(html)
# Extract attribute values
image = soup.find('img')
link = soup.find('a', {'class': 'contact'})
timestamp = soup.find('time')
print(f"Image source: {image.attrs['src']}")
print(f"Image alt text: {image.attrs['alt']}")
print(f"Link href: {link.attrs['href']}")
print(f"DateTime: {timestamp.attrs['datetime']}")
# Handle missing attributes safely
width = image.attrs.get('width', 'Not specified')
print(f"Image width: {width}")Working with Complex Structures
Handle nested and complex HTML structures:
from gazpacho import Soup
html = """
<table class="data-table">
<thead>
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
</thead>
<tbody>
<tr class="row">
<td class="name">Alice</td>
<td class="age">25</td>
<td class="city">New York</td>
</tr>
<tr class="row">
<td class="name">Bob</td>
<td class="age">30</td>
<td class="city">London</td>
</tr>
</tbody>
</table>
"""
soup = Soup(html)
# Extract table data
table = soup.find('table', {'class': 'data-table'})
print(f"Table found: {table is not None}")
# Extract individual cells
first_name = soup.find('td', {'class': 'name'})
first_age = soup.find('td', {'class': 'age'})
first_city = soup.find('td', {'class': 'city'})
print(f"First row: {first_name.text}, {first_age.text}, {first_city.text}")Let’s parse a real website structure:
from gazpacho import get, Soup
def parse_example_com():
"""Parse the example.com homepage."""
url = "https://example.com"
html = get(url)
soup = Soup(html)
# Extract key elements
title = soup.find('title')
h1 = soup.find('h1')
paragraphs = soup.find('p') # Gets first paragraph
links = soup.find('a') # Gets first link
print("Parsing example.com:")
print(f"Page title: {title.text if title else 'Not found'}")
print(f"Main heading: {h1.text if h1 else 'Not found'}")
print(f"First paragraph: {paragraphs.text if paragraphs else 'Not found'}")
if links:
print(f"First link text: {links.text}")
print(f"First link href: {links.attrs.get('href', 'No href')}")
# Run the parser
parse_example_com()Real-World Example: PyPI Project Parsing
Let’s implement the calmcode tutorial example - parsing PyPI project pages:
from gazpacho import get, Soup
def parse_pypi_project(package_name):
"""Parse PyPI project page for package information."""
url = f"https://pypi.org/project/{package_name}/"
try:
# Fetch and parse HTML
html = get(url)
soup = Soup(html)
# Extract project name
title_element = soup.find('h1', {'class': 'package-header__name'})
project_name = title_element.text.strip() if title_element else "Unknown"
# Extract description
desc_element = soup.find('p', {'class': 'package-description__summary'})
description = desc_element.text.strip() if desc_element else "No description"
# Extract version information
version_element = soup.find('h1', {'class': 'package-header__name'})
# Version is often in a separate element or part of the title
# Extract installation command
install_element = soup.find('span', {'id': 'pip-command'})
install_cmd = install_element.text.strip() if install_element else f"pip install {package_name}"
return {
'name': project_name,
'description': description,
'install_command': install_cmd,
'url': url,
'success': True
}
except Exception as e:
return {
'name': package_name,
'error': str(e),
'url': url,
'success': False
}
# Test with popular packages
packages = ['requests', 'beautifulsoup4', 'pandas']
for package in packages:
result = parse_pypi_project(package)
if result['success']:
print(f"\n--- {result['name']} ---")
print(f"Description: {result['description']}")
print(f"Install: {result['install_command']}")
else:
print(f"\nError parsing {package}: {result['error']}")Advanced Parsing Techniques
Custom Element Selection
Create functions for common parsing patterns:
from gazpacho import Soup
def find_by_text(soup, tag, text_content):
"""Find element by tag and text content."""
# This is a simplified version - gazpacho doesn't have built-in text search
# You would need to implement this by checking element text
elements = soup.find(tag) # This gets first matching tag
if elements and text_content in elements.text:
return elements
return None
def extract_links(soup):
"""Extract all links with their text and href."""
links = []
# Note: This is simplified - in practice you'd need to find all 'a' tags
link = soup.find('a')
if link:
links.append({
'text': link.text,
'href': link.attrs.get('href', ''),
'title': link.attrs.get('title', '')
})
return links
# Example usage
html = """
<div>
<a href="https://example.com" title="Example Site">Visit Example</a>
<a href="mailto:contact@example.com">Contact Us</a>
</div>
"""
soup = Soup(html)
links = extract_links(soup)
print("Extracted links:")
for link in links:
print(f" {link['text']}: {link['href']}")Error-Safe Parsing
Handle missing elements gracefully:
from gazpacho import get, Soup
def safe_extract(element, attribute=None):
"""Safely extract text or attribute from element."""
if element is None:
return None
if attribute:
return element.attrs.get(attribute, None)
else:
return element.text.strip() if element.text else None
def parse_article(html):
"""Parse article with error handling."""
soup = Soup(html)
# Safely extract elements
title_elem = soup.find('h1', {'class': 'article-title'})
author_elem = soup.find('span', {'class': 'author'})
date_elem = soup.find('time')
content_elem = soup.find('div', {'class': 'content'})
return {
'title': safe_extract(title_elem),
'author': safe_extract(author_elem),
'date': safe_extract(date_elem, 'datetime'),
'content': safe_extract(content_elem)
}
# Test with incomplete HTML
incomplete_html = """
<article>
<h1 class="article-title">Sample Article</h1>
<div class="content">Article content here.</div>
</article>
"""
result = parse_article(incomplete_html)
print("Parsed article:")
for key, value in result.items():
print(f" {key}: {value if value else 'Not found'}")Create a parser for extracting structured data from news articles:
from gazpacho import get, Soup
import re
from datetime import datetime
class NewsArticleParser:
def __init__(self):
self.parsed_articles = []
def parse_article(self, html):
"""Parse a news article from HTML."""
soup = Soup(html)
# Try multiple selectors for common news article structures
title_selectors = [
('h1', {'class': 'headline'}),
('h1', {'class': 'title'}),
('h1', {}),
('title', {})
]
author_selectors = [
('span', {'class': 'author'}),
('div', {'class': 'byline'}),
('p', {'class': 'author'})
]
date_selectors = [
('time', {}),
('span', {'class': 'date'}),
('div', {'class': 'publish-date'})
]
# Extract title
title = self._find_with_selectors(soup, title_selectors)
# Extract author
author = self._find_with_selectors(soup, author_selectors)
# Extract date
date = self._find_with_selectors(soup, date_selectors)
# Extract content paragraphs
content_paragraphs = self._extract_content(soup)
return {
'title': title,
'author': author,
'date': date,
'content': content_paragraphs,
'word_count': len(' '.join(content_paragraphs).split()) if content_paragraphs else 0
}
def _find_with_selectors(self, soup, selectors):
"""Try multiple selectors to find an element."""
for tag, attrs in selectors:
element = soup.find(tag, attrs)
if element and element.text.strip():
return element.text.strip()
return None
def _extract_content(self, soup):
"""Extract article content paragraphs."""
# Try to find content container
content_selectors = [
('div', {'class': 'article-content'}),
('div', {'class': 'content'}),
('article', {}),
('main', {})
]
content_container = None
for tag, attrs in content_selectors:
container = soup.find(tag, attrs)
if container:
content_container = container
break
# If no container found, look for paragraphs
if not content_container:
content_container = soup
# Extract paragraphs (simplified - would need to find all p tags)
paragraph = content_container.find('p')
if paragraph:
return [paragraph.text.strip()]
return []
# Test the parser
test_html = """
<article>
<h1 class="headline">Breaking News: Important Discovery</h1>
<div class="byline">
<span class="author">Jane Reporter</span>
<time datetime="2023-01-15T10:30:00">January 15, 2023</time>
</div>
<div class="article-content">
<p>This is the first paragraph of the news article with important information.</p>
<p>This is the second paragraph with more details about the story.</p>
</div>
</article>
"""
parser = NewsArticleParser()
article_data = parser.parse_article(test_html)
print("Parsed News Article:")
print(f"Title: {article_data['title']}")
print(f"Author: {article_data['author']}")
print(f"Date: {article_data['date']}")
print(f"Content: {len(article_data['content'])} paragraphs")
print(f"Word count: {article_data['word_count']}")Common Parsing Patterns
Extracting Lists
Handle lists and repeated elements:
from gazpacho import Soup
html = """
<ul class="menu">
<li><a href="/home">Home</a></li>
<li><a href="/about">About</a></li>
<li><a href="/contact">Contact</a></li>
</ul>
"""
soup = Soup(html)
# Extract menu items (simplified approach)
menu = soup.find('ul', {'class': 'menu'})
if menu:
# In practice, you'd need to find all li elements
first_item = soup.find('li')
first_link = soup.find('a')
print(f"First menu item: {first_link.text} -> {first_link.attrs['href']}")Extracting Tables
Parse tabular data:
from gazpacho import Soup
html = """
<table class="scores">
<tr>
<th>Player</th>
<th>Score</th>
</tr>
<tr>
<td>Alice</td>
<td>95</td>
</tr>
<tr>
<td>Bob</td>
<td>87</td>
</tr>
</table>
"""
soup = Soup(html)
# Extract table data (simplified)
table = soup.find('table', {'class': 'scores'})
if table:
first_cell = soup.find('td')
print(f"First data cell: {first_cell.text}")Best Practices for HTML Parsing
Robust Element Selection
- Always check if elements exist before accessing properties
- Use multiple fallback selectors for important data
- Handle missing attributes gracefully
Error Handling
- Wrap parsing operations in try-catch blocks
- Provide default values for missing data
- Log parsing errors for debugging
Performance Considerations
- Parse HTML once and reuse the Soup object
- Extract only the data you need
- Consider using more specific selectors to reduce search time
- Create Soup objects from HTML using
Soup(html) - Use
.find()to locate elements by tag, class, and attributes - Extract text with
.textand attributes with.attrs - Handle missing elements gracefully with error checking
- Combine multiple selectors for robust data extraction
- Practice with real websites to understand HTML structures
- Build reusable parsing functions for common patterns
← Previous: Getting Data | Next: Strict Mode and Attributes →