An intelligent web scraping tool that extracts specific information from websites using natural language descriptions. Simply describe what data you want to extract, and the AI will find it for you.
- Natural Language Parsing: Describe what you want to extract in plain English
- Interactive Web Interface: Built with Streamlit for easy use
- AI-Powered Extraction: Uses Ollama with Gemma3 model for intelligent content parsing
- Content Preview: View cleaned DOM content before parsing
- Batch Processing: Handles large websites by chunking content
- Python 3.8+
- Chrome browser installed
- Ollama installed with Gemma3 model (Can change the model as you wish)
- Clone the repository:
git clone <repo_url>
cd ai_web_scraper- Install dependencies:
pip install -r requirements.txt- Install and run Ollama with Gemma3:
# Install Ollama from https://ollama.ai
ollama pull gemma3- Start the application:
streamlit run ai_scraper.py- Enter a website URL
- Click "Scrape Website" to extract content
- Describe what information you want to parse (e.g., "all email addresses", "product prices", "contact information")
- Click "Parse Content" to get AI-extracted results
- Extract contact information from business websites
- Gather product details from e-commerce sites
- Collect news headlines and summaries
- Parse job listings for specific requirements
- Extract research paper abstracts
- Frontend: Streamlit
- Web Scraping: Selenium, BeautifulSoup
- AI Processing: LangChain + Ollama (Gemma3)
- Language: Python