# Firecrawl

Turn any website into LLM-ready structured data. A powerful web scraping, crawling, search and data extraction platform.

## Features

- **Single Page Scraping**: Convert any URL to Markdown, HTML, screenshots, or structured JSON
- **Multi-Page Crawling**: Recursively scrape entire websites with intelligent link filtering
- **URL Discovery**: Discover all URLs on a website instantly via sitemaps, index queries, or search
- **Web Search**: Search the web and get full page content from results in a single call
- **AI Extraction**: LLM-powered structured data extraction with schema validation
- **Autonomous Agent**: AI research agent that automatically navigates and extracts data
- **Remote Browser**: Remote browser sessions with CDP access and code execution
- **Batch Operations**: Asynchronous bulk scraping of multiple URLs
- **Self-Hosted**: Fully open source, supports local deployment with complete data control

## Usage

### Default Port

- API Service: 3002
- Queue Admin UI: http://your-ip:3002/admin/YOUR_BULL_AUTH_KEY/queues

### API Access

After deployment, access the API at `http://your-ip:3002`.

Test the crawl endpoint:
```bash
curl -X POST http://localhost:3002/v1/crawl \
    -H 'Content-Type: application/json' \
    -d '{
      "url": "https://firecrawl.dev"
    }'
```

### Data Directories

Application data is stored in the following directories:
- `./data/api` - API service data
- `./data/postgres` - PostgreSQL database data
- `./data/redis` - Redis cache data
- `./data/playwright` - Playwright browser cache

### Environment Variables

- `POSTGRES_USER` / `POSTGRES_PASSWORD`: PostgreSQL database credentials
- `BULL_AUTH_KEY`: Access key for the queue admin UI
- `OPENAI_API_KEY`: OpenAI API key for AI-powered features (optional)

### Architecture

The self-hosted version includes the following service components:
- **API Service**: Main API server handling all requests (4 CPU cores, 8GB RAM limit)
- **Playwright Service**: Browser automation service (2 CPU cores, 4GB RAM limit)
- **Redis**: Job queue and cache backend
- **RabbitMQ**: NuQ message broker
- **PostgreSQL**: Job state management database

## Links

- Website: https://www.firecrawl.dev
- GitHub: https://github.com/firecrawl/firecrawl
- Documentation: https://docs.firecrawl.dev
- Discord: https://discord.gg/firecrawl