mirror of
https://github.com/arch3rPro/1Panel-Appstore.git
synced 2026-06-10 16:39:39 +08:00
66 lines
2.3 KiB
Markdown
66 lines
2.3 KiB
Markdown
|
|
# Firecrawl
|
||
|
|
|
||
|
|
Turn any website into LLM-ready structured data. A powerful web scraping, crawling, search and data extraction platform.
|
||
|
|
|
||
|
|
## Features
|
||
|
|
|
||
|
|
- **Single Page Scraping**: Convert any URL to Markdown, HTML, screenshots, or structured JSON
|
||
|
|
- **Multi-Page Crawling**: Recursively scrape entire websites with intelligent link filtering
|
||
|
|
- **URL Discovery**: Discover all URLs on a website instantly via sitemaps, index queries, or search
|
||
|
|
- **Web Search**: Search the web and get full page content from results in a single call
|
||
|
|
- **AI Extraction**: LLM-powered structured data extraction with schema validation
|
||
|
|
- **Autonomous Agent**: AI research agent that automatically navigates and extracts data
|
||
|
|
- **Remote Browser**: Remote browser sessions with CDP access and code execution
|
||
|
|
- **Batch Operations**: Asynchronous bulk scraping of multiple URLs
|
||
|
|
- **Self-Hosted**: Fully open source, supports local deployment with complete data control
|
||
|
|
|
||
|
|
## Usage
|
||
|
|
|
||
|
|
### Default Port
|
||
|
|
|
||
|
|
- API Service: 3002
|
||
|
|
- Queue Admin UI: http://your-ip:3002/admin/YOUR_BULL_AUTH_KEY/queues
|
||
|
|
|
||
|
|
### API Access
|
||
|
|
|
||
|
|
After deployment, access the API at `http://your-ip:3002`.
|
||
|
|
|
||
|
|
Test the crawl endpoint:
|
||
|
|
```bash
|
||
|
|
curl -X POST http://localhost:3002/v1/crawl \
|
||
|
|
-H 'Content-Type: application/json' \
|
||
|
|
-d '{
|
||
|
|
"url": "https://firecrawl.dev"
|
||
|
|
}'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Data Directories
|
||
|
|
|
||
|
|
Application data is stored in the following directories:
|
||
|
|
- `./data/api` - API service data
|
||
|
|
- `./data/postgres` - PostgreSQL database data
|
||
|
|
- `./data/redis` - Redis cache data
|
||
|
|
- `./data/playwright` - Playwright browser cache
|
||
|
|
|
||
|
|
### Environment Variables
|
||
|
|
|
||
|
|
- `POSTGRES_USER` / `POSTGRES_PASSWORD`: PostgreSQL database credentials
|
||
|
|
- `BULL_AUTH_KEY`: Access key for the queue admin UI
|
||
|
|
- `OPENAI_API_KEY`: OpenAI API key for AI-powered features (optional)
|
||
|
|
|
||
|
|
### Architecture
|
||
|
|
|
||
|
|
The self-hosted version includes the following service components:
|
||
|
|
- **API Service**: Main API server handling all requests (4 CPU cores, 8GB RAM limit)
|
||
|
|
- **Playwright Service**: Browser automation service (2 CPU cores, 4GB RAM limit)
|
||
|
|
- **Redis**: Job queue and cache backend
|
||
|
|
- **RabbitMQ**: NuQ message broker
|
||
|
|
- **PostgreSQL**: Job state management database
|
||
|
|
|
||
|
|
## Links
|
||
|
|
|
||
|
|
- Website: https://www.firecrawl.dev
|
||
|
|
- GitHub: https://github.com/firecrawl/firecrawl
|
||
|
|
- Documentation: https://docs.firecrawl.dev
|
||
|
|
- Discord: https://discord.gg/firecrawl
|