Data Scraping Requirements
1. Data Sources
Data can be obtained from the following platforms:
- Desktop web version (such as websites accessed through browsers)
- Mobile web version (such as pages accessed through mobile browsers)
- Android applications (Apps)
- iOS applications (Apps)
Currently, most websites support desktop web, mobile web, and Apps simultaneously, with generally consistent data content. However, the difficulty of scraping varies:
- Desktop web and mobile web: Simplest to scrape, lowest cost.
- Android App: Medium difficulty, more comprehensive data.
- iOS App: Highest difficulty, suitable for specific needs (such as geographic location data).
Recommendation: Unless there are special requirements (such as restaurant coordinates from food delivery platforms), we usually prioritize scraping from desktop web for higher efficiency.
2. What data needs to be scraped?
Clearly defining the type of data you need is very important. More data means potentially higher scraping time and costs. For example, a product page on an e-commerce website may contain price, reviews, store information, etc., but this data may come from different sections with different scraping methods.
Taking JD desktop web version as an example, common data includes:
Figure: JD product page showing price and reviews
- Product Link: Such as https://item.jd.com/100162191634.html
- Product ID: Such as 100162191634
- Category: Such as "运动户外 > 运动鞋 > 阿迪达斯 GW3774"
- Store Name: Such as "Adidas 京东自营旗舰店"
- Main Image Link: URL of the first product image
- Review Count: Such as "5万+"
- Positive Rating: Such as "97% 买家好评"
- Product Title: Such as "阿迪达斯 Yeezy350 暴龙兽椰子 42.5"
- Original Price: Such as 835.36 元
- Current Price: Such as 708.93 元
- Color: Such as GW3774
- Size: Such as 42.5
Review Data (requires separate scraping):
Figure: JD review page showing user comments
- Review Tags: Such as "穿起来超舒服 320" "尺码很准确 24"
- Reviewer: Such as "依***q"
- Review Content: Such as "这双 Yeezy 350 真的太戳我了..."
- Review Time: Such as 2025-08-01
- Rating: Such as 5 stars
Store Data (requires separate scraping):
Figure: JD store page showing store information
- Store Name: Such as "Adidas 京东自营旗舰店"
- Store Review Count: Such as "5万+"
- Store Followers: Such as "1011.2万"
- Product Details: Such as 品牌、货号、功能
JD iOS App Example:
Figure: JD iOS App product page
Figure: JD iOS App review page
Figure: JD iOS App store page
Figure: JD iOS App product details
Web and App data content is basically consistent, but App data is more comprehensive, especially for geographic coordinate data (geographic location information) involving maps or food delivery, which can only be scraped from Apps.
3. Data Specifications
After determining the data to be scraped, it's recommended to use Excel spreadsheets to list data fields and examples for easy confirmation of requirements by both parties. You can prepare Excel yourself and send it to us, or we can organize it and confirm with you. Download Data Specification Example to view the template.
Recommendation: Before scraping, ensure Excel includes all fields (such as product title, price, reviews) and clearly defines example data to avoid later modifications.
4. Data Delivery Methods
After scraping, data can be delivered through various methods, depending on your technical capabilities and requirements:
Excel/CSV
Suitable for users familiar with Excel, simple and easy to use.
JSON
Suitable for users with basic programming skills, flexible and universal.
Database (such as MySQL)
Suitable for large data volumes and professional teams, requires programming skills.
Backend Management System
Suitable for users without programming background who need visualization.
Others
Such as file downloads or interface services (API).
For detailed explanations, please see Data Delivery Methods.
5. Data Collection Frequency
According to project requirements, data can be scraped at the following frequencies:
Daily
Suitable for scenarios requiring high real-time performance, such as price monitoring.
Weekly
Suitable for regular analysis, such as market trends.
Monthly
Suitable for long-term data collection, such as industry reports.
Summary and Recommendations
Clearly defining data scraping requirements is the key to successful collaboration. Here are some recommendations:
- Choose data sources: Prioritize desktop web for simplicity and efficiency; choose Apps when special data is needed (such as coordinates).
- Define data fields clearly: Use Excel to list required data to avoid omissions or duplicate work.
- Choose delivery method: Select Excel, JSON, database, or backend system based on technical capabilities.
- Determine frequency: Choose daily, weekly, or monthly scraping based on requirements.