Challenge : The client was interested in having easy access to a complete product listing from specific categories along with all the product specifications and prices listed together. The client previously had a data team that manually gathered data from various web sources but the results were limited and efforts were high. Even with the manual effort, structuring of the data in order to import it into their database was a challenge. Therefore the client was in need of clean data that could be uploaded into their DB in order to run the comparison engine and perform other monitoring activities. The client provided us with the list of sources to be crawled and the data points required.
Solution : Once we were provided with the list of source websites and data points, our team started working on the project. As this use case was for news data, the frequency of crawls had to be very high. This meant fresh data sets had to be provided every day. Since each site in the list had a different structure and design, site specific crawl and extraction was the solution used for this case. Once our team finished setting up the web crawlers, the data started flowing in. The data was then cleaned and formatted to be uploaded to the client’s Dropbox servers in XML format. The number of records being delivered per day was above 300,000.