Thursday 22 September 2016

Things to take care while doing Web Scraping!!!

Things to take care while doing Web Scraping!!!

In the present day and age, web scraping word becomes most popular in data science. Basically web scraping is extracting the information from the websites using pre-written programs and web scraping scripts. Many organizations have successfully used web site scraping to build relevant and useful database that they use on a daily basis to enhance their business interests. This is the age of the Big Data and web scraping is one of the trending techniques in the data science.

Throughout my journey of learning web scraping and implementing many successful scraping projects, I have come across some great experiences we can learn from.  In this post, I’m going to discuss some of the approaches to take and approaches to avoid while executing web scraping.

User Proxies: Anonymously scraping data from websites

One should not scrape website with a single IP Address. Because when you repeatedly request the web page for web scraping, there is a chance that the remote web server might block your IP address preventing further request to the web page. To overcome this situation, one should scrape websites with the help of proxy servers (anonymous scraping). This will minimize the risk of getting trapped and blacklisted by a website. Use of Proxies to hide your identity (network details) to remote web servers while scraping data. You may also use a VPN instead of proxies to anonymously scrape websites.

Take maximum data and store it.

Do not follow “process the web page as it comes from the remote server”. Instead take all the information and store it to disk. This approach will be useful when your scraping algorithm breaks in the middle. In this case you don’t have to start scraping again. Never download the same content more than once as you are just wasting bandwidth. Try and download all content to disk in one go and then do the processing.

Follow strict rules in parsing:

Check various rules while parsing the information from the web site. For example if you expect a value to be a date then check that it’s really a date. This may greatly improve the quality of information. When you get unexpected data, then the algorithm need to be changed accordingly.

Respect Robots.txt

Robots.txt specifies the set of rules that should be followed by web crawlers and robots. I strongly advise you to consider and adjust your crawler to fully respect robots.txt. Robots.txt contains instructions on the exact pages that you are allowed to crawl, user-agent, and the requisite intervals between page requests. Following to these instructions minimizes the chance of getting blacklisted and banned from website owner.

Use XPath Smartly

XPath is a nice option to select elements of the HTML document more flexibly than CSS Selectors.  Be careful about HTML structure change through page to page so one xpath you made may be failed to extract data on another page due to changes in HTML structure.

Obey Website TOC:

Some websites make it absolutely apparent in their terms and conditions that they are particularly against to web scraping activities on their content. This can make you vulnerable against possible ethical and legal implications.

Test sample scrape and verify the data with actual scrape

Once you are done with web scraping project set up, you need to test it for sometimes. Check the extracted data. If something is not good, find out the cause and make changes accordingly and finally come to a perfect web scraping project.

Source: http://webdata-scraping.com/things-take-care-web-scraping/

Monday 12 September 2016

Calculate your ROI on Web Scraping using our ROI Calculator

Calculate your ROI on Web Scraping using our ROI Calculator

Staying atop the competition is a vital thing for the survival and growth of businesses these days. Ever since big data came into the picture, web scraping has become something businesses from every industry has to invest in. If your company is not in a technically advanced industry, web scraping could even be a nightmare to start with. Wondering if going with in-house web scraping is right for you? In house or outsourcing, in the end it’s all about the returns on investment.

ROI Calculator

Considering the numerous factors that determine how much web scraping can cost you, it’s not easy to calculate the ROI on your in-house web scraping.

In house web scraping is certainly a challenging process. If you plan on going down this way, here is a brief list of prerequisites.

Engineers

Technically skilled labour is an essential requirement for web scraping. Since, web scraping techniques are complicated, it needs good programming skills to write, run and maintain the scraping bots. The cost of labour can be one of the drawbacks with doing in house web scraping.

Hardware Resources

Web scraping is a resource hungry process which requires high end servers and lots of bandwidth. Without the adequate resources, you might end up losing important data. The cost of quality servers could easily make you want to reconsider doing web scraping on your own. Not to mention the doubling up of these resources in order to keep the data intact, espcially if you’re looking at large scale.

Maintainability and ukeep of your tech stack

Once you have your servers and other technical components setup, the real deal only starts. You have to ensure availability of your servers, data backups, restoring previous states, failovers, among many other complications associated with managing servers and fixing them up when something goes wrong. You need to allocate resources (both people and hardware) to take care of the above.

Time

Time is something that we cannot really include in the equation when it comes to calculating the returns. But it is definitely a factor that defines if web scraping in house is worth it. Although web scraping is the fastest way to acquire data, the initial setup and maintenance are time consuming and complicated. This could easily lead to conflicts when you have to distribute your time between web scraping and other business activities that are crucial for your company.

Try the ROI Calculator

We came up with an ROI calculator to easily calculate your returns on investment with our web scraping services. Using this, you could easily compare the cost of in house web scraping with PromptCloud’s dedicated web scraping services. Find out how much you can save by going the PromptCloud way.

Source: https://www.promptcloud.com/blog/calculate-roi-on-web-scraping

Thursday 1 September 2016

Why is a Web scraping service better than Scraping tools

Why is a Web scraping service better than Scraping tools

Web scraping has been making ripples across various industries in the last few years. Newer businesses can employ web scraping to gain quick market insights and equip themselves to take on their competitors. This works like clockwork if you know how to do the analysis right. Before we jump into that, there is the technical aspect of web scraping. Should your company use a scraping tool to get the required data from the web? Although this sounds like an easy solution, there is more to it than what meets the eye. We explain why it’s better to go with a dedicated web scraping service to cover your data acquisition needs rather than going by the scraping tool route.

Cost is lowered

Although this might come as a surprise, the cost of getting data from employing a data scraping tool along with an IT personnel who can get it done would exceed the cost of a good subscription based web scraping service. Not every company has the necessary resources needed to run web scraping in-house. By depending on a Data service provider, you will save the cost of software, resources and labour required to run web crawling in the firm. Besides, you will also end up having more time and less worries. More of your time and effort can therefore go into the analysis part which is crucial to you as a business owner.

Accessibility is high with a service

Multifaceted websites make it difficult for the scraping tools to extract data. A good web scraping service on the other hand can easily deal with bottlenecks in the scraping process when it may arise. Websites to be scraped often undergo changes in their structure which calls for modification of the crawler accordingly. Unlike a scraping tool, a dedicated service will be able to extract data from complex sites that use Ajax, Javascript and the like. By going with a subscription based service, you are doing yourself the favour of not being involved in this constant headache.

Accuracy in results

A DIY scraping tool might be able to get you data, but the accuracy and relevance of the acquired data will vary. You might be able to get it right with a particular website, but that might not be the case with another. This gives uncertainty to the results of your data acquisition and could even be disastrous for your business. On the other hand, a good scraping service will give you highly refined data which is in a ready to consume form.

Outcomes are instant with a service

Considering the high resource requirements of the web scraping process, your scraping tool is likely to be much slower than a reputed service that has got the right infrastructure and resources to scrape data from the web efficiently. It might not be feasible for your firm to acquire and manage the same setup since that could affect the focus of your business.

Tidying up of Data is an exhausting process

Web scrapers collect data into a dump file which would be huge in size. You will have to do a lot of tidying up in this to get data in a usable format. With the scraping tools route, you would be looking for more tools to clean up the data collected. This is a waste of time and effort that you could use in much better aspects of your business. Whereas with a web scraping service, you won’t have to worry about cleaning up of the data as it comes with the service. You get the data in a plug and use format which gives you more time to do better things.

Many sites have policies for data scraping

Sometimes, websites that you want to scrape data from might have policies discouraging the act. You wouldn’t want to act against their policies being ignorant of their existence and get into legal trouble. With a web scraping service, you don’t have to worry about these. A well-established data scraping provider will definitely follow the rules and policies set by the website. This would mean you can be relieved of such worries and go ahead with finding trends and ideas from the data that they provide.

More time to analyse the data

This is so far the best advantage of going with a scraping service rather than a tool. Since all the things related to data acquisition is dealt by the scraping service provider, you would have more time for analysing and deriving useful business decisions from this data. Being the business owner, analysing the data with care should be your highest priority. Since using a scraping tool to acquire data will cost you more time and effort, the analysis part is definitely going to suffer which defies your whole purpose.

Bottom line

It is up to you to choose between a web scraping tool and a dedicated scraping service. Being the business owner, it i s much better for you to stay away from the technical aspects of web scraping and focus on deriving a better business strategy from the data. When you have made up your mind to go with a data scraping service, it is important to choose the right web scraping service for maximum benefits.

Source: https://www.promptcloud.com/blog/web-scraping-services-better-than-scraping-tools