Copart.com Data Scraping: February 2015

Friday 27 February 2015

An Easy Way For Data Extraction

There are so many data scraping tools are available in internet. With these tools you can you download large amount of data without any stress. From the past decade, the internet revolution has made the entire world as an information center. You can obtain any type of information from the internet. However, if you want any particular information on one task, you need search more websites. If you are interested in download all the information from the websites, you need to copy the information and pate in your documents. It seems a little bit hectic work for everyone. With these scraping tools, you can save your time, money and it reduces manual work.

The Web data extraction tool will extract the data from the HTML pages of the different websites and compares the data. Every day, there are so many websites are hosting in internet. It is not possible to see all the websites in a single day. With these data mining tool, you are able to view all the web pages in internet. If you are using a wide range of applications, these scraping tools are very much useful to you.

The data extraction software tool is used to compare the structured data in internet. There are so many search engines in internet will help you to find a website on a particular issue. The data in different sites is appears in different styles. This scraping expert will help you to compare the date in different site and structures the data for records.

And the web crawler software tool is used to index the web pages in the internet; it will move the data from internet to your hard disk. With this work, you can browse the internet much faster when connected. And the important use of this tool is if you are trying to download the data from internet in off peak hours. It will take a lot of time to download. However, with this tool you can download any data from internet at fast rate.There is another tool for business person is called email extractor. With this toll, you can easily target the customers email addresses. You can send advertisement for your product to the targeted customers at any time. This the best tool to find the database of the customers.

However, there are some more scraping tolls are available in internet. And also some of esteemed websites are providing the information about these tools. You download these tools by paying a nominal amount.

Source: http://ezinearticles.com/?An-Easy-Way-For-Data-Extraction&id=3517104

Wednesday 25 February 2015

Know the Different Types of Mining Processes

Mining has become a controversial industry because of its "devastating" effect to the environment and the ecosystem. However, it has contributed so much to civilization that without it, we could never be where we are today in many aspects.

There are two basic methods of mining. These are the surface and the underground mining processes:

1. Surface Mining

This involves the mining of minerals located at or near the surface of the earth. This encompasses at least six processes and these are:

• Strip Mining - this involves the stripping of the earth's surface by heavy machinery. This method is generally targeted at extracting coal or sedimentary rocks that lay near the earth's surface.

• Placer Mining - this involves the extraction of sediments in sand or gravel. It is a simple, old-fashioned way of mining. This method is generally applicable to gold and precious gems that are carried by the flow of water.

• Mountain Top Mining - this is a new method which involves blasting of a mountain top to expose coal deposits that lie underneath the mountain crest.

• Hydraulic Mining - this is an obsolete method that involves jetting the side of a mountain or hill with high pressure water to expose gold and other precious metals.

• Dredging - it involves the removal of rocks, sand and silt underneath a body of water to expose the minerals.

• Open Pit - this is the most common mining method. It involves the removal of the top layers of soil in search for gold or buried treasure. The miner digs deeper and deeper until a large, open-pit is created.

2. Underground Mining

This is the process in which a tunnel is made into the earth to find the mineral ore. The mining operation is usually performed with the use of underground mining equipment. Underground mining is done through the following methods:

• Slope Mining - it involves the creation of slopes into the ground in order to reach the ore or mineral deposit. This process is generally applied in coal mining.

• Hard rock - this method uses dynamite or giant drills to create large, deep tunnels. The miners support the tunnels with pillars to prevent them from collapsing. This is a large-scale mining process and is usually applied in the extraction of large copper, tin, lead, gold or silver deposits.

• Drift mining - this method is applicable only when the target mineral is accessible from the side of a mountain. It involves the creation of a tunnel that's slightly lower than the target mineral. The gravity makes the deposit fall to the tunnel where miners can collect them.

• Shaft method - this involves the creation of a vertical passageway that goes deep down underground where the deposit is located. Because of the depth, miners are brought in and out of the pit with elevators.

• Borehole method - this involves the use of a large drill and high pressure water to eject the target mineral.

These are the basic methods used in the extraction of common minerals. There are more complex systems, but still, they are based on these fundamental processes.

Source: http://ezinearticles.com/?Know-the-Different-Types-of-Mining-Processes&id=7932442

Saturday 21 February 2015

Data Mining in the 21st Century: Business Intelligence Solutions Extract and Visualize

When you think of the term data mining, what comes to mind? If an image of a mine shaft and miners digging for diamonds or gold comes to mind, you're on the right track. Data mining involves digging for gems or nuggets of information buried deep within data. While the miners of yesteryear used manual labor, modern data minors use business intelligence solutions to extract and make sense of data.

As businesses have become more complex and more reliant on data, the sheer volume of data has exploded. The term "big data" is used to describe the massive amounts of data enterprises must dig through in order to find those golden nuggets. For example, imagine a large retailer with numerous sales promotions, inventory, point of sale systems, and a gift registry. Each of these systems contains useful data that could be mined to make smarter decisions. However, these systems may not be interlinked, making it more difficult to glean any meaningful insights.

Data warehouses are used to extract information from various legacy systems, transform the data into a common format, and load it into a data warehouse. This process is known as ETL (Extract, Transform, and Load). Once the information is standardized and merged, it becomes possible to work with that data.

Originally, all of this behind-the-scenes consolidation took place at predetermined intervals such as once a day, once a week, or even once a month. Intervals were often needed because the databases needed to be offline during these processes. A business running 24/7 simply couldn't afford the down time required to keep the data warehouse stocked with the freshest data. Depending on how often this process took place, the data could be old and no longer relevant. While this may have been fine in the 1980s or 1990s, it's not sufficient in today's fast-paced, interconnected world.

Real-time EFL has since been developed, allowing for continuous, non-invasive data warehousing. While most business intelligence solutions today are capable of mining, extracting, transforming, and loading data continuously without service disruptions, that's not the end of the story. In fact, data mining is just the beginning.

After mining data, what are you going to do with it? You need some form of enterprise reporting in order to make sense of the massive amounts of data coming in. In the past, enterprise reporting required extensive expertise to set up and maintain. Users were typically given a selection of pre-designed reports detailing various data points or functions. While some reports may have had some customization built in, such as user-defined date ranges, customization was limited. If a user needed a special report, it required getting someone from the IT department skilled in reporting to create or modify a report based on the user's needs. This could take weeks - and it often never happened due to the hassles and politics involved.

Fortunately, modern business intelligence solutions have taken enterprise reporting down to the user level. Intuitive controls and dashboards make creating a custom report a simple matter of drag and drop while data visualization tools make the data easy to comprehend. Best of all, these tools can be used on demand, allowing for true, real-time ad hoc enterprise reporting.

Source: http://ezinearticles.com/?Data-Mining-in-the-21st-Century:-Business-Intelligence-Solutions-Extract-and-Visualize&id=7504537

Thursday 19 February 2015

Internet Data Mining - How Does it Help Businesses?

Internet has become an indispensable medium for people to conduct different types of businesses and transactions too. This has given rise to the employment of different internet data mining tools and strategies so that they could better their main purpose of existence on the internet platform and also increase their customer base manifold.

Internet data-mining encompasses various processes of collecting and summarizing different data from various websites or webpage contents or make use of different login procedures so that they could identify various patterns. With the help of internet data-mining it becomes extremely easy to spot a potential competitor, pep up the customer support service on the website and make it more customers oriented.

There are different types of internet data_mining techniques which include content, usage and structure mining. Content mining focuses more on the subject matter that is present on a website which includes the video, audio, images and text. Usage mining focuses on a process where the servers report the aspects accessed by users through the server access logs. This data helps in creating an effective and an efficient website structure. Structure mining focuses on the nature of connection of the websites. This is effective in finding out the similarities between various websites.

Also known as web data_mining, with the aid of the tools and the techniques, one can predict the potential growth in a selective market regarding a specific product. Data gathering has never been so easy and one could make use of a variety of tools to gather data and that too in simpler methods. With the help of the data mining tools, screen scraping, web harvesting and web crawling have become very easy and requisite data can be put readily into a usable style and format. Gathering data from anywhere in the web has become as simple as saying 1-2-3. Internet data-mining tools therefore are effective predictors of the future trends that the business might take.

If you are interested to know something more on Web Data Mining and other details, you are welcome to the Screen Scraping Technology site.

Source: http://ezinearticles.com/?Internet-Data-Mining---How-Does-it-Help-Businesses?&id=3860679

Tuesday 17 February 2015

Coal Seam Gas - Extraction and Processing

With rapidly depleting natural resources, people around the globe are looking for new sources of energy. Lots of people don't think much of it, but doing this is an excellent ecological move forward and may even be a lucrative endeavour. Australia has one the most significant deposits of a recently discovered gas known as coal seam gas. The deposit present in areas such as New South Wales is far more significant than the others since it contains little methane and much more carbon dioxide.

What is coal seam gas?

Coal bed methane is the more general term for this substance. It is a form of natural gas taken from substantial coal beds. The existence of this material usually spelled hazard for many sites. This stopped in recent decades, when specialists discovered its potential as an energy source. It's now among the most important sources of energy in a number of countries, particularly in North America. Extraction within australia is actually rapidly developing because of rich deposits in various parts of the country.

Extraction

The extraction procedure is reasonably challenging. It calls for heavy drilling, water pumping, and tubing. Though there are a variety of different processes, pipeline construction(an initial step) is perhaps one of the most important. The foundation of the course of action can spell a real difference between the failure or success of your undertaking.

Working with a Contractor

Pipeline construction and design is serious business. Seasoned contractors may be hard to get considering the fact that Australia's coal seam gas industry is still fairly young. You'll find only a limited number of completed and working projects across the country. There are several things to consider when getting a contractor for the project.

Find one with substantial experience within the industry sector. Some service providers have operations outside the country, especially in Canada And America. This is something you should look out for, as advancement of the gas originated there. Providers with completed projects in the said area can have the solutions required for any project to take off.

The construction process involves several basic steps. It is important the service provider you work with addresses all of your needs. Below are a few of the important supplementary services to look for.

- Pipeline design, production, and installation

- Custom ploughing (to achieve specialized trenching requirements)

- Protection and repair of pipelines with the use of various liners

- Pressure assessment and commissioning

These are only the fundamentals of pipeline construction. Sourcing coal seam gas involves many others. Do thorough research to ensure the service provider you employ is capable of completing all the necessary tasks. Other elements of the undertaking include engineering plus site preparation and rehabilitation. This industrial sector may be profitable if one makes all of the proper moves.

Avoid making uninformed decisions by doing as much research as you possibly can. Use the web to your advantage to look into a company's profile. Look for a portfolio of the projects they have completed in the past. You can gauge their trustworthiness based on their volume of clients. Check out the scope of their operations and the projects they finished.

You should also think about company policies concerning the quality of their work, safety and health, along with their policies concerning communities and the environment. These are seemingly minute but important details when searching for a contractor for pipeline construction projects.

Source: http://ezinearticles.com/?Coal-Seam-Gas---Extraction-and-Processing&id=6954936

Friday 13 February 2015

Why Common Measures Taken To Prevent Scraping Aren't Effective

Bots became more powerful in 2014. As the war continues, let’s take a closer look at why common strategies to prevent scraping didn’t pay off.

With the market for online businesses expanding rapidly, the development teams behind these online portals are under great amounts of pressure to keep up in the race. Scalability, availability and responsiveness are some of the commonly faced problems for a growing online business portal. As the value of content is increasing, content theft has become an increasing problem in the form of web scraping.

Competitors have learned to stay ahead of the race by using bots to scrape. While how these bots could be harmful is something worth talking about, it is not the main scope of this article. This article discusses some of the commonly used weapons to fight bots and brings to light their effectiveness in reality.

We come across many developers who claim to have taken measures to prevent their sites from being scraped. A common belief is that these below listed techniques reduce scraping activities significantly on a website. While some of these methods could actually work in concept, we were interested to explore how effective they were in practice.

Most Commonly used techniques to Prevent Scraping:

•    Setting up robots.txt – Surprisingly, this technique is used against malicious bots! Why this wouldn’t work is pretty straight forward – robots.txt is an agreement between websites and search engine bots to prevent search engine bots from accessing sensitive information. No malicious bot (or the scraper behind it) in it’s right mind would obey robots.txt. This is the most ineffective method to prevent scraping.

•    Filtering requests by User agent – The user agent string of a client is set by the client itself. One method is to obtain this from the HTTP header of a request. This way, a request can be filtered even before the content is served to the request. We observed that very few bots (approximately less than 10%), used the default user agent string which belonged to a scraping tool or was an empty string. Once their requests to the website were filtered based on the user agent, it didn’t take too long for scrapers to realize this and change their user agent to that of any well known browser. This method merely stops new bots written by inexperienced scrapers for a few hours.

•    Blacklisting the IP address – Seeking out to an IP blacklisting service is much easier than having to perform the hectic process of capturing more metrics from page requests and analyzing server logs. There are plenty of third party services which maintain a database of blacklisted IPs. In our hunt for a suitable blacklisting service, we found that using a third party DNSBL/RBL service was not effective as these services blacklisted only email spambot servers and were not effective in preventing scraping bots. Less than 2% of scraping bots were detected for one of our customer’s when we did a trial run.

•    Throwing CAPTCHA – A very well know practice to stop bots is to throw CAPTCHA on pages with sensitive content. Although effective against bots, CAPTCHA is thrown to all clients requesting the web page irrespective of whether it is a human or a bot. This method often antagonizes users and hence reduces traffic to the website. Some more insights to the new NO CAPTCHA Re-CAPTCHA by Google can be found in our previous blog post.

•    Honey pot or Honey trap – Honey pots are a brilliant trap mechanism to capture new bots (scrapers who are not well versed with structure of every page) on the website. But, this approach poses a lesser known threat of reducing the page rank on search engines. Here’s why – Search engine bots visit these links and might get trapped accidentally. Even if exceptions to the page were made by disallowing a set of known user agents, the links to the traps might be indexed by a search engine bot. These links are interpreted as dead, irrelevant or fake links by search engines. With more such traps, the ranking of the website decreases considerably. Furthermore, filtering requests based on user agent can exploited as discussed above. In short, honey pots are risky business which must be handled very carefully.

To summarize, these prevention strategies listed are either weak or require constant monitoring and regular maintenance to keep them effective. In practice bots are far more challenging than they actually seem to be.

What to expect in 2015?

With increasing need for scraping, the number of scraping tools and expert scrapers are also increasing which simply means bots are going to be an increasing problem. In fact, the usage of headless browsers i.e, browser like bots which are used to scrape are increasing and scrapers are no longer relying on wget, curl and html parsers. Preventing malicious bots from stealing content without actually disturbing the genuine traffic from humans and search engine bots is just going get harder. By the end of the year, we could infer from our database that almost half of an average website’s traffic is caused by bots. And a whopping 30-40% is caused by malicious bots. We believe this is only going to increase if we do not step up to take action!

p.s. If you think you are facing similar problems, why not request for more information? Also, if you do not have the time or bandwidth for taking such actions, scraping prevention and stopping malicious bots is something we provide as a service. How about a free trial?

Source:http://www.shieldsquare.com/why-common-measures-taken-to-prevent-scraping-arent-effective/

Sunday 1 February 2015

I Don’t Need No Stinking API: Web Scraping For Fun and Profit

If you’ve ever needed to pull data from a third party website, chances are you started by checking to see if they had an official API. But did you know that there’s a source of structured data that virtually every website on the internet supports automatically, by default?
scraper toolThat’s right, we’re talking about pulling our data straight out of HTML — otherwise known as web scraping. Here’s why web scraping is awesome:

Any content that can be viewed on a webpage can be scraped. Period.

If a website provides a way for a visitor’s browser to download content and render that content in a structured way, then almost by definition, that content can be accessed programmatically. In this article, I’ll show you how.

Over the past few years, I’ve scraped dozens of websites — from music blogs and fashion retailers to the USPTO and undocumented JSON endpoints I found by inspecting network traffic in my browser.

There are some tricks that site owners will use to thwart this type of access — which we’ll dive into later — but they almost all have simple work-arounds.

Why You Should Scrape

But first we’ll start with some great reasons why you should consider web scraping first, before you start looking for APIs or RSS feeds or other, more traditional forms of structured data.

Websites are More Important Than APIs

The biggest one is that site owners generally care way more about maintaining their public-facing visitor website than they do about their structured data feeds.

We’ve seen it very publicly with Twitter clamping down on their developer ecosystem, and I’ve seen it multiple times in my projects where APIs change or feeds move without warning.

Sometimes it’s deliberate, but most of the time these sorts of problems happen because no one at the organization really cares or maintains the structured data. If it goes offline or gets horribly mangled, no one really notices.

Whereas if the website goes down or is having issues, that’s a more of an in-your-face, drop-everything-until-this-is-fixed kind of problem, and gets dealt with quickly.

No Rate-Limiting

Another thing to think about is that the concept of rate-limiting is virtually non-existent for public websites.

Aside from the occasional captchas on sign up pages, most businesses generally don’t build a lot of defenses against automated access. I’ve scraped a single site for over 4 hours at a time and not seen any issues.

Unless you’re making concurrent requests, you probably won’t be viewed as a DDOS attack, you’ll just show up as a super-avid visitor in the logs, in case anyone’s looking.

Anonymous Access

There are also fewer ways for the website’s administrators to track your behavior, which can be useful if you want gather data more privately.

With APIs, you often have to register to get a key and then send along that key with every request. But with simple HTTP requests, you’re basically anonymous besides your IP address and cookies, which can be easily spoofed.

The Data’s Already in Your Face

Web scraping is also universally available, as I mentioned earlier. You don’t have to wait for a site to open up an API or even contact anyone at the organization. Just spend some time browsing the site until you find the data you need and figure out some basic access patterns — which we’ll talk about next.

Let’s Get to Scraping

So you’ve decided you want to dive in and start grabbing data like a true hacker. Awesome.

Just like reading API docs, it takes a bit of work up front to figure out how the data is structured and how you can access it. Unlike APIs however, there’s really no documentation so you have to be a little clever about it.

I’ll share some of the tips I’ve learned along the way.

Fetching the Data

So the first thing you’re going to need to do is fetch the data. You’ll need to start by finding your “endpoints” — the URL or URLs that return the data you need.

If you know you need your information organized in a certain way — or only need a specific subset of it — you can browse through the site using their navigation. Pay attention to the URLs and how they change as you click between sections and drill down into sub-sections.

The other option for getting started is to go straight to the site’s search functionality. Try typing in a few different terms and again, pay attention to the URL and how it changes depending on what you search for. You’ll probably see a GET parameter like q= that always changes based on you search term.

Try removing other unnecessary GET parameters from the URL, until you’re left with only the ones you need to load your data. Make sure that there’s always a beginning ? to start the query string and a & between each key/value pair.

Dealing with Pagination

At this point, you should be starting to see the data you want access to, but there’s usually some sort of pagination issue keeping you from seeing all of it at once. Most regular APIs do this as well, to keep single requests from slamming the database.

Usually, clicking to page 2 adds some sort of offset= parameter to the URL, which is usually either the page number or else the number of items displayed on the page. Try changing this to some really high number and see what response you get when you “fall off the end” of the data.

With this information, you can now iterate over every page of results, incrementing the offset parameter as necessary, until you hit that “end of data” condition.

The other thing you can try doing is changing the “Display X Per Page” which most pagination UIs now have. Again, look for a new GET parameter to be appended to the URL which indicates how many items are on the page.

Try setting this to some arbitrarily large number to see if the server will return all the information you need in a single request. Sometimes there’ll be some limits enforced server-side that you can’t get around by tampering with this, but it’s still worth a shot since it can cut down on the number of pages you must paginate through to get all the data you need.

AJAX Isn’t That Bad!

Sometimes people see web pages with URL fragments # and AJAX content loading and think a site can’t be scraped. On the contrary! If a site is using AJAX to load the data, that probably makes it even easier to pull the information you need.

The AJAX response is probably coming back in some nicely-structured way (probably JSON!) in order to be rendered on the page with Javscript.

All you have to do is pull up the network tab in Web Inspector or Firebug and look through the XHR requests for the ones that seem to be pulling in your data.

Once you find it, you can leave the crufty HTML behind and focus instead on this endpoint, which is essentially an undocumented API.

(Un)structured Data?

Now that you’ve figured out how to get the data you need from the server, the somewhat tricky part is getting the data you need out of the page’s markup.

Use CSS Hooks

In my experience, this is usually straightforward since most web designers litter the markup with tons of classes and ids to provide hooks for their CSS.

You can piggyback on these to jump to the parts of the markup that contain the data you need.

Just right click on a section of information you need and pull up the Web Inspector or Firebug to look at it. Zoom up and down through the DOM tree until you find the outermost <div> around the item you want.

This <div> should be the outer wrapper around a single item you want access to. It probably has some class attribute which you can use to easily pull out all of the other wrapper elements on the page. You can then iterate over these just as you would iterate over the items returned by an API response.

A note here though: the DOM tree that is presented by the inspector isn’t always the same as the DOM tree represented by the HTML sent back by the website. It’s possible that the DOM you see in the inspector has been modified by Javascript — or sometime even the browser, if it’s in quirks mode.

Once you find the right node in the DOM tree, you should always view the source of the page (“right click” > “View Source”) to make sure the elements you need are actually showing up in the raw HTML.

This issue has caused me a number of head-scratchers.

Get a Good HTML Parsing Library

It is probably a horrible idea to try parsing the HTML of the page as a long string (although there are times I’ve needed to fall back on that). Spend some time doing research for a good HTML parsing library in your language of choice.

Most of the code I write is in Python, and I love BeautifulSoup for its error handling and super-simple API. I also love its motto:

You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. :)

You’re going to have a bad time if you try to use an XML parser since most websites out there don’t actually validate as properly formed XML (sorry XHTML!) and will give you a ton of errors.

A good library will read in the HTML that you pull in using some HTTP library (hat tip to the Requests library if you’re writing Python) and turn it into an object that you can traverse and iterate over to your heart’s content, similar to a JSON object.

Some Traps To Know About

I should mention that some websites explicitly prohibit the use of automated scraping, so it’s a good idea to read your target site’s Terms of Use to see if you’re going to make anyone upset by scraping.

For two-thirds of the website I’ve scraped, the above steps are all you need. Just fire off a request to your “endpoint” and parse the returned data.

But sometimes, you’ll find that the response you get when scraping isn’t what you saw when you visited the site yourself.

When In Doubt, Spoof Headers

Some websites require that your User Agent string is set to something they allow, or you need to set certain cookies or other headers in order to get a proper response.

Depending on the HTTP library you’re using to make requests, this is usually pretty straightforward. I just browse the site in my web browser and then grab all of the headers that my browser is automatically sending. Then I put those in a dictionary and send them along with my request.

Note that this might mean grabbing some login or other session cookie, which might identify you and make your scraping less anonymous. It’s up to you how serious of a risk that is.

Content Behind A Login

Sometimes you might need to create an account and login to access the information you need. If you have a good HTTP library that handles logins and automatically sending session cookies (did I mention how awesome Requests is?), then you just need your scraper login before it gets to work.

Note that this obviously makes you totally non-anonymous to the third party website so all of your scraping behavior is probably pretty easy to trace back to you if anyone on their side cared to look.

Rate Limiting

I’ve never actually run into this issue myself, although I did have to plan for it one time. I was using a web service that had a strict rate limit that I knew I’d exceed fairly quickly.

Since the third party service conducted rate-limiting based on IP address (stated in their docs), my solution was to put the code that hit their service into some client-side Javascript, and then send the results back to my server from each of the clients.

This way, the requests would appear to come from thousands of different places, since each client would presumably have their own unique IP address, and none of them would individually be going over the rate limit.

Depending on your application, this could work for you.

Poorly Formed Markup

Sadly, this is the one condition that there really is no cure for. If the markup doesn’t come close to validating, then the site is not only keeping you out, but also serving a degraded browsing experience to all of their visitors.

It’s worth digging into your HTML parsing library to see if there’s any setting for error tolerance. Sometimes this can help.

If not, you can always try falling back on treating the entire HTML document as a long string and do all of your parsing as string splitting or — God forbid — a giant regex.

—

Well there’s 2000 words to get you started on web scraping. Hopefully I’ve convinced you that it’s actually a legitimate way of collecting data.

It’s a real hacker challenge to read through some HTML soup and look for patterns and structure in the markup in order to pull out the data you need. It usually doesn’t take much longer than reading some API docs and getting up to speed with a client. Plus it’s way more fun!

Source: https://blog.hartleybrody.com/web-scraping/