Copart.com Data Scraping: May 2013

Friday, 31 May 2013

Effectiveness of Web Data Mining Through Web Research

Web data mining is systematic approach to keyword based and hyperlink based web research for gaining business intelligence. It requires analytical skills to understand hyperlink structure of given website. Hyperlinks possess enormous amount of hidden human annotations that can help automatically understand the authority. If the webmaster provides a hyperlink pointing to another website or web page, this action is perceived as an endorsement to that webpage. Search engines highly focus on such endorsements to define the importance of the page and place them higher in organic search results.

However every hyperlink does not refer to the endorsement since the webmaster may have used it for other purposes, such as navigation or to render paid advertisements. It is important to note that authoritative pages rarely provide informative descriptions. For an instant, Google's homepage may not provide explicit self-description as "Web search engine."

These features of hyperlink systems have forced researchers to evaluate another important webpage category called hubs. A hub is a unique, informative webpage that offers collections of links to authorities. It may have only a few links pointing to other web pages but it links to a collection of prominent sites on a single topic. A hub directly awards authority status on sites that focus on a single topic. Typically, a quality hub points to many quality authorities, and, conversely, a web page that many such hubs link to can be deemed as a superior authority.

Such approach of identifying authoritative pages has resulted in the development of various popularity algorithms such as PageRank. Google uses PageRank algorithm to define authority of each webpage for a relevant search query. By analyzing hyperlink structures and web page content, these search engines can render better-quality search results than term-index engines such as Ask and topic directories such as DMOZ.

Source: http://ezinearticles.com/?Effectiveness-of-Web-Data-Mining-Through-Web-Research&id=5094403

Tuesday, 28 May 2013

Copart Auto Auction

What are you looking for? A sedan, an SUV, a luxury car, a foreign make, a domestic car, a pickup, a classic car, a motorcycle or boat? Or maybe a salvage or rebuildable car or truck that still has plenty of good miles left in it for the right buyer? AutoBidMaster can help you out. ABM has an easily-searchable database of cars and trucks that are available at auction, and has auctions in most of the lower 48 states, as well as auctions located globally. Signing up for a an AutoBidMaster account to get access to Copart auto auctions is easy, and allows you to bid on and search over 50,000 vehicles every day, with access to view any Copart Virtual Auction in real time. By becoming an AutoBidMaster member, you're automatically cleared to bid up to $1000 on a vehicle before you need to increase your buying power by using the convenient deposit system

When bidding on a Copart auction through ABM, you have the option of either personally inspecting the cars you're interested in, or having designated licensed mechanic do the inspection and then give you a detailed report on the vehicle (complete with photographs). This, of course, is far preferable to buying a vehicle sight-unseen on eBay and relying on the seller's good faith in the purchase!

AutoBidMaster bidders are professionals, as well as ordinary car buyers. The only way to get access to Copart's dealer only auctions is through a registerded broker such as ABM. You can bid from around the world, and contract with several haulers to arrange delivery of the vehicle from the auction directly to your door or office. For international buyers, we can arrange shipping and take care of all the export paperwork as well. The only exception is Canadian buyers – in that case, we can arrange shipping to the nearest location on the US side of the border.

The days of dealer-only sales of worn-out ten-year-old cars are still around, but now there are other options as well when it comes to auto auctions. You don't have to resort to a sight-unseen eBay auction or a government-auction gamble on beat-up Highway Department trucks and seized drug-dealer cars. With AutoBidMaster, we can take a lot of the guesswork and hassle out of the buying process, and you can come away with the kind of car or truck you were looking for, often for thousands of dollars off the blue book value!

Source: http://www.articlesbase.com/cars-articles/copart-auto-auction-2970131.html

Saturday, 25 May 2013

Flooded vehicles for sale? State offers database for consumers to check

A state web site that lists vehicles with titles that show whether they have been flooded or ruined in a crash has reached 26,000 entries.

It is located at www.njconsumeraffairs.gov/floodedcars. The database allows users to enter the Vehicle Identification Number (VIN) unique to each vehicle and check whether a flood or salvage title has been issued.

A vehicle is issued a salvage title if an insurer has declared it a total loss or if it is cost prohibitive to repair. Salvaged vehicles can only be driven to and from an inspection appointment.
It is against the law to sell or transfer a salvage vehicle without a title.
A vehicle ruined by flood will be issued a flood salvage title.
A car that has been flooded but is not a total loss and can be repaired will be issued a “Flood Vehicle” title. The flooded status remains through the life of a vehicle.
“Consumers who are thinking about purchasing a used motor vehicle need to be vigilant, in the wake of superstorm Sandy,” said Eric T. Kanefsky, Acting Director of the State Division of Consumer Affairs. “Storm damaged vehicles may be offered for sale for years to come.”
The state Motor Vehicle Commission does not have an estimate of how many of those 26,000 entries are related to superstorm Sandy.

Insurers have reported 60,000 claims for personal or commercial vehicles as a result of Sandy.

Source: http://blogs.app.com/hurricanesandy/blog/2013/03/14/flooded-vehicles-for-sale-state-offers-database-for-consumers-to-check/

Friday, 17 May 2013

Copart Auto Auction

What are you looking for? A sedan, an SUV, a luxury car, a foreign make, a domestic car, a pickup, a classic car, a motorcycle or boat? Or maybe a salvage or rebuildable car or truck that still has plenty of good miles left in it for the right buyer? AutoBidMaster can help you out. ABM has an easily-searchable database of cars and trucks that are available at auction, and has auctions in most of the lower 48 states, as well as auctions located globally. Signing up for a an AutoBidMaster account to get access to Copart auto auctions is easy, and allows you to bid on and search over 50,000 vehicles every day, with access to view any Copart Virtual Auction in real time. By becoming an AutoBidMaster member, youÃ¢â‚¬â„¢re automatically cleared to bid up to $1000 on a vehicle before you need to increase your buying power by using the convenient deposit system

When bidding on a Copart auction through ABM, you have the option of either personally inspecting the cars youÃ¢â‚¬â„¢re interested in, or having designated licensed mechanic do the inspection and then give you a detailed report on the vehicle (complete with photographs). This, of course, is far preferable to buying a vehicle sight-unseen on eBay and relying on the sellerÃ¢â‚¬â„¢s good faith in the purchase!

AutoBidMaster bidders are professionals, as well as ordinary car buyers. The only way to get access to Coparts dealer only auctions is through a registerded broker such as ABM. You can bid from around the world, and contract with several haulers to arrange delivery of the vehicle from the auction directly to your door or office. For international buyers, we can arrange shipping and take care of all the export paperwork as well. The only exception is Canadian buyers Ã¢â‚¬â€œ in that case, we can arrange shipping to the nearest location on the US side of the border.

The days of dealer-only sales of worn-out ten-year-old cars are still around, but now there are other options as well when it comes to auto auctions. You donÃ¢â‚¬â„¢t have to resort to a sight-unseen eBay auction or a government-auction gamble on beat-up Highway Department trucks and seized drug-dealer cars. With AutoBidMaster, we can take a lot of the guesswork and hassle out of the buying process, and you can come away with the kind of car or truck you were looking for, often for thousands of dollars off the blue book value!

Source: http://www.123articleonline.com/articles/141506/copart-auto-auction

Tuesday, 7 May 2013

WP Web Scraper

What is web scraping? Why do I need it?

Web scraping (or Web harvesting, Web data extraction) is a computer software technique of extracting information from websites. Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be formatted and displayed or stored and analyzed. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Exemplary uses of Web scraping include online price comparison, weather data monitoring, market data tracking, Web content mashup and Web data integration.
Sounds interesting, but how do I actually use it?

WP Web Scraper can be used through a shortcode (for posts, pages or sidebar) or template tag (for direct integration in your theme) for scraping and displaying web content. Here's the actual usage detail:

For use directly in posts, pages or sidebar (text widget): [wpws url="" selector=""]

Example usage as a shortcode: [wpws url="http://google.com" selector="title" user_agent="My Bot" on_error="error_show"] (Display the title tag of google's home page, using My Bot as a user agent)

For use within themes: <?php echo wpws_get_content($url, $selector, $xpath, $wpwsopt)?>

Example usage in theme: <?php echo wpws_get_content('http://google.com','title','','user_agent=My Bot&on_error=error_show&')?> (Display the title tag of google's home page, using My Bot as a user agent)

For usage of other advanced parameters refer the Usage Manual

Further details about selector syntax in Selectors
Wow! I can actually create a complete meshup using this!

Yes you can. However, you should consider the copyright of the content owner. Its best to at least attribute the content owner by a linkback or better take a written permission. Apart from rights, scraping in general is a very resource intensive task. It will exhaust the bandwidth of your host as well as the host of of the content owner. Best is not to overdo it. Ideally find single pages with enough content to create your your meshup.
Okie. Then whats the best way to optimize its usage?

Here are some tips to help you optimize the usage:

    Keep the timeout as low as possible (least is 1 second). Higher timeout might impact your page processing time if you are dealing with content on slow servers.
    If you plan use multiple scrapers in a single page, make sure you set the cache timeout to a larger period. Possibly as long as a day (i.e. 1440 minutes) or even more. This will cache content on your server and reduce scraping.
    Use fast loading pages as your content source. Also prefer pages low in size to optimize performance.
    Keep a close watch on your scraper. If the website changes its page layout, your selector may fail to fetch the right content.
    If you are scraping a lot, keep a watch on your cache size too. Clear cache occasionaly.

What libraries are used? What are the minimum requirements apart from WordPress

For scraping, the plugin primarily uses WP_HTTP classes. For caching it uses the Transients API. For parsing htm using CSS style selectors the plugin uses phpQuery - a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library and for xpath parsing it uses JS_Extractor.

Source: http://wordpress.org/extend/plugins/wp-web-scrapper/faq/

Friday, 3 May 2013

'Scrapers' Dig Deep for Data on Web

At 1 a.m. on May 7, the website PatientsLikeMe.com noticed suspicious activity on its "Mood" discussion board. There, people exchange highly personal stories about their emotional disorders, ranging from bipolar disease to a desire to cut themselves.

It was a break-in. A new member of the site, using sophisticated software, was "scraping," or copying, every single message off PatientsLikeMe's private online forums.

PatientsLikeMe managed to block and identify the intruder: Nielsen Co., the privately held New York media-research firm. Nielsen monitors online "buzz" for clients, including major drug makers, which buy data gleaned from the Web to get insight from consumers about their products, Nielsen says.

"I felt totally violated," says Bilal Ahmed, a 33-year-old resident of Sydney, Australia, who used PatientsLikeMe to connect with other people suffering from depression. He used a pseudonym on the message boards, but his PatientsLikeMe profile linked to his blog, which contains his real name.

After PatientsLikeMe told users about the break-in, Mr. Ahmed deleted all his posts, plus a list of drugs he uses. "It was very disturbing to know that your information is being sold," he says. Nielsen says it no longer scrapes sites requiring an individual account for access, unless it has permission.

The market for personal data about Internet users is booming, and in the vanguard is the practice of "scraping." Firms offer to harvest online conversations and collect personal details from social-networking sites, résumé sites and online forums where people might discuss their lives.

The emerging business of web scraping provides some of the raw material for a rapidly expanding data economy. Marketers spent $7.8 billion on online and offline data in 2009, according to the New York management consulting firm Winterberry Group LLC. Spending on data from online sources is set to more than double, to $840 million in 2012 from $410 million in 2009.

The Wall Street Journal's examination of scraping—a trade that involves personal information as well as many other types of data—is part of the newspaper's investigation into the business of tracking people's activities online and selling details about their behavior and personal interests.

Some companies collect personal information for detailed background reports on individuals, such as email addresses, cell numbers, photographs and posts on social-network sites.

Others offer what are known as listening services, which monitor in real time hundreds or thousands of news sources, blogs and websites to see what people are saying about specific products or topics.

One such service is offered by Dow Jones & Co., publisher of the Journal. Dow Jones collects data from the Web—which may include personal information contained in news articles and blog postings—that help corporate clients monitor how they are portrayed. It says it doesn't gather information from password-protected parts of sites.

The competition for data is fierce. PatientsLikeMe also sells data about its users. PatientsLikeMe says the data it sells is anonymized, no names attached.

Nielsen spokesman Matt Anchin says the company's reports to its clients include publicly available information gleaned from the Internet, "so if someone decides to share personally identifiable information, it could be included."

Internet users often have little recourse if personally identifiable data is scraped: There is no national law requiring data companies to let people remove or change information about themselves, though some firms let users remove their profiles under certain circumstances.

California has a special protection for public officials, including politicians, sheriffs and district attorneys. It makes it easier for them to remove their home address and phone numbers from these databases, by filling out a special form stating they fear for their safety.

Data brokers long have scoured public records, such as real-estate transactions and courthouse documents, for information on individuals. Now, some are adding online information to people's profiles.

Many scrapers and data brokers argue that if information is available online, it is fair game, no matter how personal.

"Social networks are becoming the new public records," says Jim Adler, chief privacy officer of Intelius Inc., a leading paid people-search website. It offers services that include criminal background checks and "Date Check," which promises details about a prospective date for $14.95.

"This data is out there," Mr. Adler says. "If we don't bring it to the consumer's attention, someone else will."

New York-based PeekYou LLC has applied for a patent for a method that, among other things, matches people's real names to the pseudonyms they use on blogs, Twitter and other social networks. PeekYou's people-search website offers records of about 250 million people, primarily in the U.S. and Canada.

PeekYou says it also is starting to work with listening services to help them learn more about the people whose conversations they are monitoring. It says it hands over only demographic information, not names or addresses.

Employers, too, are trying to figure out how to use such data to screen job candidates. It's tricky: Employers legally can't discriminate based on gender, race and other factors they may glean from social-media profiles.

One company that screens job applicants for employers, InfoCheckUSA LLC in Florida, began offering limited social-networking data—some of it scraped—to employers about a year ago. "It's slowly starting to grow," says Chris Dugger, national account manager. He says he's particularly interested in things like whether people are "talking about how they just ripped off their last employer."

Scrapers operate in a legal gray area. Internationally, anti-scraping laws vary. In the U.S., court rulings have been contradictory. "Scraping is ubiquitous, but questionable," says Eric Goldman, a law professor at Santa Clara University. "Everyone does it, but it's not totally clear that anyone is allowed to do it without permission."

Scrapers and listening companies say what they're doing is no different from what any person does when gathering information online—they just do it on a much larger scale.

"We take an incomprehensible amount of information and make it intelligent," says Chase McMichael, chief executive of InfiniGraph, a Palo Alto, Calif., "listening service" that helps companies understand the likes and dislikes of online customers.

Scraping services range from dirt cheap to custom-built. Some outfits, such as 80Legs.com in Texas, will scrape a million Web pages for $101. One Utah company, screen-scraper.com, offers do-it-yourself scraping software for free. The top listening services can charge hundreds of thousands of dollars to monitor and analyze Web discussions.

Some scrapers-for-hire don't ask clients many questions.

"If we don't think they're going to use it for illegal purposes—they often don't tell us what they're going to use it for—generally, we'll err on the side of doing it," says Todd Wilson, owner of screen-scraper.com, a 10-person firm in Provo, Utah, that operates out of a two-room office. It is one of at least three firms in a scenic area known locally as "Happy Valley" that specialize in scraping.

Screen-scraper charges between $1,500 and $10,000 for most jobs. The company says it's often hired to conduct "business intelligence," working for companies who want to scrape competitors' websites.

One recent assignment: A major insurance company wanted to scrape the names of agents working for competitors. Why? "We don't know," says Scott Wilson, the owner's brother and vice president of sales. Another job: attempting to scrape Facebook for a multi-level marketing company that wanted email addresses of users who "like" the firm's page—as well as their friends—so they all could be pitched products.

Scraping often is a cat-and-mouse game between websites, which try to protect their data, and the scrapers, who try to outfox their defenses. Scraping itself isn't difficult: Nearly any talented computer programmer can do it. But penetrating a site's defenses can be tough.

One defense familiar to most Internet users involves "captchas," the squiggly letters that many websites require people to type to prove they're human and not a scraping robot. Scrapers sometimes fight back with software that deciphers captchas.

Some professional scrapers stage blitzkrieg raids, mounting around a dozen simultaneous attacks on a website to grab as much data as quickly as possible without being detected or crashing the site they're targeting.

Raids like these are on the rise. "Customers for whom we were regularly blocking about 1,000 to 2,000 scrapes a month are now seeing three times or in some cases 10 times as much scraping," says Marino Zini, managing director of Sentor Anti Scraping System. The company's Stockholm team blocks scrapers on behalf of website clients.

At Monster.com, the jobs website that stores résumés for tens of millions of individuals, fighting scrapers is a full-time job, "every minute of every day of every week," says Patrick Manzo, global chief privacy officer of Monster Worldwide Inc. Facebook, with its trove of personal data on some 500 million users, says it takes legal and technical steps to deter scraping.

At PatientsLikeMe, there are forums where people discuss experiences with AIDS, supranuclear palsy, depression, organ transplants, post-traumatic stress disorder and self-mutilation. These are supposed to be viewable only by members who have agreed not to scrape, and not by intruders such as Nielsen.

"It was a bad legacy practice that we don't do anymore," says Dave Hudson, who in June took over as chief executive of the Nielsen unit that scraped PatientsLikeMe in May. "It's something that we decided is not acceptable, and we stopped."

Mr. Hudson wouldn't say how often the practice occurred, and wouldn't identify its client.

The Nielsen unit that did the scraping is now part of a joint venture with McKinsey & Co. called NM Incite. It traces its roots to a Cincinnati company called Intelliseek that was founded in 1997. One of its most successful early businesses was scraping message boards to find mentions of brand names for corporate clients.

In 2001, the venture-capital arm of the Central Intelligence Agency, In-Q-Tel Inc., was among a group of investors that put $8 million into the business.

Intelliseek struggled to set boundaries in the new business of monitoring individual conversations online, says Sundar Kadayam, Intelliseek's co-founder. The firm decided it wouldn't be ethical to use automated software to log into private message boards to scrape them.

But, he says, Intelliseek occasionally would ask employees to do that kind of scraping if clients requested it. "The human being can just sign in as who they are," he says. "They don't have to be deceitful."

In 2006, Nielsen bought Intelliseek, which had revenue of more than $10 million and had just become profitable, Mr. Kadayam says. He left one year after the acquisition.

At the time, Nielsen, which provides television ratings and other media services, was looking to diversify into digital businesses. Nielsen combined Intelliseek with a New York startup it had bought called BuzzMetrics.

The new unit, Nielsen BuzzMetrics, quickly became a leader in the field of social-media monitoring. It collects data from 130 million blogs, 8,000 message boards, Twitter and social networks. It sells services such as "ThreatTracker," which alerts a company if its brand is being discussed in a negative light. Clients include more than a dozen of the biggest pharmaceutical companies, according to the company's marketing material.

Like many websites, PatientsLikeMe has software that detects unusual activity. On May 7, that software sounded an alarm about the "Mood" forum.

David Williams, the chief marketing officer, quickly determined that the "member" who had triggered the alert actually was an automated program scraping the forum. He shut down the account.

The next morning, the holder of that account e-mailed customer support to ask why the login and password weren't working. By the afternoon, PatientsLikeMe had located three other suspect accounts and shut them down. The site's investigators traced all of the accounts to Nielsen BuzzMetrics.

On May 18, PatientsLikeMe sent a cease-and-desist letter to Nielsen. Ten days later, Nielsen sent a letter agreeing to stop scraping. Nielsen says it was unable to remove the scraped data from its database, but a company spokesman later said Nielsen had found a way to quarantine the PatientsLikeMe data to prevent it from being included in its reports for clients.

PatientsLikeMe's president, Ben Heywood, disclosed the break-in to the site's 70,000 members in a blog post. He also reminded users that PatientsLikeMe also sells its data in an anonymous form, without attaching user's names to it. That sparked a lively debate on the site about the propriety of selling sensitive information. The company says most of the 350 responses to the blog post were supportive. But it says a total of 218 members quit.

In total, PatientsLikeMe estimates that the scraper obtained about 5% of the messages in the site's forums, primarily in "Mood" and "Multiple Sclerosis."

"We're a business, and the reality is that someone came in and stole from us," says PatientsLikeMe's chairman, Jamie Heywood.

Source: http://online.wsj.com/article/SB10001424052748703358504575544381288117888.html