Groupon Data Scraping: June 2013

Saturday, 29 June 2013

Web Data Extraction

The Internet as we know today is a repository of information that can be accessed across geographical societies. In just over two decades, the Web has moved from a university curiosity to a fundamental research, marketing and communications vehicle that impinges upon the everyday life of most people in all over the world. It is accessed by over 16% of the population of the world spanning over 233 countries.

As the amount of information on the Web grows, that information becomes ever harder to keep track of and use. Compounding the matter is this information is spread over billions of Web pages, each with its own independent structure and format. So how do you find the information you're looking for in a useful format - and do it quickly and easily without breaking the bank?

Search Isn't Enough

Search engines are a big help, but they can do only part of the work, and they are hard-pressed to keep up with daily changes. For all the power of Google and its kin, all that search engines can do is locate information and point to it. They go only two or three levels deep into a Web site to find information and then return URLs. Search Engines cannot retrieve information from deep-web, information that is available only after filling in some sort of registration form and logging, and store it in a desirable format. In order to save the information in a desirable format or a particular application, after using the search engine to locate data, you still have to do the following tasks to capture the information you need:

· Scan the content until you find the information.

· Mark the information (usually by highlighting with a mouse).

· Switch to another application (such as a spreadsheet, database or word processor).

· Paste the information into that application.

Its not all copy and paste

Consider the scenario of a company is looking to build up an email marketing list of over 100,000 thousand names and email addresses from a public group. It will take up over 28 man-hours if the person manages to copy and paste the Name and Email in 1 second, translating to over $500 in wages only, not to mention the other costs associated with it. Time involved in copying a record is directly proportion to the number of fields of data that has to copy/pasted.

Is there any Alternative to copy-paste?

A better solution, especially for companies that are aiming to exploit a broad swath of data about markets or competitors available on the Internet, lies with usage of custom Web harvesting software and tools.

Web harvesting software automatically extracts information from the Web and picks up where search engines leave off, doing the work the search engine can't. Extraction tools automate the reading, the copying and pasting necessary to collect information for further use. The software mimics the human interaction with the website and gathers data in a manner as if the website is being browsed. Web Harvesting software only navigate the website to locate, filter and copy the required data at much higher speeds that is humanly possible. Advanced software even able to browse the website and gather data silently without leaving the footprints of access.

The next article of this series will give more details about how such softwares and uncover some myths on web harvesting.

Source: http://ezinearticles.com/?Web-Data-Extraction&id=575212

Thursday, 27 June 2013

Optimize Usage of Twitter With Data Mining

Twitter has become so popular and it is often thought of as very addictive and as more and more people are getting addicted to it, the more Twitter becomes an important medium for driving traffic to your website, marketing your products and services, or for just brand recognition purposes. As an internet marketer, you will always be interested in what's going on inside Twitter but with 40 million people located all over the world, it would be impossible to know it not unless you use additional tools to help you achieve this goal.

Twitter is a microblogging platform that is used by most people to inform their friends and loved ones what is curently going on in them, tweeters can also engaged in some sort of discussions and very recently more and more internet marketers use it to inform everyone about their company, business, products and services.

As an internet marketer, you will need to maximize your usage of Twitter. You may not just only need how to tweet efficiently or how you will be able to broadcast your tweets [http://moneymakingonlinetip.blogspot.com/2010/01/broadcast-your-tweets.html]. You will really need to know the current most talked about topics in twitter on a certain period of time for a certain geographical location. And by knowing this information, you will be able to define a good marketing strategy and how you can blend well with these people. Advertising in the right time and place would promise higher conversion rate translating to higher sales and earning more profits.

This can be achieved with the proper use of Data Mining Tools and Software. There is probably no such tools yet right at this moment, but for sure it will be an excellent strategy to acquire very useful information that will help you succeed in the business generated and extracted form data gathered from Twitter with the help of these Data Mining Tools and Software.

Source: http://ezinearticles.com/?Optimize-Usage-of-Twitter-With-Data-Mining&id=3589673

Tuesday, 25 June 2013

Data Recovery Services - Be Wary of Cheap Prices

Data recovery is a specialized, complicated process. Proper hard drive recovery can require manipulation of data at the sector level, transplantation of internal components and various other procedures. These techniques are very involved and require not only talented, knowledgeable technicians, but also an extensive inventory of disk drives to use for parts when necessary and clean facilities to conduct the work.

Unfortunately these factors mean that, in most cases, recovery services are quite expensive. Technician training, hard drive inventories and special equipment all come with a cost.

If you search for disk recovery services, you will likely find several smaller companies that offer hard disk data recovery for a fraction of the prices usually quoted by larger, more experienced organizations. These companies often operate from small offices or, in some cases, private homes. They do not possess clean room facilities, large disk drive inventories or many other pieces of equipment necessary to successfully complete most hard drive recovery cases.

When you take into account all of the training, parts and facilities necessary, you can see how it is impossible for a company to charge $200 for a hard drive recovery and not operate at a loss.

What these companies usually do is run a software program on the hard disk drive. Sometimes, if there is no physical damage to the disk drive, this program is able to recover some of the data. However, hard disk data recovery is much more than just running software. No software can recover data from a hard drive that has failed heads, damaged platters or electrical damage. In fact, attempting to run a hard drive that is physically damaged can make the problem worse. Trying to use software to recover data from a hard drive with failed read/write heads, for example, can lead to the heads scraping the platters of the drive and leaving the data unrecoverable.

Another way these smaller companies conduct business is by forwarding data recovery cases they cannot recover to larger organizations. Of course, the smaller companies charge their clients a premium for this service. In these cases it would have actually been cheaper to use the larger company in the first place.

You will also likely find that many smaller recovery companies charge evaluation or diagnostic fees upfront. They charge these fees regardless of whether or not any data is recovered. In many cases clients desperate to recover their data pay these fees and get nothing in return but a failed recovery. Smaller data recovery services simply do not have the skills, training, facilities and parts to successfully recover most disk drives. It is more cost efficient for them to make one attempt at running a software program and then call the case unsuccessful.

Sometimes you may get lucky working with a smaller data recovery company, but in most cases you will end up paying for a failed recovery. In the worst case scenario you could end up with a damaged hard drive that is now unrecoverable by any data recovery service.

You will waste time and money working with these services. You could even lose your valuable data for good.

If your data is important enough to consider data recovery, it is important enough to seek a reputable, skilled data recovery company. All major data recovery services offer free evaluations and most do not charge clients for unsuccessful recoveries. Sometimes you only have one shot to recover data on a disk drive before the platters are seriously damaged and the data is lost for good. Taking chances with inexperienced companies is not worth the risk.

Source: http://ezinearticles.com/?Data-Recovery-Services---Be-Wary-of-Cheap-Prices&id=4706055

Saturday, 22 June 2013

Customer Relationship Management (CRM) Using Data Mining Services

In today's globalized marketplace Customer relationship management (CRM) is deemed as crucial business activity to compete efficiently and outdone the competition. CRM strategies heavily depend on how effectively you can use the customer information in meeting their needs and expectations which in turn leads to more profit.

Some basic questions include - what are their specific needs, how satisfied they are with your product or services, is there a scope of improvement in existing product/service and so on. For better CRM strategy you need a predictive data mining models fueled by right data and analysis. Let me give you a basic idea on how you can use Data mining for your CRM objective.

Basic process of CRM data mining includes:
1. Define business goal
2. Construct marketing database
3. Analyze data
4. Visualize a model
5. Explore model
6. Set up model & start monitoring

Let me explain last three steps in detail.

Visualize a Model:
Building a predictive data model is an iterative process. You may require 2-3 models in order to discover the one that best suit your business problem. In searching a right data model you may need to go back, do some changes or even change your problem statement.

In building a model you start with customer data for which the result is already known. For example, you may have to do a test mailing to discover how many people will reply to your mail. You then divide this information into two groups. On the first group, you predict your desired model and apply this on remaining data. Once you finish the estimation and testing process you are left with a model that best suits your business idea.

Explore Model:
Accuracy is the key in evaluating your outcomes. For example, predictive models acquired through data mining may be clubbed with the insights of domain experts and can be used in a large project that can serve to various kinds of people. The way data mining is used in an application is decided by the nature of customer interaction. In most cases either customer contacts you or you contact them.

Set up Model & Start Monitoring:
To analyze customer interactions you need to consider factors like who originated the contact, whether it was direct or social media campaign, brand awareness of your company, etc. Then you select a sample of users to be contacted by applying the model to your existing customer database. In case of advertising campaigns you match the profiles of potential users discovered by your model to the profile of the users your campaign will reach.

In either case, if the input data involves income, age and gender demography, but the model demands gender-to-income or age-to-income ratio then you need to transform your existing database accordingly.

Source: http://ezinearticles.com/?Customer-Relationship-Management-%28CRM%29-Using-Data-Mining-Services&id=4641198

Thursday, 20 June 2013

Data Loss Symptoms, Causes, and Implications of Downtime

A number of failures can cause data files to disappear or become corrupted. Symptoms of data loss appear immediately; it causes that panicky sinking feeling in the stomach when previously accessible data is suddenly out of reach. Today more data is being stored in smaller and smaller spaces, with hard drives of 2011 having more than 500 times the capacity of those in 2001. This makes a greater, costlier impact when hardware and software malfunction. Hardware malfunctions alone account for nearly 40% of all data loss.

If the hard drive of a computer isn't spinning or won't work at all, if you hear a scraping or rattling sound, or if an error message lets you know a device is not recognized then the hardware is failing and your data is at risk. You may see file or folder names that are scrambled or disappear. A hard disk may be silent for a long time after you request data by opening a folder or file.

Hard drive damage can be caused by power surges, dust in the computer, crashes, and controller failure. Other problems are caused by human error - accidentally deleted files, damage caused from dropping a device, or spilled liquids. Do-it-yourself repairs by inexperienced people can further destroy the drive and it's cargo of important data. An estimated 32% of data loss is caused by human error.

Although virus protection has become increasingly sophisticated, 7% of all data loss is caused by computer viruses. The computer may display strange and unpredictable behavior that gets more and more pronounced, the screen may go blank, or a taunting message may appear announcing the arrival of the malevolent virus within your hard drive. Once infected, the files will need to be processed by a data retrieval company if they are of substantial value.

Backup should be performed routinely of course but backups don't usually contain all the up-to-date data, the files may be corrupted already, or the hardware and storage media may not be working. Companies rely heavily on their computer systems for accounting, inventory, payroll, and many other time-sensitive activities. Backing up data is critically important but not foolproof, especially if a great amount of data is created daily and some of it is lost.

When the computer systems go down, the operation of a company is bogged down. The potential loss caused by this downtime will motivate business owners to have the best data recovery company they can find to take the case and save the day with advanced technology; established data retrieval services will also have the highest ethics when it comes to handling your confidential information.

Companies that have suffered extensive data loss caused by problems in hard drives, servers, hard disks, tapes, and media devices can find consolation in the fact that there's a good chance that the data can be retrieved in a short period of time. This allows operations to go back to normal after only several days, reducing the loss of productivity. The daily downtime losses for large companies can run in the millions. The data recovery industry is there to bring their computer operations back to life in a short a time as possible.

Source: http://ezinearticles.com/?Data-Loss-Symptoms,-Causes,-and-Implications-of-Downtime&id=6277522

Wednesday, 19 June 2013

A Cheaper and Effective Solution For Spanish Data Entry Projects

With Spanish being spoken by more than 400 million people in 22 countries around the world, the need for Spanish Data Entry Services is growing constantly. While most businesses have in-house service providers for their Spanish Data entry projects, this proves to be both expensive and time consuming. A cheaper and better alternative would be to Outsource Spanish Data Entry projects to India.
Indian Outsourcing companies offering Spanish data entry services employ experienced and certified Spanish language experts who are well versed and fluent in Spanish. In order to ensure that the highest quality of service is provided, outsourcing companies follow a specified four step process that is listed below

o All data to be entered is captured using OCR (optical character recognition), ICR (intelligent character recognition), MICR (magnetic ink character recognition) and barcode recognition systems in order to minimize mistakes and maximize speed.

o Any additional data that could not be captured in the previous stage is typed out and verified. The captured data is then evaluated by validation and verification experts who check each and every word and mark out any inconsistencies that may appear in the language.

o A certified Spanish language expert proofreads the entire document and cross-checks it with the original. This is done to make sure that there are no errors.

o The processed data is then formatted, arranged and indexed and sent to the client as per their specific requirements.

Being in the foreign language data entry and transcription industry for more than a decade, Indian companies have the required expertise and skill that is needed to see a project through completion. Apart from Spanish data entry services, Spanish transcription support is also offered if needed. A few of the services that are offered by the outsourcing companies are

o Spanish data entry from hard copies to digital web-based systems

o Spanish data entry from hard/soft copy to any format

o Spanish business document and web-based indexing

o Spanish survey forms entry

o Spanish publications data entry

o Custom data export/import and interfaces with audits

o Data Cleansing of databases in Spanish

o Web Extraction and Data Mining in Spanish

o Creation and Maintenance of Directory Services in Spanish

o Spanish Data Capture and Document Imaging

o Spanish data entry through OCR from images

o Spanish Website Language Translation

With the huge saving that businesses make (sometimes up to 50%), they are able to shift their valuable time, energy and resources towards other core competencies. Indian outsourcing companies are also backed by hi-tech and reliable infrastructure and secure networks in order to ensure data safety. Outsourcing Spanish Data Entry Services to India will give businesses the added benefits of

o Cost-effective pricing

o Certified Spanish language experts

o Stringent quality checks

o Round the clock customer support

o Computer-assisted data capture

o State-of-the -art technology

o Quick turn around time

o Secure and safe networks

The reasonable prices, fast turn around time and high level of data accuracy have made India the choice destination for oversees clients in Spain, Latin America, Mexico, Europe and the United States. By outsourcing to India, they stand to gain a much cheaper and more effective solution for their Spanish data entry projects.

Source: http://ezinearticles.com/?A-Cheaper-and-Effective-Solution-For-Spanish-Data-Entry-Projects&id=1558394

Monday, 17 June 2013

Web Data Extraction Services and Data Collection Form Website Pages

For any business market research and surveys plays crucial role in strategic decision making. Web scrapping and data extraction techniques help you find relevant information and data for your business or personal use. Most of the time professionals manually copy-paste data from web pages or download a whole website resulting in waste of time and efforts.

Instead, consider using web scraping techniques that crawls through thousands of website pages to extract specific information and simultaneously save this information into a database, CSV file, XML file or any other custom format for future reference.

Examples of web data extraction process include:
• Spider a government portal, extracting names of citizens for a survey
• Crawl competitor websites for product pricing and feature data
• Use web scraping to download images from a stock photography site for website design

Automated Data Collection
Web scraping also allows you to monitor website data changes over stipulated period and collect these data on a scheduled basis automatically. Automated data collection helps you discover market trends, determine user behavior and predict how data will change in near future.

Examples of automated data collection include:
• Monitor price information for select stocks on hourly basis
• Collect mortgage rates from various financial firms on daily basis
• Check whether reports on constant basis as and when required

Using web data extraction services you can mine any data related to your business objective, download them into a spreadsheet so that they can be analyzed and compared with ease.

In this way you get accurate and quicker results saving hundreds of man-hours and money!

With web data extraction services you can easily fetch product pricing information, sales leads, mailing database, competitors data, profile data and many more on a consistent basis.

Source: http://ezinearticles.com/?Web-Data-Extraction-Services-and-Data-Collection-Form-Website-Pages&id=4860417

Friday, 14 June 2013

Various Data Mining Techniques

Also called Knowledge Discover in Databases (KDD), data mining is the process of automatically sifting through large volumes of data for patterns, using tools such as clustering, classification, association rule mining, and many more. There are several major data mining techniques developed and known today, and this article will briefly tackle them, along with tools for increased efficiency, including phone look up services.

Classification is a classic data mining technique. Based on machine learning, it is used to classify each item on a data set into one of predefined set of groups or classes. This method uses mathematical techniques, like linear programming, decision trees, neural network, and statistics. For instance, you can apply this technique in an application that predicts which current employees will most probably leave in the future, based on the past records of those who have resigned or left the company.

Association is one of the most used techniques, and it is where a pattern is discovered basing on a relationship of a specific item on other items within the same transaction. Market basket analysis, for example, uses association to figure out what products or services are purchased together by clients. Businesses use the data produced to devise their marketing campaign.

Sequential patterns, too, aim to discover similar patterns in data transaction over a given business phase or period. These findings are used for business analysis to see relationships among data.

Clustering makes useful cluster of objects that maintain similar characteristics using an automatic method. While classification assigns objects into predefined classes, clustering defines the classes and puts objects in them. Predication, on the other hand, is a technique that digs into the relationship between independent variables and between dependent and independent variables. It can be used to predict profits in the future - a fitted regression curve used for profit prediction can be drawn from historical sale and profit data.

Of course, it is highly important to have high-quality data in all these data mining techniques. A multi-database web service, for instance, can be incorporated to provide the most accurate telephone number lookup. It delivers real-time access to a range of public, private, and proprietary telephone data. This type of phone look up service is fast-becoming a defacto standard for cleaning data and it communicates directly with telco data sources as well.

Phone number look up web services - just like lead, name, and address validation services - help make sure that information is always fresh, up-to-date, and in the best shape for data mining techniques to be applied.

Equip your business with better leads and get better conversion rates by using phone look up and similar real-time web services.

Source: http://ezinearticles.com/?Various-Data-Mining-Techniques&id=6985662

Wednesday, 12 June 2013

Data Discovery vs. Data Extraction

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the "details" links within the search results pages to get to the data you're actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a commercial screen-scraping tool can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you've already arrived at the page containing the data you're interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL's and link titles). Regular expressions can be a bit complex to deal with, so most screen-scraping applications will hide these details from you, even though they may use regular expressions behind the scenes.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you've extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user's web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it's been extracted.

Source: http://ezinearticles.com/?Data-Discovery-vs.-Data-Extraction&id=165396

Monday, 10 June 2013

Data Mining

Data mining is the retrieving of hidden information from data using algorithms. Data mining helps to extract useful information from great masses of data, which can be used for making practical interpretations for business decision-making. It is basically a technical and mathematical process that involves the use of software and specially designed programs. Data mining is thus also known as Knowledge Discovery in Databases (KDD) since it involves searching for implicit information in large databases. The main kinds of data mining software are: clustering and segmentation software, statistical analysis software, text analysis, mining and information retrieval software and visualization software.

Data mining is gaining a lot of importance because of its vast applicability. It is being used increasingly in business applications for understanding and then predicting valuable information, like customer buying behavior and buying trends, profiles of customers, industry analysis, etc. It is basically an extension of some statistical methods like regression. However, the use of some advanced technologies makes it a decision making tool as well. Some advanced data mining tools can perform database integration, automated model scoring, exporting models to other applications, business templates, incorporating financial information, computing target columns, and more.

Some of the main applications of data mining are in direct marketing, e-commerce, customer relationship management, healthcare, the oil and gas industry, scientific tests, genetics, telecommunications, financial services and utilities. The different kinds of data are: text mining, web mining, social networks data mining, relational databases, pictorial data mining, audio data mining and video data mining.

Some of the most popular data mining tools are: decision trees, information gain, probability, probability density functions, Gaussians, maximum likelihood estimation, Gaussian Baves classification, cross-validation, neural networks, instance-based learning /case-based/ memory-based/non-parametric, regression algorithms, Bayesian networks, Gaussian mixture models, K-Means and hierarchical clustering, Markov models, support vector machines, game tree search and alpha-beta search algorithms, game theory, artificial intelligence, A-star heuristic search, HillClimbing, simulated annealing and genetic algorithms.

Some popular data mining software includes: Connexor Machines, Copernic Summarizer, Corpora, DocMINER, DolphinSearch, dtSearch, DS Dataset, Enkata, Entrieva, Files Search Assistant, FreeText Software Technologies, Intellexer, Insightful InFact, Inxight, ISYS:desktop, Klarity (part of Intology tools), Leximancer, Lextek Onix Toolkit, Lextek Profiling Engine, Megaputer Text Analyst, Monarch, Recommind MindServer, SAS Text Miner, SPSS LexiQuest, SPSS Text Mining for Clementine, Temis-Group, TeSSI®, Textalyser, TextPipe Pro, TextQuest, Readware, Quenza, VantagePoint, VisualText(TM), by TextAI, Wordstat. There is also free software and shareware such as INTEXT, S-EM (Spy-EM), and Vivisimo/Clusty.

Source: http://ezinearticles.com/?Data-Mining&id=196652

Thursday, 6 June 2013

Usefulness of Web Scraping Services

For any business or organization, surveys and market research play important roles in the strategic decision-making process. Data extraction and web scraping techniques are important tools that find relevant data and information for your personal or business use. Many companies employ people to copy-paste data manually from the web pages. This process is very reliable but very costly as it results to time wastage and effort. This is so because the data collected is less compared to the resources spent and time taken to gather such data.

Nowadays, various data mining companies have developed effective web scraping techniques that can crawl over thousands of websites and their pages to harvest particular information. The information extracted is then stored into a CSV file, database, XML file, or any other source with the required format. After the data has been collected and stored, data mining process can be used to extract the hidden patterns and trends contained in the data. By understanding the correlations and patterns in the data; policies can be formulated and thereby aiding the decision-making process. The information can also be stored for future reference.

The following are some of the common examples of data extraction process:

• Scrap through a government portal in order to extract the names of the citizens who are reliable for a given survey.
• Scraping competitor websites for feature data and product pricing
• Using web scraping to download videos and images for stock photography site or for website design

Automated Data Collection
It is important to note that web scraping process allows a company to monitor the website data changes over a given time frame. It also collects the data on a routine basis regularly. Automated data collection techniques are quite important as they help companies to discover customer trends and market trends. By determining market trends, it is possible to understand the customer behavior and predict the likelihood of how the data will change.

The following are some of the examples of the automated data collection:

• Monitoring price information for the particular stocks on hourly basis
• Collecting mortgage rates from the various financial institutions on the daily basis
• Checking on weather reports on regular basis as required

By using web scraping services it is possible to extract any data that is related to your business. The data can then be downloaded into a spreadsheet or a database for it to be analyzed and compared. Storing the data in a database or in a required format makes it easier for interpretation and understanding of the correlations and for identification of the hidden patterns.

Through web scraping it is possible to get quicker and accurate results and thus saving many resources in terms of money and time. With data extraction services, it is possible to fetch information about pricing, mailing, database, profile data, and competitors data on a consistent basis. With the emergence of professional data mining companies outsourcing your services will greatly reduce your costs and at the same time you are assured of high quality services.

Source: http://ezinearticles.com/?Usefulness-of-Web-Scraping-Services&id=7181014

Tuesday, 4 June 2013

Web scraping in Java with Jsoup, Part 2 (How-to)

Web scraping refers to programmatically downloading a page and traversing its DOM to extract the data you are interested in. I wrote a parser class in Java to perform the web scraping for my blog analyzer project. In Part 1 of this how-to I explained how I set up the calling mechanism for executing the parser against blog URLs. Here, I explain the parser class itself.

But before getting into the code, it is important to take note of the HTML structure of the document that will be parsed. The pages of The Dish are quite heavy–full of menus and javascript and other stuff, but the area of interest is the set of blog posts themselves. This example shows the HTML structure of each blog post on The Dish:

<article>
    <aside>
        <ul class="entryActions" id="meta-6a00d83451c45669e2014e885e4354970d">
            <li class="entryEmail ir">
                <div class="st_email_custom maildiv" st_url="http://andrewsullivan.thedailybeast.com/2011/05/fac-5.html" st_title="Face Of The Day">email</div>
            </li>
            <li class="entryLink ir">
                <a href="http://andrewsullivan.thedailybeast.com/2011/05/fac-5.html" title="permalink this entry">permalink</a>
            </li>
            <li class="entryTweet"></li>
            <li class="entryLike"></li>
        </ul>

        <time datetime="2011-05-12T23:37:00-4:00" pubdate>12 May 2011 07:37 PM</time>
    </aside>

    <div class="entry">
        <h1>
            <a href="http://andrewsullivan.thedailybeast.com/2011/05/fac-5.html">Face Of The Day</a>
        </h1>
        <p>
            <a href="http://dailydish.typepad.com/.a/6a00d83451c45669e2014e885e4233970d-popup" onclick="window.open( this.href, '_blank', 'width=640,height=480,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0' ); return false" style="display: inline;">
                <img alt="GT_WWII-VET-JEWISH-110511" class="asset asset-image at-xid-6a00d83451c45669e2014e885e4233970d" src="http://dailydish.typepad.com/.a/6a00d83451c45669e2014e885e4233970d-550wi" style="width: 515px;" title="GT_WWII-VET-JEWISH-110511" />
            </a>
        </p>
        <p>
        A decorated veteran takes part [truncated]
        </p>
    </div>
</article>

Blog posts are each contained within an HTML5 article tag. There is a time tag holding the date and time the post was published. A div with class aentry holds both the title and body of the post. The title is within an h1 and also contains the permalink for the post.

Now, the code to parse this page.

The simple blog parser interface again:
1
2
3

public interface BlogParser {
    public List<Link> parseURL(URL url) throws ParseException;
}

Now to talk about the implementation class: DishBlogParser. The goal is to return a list of Link objects (a “Link” in this context represents one blog URL and its associated data). DishBlogParser will extract the title and body text of each blog post along with the post date, images, videos, and links contained therein. I’ll go through the class a section at a time. Starting from the top:

@Component("blogParser")
public class DishBlogParser implements BlogParser {

    @Value("${config.excerptLength}")
    private int excerptLength;
    @Autowired
    private DateTimeFormatter blogDateFormat;
    private final Cleaner cleaner;
    private final UrlValidator urlvalidator;

    public DishBlogParser() {
        Whitelist clean = Whitelist.simpleText().addTags("blockquote", "cite", "code", "p", "q", "s", "strike");
        cleaner = new Cleaner(clean);
        urlvalidator = new UrlValidator(new String[]{"http","https"});
    }

The excerptLength field defines the maximum length for post body excerpts. The @Value annotation pulls in the value from a properties file configured in applicationContext.xml.

The blogDateFormat is a Joda formatter configured also in applicationContext.xml to match the date/time format used on The Dish. It will be used to parse dates from HTML into Joda DateTime objects. Here is how blogDateFormat is configured in applicationContext.xml:

<bean id="blogDateFormat"
         class="org.joda.time.format.DateTimeFormat"
         factory-method="forPattern">
    <constructor-arg value="dd MMM yyyy hh:mm aa"/>
</bean>

The Cleaner object is a Jsoup class that applies a whitelist filter to HTML. In this case, the cleaner is used to whitelist tags that will be allowed to appear in blog body excerpts.

Finally, the UrlValidator comes from Apache Commons and will be used to validate the syntax of URLs contained within blog posts.

Now, for the parseURL method:

public List<Link> parseURL(URL url) throws ParseException {
    try {
        // retrieve the document using Jsoup
        Connection conn = Jsoup.connect(url.toString());
        conn.timeout(12000);
        conn.userAgent("Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)");
        Document doc = conn.get();

        // select all article tags
        Elements posts = doc.select("article");

        // base URI will be used within the loop below
        String baseUri = (new StringBuilder())
            .append(url.getProtocol())
            .append("://")
            .append(url.getHost())
            .toString();

        // initialize a list of Links
        List<Link> links = new ArrayList<Link>();

Here, Jsoup is used to connect to the URL. I set a generous connection timeout, because at times The Dish server is not very snappy. I also set a common user agent, just as a general practice when requesting a web page programmatically.

On Line 7 the Document is retrieved–this is a DOM representation of the entire page. For this project, only the blog posts themselves are needed. Because each blog post is contained in an article tag, the set of posts is obtained by calling doc.select("article") (Line 10). We’re about to loop through them, but first we need to define the base URI of our URL for something a bit further down, and also initialize the List which will hold our extracted Link objects.

Now, the loop. It starts like this:

// loop through, extracting relevant data
for (Element post : posts) {
    Link link = new Link();

    // extract the title of the post
    Elements elms = post.select(".entry h1");
    String title = (elms.isEmpty() ? "No Title" : elms.first().text().trim());
    link.setTitle(title);

First, an empty Link object is initialized. Then we extract the title. Recall that “post” is a Jsoup element pointing to the article tag in the DOM. post.select(".entry h1") grabs the h1 title tag, from which we get the title string.

In a similar fashion, we grab the URL and the date:

// extract the URL of the post
elms = post.select("aside .entryLink a");
if (elms.isEmpty()) {
    Logger.getLogger(DishBlogParser.class.getName()).log(Level.WARNING, "UNABLE TO LOCATE PERMALINK, TITLE = "+ title +", URL = "+ url);
    continue;
}
link.setUrl(elms.first().attr("href"));

// extract the date of the post
elms = post.select("aside time");
if (elms.isEmpty()) {
    Logger.getLogger(DishBlogParser.class.getName()).log(Level.WARNING, "UNABLE TO LOCATE DATE, TITLE = "+ title +", URL = "+ url);
    continue;
}
// parse the date string into a Joda DateTime object
DateTime dt = blogDateFormat.parseDateTime(elms.first().text().trim());
link.setLinkDate(dt);

Note that failure to extract the URL or date is unacceptable, a warning is logged, and further processing is skipped. Note also on Line 16 blogDateFormat is used to parse the date string from the HTML into a DateTime object.

Next, let’s grab the body of the post and create an excerpt from it:

// extract the body of the post (includes title tag at this point)
Elements body = post.select(".entry");
// remove the "more" link
body.select(".moreLink").remove();

// remove the title (h1) now from the body
body.select("h1").remove();
// set full text on link, used for indexing/searching (not stored)
link.setFullText(body.text());

// create a body "Document"
Document bodyDoc = Document.createShell(baseUri);
for (Element bodyEl : body)
    bodyDoc.body().appendChild(bodyEl);
// remove unwanted tags by applying a tag whitelist
// the whitelisted tags will appear when displaying excerpts
String bodyhtml = cleaner.clean(bodyDoc).body().html();

if (bodyhtml.length() > excerptLength) {
    // we need to trim it down to excerptLength
    bodyhtml = trimExerpt(bodyhtml, excerptLength);
    // we need to parse this again now to fix possible unclosed tags caused by trimming
    bodyhtml = Jsoup.parseBodyFragment(bodyhtml).body().html();
}
link.setExerpt(bodyhtml);

Recall the body is contained in a div classed entry. The body may contain a “read on” link that expands the content. That link, if present, is removed on Line 4. The title h1 tag is also removed, and the remaining text is stored on Line 9. This full text is not destined to be stored in the database–instead it will be indexed by our search engine.

To create the excerpt, unwanted HTML tags must be removed. This is where the Jsoup Cleaner comes in. Because the Cleaner only processes Document objects, a dummy Document is created for the post (this is also where baseUri is used).

If, after processing the post body through the Cleaner, the length exceeds the excerptLength, it must be trimmed down to size. The trimExcerpt method does this. Because trimming might truncate closing HTML tags, Jsoup is used once more to parse the excerpt string, correcting any unbalanced tags. Finally, we have our excerpt.

This is the trimExerpt method that is called on Line 21 above:

private String trimExcerpt(String str, int maxLen) {
    if (str.length() <= maxLen)
        return str;

    int endIdx = maxLen;
    while (endIdx > 0 && str.charAt(endIdx) != ' ')
        endIdx--;

    return str.substring(0, endIdx);
}

The idea is to use maxLen as a suggestion, and keep backing up until a space character is found. In this way, words will not be cut off in the middle.

Continuing the loop, next the links are extracted. They are represented by InnerLink objects. Any invalid or self links are skipped.

// extract the links within the post
List<InnerLink> inlinks = new ArrayList<InnerLink>();
Elements innerlinks = body.select("a[href]");

// loop through each link, discarding self-links and invalids
for (Element innerlink : innerlinks) {
    String linkUrl = innerlink.attr("abs:href").trim();
    if (linkUrl.equals(link.getUrl()))
        continue;
    else if (urlvalidator.isValid(linkUrl)) {
        //System.out.println("link = "+ linkUrl);
        InnerLink inlink = new InnerLink();
        inlink.setUrl(linkUrl);
        inlinks.add(inlink);
    }
    else
        Logger.getLogger(DishBlogParser.class.getName()).log(Level.INFO, "INVALID URL: "+ linkUrl);
}
link.setInnerLinks(inlinks);

Next, extract any images:


// extract the images from the post
List<Image> linkimgs = new ArrayList<Image>();
Elements images = body.select("img");
for (Element image : images) {
    Image img = new Image();
    img.setOrigUrl(image.attr("src"));
    img.setAltText(image.attr("alt").replaceAll("_", " "));
    linkimgs.add(img);
}
link.setImages(linkimgs);

Finally, extract any Youtube or Vimeo videos (the two most-popular types). Note that this requires a more complex selector syntax (Line 2), in particular because over the years several different HTML codes have been used:

// extract Youtube and Vimeo videos from the post
elms = body.select("iframe[src~=(youtube\\.com|vimeo\\.com)], object[data~=(youtube\\.com|vimeo\\.com)], embed[src~=(youtube\\.com|vimeo\\.com)]");
List<Video> videos = new ArrayList<Video>(2);
for (Element video : elms) {
    String vidurl = video.attr("src");
    if (vidurl == null)
        vidurl = video.attr("data");
    if (vidurl == null || vidurl.trim().equals(""))
        continue;
    Video vid = new Video();
    vid.setUrl(vidurl);
    if (vidurl.toLowerCase().contains("vimeo.com"))
        vid.setProvider(VideoProvider.VIMEO);
    else
        vid.setProvider(VideoProvider.YOUTUBE);
    videos.add(vid);
}
link.setVideos(videos);

Finally, the loop is finished; all data has been gathered. So this Link object is added to the List, end loop, and return:

            links.add(link);
        }
        return links;
    }
    catch (IOException ex) {
        Logger.getLogger(DishBlogParser.class.getName()).log(Level.SEVERE, "IOException when attemping to parse URL "+ url, ex);
        throw new ParseException(ex);
    }
}

In conclusion…

This post has demonstrated web scraping using the open-source Jsoup library. Specifically, we loaded a page from a URL and used Jsoup’s selector syntax to extract the desired pieces of data. In a future post, I will write about what happens next: the list of Links is processed by a service bean and stored in the database.

Source: http://www.gotoquiz.com/web-coding/programming/java-programming/web-scraping-in-java-with-jsoup-part-2-how-to/