Groupon Data Scraping: 2014

Wednesday, 31 December 2014

Data Extraction, Web Screen Scraping Tool, Mozenda Scraper

Web Scraping

Web scraping, also known as Web data extraction or Web harvesting, is a software method of extracting data from websites. Web scraping is closely related and similar to Web indexing, which indexes Web content. Web indexing is the method used by most search engines. The difference with Web scraping is that it focuses more on the translation of unstructured content on the Web, characteristically in rich text format like that of HTML, into controlled data that can be analyzed stored and in a spreadsheet or database. Web scraping also makes Web browsing more efficient and productive for users. For example, Web scraping automates weather data monitoring, online price comparison, and website change recognition and data integration.

This clever method that uses specially coded software programs is also used by public agencies. Government operations and Law enforcement authorities use data scrape methods to develop information files useful against crime and evaluation of criminal behaviors. Medical industry researchers get the benefit and use of Web scraping to gather up data and analyze statistics concerning diseases such as AIDS and the most recent strain of influenza like the recent swine flu H1N1 epidemic.

Data scraping is an automatic task performed by a software program that extracts data output from another program, one that is more individual friendly. Data scraping is a helpful device for programmers who have to generate a line through a legacy system when it is no longer reachable with up to date hardware. The data generated with the use of data scraping takes information from something that was planned for use by an end user.

One of the top providers of Web Scraping software, Mozenda, is a Software as a Service company that provides many kinds of users the ability to affordably and simply extract and administer web data. Using Mozenda, individuals will be able to set up agents that regularly extract data then store this data and finally publish the data to numerous locations. Once data is in the Mozenda system, individuals may format and repurpose data and use it in other applications or just use it as intelligence. All data in the Mozenda system is safe and sound and is hosted in a class A data warehouses and may be accessed by users over the internet safely through the Mozenda Web Console.

One other comparative software is called the Djuggler. The Djuggler is used for creating web scrapers and harvesting competitive intelligence and marketing data sought out on the web. With Dijuggles, scripts from a Web scraper may be stored in a format ready for quick use. The adaptable actions supported by the Djuggler software allows for data extraction from all kinds of webpages including dynamic AJAX, pages tucked behind a login, complicated unstructured HTML pages, and much more. This software can also export the information to a variety of formats including Excel and other database programs.

Web scraping software is a ground-breaking device that makes gathering a large amount of information fairly trouble free. The program has many implications for any person or companies who have the need to search for comparable information from a variety of places on the web and place the data into a usable context. This method of finding widespread data in a short amount of time is relatively easy and very cost effective. Web scraping software is used every day for business applications, in the medical industry, for meteorology purposes, law enforcement, and government agencies.

Source:http://www.articlesbase.com/databases-articles/data-extraction-web-screen-scraping-tool-mozenda-scraper-3568330.html

Saturday, 27 December 2014

So What Exactly Is A Private Data Scraping Services To Use You?

If your computer connects to the Internet or resources on the request for this information, and queries to different servers. If you have a website to introduce to the site server recognizes your computer's IP address and displays the data and much more. Many e - commerce sites use to log your IP address, and the browsing patterns for marketing purposes.

Related Articles

Follow Some Tips For Data Scraping Services

Web Data Scraping Assuring Scraping Success Proxy Data Services

Data Scraping Services with Proxy Data Scraping

Web Data Extraction Services for Data Collection - Screen Scrapping Services, Data Mining Services

The Scraping server you connect to your destination or to process your information and make a filter. For example, IP address or protocol filtering traffic through a Scraping service. As you might guess, there are many types of Scraping services. including the ability to a high demand for the software. Email messages are quickly sent to businesses and companies to help you search for contacts.

Although there are Sanding free Scraping IP addresses in this way can work, the use of payment services, and automatic user interface (plug and play) are easy to give. Scraping web information services, thus offering a variety of relevant sources of data. Scraping information service organizations are generally used where large amounts of data every day. It is possible for you to receive efficient, high precision is also affordable.

Information on the various strategies that companies, Scraping excellent information services, and use the structure planned out and has led to the introduction of more rapid relief of the Earth.

In addition, the application software that has flexibility as a priority. In addition, there is a software that can be tailored to the needs of customers, and satisfy various customer requirements play a major role. Particular software, allows businesses to sell, a customer provides the features necessary to provide the best experience.

If you do not use a private Data Scraping Services suggest that you immediately start your Internet marketing. It is an inexpensive but vital to your marketing company. To choose how to set up a private Scraping service, visit my blog for more information. Data Scraping Services software as the activity data and provides a large amount of information, Sorting. In this way, the company reduced the cost and time savings and greater return on investment will be a concept.

Without the steady stream of data from these sites to get stopped? Scraping HTML page requests sent by argument on the web server, depending on changes in production, it is very likely to break their staff.

Data Scraping Services is common in the respective outsourcing company. Many companies outsource Data Scraping Services service companies are increasingly outsourcing these services, and generally dealing with the Internet business-related activities, in particular a lot of money, can earn.

Web Data Scraping Services, pull information from a structured plan format. Informal or semi-structured data source from the source.They are there to just work on your own server to extract data to execute. IP blocking is not a problem for them when they switch servers in minutes and back on track, scraping exercise. Try this service and you'll see what I mean.

It is an inexpensive but vital to your marketing company. To choose how to set up a private Scraping service, visit my blog for more information. Data Scraping Services software as the activity data and provides a large amount of information, Sorting. In this way, the company reduced the cost and time savings and greater return on investment will be a concept.

Source:http://www.articlesbase.com/outsourcing-articles/so-what-exactly-is-a-private-data-scraping-services-to-use-you-5587140.html

Wednesday, 24 December 2014

Data Mining for Dollars

The more you know, the more you're aware you could be saving. And the deeper you dig, the richer the reward.

That's today's data mining capsulation of your realization: awareness of cost-saving options amid logistical obligations.

According to global trade group Association for Information and Image Management (AIIM), fewer than 25% of organizations in North America and Europe are currently utilizing captured data as part of their business process. With high ease and low cost associated with utilization of their information, this unawareness is shocking. And costly.

Shippers - you're in prime position to benefit the most by data mining and assessing your electronically-captured billing records, by utilizing a freight bill processing provider, to realize and receive significant savings.

Whatever your volume, the more you know about your transportation options, throughout all modes, the easier it is to ship smarter and save. A freight bill processor is able to offer insight capable of saving you 5% - 15% annually on your transportation expenditures.

The University of California - Los Angeles states that data mining is the process of analyzing data from different perspectives and summarizing it into useful information - knowledge that can be used to increase revenue, cuts costs, or both. Data mining software is an analytical tool that allows investigation of data from many different dimensions, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations among dozens of fields in large relational databases. Practically, it leads you to noticeable shipping savings.

Data mining and subsequent reporting of shipping activity will yield discovery of timely, actionable information that empowers you to make the best logistics decisions based on carrier options, along with associated routes, rates and fees. This function also provides a deeper understanding of trends, opportunities, weaknesses and threats. Exploration of pertinent data, in any combination over any time period, enables you the operational and financial view of your functional flow, ultimately providing you significant cost savings.

With data mining, you can create a report based on a radius from a ship point, or identify opportunities for service or modal shifts, providing insight regarding carrier usage by lane, volume, average cost per pound, shipment size and service type. Performance can be measured based on overall shipping expenditures, variances from trends in costs, volumes and accessorial charges.

The easiest way to get into data mining of your transportation information is to form an alliance with a freight bill processor that provides this independent analytical tool, and utilize their unbiased technologies and related abilities to make shipping decisions that'll enable you to ship smarter and save.

Source:http://ezinearticles.com/?Data-Mining-for-Dollars&id=7061178

Monday, 22 December 2014

Scraping Fantasy Football Projections from the Web

In this post, I show how to download fantasy football projections from the web using R. In prior posts, I showed how to scrape projections from ESPN, CBS, NFL.com, and FantasyPros. In this post, I compile the R scripts for scraping projections from these sites, in addition to the following sites: Accuscore, Fantasy Football Nerd, FantasySharks, FFtoday, Footballguys, FOX Sports, WalterFootball, and Yahoo.

Why Scrape Projections?

Scraping projections from multiple sources on the web allows us to automate importing the projections with a simple script. Automation makes importing more efficient so we don’t have to manually download the projections whenever they’re updated. Once we import all of the projections, there’s a lot we can do with them, like:

•    Determine who has the most accurate projections
•    Calculate projections for your league
•    Calculate players’ risk levels
•    Calculate players’ value over replacement
•    Identify sleepers
•    Calculate the highest value you should bid on a player in an auction draft
•    Draft the best starting lineup
•    Win your auction draft
•    Win your snake draft

The R Scripts

To scrape the projections from the websites, I use the readHTMLTable function from the XML package in R. Here’s an example of how to scrape projections from FantasyPros:

1 2 3 4 5 6 7 8

#Load libraries

library("XML")

#Download fantasy football projections from FantasyPros.com

qb_fp <- readHTMLTable("http://www.fantasypros.com/nfl/projections/qb.php", stringsAsFactors = FALSE)$data

rb_fp <- readHTMLTable("http://www.fantasypros.com/nfl/projections/rb.php", stringsAsFactors = FALSE)$data

wr_fp <- readHTMLTable("http://www.fantasypros.com/nfl/projections/wr.php", stringsAsFactors = FALSE)$data

te_fp <- readHTMLTable("http://www.fantasypros.com/nfl/projections/te.php", stringsAsFactors = FALSE)$data

view raw FantasyPros projections hosted with ❤ by GitHub

The R Scripts for scraping the different sources are located below:

1.    Accuscore
2.    CBS - Jamey Eisenberg
3.    CBS – Dave Richard
4.    CBS – Average
5.    ESPN
6.    Fantasy Football Nerd
7.    FantasyPros
8.    FantasySharks
9.    FFtoday
10.    Footballguys – David Dodds
11.    Footballguys – Bob Henry
12.    Footballguys – Maurile Tremblay
13.    Footballguys – Jason Wood
14.    FOX Sports
15.    NFL.com
16.    WalterFootball
17.    Yahoo

Density Plot

Below is a density plot of the projections from the different sources:Calculate projections

Conclusion

Scraping projections from the web is fast, easy, and automated with R. Once you’ve downloaded the projections, there’s so much you can do with the data to help you win your league! Let me know in the comments if there are other sources you want included (please provide a link).

Source:http://fantasyfootballanalytics.net/2014/06/scraping-fantasy-football-projections.html

Thursday, 18 December 2014

Extractions and Skin Care

As an esthetician or skin care professional, you may have heard some controversy over the matter of performing extractions during a routine facial service. What may seem like a relatively simple procedure can actually raise great controversy in the world of esthetics. Some estheticians regard extractions as a matter of providing a complete service while others see this as inflicting trauma to the skin. Learning more about both sides of the issue can help you as a professional in making an informed decision and explaining the issue to your clients.

What is an extraction?

As a basic review, an extraction is removing impurity (plug of dead skin or oil) from a pore or pimple. It is the removal of both blackheads and whiteheads from the skin. Extractions occur after the skin has been thoroughly cleansed, exfoliated and sometimes steamed to soften the area prior to extraction.

Why Do It?

Extractions are considered a "must" by many estheticians when performing a routine facial because they want to leave their clients skin looking and feeling it's best. When done correctly, a simple extraction should be quick and relatively painless. As a trained esthetician it is important to know if your client has sensitive skin which would make them more prone to the damage that can be caused by extractions.

Why Not?

Extractions should only be performed by a trained esthetician and should not be done in excess. Extractions can cause broken capillaries or sin irritations that can lead to more (not less) breakouts. Extractions can also cause discomfort for your client when done incorrectly so you should seek their permission before performing any type of extraction during their facial. Remember your client has the right to know any product or procedure being performed on their skin and make an informed choice.

Who Decides?

As an esthetician it may be entirely up to you or it may be a procedure within your salon to do or not do extractions. It is important to check the guidelines of your employer and know their policies before performing any procedure. Remember to explain extractions and their benefits and possible complications to your client. Trust is an important part of any relationship and your client needs to know you are being open and honest with them. The last thing you want as a professional is a reputation for inflicting unnecessary and unwanted procedures or damage to your client's skin.

Bellanina Institute's owner and director, Nina Howard, is a multi-talented, forward-thinking entrepreneur who has built the Bellanina brand form the ground up to a successful million-dollar spa, spa training business, and skin care product line. Nina is a Licensed Esthetician with Para-Medical studies, Massage Therapist, Polarity Therapist, Skin Care Educator, Artist, and Professional Interior Designer.

Source:http://ezinearticles.com/?Extractions-and-Skin-Care&id=5271715

Thursday, 11 December 2014

A quick guide on web scraping: Why and how

Web scraping, which is the collection and cleaning of online data, is the first step in any
data-driven project. Here’s a short video that explains what scraping is, and how to create
automated scraping jobs using a digital tool.

This is a 15-minute video created by an instructor at Ohio State University. In the first six
minutes, the instructor talks about why we need web scraping; he then shows how to use a
scraping tool, OutWit Hub, to collect data scattered in a large database.

FYI: read reviews by Reporters’ Lab of OutWit Hub and other web scraping tools.

Source: http://www.mulinblog.com/quick-guide-web-scraping/

Thursday, 4 December 2014

The Hubcast #4: A Guide to Boston, Scraping Local Leads, & Designers.Hubspot.com

The Hubcast Podcast Episode 004

Welcome back to The Hubcast folks! As mentioned last week, this will be a weekly podcast all about HubSpot news, tips, and tricks. Please also note the extensive show notes below including some new HubSpot video tutorials created by George Thomas.

Show Notes:

Inbound 2014

THE INSIDER’S GUIDE TO BOSTON

Boston Guide

On September 15-18, the Boston Convention & Exhibition Center will be filled with sales and marketing professionals for INBOUND 2014. Whether this will be your first time visiting Boston, you’ve visited Boston in the past, or you’ve lived in the city for years, The Insider’s Guide to Boston is your go-to guide for enjoying everything the city has to offer. Click on a persona below to get started.

Are you the The Brewmaster – The Workaholic – The Chillaxer?
Check out the guide here
HubSpot Tips & Tricks

Prospects Tool – Scrape Local Leads
Prospects Tool

This weeks tip / trick is how to silence some of the noise in your prospect tool. Sometimes you might have need to just look at local leads for calls or drop offs. We show you how to do that and much more with the HubSpot Prospects Tool.

Watch the tutorial here
HubSpot Strategy
Crack down on your sites copy.

We talk about how your home page and about pages are talking to your potential customers in all the wrong ways. Are you the me, me, me person at the digital party? Or are you letting people know how their problems can be solved by your products or services.

HubSpot Updates
(Each week on the Hubcast, George and Marcus will be looking at HubSpot’s newest updates to their software. And in this particular episode, we’ll be discussing 2 of their newest updates)
Default Contact Properties

You can now choose a default option on contact properties that sets a default value for that property that can be applied across your entire contacts database. When creating or editing a new contact property in Contacts Settings, you’ll see a new default option next to the labels on properties with field types “Dropdown,” “Radio Select” and “Single On/Off Checkbox”.

Default Contact Properties

When you set a contact property as “default”, all contacts who don’t have any value set for this property will adopt the default value you’ve selected. In the example above, we’re creating a property to track whether your contact uses a new feature. Initially, all of them would be “No,” and that’s the default property that will be applied database-wide. As a result, this’ll get stamped on each contact record the value wasn’t present on.

Now, when you want to apply a contact property across multiple contacts, you don’t have to create a list of those contacts and then create a workflow that stamps that contact property across those contacts. This new feature allows you to bypass those steps by using the “default” option on new contact properties you create.

Watch the tutorial here
RSS Module with Images

Now available is a new option within modules in the template builder that will allow you to easily add a featured image to an RSS module. This module will show a blog post’s featured image next to the feed of recent blog content. If you are a marketer, all you need to do is simply check the “Featured Image” box off in the RSS Listing module to display a list of recent COS blog posts with images on any page. No developers or code necessary to do this!

If you are a designer and want to add additional styling to an RSS module with images, you can do so using HubL tokens.

Here is documentation on how to get started.

Default Contact Properties
Watch the tutorial here

HubSpot Wishlist

The HubSpot Keywords Tool

Why oh why!!!! Hubspot why can we only have 1,000 keywords in our keywords tool? We talk about how for many companies a 1,000 keywords dont just cut it. For example Yale applaince can easily blow through those keywords.

Source: http://www.thesaleslion.com/hubcast-podcast-004/

Sunday, 30 November 2014

Web Scraping’s 2013 Review – part 1

Here we are, almost having ended another year and having the chance to analyze the aspects of the Web scraping market over the last twelve months. First of all i want to underline all the buzzwords on the tech field as published in the Yahoo’s year in review article . According to Yahoo, the most searched items wore

iPhone (including 4, 5, 5s, 5c, and 6)
Samsung (including Galaxy, S4, S3, Note)
Siri
iPad Cases
Snapchat
Google Glass
Apple iPad
BlackBerry Z10
Cloud Computing

It’s easy to see that none of this terms regards in any way with the field of data mining, and they rather focus on the gadgets and apps industry, which is just one of the ways technology can evolve to. Regarding actual data mining industry there were a lot of talks about it in this year’s MIT’s Engaging Data 2013 Conference. One of the speakers Noam Chomsky gave an acid speech relating data extraction and its connection to the Big Data phenomena that is also on everyone’s lips this year. He defined a good way to see if Big Data works by following a series of few simple factors: 1. It’s the analysis, not the raw data, that counts. 2. A picture is worth a thousand words 3. Make a big data portal (Not sure if Facebook is planning on dominating in cloud services some day) 4. Use a hybrid organizational model (We’re asleep already, soon) let’s move 5. Train employees Other interesting declaration was given by EETimes saying, “Data science will do more for medicine in the next 10 years than biological science.” which says a lot about the volume of required extracted data.

Because we want to cover as many as possible events about data mining this article will be a two parter, so don’t forget to check our blog tomorrow when the second part of this article will come up!

Source:http://thewebminer.com/blog/2013/12/

Thursday, 27 November 2014

Scraping R-bloggers with Python – Part 2

In my previous post I showed how to write a small simple python script to download the pages of R-bloggers.com. If you followed that post and ran the script, you should have a folder on your hard drive with 2409 .html files labeled post1.html , post2.html and so forth. The next step is to write a small script that extract the information we want from each page, and store that information in a .csv file that is easily read by R. In this post I will show how to extract the post title, author name and date of a given post and store it in a .csv file with a unique id.

To do this open a document in your favorite python editor (I like to use aquamacs) and name it: extraction.py. As in the previous post we start by importing the modules that we will use for the extraction:

from BeautifulSoup import BeautifulSoup

import os
import re

As in the previous post we will be using the BeautifulSoup module to extract the relevant information from the pages. The os module is used to get a list of file from the directory where we have saved the .html files, and finally the re module allows us to use regular expressions to format the titles that include a comma value or a newline value (\n). We need to remove these as they would mess up the formatting of the .csv file.

After having read in the modules, we need to get a list of files that we can iterate over. First we need to specify the path were the files are saved, and then we use the os module to get all the filenames in the specified directory:

path = "/Users/thomasjensen/Documents/RBloggersScrape/download"

listing = os.listdir(path)

It might be that there are other files in the given directory, hence we apply a filter, in shape of a list comprehension, to weed out any file names that do not match our naming scheme:

listing = [name for name in listing if re.search(r"post\d+\.html",name) != None]

Notice that a regular expression was used to determine whether a given name in the list matched our naming scheme. For more on regular expressions have a look at this site.

The final steps in preparing our extraction is to change the working directory to where we have our .html files, and create an empty dictionary:

os.chdir(path)
data = {}

Dictionaries are one of the great features of Python. Essentially a dictionary is a mapping of a key to a specific value, however the fact that dictionaries can be nested within each other, allows us to create data structures similar to R’s data frames.

Now we are ready to begin extracting information from our downloaded pages. Much as in the previous post, we will loop over all the file names, read each file into Python and create a BeautifulSoup object from the file:

for page in listing:
    site = open(page,"rb")
    soup = BeautifulSoup(site)

In order to store the values we extract from a given page, we update the dictionary with a unique key for the page. Since our naming scheme made sure that each file had a unique name, we simply remove the .html part from the page name, and use that as our key:

key = re.sub(".html","",page)

data.update({key:{}})

This will create a mapping between our key and an empty dictionary, nested within the data dictionary. Once this is done we can start extract information and store it in our newly created nested dictionary. The content we want is located in the main column, which has the id tag “leftcontent” in the html code. To get at this we use find() function on soup object created above:

content = soup.find("div", id = "leftcontent")

The first “h1” tag in our content object contains the title, so again we will use the find() function on the content object, to find the first “h1” tag:

title = content.findNext("h1").text

To get the text within the “h1” tag the .text had been added to our search with in the content object.

To find the author name, we are lucky that there is a class of “div” tags called “meta” which contain a link with the author name in it. To get the author name we simply find the meta div class and search for a link. Then we pull out the text of the link tag:

author = content.find("div",{"class":"meta"}).findNext("a").text

Getting the date is a simple matter as it is nested within div tag with the class “date”:

date = content.find("div",{"class":"date"}).text

Once we have the three variables we put them in dictionaries that are nested within the nested dictionary we created with the key:

data[key]["title"] = title
data[key]["author"] = author
data[key]["date"] = date

Once we have run the loop and gone through all posts, we need to write them in the right format to a .csv file. To begin with we open a .csv file names output:

output = open("/Users/thomasjensen/Documents/RBloggersScrape/output.csv","wb")

then we create a header that contain the variable names and write it to the output.csv file as the first row:

variables = unicode(",".join(["id","date","author","title"]))
header = variables + "\n"
output.write(header.encode("utf8"))

Next we pull out all the unique keys from our dictionary that represent individual posts:

keys = data.keys()

Now it is a simple matter of looping through all the keys, pull out the information associated with each key, and write that information to the output.csv file:

for key in keys:
    print key
    id = key
    date = re.sub(",","",data[key]["date"])
    author = data[key]["author"]
    title = re.sub(",","",data[key]["title"])
    title = re.sub("\\n","",title)
    linelist = [id,date,author,title]
    linestring = unicode(",".join(linelist))
    linestring = linestring + "\n"
    output.write(linestring.encode("utf-8"))

Notice that we first create four variables that contain the id, date, author and title information. With regards to the title we use two regular expressions to remove any commas and “\n” from the title, as these would create new columns or new line breaks in the output.csv file. Finally we put the variables together in a list, and turn the list into a string with the list items separated by a comma. Then a linebreak is added to the end of the string, and the string is written to the output.csv file. As a last step we close the file connection:

output.close()

And that is it. If you followed the steps you should now have a csv file in your directory with 2409 rows, and four variables – ready to be read into R. Stay tuned for the next post which will show how we can use this data to see how R-bloggers has developed since 2005. The full extraction script is shown below:

from BeautifulSoup import BeautifulSoup

import os
import re

path = "/Users/thomasjensen/Documents/RBloggersScrape/download"
listing = os.listdir(path)

listing = [name for name in listing if re.search(r"post\d+\.html",name) != None]
os.chdir(path)
data = {}
for page in listing:
site = open(page,"rb")
soup = BeautifulSoup(site)
key = re.sub(".html","",page)
print key
data.update({key:{}})
content = soup.find("div", id = "leftcontent")
title = content.findNext("h1").text
author = content.find("div",{"class":"meta"}).findNext("a").text
date = content.find("div",{"class":"date"}).text
data[key]["title"] = title
data[key]["author"] = author
data[key]["date"] = date

output = open("/Users/thomasjensen/Documents/RBloggersScrape/output.csv","wb")

keys = data.keys()
variables = unicode(",".join(["id","date","author","title"]))
header = variables + "\n"
output.write(header.encode("utf8"))
for key in keys:
print key
id = key
date = re.sub(",","",data[key]["date"])
author = data[key]["author"]
title = re.sub(",","",data[key]["title"])
title = re.sub("\\n","",title)
linelist = [id,date,author,title]
linestring = unicode(",".join(linelist))
linestring = linestring + "\n"
output.write(linestring.encode("utf-8"))
output.close()

Source: http://www.r-bloggers.com/scraping-r-bloggers-with-python-part-2/

Sunday, 23 November 2014

Data Mining Outsourcing in a Better and Unique Approach

Data mining outsourcing services are ideal for clarity in various decision making processes. It is the ultimate goal of any organization and business to increase on its profits as well as strengthen the bond with its customers. Equipping the business in such a way that it’s very easy to detect frauds and manage risks in a convenient manner is equally important. Volumes of data that are irrelevant or cannot be used when raw needs to be converted to a more useful form. The data mining outsourcing services can greatly help you to analyze and interpret data in a more diligent way.

This service to reliable, experienced and qualified hands is very important. Your research project or engineering project can be easily and conveniently handled by experienced staff who guarantees you an accuracy level of about 98% and a massive reduction in operating costs. The quality of work is unsurpassed and the presentation is done in a format that is easy and simple for you. The project is done in a very short time alleviating you delays as well as ensuring on-time completion of your projects. To enjoy a successful outsourcing experience, you need to bank on a famous and reliable expertise.

The only time to rely with data mining outsourcing services is when you do not have a reliable, experienced expertise in your business. Statistics indicate that it’s very easy to lose business intelligence or expose the privacy of the customers through this process. However companies which offer secure outsourcing process are on the increase as a result of massive competition. It’s an opportunity to develop your potential of sourced data and improve your business in all fields.

Data mining potential applications are infinite. However major applications are in the marketing research and scientific projects. It’s done both on large and small quantities of data by experienced staff well known for their best analytical procedures to guarantee you accurate and easy to use information. Data mining outsourcing services are the only perfect way to profitability.

Source:http://www.e-edge.biz/Data_Mining_Outsourcing_in_a_Better_and_Unique_Approach.html

Wednesday, 19 November 2014

Online Data Entry & Web Scraping Services

To operate any type of organization smoothly, it is essential to have precise data that is accurate and reliable. When your business expands, data entry on an ongoing basis is a tedious job. It’s a very time consuming task that can often distract employees focusing on core business areas.

Webpop offers all forms of online data entry services that are quick and accurate. We provide data entry services across all verticals that can be completely customized to your business requirements.

Database Population Services

Database population involves content collection from various database sources. This requires a lot of attention to detail, dedication and awareness and can prove a formidable task, especially for websites that largeley depend on it.

Webpop offer a quick and efficient database population service that helps relieve the stress from an extremely laborius task and leaves you more time to focus on more important aspects of your business. By investing just a fraction of the cost, you can outsource your database population tasks to us.

Web Scraping Services

Webpop have been assisting clients in searching, extracting and collecting data from the web for the past 5 years using the latest techniques in web scraping techology. We can scrape all types of information from a variety of sources such as websites, blogs, online directories, e-commerce websites and podcasts to name a few. We use a varied selection of automated and manual web scraping technologies to extract, gather and collect all of the required data you require from any chosen website(s) on the World Wide Web.

We can simplify the whole process from collection to population, converting your scraped data in to structured formats that are applicable to your website. This can be offered as a one time service or an ongoing basis that will assist you in constantly keeping your website’s content fresh and up to date. We can crawl competitors websites, gather sales leads, product details, pricing methodologies and also creat custom campaigns to suit your project’s requirements.

Over the years Webpop has grown from strength-to-strength by providing all types of data entry, database population and web scraping services. All of our data entry services are performed with care, due dilligence and attention to detail. We enjoy a challenge and pride ourselves on delivering results whilst working on precarious projects that require precision and total commitment.

Source:http://www.webpopdesign.com/services/data-entry/

Saturday, 15 November 2014

Is Web Scraping Legal?

Web scraping might be one of the best ways to aggregate content from across the internet, but it comes with a caveat: It’s also one of the hardest tools to parse from a legal standpoint.

For the uninitiated, web scraping is a process whereby an automated piece of software extracts data from a website by “scraping” through the site’s many pages. While search engines like Google and Bing do a similar task when they index web pages, scraping engines take the process a step further and convert the information into a format which can be easily transferred over to a database or spreadsheet.

It’s also important to note that a web scraper is not the same as an API. While a company might provide an API to allow other systems to interact with its data, the quality and quantity of data available through APIs is typically lower than what is made available through web scraping. In addition, web scrapers provide more up-to-date information than APIs and are much easier to customize from a structural standpoint.

The applications of this “scraped” information are widespread. A journalist like Nate Silver might use scrapers to monitor baseball statistics and create numerical evidence for a new sports story he’s working on. Similarly, an eCommerce business might bulk scrape product titles, prices, and SKUs from other sites in order to further analyze them.

Legality of Web ScrapingWhile web scraping is an undoubtedly powerful tool, it’s still undergoing growing pains when it comes to legal matters. Because the scraping process appropriates pre-existing content from across the web, there are all kinds of ethical and legal quandaries that confront businesses who hope to do leverage scrapers for their own processes.

In this “wild west” environment, where the legal implications of web scraping are in a constant state of flux, it helps to get a foothold on where the legal needle currently falls. The following timeline outlines some of the biggest cases involving web scrapers in the United States, and allows us to achieve a greater understanding on the precedents that surround the court rulings.

Terms of Use Tug-of-War—2000-2009

For years after they first came into use, web scrapers went largely unchallenged from a legal standpoint. In 2000, however, the use of scrapers came under heavy and consistent fire when eBay fired the first shot against an auction data aggregator called Bidder’s Edge. In this very early case, eBay argued that Bidder’s Edge was using scrapers in a way that violated Trespass to Chattels doctrine. While the lawsuit was settled out of court, the judge upheld eBay’s original injunction, stating that heavy bot traffic could very well disrupt eBay’s service.

Then in 2003’s Intel Corp. v. Hamidi, the California Supreme court overturned the basis of eBay v. Bidder’s Edge, ruling that Trespass to Chattels could not extend to the context of computers if no actual damage to personal property occurred.

So in terms of legal action against web scraping, Tresspass to Chattels no longer applied, and things were back to square one. This began a period in which the courts consistently rejected Terms of Service as a valid means of prohibiting scrapers, including cases like Perfect 10 v. Google, and Cvent v. Eventbrite.

The Takeaway: The earliest cases against scrapers hinged on Trespass to Chattels law, and were successful. However, that doctrine is no longer a valid approach.

Facebook Web Scraping2009—Facebook Steps In

In 2009, Facebook turned the tides of the web scraping war when Power.com, a site which aggregated multiple social networks into one centralized site, included Facebook in their service. Because Power.com was scraping Facebook’s content instead of adhering to their established standards, Facebook sued Power on grounds of copyright infringement.

In denying Power.com’s motion to dismiss the case, the Judge ruled that scraping can constitute copying, however momentary that copying may be. And because Facebook’s Terms of Service don’t allow for scraping, that act of copying constituted an infringement on Facebook’s copyright. With this decision, the waters regarding the legality of web scrapers began to shift in favor of the content creators.

The Takeaway: Even if a web scraper ignores infringing content on its way to freely-usable content, it might qualify as copyright infringement by virtue of having technically “copied” the infringing content first.

2011-2014— U.S. v Auernheimer

In 2010, hacker Andrew “Weev” Auernheimer found a security flaw in AT&T’s website, which would display the email addresses of users who visited the site via their iPads. By exploiting the flaw using some simple scripts and a scraper, Auernheimer was able to gather thousands of emails from the AT&T site.

Although these email addresses were publicly available, Auernheimer’s exploit led to his 2012 conviction, where he was charged with identity fraud and conspiracy to access a computer without authorization.

Data ScrapingEarlier this year, the court vacated Auernheimer’s conviction, ruling that the trial’s New Jersey venue was improper. But even though the case turned out to be mostly inconclusive, the court noted the fact that there was no evidence to show that “any password gate or code-based barrier was breached.” This seems to leave room for the web scraping of publicly-available personal information, although it’s still very much open to interpretation and not set in stone.

The Takeaway: Using a web scraper to aggregate sensitive personal information can lead to a conviction, even if that information was technically available to the public. While there is hope in the court’s observation that no passwords or barriers were broken to retrieve this information, the waters here are still very volatile.

2013—Associated Press vs. Meltwater

Meltwater is a software company whose “Global Media Monitoring” product uses scrapers to aggregate news stories for paying clients. The Associated Press took issue with Meltwater’s scraping of their original stories, some of which had been copyrighted. In 2012, AP filed suit against Meltwater for copy infringement and hot news misappropriation.

While it’s already been established that facts cannot be copyrighted, the court decided that the AP’s copyrighted articles—and more specifically, the way in which the facts within those articles were arranged—were not fair game for copying. On top of this, Meltwater’s use of the articles failed to meet the established fair use standards, and could not be defended on that front either.

The Takeaway: Fair use is limited when it comes to web scrapers, and copyrighted content is not always open to be scraped.

~~

By closely observing the outcomes of previous rulings, you’ll find that there are a few guidelines that a scraper should attempt to adhere to:

    Content being scraped is not copyright protected
    The act of scraping does not burden the services of the site being scraped
    The scraper does not violate the Terms of Use of the site being scraped
    The scraper does not gather sensitive user information
    The scraped content adheres to fair use standards

While all of these guidelines are important to understand before using scrapers, there are other ways to acclimate to the legal nuances. In many cases, you’ll find that a simple conversation with a business software developer or consultant will lead to some satisfying conclusions: Odds are, they’ve used scrapers in the past and can shed light on any snags they’ve hit in the process. And of course, talking with a lawyer is always an ideal course of action when treading into questionable legal territory.

Source:http://blog.icreon.us/2014/09/12/web-scraping-and-you-a-legal-primer-for-one-of-its-most-useful-tools/

Thursday, 13 November 2014

Big Data Democratization via Web Scraping

Big Data Democratization via Web Scraping

If we had to put democratization of data inline with the classroom definition of democracy, it would read- Data by the people, for the people, of the people. Makes a lot of sense, doesn’t it? It resonates with the generic feeling we have these days with respect to easy access to data for our daily tasks. Thanks to the internet revolution, and now the social media.

Big-data-crawling

Big Data web Crawling

By the people- most of the public data on the web is a user group’s sentiments, analyses and other information.

Of the people- Although the “of” here does not literally mean that the data is owned, all such data on the internet either relates to the user group itself or its views on things.

For the people- Most of this data is presented via channels (either social media, news, etc.) for public benefit be it travel tips, daily news feeds, product price comparisons, etc.

Essentially, data democratization has come to mean that by leveraging cloud computing, data that’s mostly user-generated on the internet has become accessible by all industries- big or small for their own internal use (commercial or not). This democratization has been put to use for unearthing hidden patterns from big blobs of datasets. Use cases have evolved with the consumer internet landscape and Big Data is now being used for various other means quite unanticipated.

With respect to the democratization, we’ve also heard enough about how data analytics is paving way beyond data analysts within companies and becoming available to even the non-tech-savvies. But did anyone mention DaaS providers who aid in the very first phase of data acquisition? Data scraping or web crawling (whatever your lingo is) has come to become an indivisible part of data democratization, especially when talking large-scale. The first step into bringing the public data to use is acquiring it which is where setting up web crawlers internally or partnering with DaaS providers comes to play. This blog guides towards making a choice. Its not always all the data that companies crunch or should crunch from the web. There’s obviously certain channels that are of more interest to the community than the rest and there lies the barrier- to identify sources of higher ROI and acquire data in a machine-readable format.

DaaS providers usually come to help with the entire data acquisition pipeline- starting from picking the right sources through crawl, extraction, dedup as well as data normalization based on specific requirements. Once the data has been acquired, its most likely published on another channel. Such network effect bolsters the democracy.

Steps in Data Acquisition Pipeline

crawl-extract-norm

Note- PromptCloud only delivers structured data as per the schema provided.

So while democratization may refer to easy access of computing resources in order to draw patterns from Big Data, it could also be analogous to ensuring right data in the right format at right intervals. In fact, DaaS providers have themselves used this democracy to empower it further.

Source:https://www.promptcloud.com/blog/big-data-democratization-using-web-scraping-2/

Monday, 10 November 2014

Data Scraping vs. Data Crawling

One of our favorite quotes has been- ‘If a problem changes by an order, it becomes a totally different problem’ and in this lies the answer to- what’s the difference between scraping and crawling?

Crawling usually refers to dealing with large data-sets where you develop your own crawlers (or bots) which crawl to the deepest of the web pages. Data scraping on the other hand refers to retrieving information from any source (not necessarily the web). It’s more often the case that irrespective of the approaches involved, we refer to extracting data from the web as scraping (or harvesting) and that’s a serious misconception.

=>Below are some differences in our opinion- both evident and subtle

1.    Scraping data does not necessarily involve the web. Data scraping could refer to extracting information from a local machine, a database, or even if it is from the internet, a mere “Save as” link on the page is also a subset of the data scraping universe. Crawling on the other hand differs immensely in scale as well as in range. Firstly, crawling = web crawling which means on the web, we can only “crawl” data. Programs that perform this incredible job are called crawl agents or bots or spiders (please leave the other spider in spiderman’s world). Some web spiders are algorithmically designed to reach the maximum depth of a page and crawl them iteratively (did we ever say scrape?).

2.    Web is an open world and the quintessential practising platform of our right to freedom. Thus a lot of content gets created and then duplicated. For instance, the same blog might be posted on different pages and our spiders don’t understand that. Hence, data de-duplication (affectionately dedup) is an integral part of data crawling. This is done to achieve two things- keep our clients happy by not flooding their machines with the same data more than once, and saving our own servers some space. However, dedup is not necessarily a part of data scraping.

3.    One of the most challenging things in the web crawling space is to deal with coordination of successive crawls. Our spiders have to be polite with the servers that they hit so that they don’t piss them off and this creates an interesting situation to handle. Over a period of time, our intelligent spiders have to get more intelligent (and not crazy!) and learn to know when and how much to hit a server in order to crawl data on its web pages while complying with its politeness policies.

4.    Finally, different crawl agents are used to crawl different websites and hence you need to ensure they don’t conflict with each other in the process. This situation never arises when you intend to just scrape data.

On a concluding note, scraping represents a very superficial node of crawling which we call extraction and that again requires few algorithms and some automation in place.

Source:https://www.promptcloud.com/blog/data-scraping-vs-data-crawling/

Friday, 7 November 2014

Web Scraping and Copyright

There are rigorous debates on the topic of copyright issues of a scraped data. Whether it is legal to use web scraping to extract data from a website or it’s just an illegal act that will lead you to a troublesome situation. There are certain implementations when we talk of web scraping. There are certain websites that will provide you with RSS feeds.

Just grab the piece use it and get the credit. However there are sites that do not allow such kind of cooperation (as we will call this!) and thus there is no other way to extract reliable data. Another way is to hold the “ctrl” plus “c” keys and wait for a while for the data to be copied to your computer.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-copyright/

Monday, 8 September 2014

Scraping webdata from a website that loads data in a streaming fashion

I'm trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python

mechanize and beautifulsoup to do the scraping.

I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is

streamed into the table and mechanize.Browser() just stops listening.

So here's the issue: If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500

contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use

browser.open() at that url all you get is the first ~5 rows.

I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a

time.sleep(10) between the .open() and .read() but that didn't make much difference.

And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-

source'). SO I don't think its a javascript issue.

Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.

-Will

2 Answers

Why not use an html parser like lxml with xpath expressions.

I tried

>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>

Similarly, you can create xpath expression of your choice to work with.

Source: http://stackoverflow.com/questions/9435512/scraping-webdata-from-a-website-that-loads-data-in-a-streaming-

fashion

How can I circumvent page view limits when scraping web data using Python?

I am using Python to scrape US postal code population data from http:/www.city-data.com, through this directory: http://www.city-data.com/zipDir.html. The specific pages I am trying to scrape are individual postal code pages with URLs like this: http://www.city-data.com/zips/01001.html. All of the individual zip code pages I need to access have this same URL Format, so my script simply does the following for postal_code in range:

    Creates URL given postal code
    Tries to get response from URL
    If (2), Check the HTTP of that URL
    If HTTP is 200, retrieves the HTML and scrapes the data into a list
    If HTTP is not 200, pass and count error (not a valid postal code/URL)
    If no response from URL because of error, pass that postal code and count error
    At end of script, print counter variables and timestamp

The problem is that I run the script and it works fine for ~500 postal codes, then suddenly stops working and returns repeated timeout errors. My suspicion is that the site's server is limiting the page views coming from my IP address, preventing me from completing the amount of scraping that I need to do (all 100,000 potential postal codes).

My question is as follows: Is there a way to confuse the site's server, for example using a proxy of some kind, so that it will not limit my page views and I can scrape all of the data I need?

Thanks for the help! Here is the code:

##POSTAL CODE POPULATION SCRAPER##

import requests

import re

import datetime

def zip_population_scrape():

    """
    This script will scrape population data for postal codes in range
    from city-data.com.
    """
    postal_code_data = [['zip','population']] #list for storing scraped data

    #Counters for keeping track:
    total_scraped = 0
    total_invalid = 0
    errors = 0

    for postal_code in range(1001,5000):

        #This if statement is necessary because the postal code can't start
        #with 0 in order for the for statement to interate successfully
        if postal_code <10000:
            postal_code_string = str(0)+str(postal_code)
        else:
            postal_code_string = str(postal_code)

        #all postal code URLs have the same format on this site
        url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'

        #try to get current URL
        try:
            response = requests.get(url, timeout = 5)
            http = response.status_code

            #print current for logging purposes
            print url +" - HTTP: " + str(http)

            #if valid webpage:
            if http == 200:

                #save html as text
                html = response.text

                #extra print statement for status updates
                print "HTML ready"

                #try to find two substrings in HTML text
                #add the substring in between them to list w/ postal code
                try:

                    found = re.search('population in 2011:</b> (.*)<br>', html).group(1)

                    #add to # scraped counter
                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])

                    #print statement for logging
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
                #if substrings not found, try searching for others
                #and doing the same as above
                except AttributeError:
                    found = re.search('population in 2010:</b> (.*)<br>', html).group(1)

                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."

            #if http =404, zip is not valid. Add to counter and print log
            elif http == 404:
                total_invalid +=1

                print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."

            #other http codes: add to error counter and print log
            else:
                errors +=1

                print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."

        #if get url fails by connnection error, add to error count & pass
        except requests.exceptions.ConnectionError:
            errors +=1
            print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
            pass

        #if get url fails by timeout error, add to error count & pass
        except requests.exceptions.Timeout:
            errors +=1
            print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
            pass

    #print final log/counter data, along with timestamp finished
    now= datetime.datetime.now()
    print now.strftime("%Y-%m-%d %H:%M")
    print str(total_scraped) + " total zips scraped."
    print str(total_invalid) + " total unavailable zips."
    print str(errors) + " total errors."

Source: http://stackoverflow.com/questions/25452798/how-can-i-circumvent-page-view-limits-when-scraping-web-data-using-python

Sunday, 7 September 2014

Web data scraping (online news comments) with Scrapy (Python)

Since you seem like the try-first ask-question later type (that's a very good thing), I won't give you an answer, but a

(very detailed) guide on how to find the answer.

The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to

scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being

processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome

Developer Tools for this, but you can use others such as FF firebug.

So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads

the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this

clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our

case requires a bit more work. I can tell two things right away:

    They're using javascript to load the comments (because I'm staying on the same page).
    They load them dynamically with AJAX calls each time you click (meaning instead of loading the comments with the

page and just showing them to you, with each click it does another request to the database).

Now let's right-click and inspect element on that button. It's actually just a simple span with text:

<span>View Comments (2077)</span>

By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the

devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to

fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of

confusing data. Wait, here's one that makes sense:

http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-

3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_commen

ts.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_com

ment=1

See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object

(looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you

need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest

that a human can read:

go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
          '_media.modules.content_comments.switches._enable_mutecommenter': '1',
          '_media.modules.content_comments.switches._enable_view_others': '1',
          'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
          'count': '10',
          'enable_collapsed_comment': '1',
          'isNext': 'true',
          'offset': '20',
          'pageNumber': '2',
          'sortBy': 'highestRated'}

Now it's just a matter of trial-and-error. However, a few things to note here:

    Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what

happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So

now we understand how to use offset

    The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that

from the original page somehow. Try digging around a little, you'll find it.

    Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch

the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article

itself)

    Using the devtools you have access to all client-side scripts. So by digging you can find that that link to

/get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the

request, and try to emulate that (though you can probably figure it out yourself)

    You might need to overcome some security measures. For example, you might need a session-key from the original

article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't

trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in

case it shows up.

    Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the

html comments you are getting (for which you might want to check out BeautifulSoup).

As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task

either.

So don't panic.

It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt).

Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help

you.

Source: http://stackoverflow.com/questions/20218855/web-data-scraping-online-news-comments-with-scrapy-python

Saturday, 6 September 2014

A good web data extraction/screen scraper program?

I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require much programming knowledge.

You can try ScraperWiki.com if you know python.

I've experimented with Screen-Scraper and found it easy to use. The application comes in multiple versions: basic (which is free), professional, and enterprise. Also, multiple platforms are supported.

Hire a programmer to do it so that there is only a one off cost. I often see similar projects on freelancing websites like Elance and oDesk.

I really like iMacros. You can give it a test drive to see if it meets your needs with the totally free Firefox extension (there's also IE versions), but there are also more full featured application and "server" versions that have more features and ability to do thing in an unattended manner.

Here are some other alternatives to consider:

    License the data from the provider. Call em up and ask 'em.

    Use Amazon Mechanical Turk to get humans to copy and paste and format it for ya. They are cheap.

    For automation, it depends on how complicated the HTML is and how often it changes. You could use Excel's Web Data Import if it's really simple.

You can use irobot from IRobotSoft, which is totally free, and provides more functionalityies than other paid software. Watch demos here http://irobotsoft.com/help/ for how simple it is.

Questions on their forum were answered very quickly.

Source: http://stackoverflow.com/questions/2334164/a-good-web-data-extraction-screen-scraper-program

Friday, 5 September 2014

How to login to website and extract data using PHP [closed]

I have installed the tiny tiny rss on to my computer (Windows) and also have Xampp installed (localhost).

I want to be able to use PHP to extract data from the Tiny tiny RSS webpage.

I have tried this it which just opens the front page:

<?php
$homepage = file_get_contents('my install tiny tiny rss url');
echo $homepage;
?>

But how do I login and extract the data.

You can use cURL to send post data and headers. To login you need to replicate the exact data exchange between the client and the server.

SOurce: http://stackoverflow.com/questions/20611918/how-to-login-to-website-and-extract-data-using-php

Is it ok to scrape data from Google results?

I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?

Google will eventually block your IP when you exceed a certain amount of requests.

Google disallows automated access in their TOS, so if you accept their terms you would break them.

That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)

There are two options to scrape Google results:

1) Use their API

    You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather.

    If you want a higher amount of API requests you need to pay.
    60 requests per hour cost 2000 USD per year, more queries require a custom deal.

2) Scrape the normal result pages

    Here comes the tricky part. It is possible to scrape the normal result pages. Google does not allow it.
    If you scrape at a rate higher than 15 keyword requests per hour you risk detection, higher than 20/h will get you blocked from my experience.
    By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 2000 requests per hour. (50k a day)
    There is an open source search engine scraper written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.

Source: http://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results

Thursday, 4 September 2014

Data Scraping from PDF and Excel

I am doing a little data scraping, There are 3 types of file from which i am scraping data.

1- HTML
2- PDF
3- Excel(xls)

For HTML i am comfortable, i am using HTML Agility for that.

For PDF and excel i need suggestions from anyone.

Concerning Excel. If you are in a MS environment you can either do Office Automation or use OLEDB. In a Java

environment look at Apache POI.

EDIT: Concerning PDF in Java try Apache PDFBox . Can also work in .NET using IKVM

I can recommend Cogniview's PDF2XL, a reasonably inexpensive commercial product, to extract data from tables in PDF

files into Excel. We have used it with great success.

HTML Agility is a library. Its good to use. But then, why do you need separate tools for different data extraction

purposes? Use Automation Anywhere to extract data from any source. As far as I know, it would work for all the three

sources you have specified. Google it.

Source: http://stackoverflow.com/questions/3147803/data-scraping-from-pdf-and-excel

Wednesday, 3 September 2014

Excel VBA Data Mining Real-Time Data from a Web Page that Refreshes Data

I want to capture real-time data that updates into a table on a webpage; I prefer capturing it into excel using VBA, but I will write it in .NET C# or VB if I that is easier.

the data updates about 1 or 2 seconds, and I want to just grab the latest data quotes and log it into my spreadsheet; the table names are the same, only the data refreshes, and it does so automatically on the web page.

I've done a lot of Excel VBA and I know how to download a URL to a file--this is NOT what I want; I want to gain access to my webpage that is active and grab the data updates after I've logged into my site and selected a webpage that I like.

Is there a simple way to access this data on the webpage from Excel or .Net? Because it refreshes no more than once every 1 or 2 seconds, it is easy to just keep checking it for updates, and I can compare the latest data to see if it actually refreshed.

In Excel 2003, use Data/Import External Data/New Web Query
Browse to your page and select the table you want to import.
After that you can either do a manual Refresh, or use a timer procedure to do something like:

Source: http://stackoverflow.com/questions/9855794/excel-vba-data-mining-real-time-data-from-a-web-page-that-refreshes-data

Tuesday, 2 September 2014

Need to pull data from a website…web query? macro?

I have a list of every DOT # (Dept. of Trans.) in the country. I want to find out insurance effective date for each one of these companies. If you go to http://li-public.fmcsa.dot.gov --> "continue" --> then from the dropdown select "carrier search" and hit "go" it'll take you to a search form (that is the only way to get to this screen).

From there, you can input a DOT # X (use 61222 as an example) and it'll bring you to another screen. Click "view report in HTML" and then down on the bottom you'll see "Active/Pending Insurance". I want to pull the "effective date" from that page and stick it in the spreadsheet next to the DOT # X that I already know.

Of the thousands of DOT #'s in my list, not all will have filings on this website, if that makes a difference.

Can this be done with a Macro or Excel Web Query? I know I probably sound like a total novice, but I'd appreciate any help I could get.

Can you do it? Frankly even if you could you'd lock up the spreadsheet while it's doing that processing. And in the end, how would you handle an error half-way through?

I'd not do this in a client-facing application. This sounds more like something to do in server-side app that can do the processing and gather the information in a more controlled environment. Then you Excel spreadsheet could query that app and get the information in one fell swoop. Error handling is much simpler and you don't end up sitting there staring at Excel why it works its way through thousands of web sites. It was not built to do that elegantly.

What do you write the web service I'm describing in? Well it depends on your preference. Me, I'd write it in Ruby on Rails since it can easily handle the scraping aspect of the task and can report the data out easily as well. But it really falls back to whatever you're most comfortable coding in.

Source: http://stackoverflow.com/questions/15286429/need-to-pull-data-from-a-website-web-query-macro

How to extract data from web 2.0 graphs using a scraper

I have recently come across a web page containing a graph object that displays the (x, y) values on the object as the

mouse is rolled across it. Is there any way to automate the extraction of this data?

How is the graph data loaded? If embedded in the page source then you can extract it with xpath or regex. Else use

Firebug to see how it is loaded.

You will need a solution that works inside the web browser, so the AJAX/Javascript is properly rendered.

I have used iMacros with good success for web scraping in the past. There are free/open-source and "PRO" paid editions

(comparison table here).

Another option is always to custom code something with the Microsoft webbrowser control.

Source: http://stackoverflow.com/questions/3980774/how-to-extract-data-from-web-2-0-graphs-using-a-scraper

Anyone knows an online tool that can scrape a page and create a REST API for the scraped data?

I'm looking for a SaaS solution that is able to login to a platform, scrape data (reports) and then allow accessing the

data through an API. I have some reporting platforms that provide web reporting and email reporting but with no API.

Online reporting doesn't help and email reporting, although can be automated and scraped, isn't so reliable.

If you are willing to do the scraping through your own connection, have a look at Import IO. They have a desktop

application that you use to teach the system how to scrape a page, and then you run the crawler from that application -

and you can run it for as long as you like, as far as I can tell.

You may then upload your data to the Import cloud, from where it is available via an API on the import.io servers.

Useful data can be made public to donate it "to the commons" if you wish.

I did some more digging, found iMacros as a possible solution. Its Windows based, which is a drawback in my case, but

it does allow automation of the scraping and afterwards interaction via common web scripting languages like PHP and

ASP.net.

If you are familiar with jQuery, I think you can use node.js and Cheerio module, then you can create a simple

application to do auto scraping. Actually I have already built a site to do on line web scraping based on the above

mentioned tech, the site is www.datafiddle.net, you can take a look at it.

Source: http://stackoverflow.com/questions/19646028/anyone-knows-an-online-tool-that-can-scrape-a-page-and-create-a-

rest-api-for-the

Wednesday, 27 August 2014

Extract data from Web Scraping C#

I am MVC ASP.NET developer.

I have received the contents from any url, i.e. http, https etc. using WebRequest class.

I have received all the content of that particular url. (for now I took http://google.com)

My next step is to extract buttons, header, footer, colors, text etc.

Here is my code for now:

public ActionResult GetContent(UrlModel model) //model having a string URL
which is entered in a text box and method hits using submit button.
{
    //WebRequest request = WebRequest.Create(model.URL);

    WebRequest request = WebRequest.Create(model.URL);

    request.Credentials = CredentialCache.DefaultCredentials;

    WebResponse response = request.GetResponse();

    Stream dataStream = response.GetResponseStream();

    StreamReader reader = new StreamReader(dataStream);

    string responseFromServer = reader.ReadToEnd();
    ViewBag.Response = responseFromServer;

    reader.Close();
    response.Close();
    return View();
}

Can someone help me with writing the code ?

Also do suggest me with some techniques of data extraction in C#.

Source: http://stackoverflow.com/questions/21901162/extract-data-from-web-scraping-c-sharp

Scrapy, scraping price data from StubHub

I've been having a difficult time with this one.

I want to scrape all the prices listed for this Bruno Mars concert at the Hollywood Bowl so I can get the average price.

http://www.stubhub.com/bruno-mars-tickets/bruno-mars-hollywood-hollywood-bowl-31-5-2014-4449604/

I've located the prices in the HTML and the xpath is pretty straightforward but I cannot get any values to return.

I think it has something to do with the content being generated via javascript or ajax but I can't figure out how to send the correct request to get the code to work.

Here's what I have:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

from deeptix.items import DeeptixItem

class TicketSpider(BaseSpider):
    name = "deeptix"
    allowed_domains = ["stubhub.com"]
    start_urls = ["http://www.stubhub.com/bruno-mars-tickets/bruno-mars-hollywood-hollywood-bowl-31-5-2014-4449604/"]

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//div[contains(@class, "q_cont")]')
    items = []
    for site in sites:
        item = DeeptixItem()
        item['price'] = site.xpath('span[contains(@class, "q")]/text()').extract()
        items.append(item)
    return items

Any help would be greatly appreciated I've been struggling with this one for quite some time now. Thank you in advance!

Source: http://stackoverflow.com/questions/22770917/scrapy-scraping-price-data-from-stubhub

Tuesday, 26 August 2014

How do you scrape AJAX pages?

Overview:

All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.

When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.

You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.

Example:

Take this as an example, assume the page you want to scrape from has the following script:

<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
{
// Firefox, Opera 8.0+, Safari
xmlHttp=new XMLHttpRequest();
}
catch (e)
{
// Internet Explorer
try
    {
    xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
    }
catch (e)
    {
    try
      {
      xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
      }
    catch (e)
      {
      alert("Your browser does not support AJAX!");
      return false;
      }
    }
}
xmlHttp.onreadystatechange=function()
    {
    if(xmlHttp.readyState==4)
      {
      document.myForm.time.value=xmlHttp.responseText;
      }
    }
xmlHttp.open("GET","time.asp",true);
xmlHttp.send(null);
}
</script>

Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.

Sporce: http://stackoverflow.com/questions/260540/how-do-you-scrape-ajax-pages

using Perl to scrape a website

I am interested in writing a perl script that goes to the following link and extracts the number 1975: https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219

That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.

In the file, locations.txt, I put the list of counties, such as San Diego County.

The current code runs, but instead of the # 1975, it displays unknown. The number 1975 should be in $val\n.

I would very much appreciate any help!

#!/usr/bin/perl

use strict;

use LWP::Simple;

open(L, "locations26.txt");

my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';

open(O, ">out26.txt");
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
     chomp($location);
     $location =~ s/ /+/g;
      foreach my $year (1923..1923) {
                 my $u = $url;
                 $u =~ s/%LOCATION%/$location/;
                 $u =~ s/%YEAR%/$year/;
                 #print "$u\n";
                 my $content = get($u);
                 my $val = 'unknown';
                 if ($content =~ / of .strong.([0-9,]+)..strong. /) {
                         $val = $1;
                 }
                 $val =~ s/,//g;
                 $location =~ s/\+/ /g;
                 print "'$location',$year,$val\n";
                 print O "'$location',$year,$val\n";
         }
     }

Update: API is not a viable solution. I have been in contact with the site developer. The API does not apply to that part of the webpage. Hence, any solution pertaining to JSON will not be applicbale.

Source: http://stackoverflow.com/questions/14654288/using-perl-to-scrape-a-website

Monday, 25 August 2014

Data Scraping using php

Here is my code

    $ip=$_SERVER['REMOTE_ADDR'];

    $url=file_get_contents("http://whatismyipaddress.com/ip/$ip");

    preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);

    $isp=$output[1][2];

    $city=$output[9][2];

    $state=$output[8][2];

    $zipcode=$output[12][2];

    $country=$output[7][2];

    ?>
    <body>
    <table align="center">
    <tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
    <tr><td>City :</td><td><?php echo $city;?></td></tr>
    <tr><td>State :</td><td><?php echo $state;?></td></tr>
    <tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
    <tr><td>Country :</td><td><?php echo $country;?></td></tr>
    </table>
    </body>

How do I find out the ISP provider of a person viewing a PHP page?

Is it possible to use PHP to track or reveal it?

Error: http://i.imgur.com/LGWI8.png

Curl Scrapping

<?php
$curl_handle=curl_init();
curl_setopt( $curl_handle, CURLOPT_FOLLOWLOCATION, true );
$url='http://www.whatismyipaddress.com/ip/132.123.23.23';
curl_setopt($curl_handle, CURLOPT_URL,$url);
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Your application name');
$query = curl_exec($curl_handle);

curl_close($curl_handle);
preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);
echo $query;
$isp=$output[1][2];

$city=$output[9][2];

$state=$output[8][2];

$zipcode=$output[12][2];

$country=$output[7][2];
?>
<body>
<table align="center">
<tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
<tr><td>City :</td><td><?php echo $city;?></td></tr>
<tr><td>State :</td><td><?php echo $state;?></td></tr>
<tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
<tr><td>Country :</td><td><?php echo $country;?></td></tr>
</table>
</body>

Error: http://i.imgur.com/FJIq6.png

What's is wrong with my code here? Any alternative code , that i can use here.

I am not able to scrape that data as described here. http://i.imgur.com/FJIq6.png

P.S. Please post full code. It would be easier for me to understand.

Source: http://stackoverflow.com/questions/10461088/data-scraping-using-php

PDF scraping using R

I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system

Source: http://stackoverflow.com/questions/7918718/pdf-scraping-using-r

Sunday, 24 August 2014

Php Scraping data from a website

I am very new to programming and need a little help with getting data from a website and passing it into my PHP script.

The website is http://www.birthdatabase.com/.

I would like to plug in a name (First and Last) and retrieve the result. I know you can query the site by passing the name in the URL, but I am having problems scraping the results.

http://www.birthdatabase.com/cgi-bin/query.pl?textfield=FIRST&textfield2=LAST&age=&affid=

I am using the file_get_contents($URL) function to get the page but need help after that. Specifically, I would like to scrape only the results from a certain state if there are multiple results for that name.

You need the awesome simple_html_dom class.

With this class you can query the webpage's DOM in a similar way to jQuery.

First include the class in your page, then get the page content with this snippet:

$html = file_get_html('http://www.birthdatabase.com/cgi-bin/query.pl?textfield=' . $first . '&textfield2=' . $last . '&age=&affid=');

Then you can use CSS selections to scrape your data (something like this):

$n = 0;
foreach($html->find('table tbody tr td div font b table tbody') as $element) {
@$row[$n]['tr'] = $element->find('tr')->text;
$n++;
}

// output your data
print_r($row);

Source: http://stackoverflow.com/questions/15601584/php-scraping-data-from-a-website