Getting Started with Data Toolbar for Internet Explorer

Version 3.4.7367 2020-03-04 View the release notes

Install the Toolbar and Start the Wizard

Download and install the setup file using default settings. Restart Internet Explorer (32-bit or 64-bit) and navigate to the web site you want to extract data from. Make sure that Internet Explorer security level is set to medium-high or lower. For IE 11 and Windows 8.1 set enhanced protected mode to off. For best results use the latest version of Internet Explorer available for your operation system.

In this walkthrough we will use a product catalog (a list of Canon cameras) from www.bestbuy.com web site. To start the DataTool wizard click on the DataTool button in the toolbar area of Internet Explorer.

Add Columns

When the wizard is open, moving your mouse pointer over the web page automatically highlights page elements that can be marked as data fields. With the Add Column radio-button selected, clicking on a data field or an image will automatically create a new column. In column selection mode Internet Explorer navigation is controlled by the wizard so clicking on a hyperlink does not open a new page. Use right-click for element selection to avoid pop-up windows or page updates.

Chose any record as a sample and using this record simply point to the data you want to collect from all of the records on the web site (video). As you select new fields, additional columns are automatically created.

Test your columns selection by pressing the Get Data button.

  • If the wizard has not identified items correctly, add more sample fields to improve item recognition.
  • If only one record has been extracted, make sure that all sample fields belong to the same item.
  • Sometimes column layout changes from one item to another. The same text field can be represented using different formatting options or using different HTML elements. Selecting such an element as a sample will cause missing data. In this case, instead of selecting a text element directly, select its parent container, which can be a table cell (TD) or a DIV element.

Add Data and Images from Details Page

Click on the Add Details radio-button to add a high resolution image or a detailed description from a Details page associated with the current item. The browser will automatically open that page using the first link found in the column list. When navigation is complete, click on the fields you want to add. To return to the master page press either the Add Columns or the Set Next Element button.

Sometimes a Details page contains all information about an item that you need. It is still required to add a link column from a primary list to instruct the program how it should navigate from one details page to another. You can easily delete the extra column from the final output file.

On some web site Details pages show a varying amount of details for different items. For example, in the business directory some companies may publish less information about themselves then others. That changes the web elements position in the document tree and causes data extraction errors. The problem can be usually resolved by linking a data field to its prompt.

Select Crawling Rule (Next Page element)

Where a web site features a NEXT page option, the Data Toolbar will automatically collect data from all available pages. Once you have completed selecting the data fields, go to the "Set Next element" option. Once this radio-button is selected place your mouse on the Next button on the web page and click. You will then see the Next Element added to the column list. Make sure that the click has not cause web page update.

If a web does not have dedicated "Next element" but has numeric page links 1 2 3..., select number 2 as a crawling rule. The program will automatically increment it to go through the whole range.

Editing and Removing a Column

If you have selected a data field you are not happy with, click on the red button on the far right hand side. This will remove the field you have selected. In the same way you can reset the Next page element. The default column names assigned by the program can be edited. Just click on a cell containing the name and type a new name. Press Clear to clear the column list.

Use Up and Down buttons in the left top corner of the data grid to change columns order.

Extract Web Data

Once you have selected the data fields and set the Next Element, click on the Get Data button. The program will start collecting web page data showing you the number of processed pages and extracted data rows. At any time you can interrupt data scraping by clicking either the Show Data or the Edit Tags button.

Review Data

fter all pages are processed, the wizard goes into Review Data mode. You can review the collected information before saving it on your computer. The search box can be used to filter data. Checking the Show Complete Text checkbox wraps the text and adjusts the cell's height to fit the text without trimming.

If, instead of the multiple records that you see on the web page, the program collects just one, press the Edit Tags button to go back and check that all of the columns that you selected belong to the same record.

If you are satisfied with the collected information press Continue to go to the Save Data screen.

Save Data

The Save Data screen presents two options: Saving Data and Adding More Data Rows. Pressing the Continue button on the Save Data Screen will default to Save and Exit. The program can save data as either a CSV, XML or HTML table. These formats can be easily imported into an Excel or Google spreadsheet. If you have added image collection as well, select the desired location of the downloaded images on your computer. Selecting Web location will keep references to the original image locations on the Web. Checking the Open File checkbox opens a generated data file as soon as it gets created.

The Free edition limits program output to 100 records. There are no limitations in the standard edition.

Use Data

The picture on the right shows a CSV file generated by the Data Toolbar opened in Excel.

For web sites that may not offer a Next button, you can continue to extract data using the Add More Data Rows option. Once selected, press Continue. Next, navigate to the web page from which you wish to collect data and press Get Data. You can repeat this process as often as you like, adding data to the same CSV file before saving. When Edit Columns before adding rows is selected, the standard edit display is shown, allowing you to make any changes required.

Advanced Column Selection and Editing

To get access to advanced column editing options click on the icon in the first column of the data grid. Advanced options include:

  • Easy navigation between child and parent HTML elements.
    To select a parent element (container) press the top button on the far right side of the form.
  • Changing default capture type. For example, capturing an URL instead of a text.
    The following four capture types are available: Text (Inner Text), Image (Inner Image), Link (Associated Link), HTML (Outer HTML).
  • Viewing and selecting inner images of an element. To see all available inner images of an item, select its container element first.
    That will open an image selection panel. Click on the image on the panel to select it.
  • Filtering content using regular expressions is done by entering a regular expression into the "Find Match" text box. That option can be used to extract some information from a raw HTML code

Useful regular expressions:

To extarct a numeric value (i.e. a price) use either [1-9.,]+ or [$][1-9.,]+ expression.
To extract a text between two strings use start-string(.*?)end-string expression.
To extarct an email address use (([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})) expression.

Partial Web Page Updates

Sometimes a web site updates just a part of the web page instead of navigating to a new URL in response to a user action. Usually that happens when a user clicks on a "Next Page" element. Partial updates reduce screen flickering and

Project Settings

ata Toolbar associates the column list with the web site for which it has been created. The column list is saved and loaded automatically when you close or open the wizard. Besides the column list there are some advanced program options that can be associated with the web site. Select Options to manage advanced project settings. The Options screen allow you to change download rules for "Details" and "Next" web pages, and export or import a project as text file. Do not change the project settings unless you need to resolve a problem.

Expected site response can be set either to "New page" (default) or "Partial Update". Partial updates are used by web designers to eliminated flickering caused by full page updates. Partial updates do not generate a normal event flow and are processed based on timer events.
Decrease the default value of Delay after page complete event to 0.5 second to improve program performance. Keep it at 2.5 seconds or increase it for pages that use asynchronous JavaScript (AJAX).
Use "Open details page in a hidden window" option to eliminate a page reload when going back from details to master page.

The Web browser tab allows you to run the wizard in "Explorer" or "Standalone" modes. Standalone mode may improve web scraping performance by not showing downloaded content in Internet Explorer and running extraction task as aseparate process.

Project settings can be explicitly exported into an XML file. This can be useful for sites that require multiple data scraping schemes.

Save vertical space for browsing

Data Toolbar does not need much space. Right click anywhere in the toolbar area of Internet Explorer and make sure that the Lock the Toolbars menu item is Off. Then drag the Data Toolbar to put it on the same line as the Menu Bar or another toolbar that you have. On the picture below the Menu Bar, the Data Toolbar and the Google toolbar share the same horizontal bar.

At any time you can hide Data Toolbar completely using close toolbar button [x]. You should disable Data Toolbar Helper if you disable the toolbar. To bring the toolbar back right click anywhere in the toolbar area (IE8) or Home button area (IE9) of Internet Explorer and enable the toolbar and its helper.