Home » questions » Automatic reading/parsing of web page?

Automatic reading/parsing of web page?

2006-08-03 10:49:28, Category: Programming & Design
I need to know any good links for parsing (reading) html files automatically. I will be creating my system using Visual Basic so please do not offer advice using other langauges. I will have to split the html tags using a delimiter, maybe "/>" as an end-tag. I want to be able to parse each web page and perform tests on the text that a web site user reads - counting repeating words is just one task of many I have to perform. Anyone who knows how to automatically read a web page please give me some advice because this task is essential for me.

Answers

  1. thegooddeal2000

    On 2006-08-03 11:23:57


    You're on the right track. You can open it file as a text file and iterate through it line-by-line, character-by-character. When you encounter a "<", stop reading, when you encounter a ">" start reading again. I'm assuming you have some basic VB knowledge, so I will let you dissect this code and interpret for yourself what I would do: Dim html As String = My.Computer.FileSystem.ReadAllText("C:\test.htm") Dim text As String = "" Dim c As Char Dim reading As Boolean = True For Each c In html If c = "<" Then reading = False ElseIf c = ">" Then text += " " reading = True Else If reading Then text += c End If End If Next Then you can split out the individual words with a string.split(" ") method. Worth a shot! If you want to pull the html directly from the web, try using the browser control to get the page, then use the above method.
  2. TruthIsGod

    On 2006-08-03 10:57:33


    i once checked & found that a module for regular expression support for vb is available. that should work