mardi 2 juin 2015

Nokogiri parsing missing element create issue

I am having Plain html doc NO CSS . In which some of the content i need to pass to excel sheet. I tried with Nokogiri it works on Css basis.

Do anybody tried this thing.

<html>
 <head></head>
  <body>
    ***NOTE***
   <br>
      Items 
   <br>
   <br>
      Invoice Number : [78945824] PO Number : [4587958]
   <br>
      Tracking no : 12543
   <br>
   <br>
      Items 
   <br>
   <br>
      Invoice Number : [79546828] PO Number : [4567892]
   <br>
      Tracking no : 
   <br>
   <br>
      Items 
   <br>
   <br>
      Invoice Number : [78976824] PO Number : [897569]
   <br>
      Tracking no : 12543
   <br>
   </body>
   </html>

I am able to retrieve the PO Number & Tracking no

  require 'rubygems'
require 'nokogiri'   
require 'open-uri'

PAGE_URL = "a.html"

page = Nokogiri::HTML(open(PAGE_URL))
    data = page.css("body").text

    po_numbers = data.scan(/Invoice Number : \[\d+\] PO Number : \[(\d+)\]/).flatten
    tracking_numbers = page.css("a").text.split

    [["PO Number", "Tracking Number"]].concat(po_numbers.zip(tracking_numbers))
 puts po_numbers
 puts tracking_numbers


=> po_numbers = ["4587958", "4567892", "4587958"]
=> tracking_numbers = ["12543", "12356"]

When we zip those together, we get:

=> po_numbers.zip(tracking_numbers)
=> [["4587958", "12543"], ["4567892", "12356"], ["4587958", "nil"]]

What we want is:

=> [["4587958", "12543"], ["4567892", "nil"], ["4587958", 12356], ]

Aucun commentaire:

Enregistrer un commentaire