regex - Can't Process Any String from a Txt file -
this weird. can't process string text file. thing can print out.
here code:
val pattern = new regex("</document>") val file = source.fromfile(filelocale)(io.codec("iso8859-1")) (line <- file.getlines()) { // line match { // case "</document>" => {println("found it!!!!!"); return} //break out // case _ => println(line) //save lines file // } println(line.indexof("public")) }
first using regex , pattern match pattern.findfirstin()
, match result. found couldn't produce anything. thought regex off because don't know regex (i'm trying match closing tag </document>
in txt file, find first closing tag, exit out of loop/function , save read different file)
please don't tell me use jsoup. i'm dealing text file 23mb big , crushed browser (written in c++) , plain text editor.
i'm preprocessing text file , reduce more desirable size, i'll use jsoup parse html dom tree.
however, since can't use regex, thought plain string matching case "</document>"
. well, did not work. tried line.indexof("</document>")
. wasn't working either. wonder if problem /
symbol, , tried find public
, in text file. still, can't find it. results -1
.
the operation can do, apparently, print out line is. what's going on!?
this test file made original 23megabytes file:
<sec-document>0001000180-14-000019.txt : 20140221 <sec-header>0001000180-14-000019.hdr.sgml : 20140221
20140221171951 accession number: 0001000180-14-000019 conformed submission type: 10-k public document count: 17 conformed period of report: 20131229 filed of date: 20140221 date of change: 20140221
filer: company data: company conformed name: sandisk corp central index key: 0001000180 standard industrial classification: computer
storage devices [3572] irs number: 770191793 state of incorporation: de fiscal year end: 1229
filing values: form type: 10-k sec act: 1934 act sec file number: 000-26734 film number: 14634715 business address: street 1: 951 sandisk drive city: milpitas state: ca zip: 95035 business
phone: 408-801-1000
mail address: street 1: 951 sandisk drive city: milpitas state: ca zip: 95035 </sec-header> <document> <type>10-k
1 sndk201310-k.htm form 10-k fy13 sndk 2013 10-k 10-k 2 second part form 10-k fy13
</document> <document> <type>10-k <sequence>2 <filename>third part <description>form 10-k fy13 <text> <!doctype html public "-//w3c//dtd
html 4.01 transitional//en" "http://www.w3.org/tr/html4/loose.dtd">
i'd following:
val lines = file.getlines.takewhile( ! _.contains("</document>")
this collect lines until first 1 contains </document>
, returns iterator[string]
can read once, or if prefer list:
val lines = file.getlines.takewhile( ! _.contains("</document>").tolist
but if memory usage problem, you're better of using iterator
, reads file on demand , doesn't need allocate memory of it.
Comments
Post a Comment