regex - Can't Process Any String from a Txt file -


this weird. can't process string text file. thing can print out.

here code:

 val pattern = new regex("</document>")        val file = source.fromfile(filelocale)(io.codec("iso8859-1"))     (line <- file.getlines()) { //      line match { //        case "</document>" => {println("found it!!!!!"); return} //break out //        case _ => println(line)   //save lines file //      }       println(line.indexof("public"))     } 

first using regex , pattern match pattern.findfirstin() , match result. found couldn't produce anything. thought regex off because don't know regex (i'm trying match closing tag </document> in txt file, find first closing tag, exit out of loop/function , save read different file)

please don't tell me use jsoup. i'm dealing text file 23mb big , crushed browser (written in c++) , plain text editor.

i'm preprocessing text file , reduce more desirable size, i'll use jsoup parse html dom tree.

however, since can't use regex, thought plain string matching case "</document>". well, did not work. tried line.indexof("</document>"). wasn't working either. wonder if problem / symbol, , tried find public, in text file. still, can't find it. results -1.

the operation can do, apparently, print out line is. what's going on!?


this test file made original 23megabytes file:

<sec-document>0001000180-14-000019.txt : 20140221 <sec-header>0001000180-14-000019.hdr.sgml : 20140221 

20140221171951 accession number: 0001000180-14-000019 conformed submission type: 10-k public document count: 17 conformed period of report: 20131229 filed of date: 20140221 date of change: 20140221

filer:    company data:           company conformed name:         sandisk corp        central index key:          0001000180      standard industrial classification: computer 

storage devices [3572] irs number: 770191793 state of incorporation: de fiscal year end: 1229

  filing values:      form type:      10-k        sec act:        1934 act        sec file number:    000-26734       film number:        14634715    business address:           street 1:       951 sandisk drive       city:           milpitas        state:          ca      zip:            95035       business 

phone: 408-801-1000

  mail address:           street 1:       951 sandisk drive       city:           milpitas        state:          ca      zip:            95035 </sec-header> <document> <type>10-k 

1 sndk201310-k.htm form 10-k fy13 sndk 2013 10-k 10-k 2 second part form 10-k fy13

</document> <document> <type>10-k <sequence>2 <filename>third part <description>form 10-k fy13 <text> <!doctype html public "-//w3c//dtd 

html 4.01 transitional//en" "http://www.w3.org/tr/html4/loose.dtd">

i'd following:

val lines =  file.getlines.takewhile( ! _.contains("</document>") 

this collect lines until first 1 contains </document> , returns iterator[string] can read once, or if prefer list:

val lines =  file.getlines.takewhile( ! _.contains("</document>").tolist 

but if memory usage problem, you're better of using iterator, reads file on demand , doesn't need allocate memory of it.


Comments

Popular posts from this blog

php - Magento - Deleted Base url key -

javascript - Tooltipster plugin not firing jquery function when button or any click even occur -

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -