Regexp matches many in Rubular but not in production.

● ARCHIVED · READ-ONLY

Started by tyler.kendrick Nov 28, 2014, 04:12 AM 8 posts View original ↗

tyler.kendrick

#1 Nov 28, 2014, 04:12 AM Source

* Edit: Neglected to put [Ace] in the header.

On Rubular, the following regular expression evaluates 7 match groups - as expected

/^<(\w+)\s*(\w+=".*")*\s*(??:\/\s*>)|(?:>(.*)<\s*\/\s*\1\s*>))$/mHowever, the following line only returns the last match in a note box.

/^<(\w+)\s*(\w+=".*")*\s*(??:\/\s*>)|(?:>(.*)<\s*\/\s*\1\s*>))$/m.match(note) { |m| name = m[1] attributes = @options[:parse_attr].call(m[2]) innerText = @options[:parse_text].call(m[3]) msgbox_p("#{name} found with attributes: #{attributes.join(',')}. innerText=#{innerText}")}The note contains the following text:

This be text

<tag/>

<tag2></tag2>

<tag3 value="text" />

<tag4>innerText</tag4>

<tag5 value="text" value2="text">

inertia

</tag5>

<alert>Message from a1</alert>

<actor_tag>inner text</actor_tag>
The text is the same for rubular and the note section this was called from.

Any ideas why the engine appears to parse differently from the note section? Is there some oddness with line-breaks that I'm neglecting?

FenixFyreX

#2 Nov 28, 2014, 05:42 AM Source

Use the method String#scan, it'll return all of the matches, like so:

Code:

string = <<HDOC<tag/><tag2></tag2><tag3 value="text" /><tag4>innerText</tag4><tag5 value="text" value2="text">inertia</tag5> <alert>Message from a1</alert><actor_tag>inner text</actor_tag>HDOCstring.scan(/^<(\w+)\s*(\w+=".*")*\s*(??:\/\s*>)|(?:>(.*)<\s*\/\s*\1\s*>))$/m)# => [['tag', nil, nil, nil], ['tag2', nil, nil], ['tag3', 'value="text"', nil]] # and so on, so forth

tyler.kendrick

#3 Nov 28, 2014, 09:27 PM Source

Still doesn't address the issue.

Put the following call-script on an event:

actor_id = 1actor = $data_actors[actor_id]regexp = /^<(\w+)\s*(\w+=".*")*\s*(??:\/\s*>)|(?:>(.*)<\s*\/\s*\1\s*>))$/mnote = actor.notematches = note.scan(regexp) { |x| msgbox_p("matched: " + x.inspect)}The put the following text on the actor's note section:

This be text

<tag/>

<tag2></tag2>

<tag3 value="text" />

<tag4>innerText</tag4>

<tag5 value="text" value2="text">

inertia

</tag5>

<alert>Message from a1</alert>

<actor_tag>inner text</actor_tag>
The Problem: Only one match (the last tag) is found.

The same regular expression matches many tags in rubular; but only matches the last tag in a note section.
tyler.kendrick

#4 Nov 28, 2014, 09:45 PM Source

I made a silly discovery. Yes, the behavior is different between rubular's regexp engine and RMVXA's. However, this seems to be because of the way the note section is parsed.

I believe this is because when the note is parsed, the line endings are converted to escape characters - meaning that "$" will prevent a match, and make the last tag valid (assuming it is not followed by a line-break or any other character).

Simply removing the "$" character from the regexp will allow RMVXA's engine to parse the note text uninterrupted.
Tsukihime

#5 Nov 28, 2014, 10:33 PM Source

No, line-endings are preserved as you would expect in a windows environment \r\n
cremnophobia

#6 Nov 28, 2014, 11:01 PM Source

I have to say I'd consider this a bug. I don't expect CRLF as newline. The source code of scripts also use them. Ace does the right thing by using UTF-8 where it matters, even though Windows uses UTF-16LE (and the legacy ANSI code pages). Why not also use only LF? That is far easier and faster than converting strings from/to UTF-8, and is just as sane.

At least the String#encode and the newline transcoders work in RGSS3.
Zeriab

#7 Nov 29, 2014, 09:31 AM Source

With CR+LF being the Windows platform newline I would rather say not expected that as newline on a Windows program is rather a user error ;)

More generally I would say to expect CR?LF as a possible new line match. The truth is more complicated, but reasonable we will only encounter LF and CR+LF as new lines.

@tyler:

With the tags being numbers requiring start and end tags seems like a rather unnecessary condition. Do you really need to enforce such a constraint?

*hugs*

- Zeriab
FenixFyreX

#8 Nov 29, 2014, 10:13 PM Source

Further testing revealed my mistake in my response above; my apologies. I would recommend replacing $ with [\r\n]+, that works for me with the example you provided, from an event parsing actor 1's notebox.

[\r\n]+ matches both *nix and Windows line ending styles, so it's a general go-to anyways.