Filename | /Users/timbo/perl5/perlbrew/perls/perl-5.18.2/lib/site_perl/5.18.2/PPI/Tokenizer.pm |
Statements | Executed 3487328 statements in 3.64s |
Calls | P | F | Exclusive Time |
Inclusive Time |
Subroutine |
---|---|---|---|---|---|
149609 | 1 | 1 | 1.30s | 4.58s | _process_next_char | PPI::Tokenizer::
26904 | 2 | 1 | 857ms | 6.13s | _process_next_line | PPI::Tokenizer::
94513 | 1 | 1 | 602ms | 6.79s | get_token | PPI::Tokenizer::
56533 | 14 | 7 | 428ms | 681ms | _new_token | PPI::Tokenizer::
20542 | 6 | 4 | 261ms | 305ms | _previous_significant_tokens | PPI::Tokenizer::
94513 | 29 | 16 | 218ms | 218ms | _finalize_token | PPI::Tokenizer::
27281 | 3 | 2 | 186ms | 246ms | _fill_line | PPI::Tokenizer::
144 | 1 | 1 | 162ms | 162ms | CORE:subst (opcode) | PPI::Tokenizer::
144 | 1 | 1 | 118ms | 503ms | new | PPI::Tokenizer::
27287 | 3 | 2 | 60.1ms | 60.1ms | _get_line | PPI::Tokenizer::
1866 | 1 | 1 | 16.1ms | 34.6ms | _opcontext | PPI::Tokenizer::
15534 | 1 | 1 | 4.15ms | 4.15ms | CORE:match (opcode) | PPI::Tokenizer::
144 | 1 | 1 | 1.34ms | 1.76ms | _clean_eof | PPI::Tokenizer::
52 | 2 | 1 | 488µs | 589µs | _last_significant_token | PPI::Tokenizer::
1 | 1 | 1 | 135µs | 224µs | BEGIN@88 | PPI::Tokenizer::
1 | 1 | 1 | 12µs | 23µs | BEGIN@81 | PPI::Tokenizer::
1 | 1 | 1 | 7µs | 35µs | BEGIN@82 | PPI::Tokenizer::
1 | 1 | 1 | 6µs | 23µs | BEGIN@90 | PPI::Tokenizer::
1 | 1 | 1 | 3µs | 3µs | BEGIN@83 | PPI::Tokenizer::
1 | 1 | 1 | 3µs | 3µs | BEGIN@84 | PPI::Tokenizer::
1 | 1 | 1 | 3µs | 3µs | BEGIN@85 | PPI::Tokenizer::
1 | 1 | 1 | 3µs | 3µs | BEGIN@87 | PPI::Tokenizer::
1 | 1 | 1 | 3µs | 3µs | BEGIN@86 | PPI::Tokenizer::
1 | 1 | 1 | 3µs | 3µs | BEGIN@91 | PPI::Tokenizer::
0 | 0 | 0 | 0s | 0s | __ANON__[:211] | PPI::Tokenizer::
0 | 0 | 0 | 0s | 0s | _char | PPI::Tokenizer::
0 | 0 | 0 | 0s | 0s | _last_token | PPI::Tokenizer::
0 | 0 | 0 | 0s | 0s | all_tokens | PPI::Tokenizer::
0 | 0 | 0 | 0s | 0s | decrement_cursor | PPI::Tokenizer::
0 | 0 | 0 | 0s | 0s | increment_cursor | PPI::Tokenizer::
Line | State ments |
Time on line |
Calls | Time in subs |
Code |
---|---|---|---|---|---|
1 | package PPI::Tokenizer; | ||||
2 | |||||
3 | =pod | ||||
4 | |||||
5 | =head1 NAME | ||||
6 | |||||
7 | PPI::Tokenizer - The Perl Document Tokenizer | ||||
8 | |||||
9 | =head1 SYNOPSIS | ||||
10 | |||||
11 | # Create a tokenizer for a file, array or string | ||||
12 | $Tokenizer = PPI::Tokenizer->new( 'filename.pl' ); | ||||
13 | $Tokenizer = PPI::Tokenizer->new( \@lines ); | ||||
14 | $Tokenizer = PPI::Tokenizer->new( \$source ); | ||||
15 | |||||
16 | # Return all the tokens for the document | ||||
17 | my $tokens = $Tokenizer->all_tokens; | ||||
18 | |||||
19 | # Or we can use it as an iterator | ||||
20 | while ( my $Token = $Tokenizer->get_token ) { | ||||
21 | print "Found token '$Token'\n"; | ||||
22 | } | ||||
23 | |||||
24 | # If we REALLY need to manually nudge the cursor, you | ||||
25 | # can do that to (The lexer needs this ability to do rollbacks) | ||||
26 | $is_incremented = $Tokenizer->increment_cursor; | ||||
27 | $is_decremented = $Tokenizer->decrement_cursor; | ||||
28 | |||||
29 | =head1 DESCRIPTION | ||||
30 | |||||
31 | PPI::Tokenizer is the class that provides Tokenizer objects for use in | ||||
32 | breaking strings of Perl source code into Tokens. | ||||
33 | |||||
34 | By the time you are reading this, you probably need to know a little | ||||
35 | about the difference between how perl parses Perl "code" and how PPI | ||||
36 | parsers Perl "documents". | ||||
37 | |||||
38 | "perl" itself (the interpreter) uses a heavily modified lex specification | ||||
39 | to specify its parsing logic, maintains several types of state as it | ||||
40 | goes, and incrementally tokenizes, lexes AND EXECUTES at the same time. | ||||
41 | |||||
42 | In fact, it is provably impossible to use perl's parsing method without | ||||
43 | simultaneously executing code. A formal mathematical proof has been | ||||
44 | published demonstrating the method. | ||||
45 | |||||
46 | This is where the truism "Only perl can parse Perl" comes from. | ||||
47 | |||||
48 | PPI uses a completely different approach by abandoning the (impossible) | ||||
49 | ability to parse Perl the same way that the interpreter does, and instead | ||||
50 | parsing the source as a document, using a document structure independantly | ||||
51 | derived from the Perl documentation and approximating the perl interpreter | ||||
52 | interpretation as closely as possible. | ||||
53 | |||||
54 | It was touch and go for a long time whether we could get it close enough, | ||||
55 | but in the end it turned out that it could be done. | ||||
56 | |||||
57 | In this approach, the tokenizer C<PPI::Tokenizer> is implemented separately | ||||
58 | from the lexer L<PPI::Lexer>. | ||||
59 | |||||
60 | The job of C<PPI::Tokenizer> is to take pure source as a string and break it | ||||
61 | up into a stream/set of tokens, and contains most of the "black magic" used | ||||
62 | in PPI. By comparison, the lexer implements a relatively straight forward | ||||
63 | tree structure, and has an implementation that is uncomplicated (compared | ||||
64 | to the insanity in the tokenizer at least). | ||||
65 | |||||
66 | The Tokenizer uses an immense amount of heuristics, guessing and cruft, | ||||
67 | supported by a very B<VERY> flexible internal API, but fortunately it was | ||||
68 | possible to largely encapsulate the black magic, so there is not a lot that | ||||
69 | gets exposed to people using the C<PPI::Tokenizer> itself. | ||||
70 | |||||
71 | =head1 METHODS | ||||
72 | |||||
73 | Despite the incredible complexity, the Tokenizer itself only exposes a | ||||
74 | relatively small number of methods, with most of the complexity implemented | ||||
75 | in private methods. | ||||
76 | |||||
77 | =cut | ||||
78 | |||||
79 | # Make sure everything we need is loaded so | ||||
80 | # we don't have to go and load all of PPI. | ||||
81 | 2 | 21µs | 2 | 34µs | # spent 23µs (12+11) within PPI::Tokenizer::BEGIN@81 which was called:
# once (12µs+11µs) by PPI::BEGIN@28 at line 81 # spent 23µs making 1 call to PPI::Tokenizer::BEGIN@81
# spent 11µs making 1 call to strict::import |
82 | 2 | 19µs | 2 | 63µs | # spent 35µs (7+28) within PPI::Tokenizer::BEGIN@82 which was called:
# once (7µs+28µs) by PPI::BEGIN@28 at line 82 # spent 35µs making 1 call to PPI::Tokenizer::BEGIN@82
# spent 28µs making 1 call to Exporter::import |
83 | 2 | 18µs | 1 | 3µs | # spent 3µs within PPI::Tokenizer::BEGIN@83 which was called:
# once (3µs+0s) by PPI::BEGIN@28 at line 83 # spent 3µs making 1 call to PPI::Tokenizer::BEGIN@83 |
84 | 2 | 15µs | 1 | 3µs | # spent 3µs within PPI::Tokenizer::BEGIN@84 which was called:
# once (3µs+0s) by PPI::BEGIN@28 at line 84 # spent 3µs making 1 call to PPI::Tokenizer::BEGIN@84 |
85 | 2 | 14µs | 1 | 3µs | # spent 3µs within PPI::Tokenizer::BEGIN@85 which was called:
# once (3µs+0s) by PPI::BEGIN@28 at line 85 # spent 3µs making 1 call to PPI::Tokenizer::BEGIN@85 |
86 | 2 | 20µs | 1 | 3µs | # spent 3µs within PPI::Tokenizer::BEGIN@86 which was called:
# once (3µs+0s) by PPI::BEGIN@28 at line 86 # spent 3µs making 1 call to PPI::Tokenizer::BEGIN@86 |
87 | 2 | 15µs | 1 | 3µs | # spent 3µs within PPI::Tokenizer::BEGIN@87 which was called:
# once (3µs+0s) by PPI::BEGIN@28 at line 87 # spent 3µs making 1 call to PPI::Tokenizer::BEGIN@87 |
88 | 2 | 79µs | 1 | 224µs | # spent 224µs (135+89) within PPI::Tokenizer::BEGIN@88 which was called:
# once (135µs+89µs) by PPI::BEGIN@28 at line 88 # spent 224µs making 1 call to PPI::Tokenizer::BEGIN@88 |
89 | |||||
90 | 2 | 22µs | 2 | 39µs | # spent 23µs (6+16) within PPI::Tokenizer::BEGIN@90 which was called:
# once (6µs+16µs) by PPI::BEGIN@28 at line 90 # spent 23µs making 1 call to PPI::Tokenizer::BEGIN@90
# spent 16µs making 1 call to vars::import |
91 | # spent 3µs within PPI::Tokenizer::BEGIN@91 which was called:
# once (3µs+0s) by PPI::BEGIN@28 at line 93 | ||||
92 | 1 | 4µs | $VERSION = '1.215'; | ||
93 | 1 | 1.57ms | 1 | 3µs | } # spent 3µs making 1 call to PPI::Tokenizer::BEGIN@91 |
94 | |||||
- - | |||||
99 | ##################################################################### | ||||
100 | # Creation and Initialization | ||||
101 | |||||
102 | =pod | ||||
103 | |||||
104 | =head2 new $file | \@lines | \$source | ||||
105 | |||||
106 | The main C<new> constructor creates a new Tokenizer object. These | ||||
107 | objects have no configuration parameters, and can only be used once, | ||||
108 | to tokenize a single perl source file. | ||||
109 | |||||
110 | It takes as argument either a normal scalar containing source code, | ||||
111 | a reference to a scalar containing source code, or a reference to an | ||||
112 | ARRAY containing newline-terminated lines of source code. | ||||
113 | |||||
114 | Returns a new C<PPI::Tokenizer> object on success, or throws a | ||||
115 | L<PPI::Exception> exception on error. | ||||
116 | |||||
117 | =cut | ||||
118 | |||||
119 | # spent 503ms (118+384) within PPI::Tokenizer::new which was called 144 times, avg 3.49ms/call:
# 144 times (118ms+384ms) by PPI::Lexer::lex_file at line 159 of PPI/Lexer.pm, avg 3.49ms/call | ||||
120 | 144 | 119µs | my $class = ref($_[0]) || $_[0]; | ||
121 | |||||
122 | # Create the empty tokenizer struct | ||||
123 | 144 | 1.61ms | my $self = bless { | ||
124 | # Source code | ||||
125 | source => undef, | ||||
126 | source_bytes => undef, | ||||
127 | |||||
128 | # Line buffer | ||||
129 | line => undef, | ||||
130 | line_length => undef, | ||||
131 | line_cursor => undef, | ||||
132 | line_count => 0, | ||||
133 | |||||
134 | # Parse state | ||||
135 | token => undef, | ||||
136 | class => 'PPI::Token::BOM', | ||||
137 | zone => 'PPI::Token::Whitespace', | ||||
138 | |||||
139 | # Output token buffer | ||||
140 | tokens => [], | ||||
141 | token_cursor => 0, | ||||
142 | token_eof => 0, | ||||
143 | |||||
144 | # Perl 6 blocks | ||||
145 | perl6 => [], | ||||
146 | }, $class; | ||||
147 | |||||
148 | 144 | 208µs | if ( ! defined $_[1] ) { | ||
149 | # We weren't given anything | ||||
150 | PPI::Exception->throw("No source provided to Tokenizer"); | ||||
151 | |||||
152 | } elsif ( ! ref $_[1] ) { | ||||
153 | 144 | 566µs | 144 | 187ms | my $source = PPI::Util::_slurp($_[1]); # spent 187ms making 144 calls to PPI::Util::_slurp, avg 1.30ms/call |
154 | 144 | 1.20ms | if ( ref $source ) { | ||
155 | # Content returned by reference | ||||
156 | $self->{source} = $$source; | ||||
157 | } else { | ||||
158 | # Errors returned as a string | ||||
159 | return( $source ); | ||||
160 | } | ||||
161 | |||||
162 | } elsif ( _SCALAR0($_[1]) ) { | ||||
163 | $self->{source} = ${$_[1]}; | ||||
164 | |||||
165 | } elsif ( _ARRAY0($_[1]) ) { | ||||
166 | $self->{source} = join '', map { "\n" } @{$_[1]}; | ||||
167 | |||||
168 | } else { | ||||
169 | # We don't support whatever this is | ||||
170 | PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider"); | ||||
171 | } | ||||
172 | |||||
173 | # We can't handle a null string | ||||
174 | 144 | 289µs | $self->{source_bytes} = length $self->{source}; | ||
175 | 144 | 3.62ms | if ( $self->{source_bytes} > 1048576 ) { | ||
176 | # Dammit! It's ALWAYS the "Perl" modules larger than a | ||||
177 | # meg that seems to blow up the Tokenizer/Lexer. | ||||
178 | # Nobody actually writes real programs larger than a meg | ||||
179 | # Perl::Tidy (the largest) is only 800k. | ||||
180 | # It is always these idiots with massive Data::Dumper | ||||
181 | # structs or huge RecDescent parser. | ||||
182 | PPI::Exception::ParserRejection->throw("File is too large"); | ||||
183 | |||||
184 | } elsif ( $self->{source_bytes} ) { | ||||
185 | # Split on local newlines | ||||
186 | 144 | 163ms | 144 | 162ms | $self->{source} =~ s/(?:\015{1,2}\012|\015|\012)/\n/g; # spent 162ms making 144 calls to PPI::Tokenizer::CORE:subst, avg 1.12ms/call |
187 | 144 | 107ms | $self->{source} = [ split /(?<=\n)/, $self->{source} ]; | ||
188 | |||||
189 | } else { | ||||
190 | $self->{source} = [ ]; | ||||
191 | } | ||||
192 | |||||
193 | ### EVIL | ||||
194 | # I'm explaining this earlier than I should so you can understand | ||||
195 | # why I'm about to do something that looks very strange. There's | ||||
196 | # a problem with the Tokenizer, in that tokens tend to change | ||||
197 | # classes as each letter is added, but they don't get allocated | ||||
198 | # their definite final class until the "end" of the token, the | ||||
199 | # detection of which occurs in about a hundred different places, | ||||
200 | # all through various crufty code (that triples the speed). | ||||
201 | # | ||||
202 | # However, in general, this does not apply to tokens in which a | ||||
203 | # whitespace character is valid, such as comments, whitespace and | ||||
204 | # big strings. | ||||
205 | # | ||||
206 | # So what we do is add a space to the end of the source. This | ||||
207 | # triggers normal "end of token" functionality for all cases. Then, | ||||
208 | # once the tokenizer hits end of file, it examines the last token to | ||||
209 | # manually either remove the ' ' token, or chop it off the end of | ||||
210 | # a longer one in which the space would be valid. | ||||
211 | 15678 | 34.2ms | 15678 | 39.0ms | if ( List::MoreUtils::any { /^__(?:DATA|END)__\s*$/ } @{$self->{source}} ) { # spent 34.9ms making 144 calls to List::MoreUtils::any, avg 242µs/call
# spent 4.15ms making 15534 calls to PPI::Tokenizer::CORE:match, avg 267ns/call |
212 | $self->{source_eof_chop} = ''; | ||||
213 | } elsif ( ! defined $self->{source}->[0] ) { | ||||
214 | $self->{source_eof_chop} = ''; | ||||
215 | } elsif ( $self->{source}->[-1] =~ /\s$/ ) { | ||||
216 | $self->{source_eof_chop} = ''; | ||||
217 | } else { | ||||
218 | $self->{source_eof_chop} = 1; | ||||
219 | $self->{source}->[-1] .= ' '; | ||||
220 | } | ||||
221 | |||||
222 | 144 | 765µs | $self; | ||
223 | } | ||||
224 | |||||
- - | |||||
229 | ##################################################################### | ||||
230 | # Main Public Methods | ||||
231 | |||||
232 | =pod | ||||
233 | |||||
234 | =head2 get_token | ||||
235 | |||||
236 | When using the PPI::Tokenizer object as an iterator, the C<get_token> | ||||
237 | method is the primary method that is used. It increments the cursor | ||||
238 | and returns the next Token in the output array. | ||||
239 | |||||
240 | The actual parsing of the file is done only as-needed, and a line at | ||||
241 | a time. When C<get_token> hits the end of the token array, it will | ||||
242 | cause the parser to pull in the next line and parse it, continuing | ||||
243 | as needed until there are more tokens on the output array that | ||||
244 | get_token can then return. | ||||
245 | |||||
246 | This means that a number of Tokenizer objects can be created, and | ||||
247 | won't consume significant CPU until you actually begin to pull tokens | ||||
248 | from it. | ||||
249 | |||||
250 | Return a L<PPI::Token> object on success, C<0> if the Tokenizer had | ||||
251 | reached the end of the file, or C<undef> on error. | ||||
252 | |||||
253 | =cut | ||||
254 | |||||
255 | # spent 6.79s (602ms+6.19) within PPI::Tokenizer::get_token which was called 94513 times, avg 72µs/call:
# 94513 times (602ms+6.19s) by PPI::Lexer::_get_token at line 1413 of PPI/Lexer.pm, avg 72µs/call | ||||
256 | 94513 | 17.5ms | my $self = shift; | ||
257 | |||||
258 | # Shortcut for EOF | ||||
259 | 94513 | 15.6ms | if ( $self->{token_eof} | ||
260 | and $self->{token_cursor} > scalar @{$self->{tokens}} | ||||
261 | ) { | ||||
262 | return 0; | ||||
263 | } | ||||
264 | |||||
265 | # Return the next token if we can | ||||
266 | 94513 | 298ms | 82384 | 48.3ms | if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) { # spent 48.3ms making 82384 calls to PPI::Util::TRUE, avg 587ns/call |
267 | 82384 | 11.9ms | $self->{token_cursor}++; | ||
268 | 82384 | 244ms | return $token; | ||
269 | } | ||||
270 | |||||
271 | 12129 | 268µs | my $line_rv; | ||
272 | |||||
273 | # Catch exceptions and return undef, so that we | ||||
274 | # can start to convert code to exception-based code. | ||||
275 | 12129 | 4.52ms | my $rv = eval { | ||
276 | # No token, we need to get some more | ||||
277 | 12129 | 14.1ms | 12129 | 4.32s | while ( $line_rv = $self->_process_next_line ) { # spent 4.32s making 12129 calls to PPI::Tokenizer::_process_next_line, avg 356µs/call |
278 | # If there is something in the buffer, return it | ||||
279 | # The defined() prevents a ton of calls to PPI::Util::TRUE | ||||
280 | 26616 | 31.1ms | 14775 | 1.81s | if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) { # spent 1.81s making 14775 calls to PPI::Tokenizer::_process_next_line, avg 123µs/call |
281 | 11841 | 1.48ms | $self->{token_cursor}++; | ||
282 | 11841 | 5.62ms | return $token; | ||
283 | } | ||||
284 | } | ||||
285 | 288 | 56µs | return undef; | ||
286 | }; | ||||
287 | 12129 | 80.8ms | 11841 | 8.35ms | if ( $@ ) { # spent 8.35ms making 11841 calls to PPI::Util::TRUE, avg 705ns/call |
288 | if ( _INSTANCE($@, 'PPI::Exception') ) { | ||||
289 | $@->throw; | ||||
290 | } else { | ||||
291 | my $errstr = $@; | ||||
292 | $errstr =~ s/^(.*) at line .+$/$1/; | ||||
293 | PPI::Exception->throw( $errstr ); | ||||
294 | } | ||||
295 | } elsif ( $rv ) { | ||||
296 | return $rv; | ||||
297 | } | ||||
298 | |||||
299 | 288 | 63µs | if ( defined $line_rv ) { | ||
300 | # End of file, but we can still return things from the buffer | ||||
301 | 288 | 181µs | if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) { | ||
302 | $self->{token_cursor}++; | ||||
303 | return $token; | ||||
304 | } | ||||
305 | |||||
306 | # Set our token end of file flag | ||||
307 | 288 | 82µs | $self->{token_eof} = 1; | ||
308 | 288 | 489µs | return 0; | ||
309 | } | ||||
310 | |||||
311 | # Error, pass it up to our caller | ||||
312 | undef; | ||||
313 | } | ||||
314 | |||||
315 | =pod | ||||
316 | |||||
317 | =head2 all_tokens | ||||
318 | |||||
319 | When not being used as an iterator, the C<all_tokens> method tells | ||||
320 | the Tokenizer to parse the entire file and return all of the tokens | ||||
321 | in a single ARRAY reference. | ||||
322 | |||||
323 | It should be noted that C<all_tokens> does B<NOT> interfere with the | ||||
324 | use of the Tokenizer object as an iterator (does not modify the token | ||||
325 | cursor) and use of the two different mechanisms can be mixed safely. | ||||
326 | |||||
327 | Returns a reference to an ARRAY of L<PPI::Token> objects on success | ||||
328 | or throws an exception on error. | ||||
329 | |||||
330 | =cut | ||||
331 | |||||
332 | sub all_tokens { | ||||
333 | my $self = shift; | ||||
334 | |||||
335 | # Catch exceptions and return undef, so that we | ||||
336 | # can start to convert code to exception-based code. | ||||
337 | eval { | ||||
338 | # Process lines until we get EOF | ||||
339 | unless ( $self->{token_eof} ) { | ||||
340 | my $rv; | ||||
341 | while ( $rv = $self->_process_next_line ) {} | ||||
342 | unless ( defined $rv ) { | ||||
343 | PPI::Exception->throw("Error while processing source"); | ||||
344 | } | ||||
345 | |||||
346 | # Clean up the end of the tokenizer | ||||
347 | $self->_clean_eof; | ||||
348 | } | ||||
349 | }; | ||||
350 | if ( $@ ) { | ||||
351 | my $errstr = $@; | ||||
352 | $errstr =~ s/^(.*) at line .+$/$1/; | ||||
353 | PPI::Exception->throw( $errstr ); | ||||
354 | } | ||||
355 | |||||
356 | # End of file, return a copy of the token array. | ||||
357 | return [ @{$self->{tokens}} ]; | ||||
358 | } | ||||
359 | |||||
360 | =pod | ||||
361 | |||||
362 | =head2 increment_cursor | ||||
363 | |||||
364 | Although exposed as a public method, C<increment_method> is implemented | ||||
365 | for expert use only, when writing lexers or other components that work | ||||
366 | directly on token streams. | ||||
367 | |||||
368 | It manually increments the token cursor forward through the file, in effect | ||||
369 | "skipping" the next token. | ||||
370 | |||||
371 | Return true if the cursor is incremented, C<0> if already at the end of | ||||
372 | the file, or C<undef> on error. | ||||
373 | |||||
374 | =cut | ||||
375 | |||||
376 | sub increment_cursor { | ||||
377 | # Do this via the get_token method, which makes sure there | ||||
378 | # is actually a token there to move to. | ||||
379 | $_[0]->get_token and 1; | ||||
380 | } | ||||
381 | |||||
382 | =pod | ||||
383 | |||||
384 | =head2 decrement_cursor | ||||
385 | |||||
386 | Although exposed as a public method, C<decrement_method> is implemented | ||||
387 | for expert use only, when writing lexers or other components that work | ||||
388 | directly on token streams. | ||||
389 | |||||
390 | It manually decrements the token cursor backwards through the file, in | ||||
391 | effect "rolling back" the token stream. And indeed that is what it is | ||||
392 | primarily intended for, when the component that is consuming the token | ||||
393 | stream needs to implement some sort of "roll back" feature in its use | ||||
394 | of the token stream. | ||||
395 | |||||
396 | Return true if the cursor is decremented, C<0> if already at the | ||||
397 | beginning of the file, or C<undef> on error. | ||||
398 | |||||
399 | =cut | ||||
400 | |||||
401 | sub decrement_cursor { | ||||
402 | my $self = shift; | ||||
403 | |||||
404 | # Check for the beginning of the file | ||||
405 | return 0 unless $self->{token_cursor}; | ||||
406 | |||||
407 | # Decrement the token cursor | ||||
408 | $self->{token_eof} = 0; | ||||
409 | --$self->{token_cursor}; | ||||
410 | } | ||||
411 | |||||
- - | |||||
416 | ##################################################################### | ||||
417 | # Working With Source | ||||
418 | |||||
419 | # Fetches the next line from the input line buffer | ||||
420 | # Returns undef at EOF. | ||||
421 | # spent 60.1ms within PPI::Tokenizer::_get_line which was called 27287 times, avg 2µs/call:
# 27281 times (60.1ms+0s) by PPI::Tokenizer::_fill_line at line 443, avg 2µs/call
# 5 times (10µs+0s) by PPI::Token::HereDoc::__TOKENIZER__on_char at line 222 of PPI/Token/HereDoc.pm, avg 2µs/call
# once (3µs+0s) by PPI::Token::HereDoc::__TOKENIZER__on_char at line 211 of PPI/Token/HereDoc.pm | ||||
422 | 27287 | 3.41ms | my $self = shift; | ||
423 | 27287 | 6.10ms | return undef unless $self->{source}; # EOF hit previously | ||
424 | |||||
425 | # Pull off the next line | ||||
426 | 27143 | 15.3ms | my $line = shift @{$self->{source}}; | ||
427 | |||||
428 | # Flag EOF if we hit it | ||||
429 | 27143 | 3.09ms | $self->{source} = undef unless defined $line; | ||
430 | |||||
431 | # Return the line (or EOF flag) | ||||
432 | 27143 | 113ms | return $line; # string or undef | ||
433 | } | ||||
434 | |||||
435 | # Fetches the next line, ready to process | ||||
436 | # Returns 1 on success | ||||
437 | # Returns 0 on EOF | ||||
438 | # spent 246ms (186+60.1) within PPI::Tokenizer::_fill_line which was called 27281 times, avg 9µs/call:
# 26904 times (184ms+59.2ms) by PPI::Tokenizer::_process_next_line at line 490, avg 9µs/call
# 372 times (1.89ms+884µs) by PPI::Token::_QuoteEngine::_scan_for_brace_character at line 183 of PPI/Token/_QuoteEngine.pm, avg 7µs/call
# 5 times (38µs+16µs) by PPI::Token::_QuoteEngine::_scan_for_unescaped_character at line 137 of PPI/Token/_QuoteEngine.pm, avg 11µs/call | ||||
439 | 27281 | 3.17ms | my $self = shift; | ||
440 | 27281 | 3.02ms | my $inscan = shift; | ||
441 | |||||
442 | # Get the next line | ||||
443 | 27281 | 27.1ms | 27281 | 60.1ms | my $line = $self->_get_line; # spent 60.1ms making 27281 calls to PPI::Tokenizer::_get_line, avg 2µs/call |
444 | 27281 | 2.96ms | unless ( defined $line ) { | ||
445 | # End of file | ||||
446 | 288 | 32µs | unless ( $inscan ) { | ||
447 | 288 | 199µs | delete $self->{line}; | ||
448 | 288 | 52µs | delete $self->{line_cursor}; | ||
449 | 288 | 46µs | delete $self->{line_length}; | ||
450 | 288 | 529µs | return 0; | ||
451 | } | ||||
452 | |||||
453 | # In the scan version, just set the cursor to the end | ||||
454 | # of the line, and the rest should just cascade out. | ||||
455 | $self->{line_cursor} = $self->{line_length}; | ||||
456 | return 0; | ||||
457 | } | ||||
458 | |||||
459 | # Populate the appropriate variables | ||||
460 | 26993 | 6.62ms | $self->{line} = $line; | ||
461 | 26993 | 4.61ms | $self->{line_cursor} = -1; | ||
462 | 26993 | 6.80ms | $self->{line_length} = length $line; | ||
463 | 26993 | 3.62ms | $self->{line_count}++; | ||
464 | |||||
465 | 26993 | 68.3ms | 1; | ||
466 | } | ||||
467 | |||||
468 | # Get the current character | ||||
469 | sub _char { | ||||
470 | my $self = shift; | ||||
471 | substr( $self->{line}, $self->{line_cursor}, 1 ); | ||||
472 | } | ||||
473 | |||||
- - | |||||
478 | #################################################################### | ||||
479 | # Per line processing methods | ||||
480 | |||||
481 | # Processes the next line | ||||
482 | # Returns 1 on success completion | ||||
483 | # Returns 0 if EOF | ||||
484 | # Returns undef on error | ||||
485 | sub _process_next_line { | ||||
486 | 26904 | 3.78ms | my $self = shift; | ||
487 | |||||
488 | # Fill the line buffer | ||||
489 | 26904 | 903µs | my $rv; | ||
490 | 26904 | 23.3ms | 26904 | 243ms | unless ( $rv = $self->_fill_line ) { # spent 243ms making 26904 calls to PPI::Tokenizer::_fill_line, avg 9µs/call |
491 | 288 | 38µs | return undef unless defined $rv; | ||
492 | |||||
493 | # End of file, finalize last token | ||||
494 | 288 | 275µs | 288 | 397µs | $self->_finalize_token; # spent 397µs making 288 calls to PPI::Tokenizer::_finalize_token, avg 1µs/call |
495 | 288 | 450µs | return 0; | ||
496 | } | ||||
497 | |||||
498 | # Run the __TOKENIZER__on_line_start | ||||
499 | 26616 | 39.3ms | 26616 | 354ms | $rv = $self->{class}->__TOKENIZER__on_line_start( $self ); # spent 269ms making 14943 calls to PPI::Token::Whitespace::__TOKENIZER__on_line_start, avg 18µs/call
# spent 65.6ms making 9695 calls to PPI::Token::Pod::__TOKENIZER__on_line_start, avg 7µs/call
# spent 14.1ms making 1834 calls to PPI::Token::End::__TOKENIZER__on_line_start, avg 8µs/call
# spent 4.66ms making 144 calls to PPI::Token::BOM::__TOKENIZER__on_line_start, avg 32µs/call |
500 | 26616 | 3.26ms | unless ( $rv ) { | ||
501 | # If there are no more source lines, then clean up | ||||
502 | 16923 | 9.78ms | 144 | 1.76ms | if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) { # spent 1.76ms making 144 calls to PPI::Tokenizer::_clean_eof, avg 12µs/call |
503 | $self->_clean_eof; | ||||
504 | } | ||||
505 | |||||
506 | # Defined but false means next line | ||||
507 | 16923 | 66.4ms | return 1 if defined $rv; | ||
508 | PPI::Exception->throw("Error at line $self->{line_count}"); | ||||
509 | } | ||||
510 | |||||
511 | # If we can't deal with the entire line, process char by char | ||||
512 | 9693 | 203ms | 149609 | 4.58s | while ( $rv = $self->_process_next_char ) {} # spent 4.58s making 149609 calls to PPI::Tokenizer::_process_next_char, avg 31µs/call |
513 | 9693 | 1.15ms | unless ( defined $rv ) { | ||
514 | PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}"); | ||||
515 | } | ||||
516 | |||||
517 | # Trigger any action that needs to happen at the end of a line | ||||
518 | 9693 | 13.4ms | 9693 | 94.6ms | $self->{class}->__TOKENIZER__on_line_end( $self ); # spent 94.4ms making 9549 calls to PPI::Token::Whitespace::__TOKENIZER__on_line_end, avg 10µs/call
# spent 224µs making 144 calls to PPI::Token::__TOKENIZER__on_line_end, avg 2µs/call |
519 | |||||
520 | # If there are no more source lines, then clean up | ||||
521 | 9693 | 7.24ms | unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) { | ||
522 | return $self->_clean_eof; | ||||
523 | } | ||||
524 | |||||
525 | 9693 | 37.6ms | return 1; | ||
526 | } | ||||
527 | |||||
- - | |||||
532 | ##################################################################### | ||||
533 | # Per-character processing methods | ||||
534 | |||||
535 | # Process on a per-character basis. | ||||
536 | # Note that due the the high number of times this gets | ||||
537 | # called, it has been fairly heavily in-lined, so the code | ||||
538 | # might look a bit ugly and duplicated. | ||||
539 | # spent 4.58s (1.30+3.28) within PPI::Tokenizer::_process_next_char which was called 149609 times, avg 31µs/call:
# 149609 times (1.30s+3.28s) by PPI::Tokenizer::_process_next_line at line 512, avg 31µs/call | ||||
540 | 149609 | 24.1ms | my $self = shift; | ||
541 | |||||
542 | ### FIXME - This checks for a screwed up condition that triggers | ||||
543 | ### several warnings, amoungst other things. | ||||
544 | 149609 | 48.5ms | if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) { | ||
545 | # $DB::single = 1; | ||||
546 | return undef; | ||||
547 | } | ||||
548 | |||||
549 | # Increment the counter and check for end of line | ||||
550 | 149609 | 57.7ms | return 0 if ++$self->{line_cursor} >= $self->{line_length}; | ||
551 | |||||
552 | # Pass control to the token class | ||||
553 | 139916 | 1.69ms | my $result; | ||
554 | 139916 | 221ms | 139916 | 2.94s | unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) { # spent 1.87s making 106218 calls to PPI::Token::Whitespace::__TOKENIZER__on_char, avg 18µs/call
# spent 362ms making 7754 calls to PPI::Token::Symbol::__TOKENIZER__on_char, avg 47µs/call
# spent 299ms making 10634 calls to PPI::Token::Operator::__TOKENIZER__on_char, avg 28µs/call
# spent 201ms making 8180 calls to PPI::Token::Unknown::__TOKENIZER__on_char, avg 25µs/call
# spent 90.9ms making 1688 calls to PPI::Token::_QuoteEngine::__TOKENIZER__on_char, avg 54µs/call
# spent 69.1ms making 3157 calls to PPI::Token::Structure::__TOKENIZER__on_char, avg 22µs/call
# spent 38.4ms making 1170 calls to PPI::Token::Number::__TOKENIZER__on_char, avg 33µs/call
# spent 13.3ms making 1018 calls to PPI::Token::Number::Float::__TOKENIZER__on_char, avg 13µs/call
# spent 1.61ms making 34 calls to PPI::Token::Magic::__TOKENIZER__on_char, avg 47µs/call
# spent 654µs making 61 calls to PPI::Token::Cast::__TOKENIZER__on_char, avg 11µs/call
# spent 69µs making 2 calls to PPI::Token::DashedWord::__TOKENIZER__on_char, avg 34µs/call |
555 | # undef is error. 0 is "Did stuff ourself, you don't have to do anything" | ||||
556 | return defined $result ? 1 : undef; | ||||
557 | } | ||||
558 | |||||
559 | # We will need the value of the current character | ||||
560 | 123420 | 54.3ms | my $char = substr( $self->{line}, $self->{line_cursor}, 1 ); | ||
561 | 123420 | 15.8ms | if ( $result eq '1' ) { | ||
562 | # If __TOKENIZER__on_char returns 1, it is signaling that it thinks that | ||||
563 | # the character is part of it. | ||||
564 | |||||
565 | # Add the character | ||||
566 | 12474 | 6.66ms | if ( defined $self->{token} ) { | ||
567 | $self->{token}->{content} .= $char; | ||||
568 | } else { | ||||
569 | defined($self->{token} = $self->{class}->new($char)) or return undef; | ||||
570 | } | ||||
571 | |||||
572 | 12474 | 37.1ms | return 1; | ||
573 | } | ||||
574 | |||||
575 | # We have been provided with the name of a class | ||||
576 | 110946 | 85.8ms | 21222 | 254ms | if ( $self->{class} ne "PPI::Token::$result" ) { # spent 254ms making 21222 calls to PPI::Tokenizer::_new_token, avg 12µs/call |
577 | # New class | ||||
578 | $self->_new_token( $result, $char ); | ||||
579 | } elsif ( defined $self->{token} ) { | ||||
580 | # Same class as current | ||||
581 | $self->{token}->{content} .= $char; | ||||
582 | } else { | ||||
583 | # Same class, but no current | ||||
584 | 37692 | 61.1ms | 37692 | 85.7ms | defined($self->{token} = $self->{class}->new($char)) or return undef; # spent 85.7ms making 37692 calls to PPI::Token::new, avg 2µs/call |
585 | } | ||||
586 | |||||
587 | 110946 | 352ms | 1; | ||
588 | } | ||||
589 | |||||
- - | |||||
594 | ##################################################################### | ||||
595 | # Altering Tokens in Tokenizer | ||||
596 | |||||
597 | # Finish the end of a token. | ||||
598 | # Returns the resulting parse class as a convenience. | ||||
599 | # spent 218ms within PPI::Tokenizer::_finalize_token which was called 94513 times, avg 2µs/call:
# 31193 times (67.2ms+0s) by PPI::Tokenizer::_new_token at line 620, avg 2µs/call
# 14291 times (35.5ms+0s) by PPI::Token::Word::__TOKENIZER__commit at line 539 of PPI/Token/Word.pm, avg 2µs/call
# 13365 times (29.4ms+0s) by PPI::Token::Structure::__TOKENIZER__commit at line 76 of PPI/Token/Structure.pm, avg 2µs/call
# 9549 times (20.9ms+0s) by PPI::Token::Whitespace::__TOKENIZER__on_line_end at line 417 of PPI/Token/Whitespace.pm, avg 2µs/call
# 7437 times (16.8ms+0s) by PPI::Token::Operator::__TOKENIZER__on_char at line 112 of PPI/Token/Operator.pm, avg 2µs/call
# 7245 times (21.2ms+0s) by PPI::Token::Symbol::__TOKENIZER__on_char at line 216 of PPI/Token/Symbol.pm, avg 3µs/call
# 3157 times (6.88ms+0s) by PPI::Token::Structure::__TOKENIZER__on_char at line 70 of PPI/Token/Structure.pm, avg 2µs/call
# 2743 times (7.54ms+0s) by PPI::Token::_QuoteEngine::__TOKENIZER__on_char at line 58 of PPI/Token/_QuoteEngine.pm, avg 3µs/call
# 1668 times (3.76ms+0s) by PPI::Token::Whitespace::__TOKENIZER__on_line_start at line 165 of PPI/Token/Whitespace.pm, avg 2µs/call
# 1252 times (2.71ms+0s) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 213 of PPI/Token/Whitespace.pm, avg 2µs/call
# 832 times (2.14ms+0s) by PPI::Token::Number::__TOKENIZER__on_char at line 125 of PPI/Token/Number.pm, avg 3µs/call
# 509 times (1.33ms+0s) by PPI::Token::Symbol::__TOKENIZER__on_char at line 174 of PPI/Token/Symbol.pm, avg 3µs/call
# 288 times (397µs+0s) by PPI::Tokenizer::_process_next_line at line 494, avg 1µs/call
# 148 times (513µs+0s) by PPI::Token::Number::Float::__TOKENIZER__on_char at line 108 of PPI/Token/Number/Float.pm, avg 3µs/call
# 146 times (415µs+0s) by PPI::Token::Pod::__TOKENIZER__on_line_start at line 148 of PPI/Token/Pod.pm, avg 3µs/call
# 144 times (335µs+0s) by PPI::Tokenizer::_clean_eof at line 635, avg 2µs/call
# 144 times (308µs+0s) by PPI::Token::Word::__TOKENIZER__commit at line 458 of PPI/Token/Word.pm, avg 2µs/call
# 144 times (299µs+0s) by PPI::Token::Word::__TOKENIZER__commit at line 441 of PPI/Token/Word.pm, avg 2µs/call
# 85 times (215µs+0s) by PPI::Token::Unknown::__TOKENIZER__on_char at line 179 of PPI/Token/Unknown.pm, avg 3µs/call
# 61 times (125µs+0s) by PPI::Token::Cast::__TOKENIZER__on_char at line 51 of PPI/Token/Cast.pm, avg 2µs/call
# 51 times (105µs+0s) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 261 of PPI/Token/Whitespace.pm, avg 2µs/call
# 30 times (105µs+0s) by PPI::Token::Magic::__TOKENIZER__on_char at line 228 of PPI/Token/Magic.pm, avg 4µs/call
# 22 times (54µs+0s) by PPI::Token::Unknown::__TOKENIZER__on_char at line 216 of PPI/Token/Unknown.pm, avg 2µs/call
# 3 times (8µs+0s) by PPI::Token::ArrayIndex::__TOKENIZER__on_char at line 56 of PPI/Token/ArrayIndex.pm, avg 3µs/call
# 2 times (5µs+0s) by PPI::Token::DashedWord::__TOKENIZER__on_char at line 95 of PPI/Token/DashedWord.pm, avg 2µs/call
# once (2µs+0s) by PPI::Token::Magic::__TOKENIZER__on_char at line 170 of PPI/Token/Magic.pm
# once (2µs+0s) by PPI::Token::Unknown::__TOKENIZER__on_char at line 150 of PPI/Token/Unknown.pm
# once (2µs+0s) by PPI::Token::HereDoc::__TOKENIZER__on_char at line 218 of PPI/Token/HereDoc.pm
# once (2µs+0s) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 316 of PPI/Token/Whitespace.pm | ||||
600 | 94513 | 16.2ms | my $self = shift; | ||
601 | 94513 | 16.8ms | return $self->{class} unless defined $self->{token}; | ||
602 | |||||
603 | # Add the token to the token buffer | ||||
604 | 94225 | 34.9ms | push @{ $self->{tokens} }, $self->{token}; | ||
605 | 94225 | 16.6ms | $self->{token} = undef; | ||
606 | |||||
607 | # Return the parse class to that of the zone we are in | ||||
608 | 94225 | 297ms | $self->{class} = $self->{zone}; | ||
609 | } | ||||
610 | |||||
611 | # Creates a new token and sets it in the tokenizer | ||||
612 | # The defined() in here prevent a ton of calls to PPI::Util::TRUE | ||||
613 | # spent 681ms (428+253) within PPI::Tokenizer::_new_token which was called 56533 times, avg 12µs/call:
# 21222 times (159ms+94.4ms) by PPI::Tokenizer::_process_next_char at line 576, avg 12µs/call
# 14291 times (103ms+63.5ms) by PPI::Token::Word::__TOKENIZER__commit at line 533 of PPI/Token/Word.pm, avg 12µs/call
# 13365 times (103ms+47.8ms) by PPI::Token::Structure::__TOKENIZER__commit at line 75 of PPI/Token/Structure.pm, avg 11µs/call
# 3724 times (24.9ms+10.0ms) by PPI::Token::Whitespace::__TOKENIZER__on_line_start at line 159 of PPI/Token/Whitespace.pm, avg 9µs/call
# 1668 times (19.7ms+6.52ms) by PPI::Token::Whitespace::__TOKENIZER__on_line_start at line 164 of PPI/Token/Whitespace.pm, avg 16µs/call
# 1055 times (10.0ms+26.4ms) by PPI::Token::Word::__TOKENIZER__commit at line 497 of PPI/Token/Word.pm, avg 35µs/call
# 288 times (1.53ms+796µs) by PPI::Token::End::__TOKENIZER__on_line_start at line 84 of PPI/Token/End.pm, avg 8µs/call
# 242 times (1.71ms+1.08ms) by PPI::Token::Comment::__TOKENIZER__commit at line 93 of PPI/Token/Comment.pm, avg 12µs/call
# 242 times (1.62ms+1.01ms) by PPI::Token::Comment::__TOKENIZER__commit at line 94 of PPI/Token/Comment.pm, avg 11µs/call
# 144 times (1.30ms+760µs) by PPI::Token::Word::__TOKENIZER__commit at line 440 of PPI/Token/Word.pm, avg 14µs/call
# 144 times (1.29ms+646µs) by PPI::Token::End::__TOKENIZER__on_line_start at line 70 of PPI/Token/End.pm, avg 13µs/call
# 144 times (703µs+318µs) by PPI::Token::Word::__TOKENIZER__commit at line 454 of PPI/Token/Word.pm, avg 7µs/call
# 2 times (15µs+9µs) by PPI::Token::Whitespace::__TOKENIZER__on_line_start at line 170 of PPI/Token/Whitespace.pm, avg 12µs/call
# 2 times (14µs+8µs) by PPI::Token::Number::Float::__TOKENIZER__on_char at line 93 of PPI/Token/Number/Float.pm, avg 11µs/call | ||||
614 | 56533 | 9.70ms | my $self = shift; | ||
615 | # throw PPI::Exception() unless @_; | ||||
616 | 56533 | 31.6ms | my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::' | ||
617 | ? shift : 'PPI::Token::' . shift; | ||||
618 | |||||
619 | # Finalize any existing token | ||||
620 | 56533 | 38.5ms | 31193 | 67.2ms | $self->_finalize_token if defined $self->{token}; # spent 67.2ms making 31193 calls to PPI::Tokenizer::_finalize_token, avg 2µs/call |
621 | |||||
622 | # Create the new token and update the parse class | ||||
623 | 56533 | 96.6ms | 56533 | 186ms | defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw; # spent 138ms making 53790 calls to PPI::Token::new, avg 3µs/call
# spent 24.2ms making 1061 calls to PPI::Token::_QuoteEngine::Full::new, avg 23µs/call
# spent 23.6ms making 1682 calls to PPI::Token::_QuoteEngine::Simple::new, avg 14µs/call |
624 | 56533 | 11.2ms | $self->{class} = $class; | ||
625 | |||||
626 | 56533 | 165ms | 1; | ||
627 | } | ||||
628 | |||||
629 | # At the end of the file, we need to clean up the results of the erroneous | ||||
630 | # space that we inserted at the beginning of the process. | ||||
631 | # spent 1.76ms (1.34+424µs) within PPI::Tokenizer::_clean_eof which was called 144 times, avg 12µs/call:
# 144 times (1.34ms+424µs) by PPI::Tokenizer::_process_next_line at line 502, avg 12µs/call | ||||
632 | 144 | 47µs | my $self = shift; | ||
633 | |||||
634 | # Finish any partially completed token | ||||
635 | 144 | 645µs | 288 | 424µs | $self->_finalize_token if $self->{token}; # spent 335µs making 144 calls to PPI::Tokenizer::_finalize_token, avg 2µs/call
# spent 89µs making 144 calls to PPI::Util::TRUE, avg 618ns/call |
636 | |||||
637 | # Find the last token, and if it has no content, kill it. | ||||
638 | # There appears to be some evidence that such "null tokens" are | ||||
639 | # somehow getting created accidentally. | ||||
640 | 144 | 132µs | my $last_token = $self->{tokens}->[ -1 ]; | ||
641 | 144 | 91µs | unless ( length $last_token->{content} ) { | ||
642 | pop @{$self->{tokens}}; | ||||
643 | } | ||||
644 | |||||
645 | # Now, if the last character of the last token is a space we added, | ||||
646 | # chop it off, deleting the token if there's nothing else left. | ||||
647 | 144 | 80µs | if ( $self->{source_eof_chop} ) { | ||
648 | $last_token = $self->{tokens}->[ -1 ]; | ||||
649 | $last_token->{content} =~ s/ $//; | ||||
650 | unless ( length $last_token->{content} ) { | ||||
651 | # Popping token | ||||
652 | pop @{$self->{tokens}}; | ||||
653 | } | ||||
654 | |||||
655 | # The hack involving adding an extra space is now reversed, and | ||||
656 | # now nobody will ever know. The perfect crime! | ||||
657 | $self->{source_eof_chop} = ''; | ||||
658 | } | ||||
659 | |||||
660 | 144 | 331µs | 1; | ||
661 | } | ||||
662 | |||||
- - | |||||
667 | ##################################################################### | ||||
668 | # Utility Methods | ||||
669 | |||||
670 | # Context | ||||
671 | sub _last_token { | ||||
672 | $_[0]->{tokens}->[-1]; | ||||
673 | } | ||||
674 | |||||
675 | # spent 589µs (488+101) within PPI::Tokenizer::_last_significant_token which was called 52 times, avg 11µs/call:
# 51 times (479µs+99µs) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 265 of PPI/Token/Whitespace.pm, avg 11µs/call
# once (10µs+2µs) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 321 of PPI/Token/Whitespace.pm | ||||
676 | 52 | 19µs | my $self = shift; | ||
677 | 52 | 41µs | my $cursor = $#{ $self->{tokens} }; | ||
678 | 52 | 20µs | while ( $cursor >= 0 ) { | ||
679 | 104 | 45µs | my $token = $self->{tokens}->[$cursor--]; | ||
680 | 104 | 266µs | 104 | 101µs | return $token if $token->significant; # spent 54µs making 52 calls to PPI::Token::Whitespace::significant, avg 1µs/call
# spent 46µs making 52 calls to PPI::Element::significant, avg 894ns/call |
681 | } | ||||
682 | |||||
683 | # Nothing... | ||||
684 | PPI::Token::Whitespace->null; | ||||
685 | } | ||||
686 | |||||
687 | # Get an array ref of previous significant tokens. | ||||
688 | # Like _last_significant_token except it gets more than just one token | ||||
689 | # Returns array ref on success. | ||||
690 | # Returns 0 on not enough tokens | ||||
691 | # spent 305ms (261+43.9) within PPI::Tokenizer::_previous_significant_tokens which was called 20542 times, avg 15µs/call:
# 15490 times (172ms+28.5ms) by PPI::Token::Word::__TOKENIZER__commit at line 430 of PPI/Token/Word.pm, avg 13µs/call
# 3157 times (72.8ms+13.4ms) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 222 of PPI/Token/Whitespace.pm, avg 27µs/call
# 1866 times (16.1ms+1.91ms) by PPI::Tokenizer::_opcontext at line 741, avg 10µs/call
# 25 times (469µs+119µs) by PPI::Token::Unknown::__TOKENIZER__is_an_attribute at line 305 of PPI/Token/Unknown.pm, avg 24µs/call
# 2 times (17µs+3µs) by PPI::Token::Unknown::__TOKENIZER__on_char at line 57 of PPI/Token/Unknown.pm, avg 10µs/call
# 2 times (11µs+2µs) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 384 of PPI/Token/Whitespace.pm, avg 6µs/call | ||||
692 | 20542 | 4.29ms | my $self = shift; | ||
693 | 20542 | 2.60ms | my $count = shift || 1; | ||
694 | 20542 | 8.90ms | my $cursor = $#{ $self->{tokens} }; | ||
695 | |||||
696 | 20542 | 1.91ms | my ($token, @tokens); | ||
697 | 20542 | 4.68ms | while ( $cursor >= 0 ) { | ||
698 | 42181 | 14.9ms | $token = $self->{tokens}->[$cursor--]; | ||
699 | 42181 | 53.6ms | 42181 | 40.9ms | if ( $token->significant ) { # spent 25.1ms making 26762 calls to PPI::Element::significant, avg 940ns/call
# spent 13.8ms making 13592 calls to PPI::Token::Whitespace::significant, avg 1µs/call
# spent 1.88ms making 1824 calls to PPI::Token::Comment::significant, avg 1µs/call
# spent 3µs making 3 calls to PPI::Token::Pod::significant, avg 1µs/call |
700 | 26762 | 10.4ms | push @tokens, $token; | ||
701 | 26762 | 107ms | return \@tokens if scalar @tokens >= $count; | ||
702 | } | ||||
703 | } | ||||
704 | |||||
705 | # Pad with empties | ||||
706 | 144 | 424µs | foreach ( 1 .. ($count - scalar @tokens) ) { | ||
707 | 144 | 703µs | 144 | 3.03ms | push @tokens, PPI::Token::Whitespace->null; # spent 3.03ms making 144 calls to PPI::Token::Whitespace::null, avg 21µs/call |
708 | } | ||||
709 | |||||
710 | 144 | 466µs | \@tokens; | ||
711 | } | ||||
712 | |||||
713 | 1 | 7µs | my %OBVIOUS_CLASS = ( | ||
714 | 'PPI::Token::Symbol' => 'operator', | ||||
715 | 'PPI::Token::Magic' => 'operator', | ||||
716 | 'PPI::Token::Number' => 'operator', | ||||
717 | 'PPI::Token::ArrayIndex' => 'operator', | ||||
718 | 'PPI::Token::Quote::Double' => 'operator', | ||||
719 | 'PPI::Token::Quote::Interpolate' => 'operator', | ||||
720 | 'PPI::Token::Quote::Literal' => 'operator', | ||||
721 | 'PPI::Token::Quote::Single' => 'operator', | ||||
722 | 'PPI::Token::QuoteLike::Backtick' => 'operator', | ||||
723 | 'PPI::Token::QuoteLike::Command' => 'operator', | ||||
724 | 'PPI::Token::QuoteLike::Readline' => 'operator', | ||||
725 | 'PPI::Token::QuoteLike::Regexp' => 'operator', | ||||
726 | 'PPI::Token::QuoteLike::Words' => 'operator', | ||||
727 | ); | ||||
728 | |||||
729 | 1 | 2µs | my %OBVIOUS_CONTENT = ( | ||
730 | '(' => 'operand', | ||||
731 | '{' => 'operand', | ||||
732 | '[' => 'operand', | ||||
733 | ';' => 'operand', | ||||
734 | '}' => 'operator', | ||||
735 | ); | ||||
736 | |||||
737 | # Try to determine operator/operand context, is possible. | ||||
738 | # Returns "operator", "operand", or "" if unknown. | ||||
739 | # spent 34.6ms (16.1+18.5) within PPI::Tokenizer::_opcontext which was called 1866 times, avg 19µs/call:
# 1866 times (16.1ms+18.5ms) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 397 of PPI/Token/Whitespace.pm, avg 19µs/call | ||||
740 | 1866 | 419µs | my $self = shift; | ||
741 | 1866 | 2.31ms | 1866 | 18.0ms | my $tokens = $self->_previous_significant_tokens(1); # spent 18.0ms making 1866 calls to PPI::Tokenizer::_previous_significant_tokens, avg 10µs/call |
742 | 1866 | 635µs | my $p0 = $tokens->[0]; | ||
743 | 1866 | 905µs | my $c0 = ref $p0; | ||
744 | |||||
745 | # Map the obvious cases | ||||
746 | 1866 | 5.32ms | return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0}; | ||
747 | 133 | 334µs | 153 | 247µs | return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0}; # spent 247µs making 153 calls to PPI::Token::content, avg 2µs/call |
748 | |||||
749 | # Most of the time after an operator, we are an operand | ||||
750 | 113 | 485µs | 113 | 168µs | return 'operand' if $p0->isa('PPI::Token::Operator'); # spent 168µs making 113 calls to UNIVERSAL::isa, avg 1µs/call |
751 | |||||
752 | # If there's NOTHING, it's operand | ||||
753 | 107 | 149µs | 107 | 140µs | return 'operand' if $p0->content eq ''; # spent 140µs making 107 calls to PPI::Token::content, avg 1µs/call |
754 | |||||
755 | # Otherwise, we don't know | ||||
756 | 107 | 283µs | return '' | ||
757 | } | ||||
758 | |||||
759 | 1 | 6µs | 1; | ||
760 | |||||
761 | =pod | ||||
762 | |||||
763 | =head1 NOTES | ||||
764 | |||||
765 | =head2 How the Tokenizer Works | ||||
766 | |||||
767 | Understanding the Tokenizer is not for the feint-hearted. It is by far | ||||
768 | the most complex and twisty piece of perl I've ever written that is actually | ||||
769 | still built properly and isn't a terrible spaghetti-like mess. In fact, you | ||||
770 | probably want to skip this section. | ||||
771 | |||||
772 | But if you really want to understand, well then here goes. | ||||
773 | |||||
774 | =head2 Source Input and Clean Up | ||||
775 | |||||
776 | The Tokenizer starts by taking source in a variety of forms, sucking it | ||||
777 | all in and merging into one big string, and doing our own internal line | ||||
778 | split, using a "universal line separator" which allows the Tokenizer to | ||||
779 | take source for any platform (and even supports a few known types of | ||||
780 | broken newlines caused by mixed mac/pc/*nix editor screw ups). | ||||
781 | |||||
782 | The resulting array of lines is used to feed the tokenizer, and is also | ||||
783 | accessed directly by the heredoc-logic to do the line-oriented part of | ||||
784 | here-doc support. | ||||
785 | |||||
786 | =head2 Doing Things the Old Fashioned Way | ||||
787 | |||||
788 | Due to the complexity of perl, and after 2 previously aborted parser | ||||
789 | attempts, in the end the tokenizer was fashioned around a line-buffered | ||||
790 | character-by-character method. | ||||
791 | |||||
792 | That is, the Tokenizer pulls and holds a line at a time into a line buffer, | ||||
793 | and then iterates a cursor along it. At each cursor position, a method is | ||||
794 | called in whatever token class we are currently in, which will examine the | ||||
795 | character at the current position, and handle it. | ||||
796 | |||||
797 | As the handler methods in the various token classes are called, they | ||||
798 | build up a output token array for the source code. | ||||
799 | |||||
800 | Various parts of the Tokenizer use look-ahead, arbitrary-distance | ||||
801 | look-behind (although currently the maximum is three significant tokens), | ||||
802 | or both, and various other heuristic guesses. | ||||
803 | |||||
804 | I've been told it is officially termed a I<"backtracking parser | ||||
805 | with infinite lookaheads">. | ||||
806 | |||||
807 | =head2 State Variables | ||||
808 | |||||
809 | Aside from the current line and the character cursor, the Tokenizer | ||||
810 | maintains a number of different state variables. | ||||
811 | |||||
812 | =over | ||||
813 | |||||
814 | =item Current Class | ||||
815 | |||||
816 | The Tokenizer maintains the current token class at all times. Much of the | ||||
817 | time is just going to be the "Whitespace" class, which is what the base of | ||||
818 | a document is. As the tokenizer executes the various character handlers, | ||||
819 | the class changes a lot as it moves a long. In fact, in some instances, | ||||
820 | the character handler may not handle the character directly itself, but | ||||
821 | rather change the "current class" and then hand off to the character | ||||
822 | handler for the new class. | ||||
823 | |||||
824 | Because of this, and some other things I'll deal with later, the number of | ||||
825 | times the character handlers are called does not in fact have a direct | ||||
826 | relationship to the number of actual characters in the document. | ||||
827 | |||||
828 | =item Current Zone | ||||
829 | |||||
830 | Rather than create a class stack to allow for infinitely nested layers of | ||||
831 | classes, the Tokenizer recognises just a single layer. | ||||
832 | |||||
833 | To put it a different way, in various parts of the file, the Tokenizer will | ||||
834 | recognise different "base" or "substrate" classes. When a Token such as a | ||||
835 | comment or a number is finalised by the tokenizer, it "falls back" to the | ||||
836 | base state. | ||||
837 | |||||
838 | This allows proper tokenization of special areas such as __DATA__ | ||||
839 | and __END__ blocks, which also contain things like comments and POD, | ||||
840 | without allowing the creation of any significant Tokens inside these areas. | ||||
841 | |||||
842 | For the main part of a document we use L<PPI::Token::Whitespace> for this, | ||||
843 | with the idea being that code is "floating in a sea of whitespace". | ||||
844 | |||||
845 | =item Current Token | ||||
846 | |||||
847 | The final main state variable is the "current token". This is the Token | ||||
848 | that is currently being built by the Tokenizer. For certain types, it | ||||
849 | can be manipulated and morphed and change class quite a bit while being | ||||
850 | assembled, as the Tokenizer's understanding of the token content changes. | ||||
851 | |||||
852 | When the Tokenizer is confident that it has seen the end of the Token, it | ||||
853 | will be "finalized", which adds it to the output token array and resets | ||||
854 | the current class to that of the zone that we are currently in. | ||||
855 | |||||
856 | I should also note at this point that the "current token" variable is | ||||
857 | optional. The Tokenizer is capable of knowing what class it is currently | ||||
858 | set to, without actually having accumulated any characters in the Token. | ||||
859 | |||||
860 | =back | ||||
861 | |||||
862 | =head2 Making It Faster | ||||
863 | |||||
864 | As I'm sure you can imagine, calling several different methods for each | ||||
865 | character and running regexes and other complex heuristics made the first | ||||
866 | fully working version of the tokenizer extremely slow. | ||||
867 | |||||
868 | During testing, I created a metric to measure parsing speed called | ||||
869 | LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU | ||||
870 | cycles on a typical single-core CPU, and so a Tokenizer running at | ||||
871 | "1000 lines per gigacycle" should generate around 1200 lines of tokenized | ||||
872 | code when running on a 1200 MHz processor. | ||||
873 | |||||
874 | The first working version of the tokenizer ran at only 350 LPGC, so to | ||||
875 | tokenize a typical large module such as L<ExtUtils::MakeMaker> took | ||||
876 | 10-15 seconds. This sluggishness made it unpractical for many uses. | ||||
877 | |||||
878 | So in the current parser, there are multiple layers of optimisation | ||||
879 | very carefully built in to the basic. This has brought the tokenizer | ||||
880 | up to a more reasonable 1000 LPGC, at the expense of making the code | ||||
881 | quite a bit twistier. | ||||
882 | |||||
883 | =head2 Making It Faster - Whole Line Classification | ||||
884 | |||||
885 | The first step in the optimisation process was to add a hew handler to | ||||
886 | enable several of the more basic classes (whitespace, comments) to be | ||||
887 | able to be parsed a line at a time. At the start of each line, a | ||||
888 | special optional handler (only supported by a few classes) is called to | ||||
889 | check and see if the entire line can be parsed in one go. | ||||
890 | |||||
891 | This is used mainly to handle things like POD, comments, empty lines, | ||||
892 | and a few other minor special cases. | ||||
893 | |||||
894 | =head2 Making It Faster - Inlining | ||||
895 | |||||
896 | The second stage of the optimisation involved inlining a small | ||||
897 | number of critical methods that were repeated an extremely high number | ||||
898 | of times. Profiling suggested that there were about 1,000,000 individual | ||||
899 | method calls per gigacycle, and by cutting these by two thirds a significant | ||||
900 | speed improvement was gained, in the order of about 50%. | ||||
901 | |||||
902 | You may notice that many methods in the C<PPI::Tokenizer> code look | ||||
903 | very nested and long hand. This is primarily due to this inlining. | ||||
904 | |||||
905 | At around this time, some statistics code that existed in the early | ||||
906 | versions of the parser was also removed, as it was determined that | ||||
907 | it was consuming around 15% of the CPU for the entire parser, while | ||||
908 | making the core more complicated. | ||||
909 | |||||
910 | A judgment call was made that with the difficulties likely to be | ||||
911 | encountered with future planned enhancements, and given the relatively | ||||
912 | high cost involved, the statistics features would be removed from the | ||||
913 | Tokenizer. | ||||
914 | |||||
915 | =head2 Making It Faster - Quote Engine | ||||
916 | |||||
917 | Once inlining had reached diminishing returns, it became obvious from | ||||
918 | the profiling results that a huge amount of time was being spent | ||||
919 | stepping a char at a time though long, simple and "syntactically boring" | ||||
920 | code such as comments and strings. | ||||
921 | |||||
922 | The existing regex engine was expanded to also encompass quotes and | ||||
923 | other quote-like things, and a special abstract base class was added | ||||
924 | that provided a number of specialised parsing methods that would "scan | ||||
925 | ahead", looking out ahead to find the end of a string, and updating | ||||
926 | the cursor to leave it in a valid position for the next call. | ||||
927 | |||||
928 | This is also the point at which the number of character handler calls began | ||||
929 | to greatly differ from the number of characters. But it has been done | ||||
930 | in a way that allows the parser to retain the power of the original | ||||
931 | version at the critical points, while skipping through the "boring bits" | ||||
932 | as needed for additional speed. | ||||
933 | |||||
934 | The addition of this feature allowed the tokenizer to exceed 1000 LPGC | ||||
935 | for the first time. | ||||
936 | |||||
937 | =head2 Making It Faster - The "Complete" Mechanism | ||||
938 | |||||
939 | As it became evident that great speed increases were available by using | ||||
940 | this "skipping ahead" mechanism, a new handler method was added that | ||||
941 | explicitly handles the parsing of an entire token, where the structure | ||||
942 | of the token is relatively simple. Tokens such as symbols fit this case, | ||||
943 | as once we are passed the initial sigil and word char, we know that we | ||||
944 | can skip ahead and "complete" the rest of the token much more easily. | ||||
945 | |||||
946 | A number of these have been added for most or possibly all of the common | ||||
947 | cases, with most of these "complete" handlers implemented using regular | ||||
948 | expressions. | ||||
949 | |||||
950 | In fact, so many have been added that at this point, you could arguably | ||||
951 | reclassify the tokenizer as a "hybrid regex, char-by=char heuristic | ||||
952 | tokenizer". More tokens are now consumed in "complete" methods in a | ||||
953 | typical program than are handled by the normal char-by-char methods. | ||||
954 | |||||
955 | Many of the these complete-handlers were implemented during the writing | ||||
956 | of the Lexer, and this has allowed the full parser to maintain around | ||||
957 | 1000 LPGC despite the increasing weight of the Lexer. | ||||
958 | |||||
959 | =head2 Making It Faster - Porting To C (In Progress) | ||||
960 | |||||
961 | While it would be extraordinarily difficult to port all of the Tokenizer | ||||
962 | to C, work has started on a L<PPI::XS> "accelerator" package which acts as | ||||
963 | a separate and automatically-detected add-on to the main PPI package. | ||||
964 | |||||
965 | L<PPI::XS> implements faster versions of a variety of functions scattered | ||||
966 | over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and | ||||
967 | various other places, and implements them identically in XS/C. | ||||
968 | |||||
969 | In particular, the skip-ahead methods from the Quote Engine would appear | ||||
970 | to be extremely amenable to being done in C, and a number of other | ||||
971 | functions could be cherry-picked one at a time and implemented in C. | ||||
972 | |||||
973 | Each method is heavily tested to ensure that the functionality is | ||||
974 | identical, and a versioning mechanism is included to ensure that if a | ||||
975 | function gets out of sync, L<PPI::XS> will degrade gracefully and just | ||||
976 | not replace that single method. | ||||
977 | |||||
978 | =head1 TO DO | ||||
979 | |||||
980 | - Add an option to reset or seek the token stream... | ||||
981 | |||||
982 | - Implement more Tokenizer functions in L<PPI::XS> | ||||
983 | |||||
984 | =head1 SUPPORT | ||||
985 | |||||
986 | See the L<support section|PPI/SUPPORT> in the main module. | ||||
987 | |||||
988 | =head1 AUTHOR | ||||
989 | |||||
990 | Adam Kennedy E<lt>adamk@cpan.orgE<gt> | ||||
991 | |||||
992 | =head1 COPYRIGHT | ||||
993 | |||||
994 | Copyright 2001 - 2011 Adam Kennedy. | ||||
995 | |||||
996 | This program is free software; you can redistribute | ||||
997 | it and/or modify it under the same terms as Perl itself. | ||||
998 | |||||
999 | The full text of the license can be found in the | ||||
1000 | LICENSE file included with this module. | ||||
1001 | |||||
1002 | =cut | ||||
# spent 4.15ms within PPI::Tokenizer::CORE:match which was called 15534 times, avg 267ns/call:
# 15534 times (4.15ms+0s) by List::MoreUtils::any at line 211, avg 267ns/call | |||||
# spent 162ms within PPI::Tokenizer::CORE:subst which was called 144 times, avg 1.12ms/call:
# 144 times (162ms+0s) by PPI::Tokenizer::new at line 186, avg 1.12ms/call |