Profile of PPI/Tokenizer.pm

Filename	/Users/timbo/perl5/perlbrew/perls/perl-5.18.2/lib/site_perl/5.18.2/PPI/Tokenizer.pm
Statements	Executed 3487328 statements in 3.64s

Subroutines
Calls	P	F	Exclusive Time	Inclusive Time	Subroutine
149609	1	1	1.30s	4.58s	PPI::Tokenizer::::_process_next_charPPI::Tokenizer::_process_next_char
26904	2	1	857ms	6.13s	PPI::Tokenizer::::_process_next_linePPI::Tokenizer::_process_next_line
94513	1	1	602ms	6.79s	PPI::Tokenizer::::get_tokenPPI::Tokenizer::get_token
56533	14	7	428ms	681ms	PPI::Tokenizer::::_new_tokenPPI::Tokenizer::_new_token
20542	6	4	261ms	305ms	PPI::Tokenizer::_previous_significant_tokens
94513	29	16	218ms	218ms	PPI::Tokenizer::::_finalize_tokenPPI::Tokenizer::_finalize_token
27281	3	2	186ms	246ms	PPI::Tokenizer::::_fill_linePPI::Tokenizer::_fill_line
144	1	1	162ms	162ms	PPI::Tokenizer::::CORE:substPPI::Tokenizer::CORE:subst (opcode)
144	1	1	118ms	503ms	PPI::Tokenizer::::newPPI::Tokenizer::new
27287	3	2	60.1ms	60.1ms	PPI::Tokenizer::::_get_linePPI::Tokenizer::_get_line
1866	1	1	16.1ms	34.6ms	PPI::Tokenizer::::_opcontextPPI::Tokenizer::_opcontext
15534	1	1	4.15ms	4.15ms	PPI::Tokenizer::::CORE:matchPPI::Tokenizer::CORE:match (opcode)
144	1	1	1.34ms	1.76ms	PPI::Tokenizer::::_clean_eofPPI::Tokenizer::_clean_eof
52	2	1	488µs	589µs	PPI::Tokenizer::_last_significant_token
1	1	1	135µs	224µs	PPI::Tokenizer::::BEGIN@88PPI::Tokenizer::BEGIN@88
1	1	1	12µs	23µs	PPI::Tokenizer::::BEGIN@81PPI::Tokenizer::BEGIN@81
1	1	1	7µs	35µs	PPI::Tokenizer::::BEGIN@82PPI::Tokenizer::BEGIN@82
1	1	1	6µs	23µs	PPI::Tokenizer::::BEGIN@90PPI::Tokenizer::BEGIN@90
1	1	1	3µs	3µs	PPI::Tokenizer::::BEGIN@83PPI::Tokenizer::BEGIN@83
1	1	1	3µs	3µs	PPI::Tokenizer::::BEGIN@84PPI::Tokenizer::BEGIN@84
1	1	1	3µs	3µs	PPI::Tokenizer::::BEGIN@85PPI::Tokenizer::BEGIN@85
1	1	1	3µs	3µs	PPI::Tokenizer::::BEGIN@87PPI::Tokenizer::BEGIN@87
1	1	1	3µs	3µs	PPI::Tokenizer::::BEGIN@86PPI::Tokenizer::BEGIN@86
1	1	1	3µs	3µs	PPI::Tokenizer::::BEGIN@91PPI::Tokenizer::BEGIN@91
0	0	0	0s	0s	PPI::Tokenizer::::__ANON__[:211]PPI::Tokenizer::__ANON__[:211]
0	0	0	0s	0s	PPI::Tokenizer::::_charPPI::Tokenizer::_char
0	0	0	0s	0s	PPI::Tokenizer::::_last_tokenPPI::Tokenizer::_last_token
0	0	0	0s	0s	PPI::Tokenizer::::all_tokensPPI::Tokenizer::all_tokens
0	0	0	0s	0s	PPI::Tokenizer::::decrement_cursorPPI::Tokenizer::decrement_cursor
0	0	0	0s	0s	PPI::Tokenizer::::increment_cursorPPI::Tokenizer::increment_cursor

Call graph for these subroutines as a Graphviz dot language file.

Line	State ments	Time on line	Calls	Time in subs	Code
1					package PPI::Tokenizer;
2
3					=pod
4
5					=head1 NAME
6
7					PPI::Tokenizer - The Perl Document Tokenizer
8
9					=head1 SYNOPSIS
10
11					# Create a tokenizer for a file, array or string
12					$Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13					$Tokenizer = PPI::Tokenizer->new( \@lines );
14					$Tokenizer = PPI::Tokenizer->new( \$source );
15
16					# Return all the tokens for the document
17					my $tokens = $Tokenizer->all_tokens;
18
19					# Or we can use it as an iterator
20					while ( my $Token = $Tokenizer->get_token ) {
21					print "Found token '$Token'\n";
22					}
23
24					# If we REALLY need to manually nudge the cursor, you
25					# can do that to (The lexer needs this ability to do rollbacks)
26					$is_incremented = $Tokenizer->increment_cursor;
27					$is_decremented = $Tokenizer->decrement_cursor;
28
29					=head1 DESCRIPTION
30
31					PPI::Tokenizer is the class that provides Tokenizer objects for use in
32					breaking strings of Perl source code into Tokens.
33
34					By the time you are reading this, you probably need to know a little
35					about the difference between how perl parses Perl "code" and how PPI
36					parsers Perl "documents".
37
38					"perl" itself (the interpreter) uses a heavily modified lex specification
39					to specify its parsing logic, maintains several types of state as it
40					goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41
42					In fact, it is provably impossible to use perl's parsing method without
43					simultaneously executing code. A formal mathematical proof has been
44					published demonstrating the method.
45
46					This is where the truism "Only perl can parse Perl" comes from.
47
48					PPI uses a completely different approach by abandoning the (impossible)
49					ability to parse Perl the same way that the interpreter does, and instead
50					parsing the source as a document, using a document structure independantly
51					derived from the Perl documentation and approximating the perl interpreter
52					interpretation as closely as possible.
53
54					It was touch and go for a long time whether we could get it close enough,
55					but in the end it turned out that it could be done.
56
57					In this approach, the tokenizer C<PPI::Tokenizer> is implemented separately
58					from the lexer L<PPI::Lexer>.
59
60					The job of C<PPI::Tokenizer> is to take pure source as a string and break it
61					up into a stream/set of tokens, and contains most of the "black magic" used
62					in PPI. By comparison, the lexer implements a relatively straight forward
63					tree structure, and has an implementation that is uncomplicated (compared
64					to the insanity in the tokenizer at least).
65
66					The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67					supported by a very B<VERY> flexible internal API, but fortunately it was
68					possible to largely encapsulate the black magic, so there is not a lot that
69					gets exposed to people using the C<PPI::Tokenizer> itself.
70
71					=head1 METHODS
72
73					Despite the incredible complexity, the Tokenizer itself only exposes a
74					relatively small number of methods, with most of the complexity implemented
75					in private methods.
76
77					=cut
78
79					# Make sure everything we need is loaded so
80					# we don't have to go and load all of PPI.
81	2	21µs	2	34µs	# spent 23µs (12+11) within PPI::Tokenizer::BEGIN@81 which was called: # once (12µs+11µs) by PPI::BEGIN@28 at line 81 use strict; # spent 23µs making 1 call to PPI::Tokenizer::BEGIN@81 # spent 11µs making 1 call to strict::import
82	2	19µs	2	63µs	# spent 35µs (7+28) within PPI::Tokenizer::BEGIN@82 which was called: # once (7µs+28µs) by PPI::BEGIN@28 at line 82 use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0}; # spent 35µs making 1 call to PPI::Tokenizer::BEGIN@82 # spent 28µs making 1 call to Exporter::import
83	2	18µs	1	3µs	# spent 3µs within PPI::Tokenizer::BEGIN@83 which was called: # once (3µs+0s) by PPI::BEGIN@28 at line 83 use List::MoreUtils (); # spent 3µs making 1 call to PPI::Tokenizer::BEGIN@83
84	2	15µs	1	3µs	# spent 3µs within PPI::Tokenizer::BEGIN@84 which was called: # once (3µs+0s) by PPI::BEGIN@28 at line 84 use PPI::Util (); # spent 3µs making 1 call to PPI::Tokenizer::BEGIN@84
85	2	14µs	1	3µs	# spent 3µs within PPI::Tokenizer::BEGIN@85 which was called: # once (3µs+0s) by PPI::BEGIN@28 at line 85 use PPI::Element (); # spent 3µs making 1 call to PPI::Tokenizer::BEGIN@85
86	2	20µs	1	3µs	# spent 3µs within PPI::Tokenizer::BEGIN@86 which was called: # once (3µs+0s) by PPI::BEGIN@28 at line 86 use PPI::Token (); # spent 3µs making 1 call to PPI::Tokenizer::BEGIN@86
87	2	15µs	1	3µs	# spent 3µs within PPI::Tokenizer::BEGIN@87 which was called: # once (3µs+0s) by PPI::BEGIN@28 at line 87 use PPI::Exception (); # spent 3µs making 1 call to PPI::Tokenizer::BEGIN@87
88	2	79µs	1	224µs	# spent 224µs (135+89) within PPI::Tokenizer::BEGIN@88 which was called: # once (135µs+89µs) by PPI::BEGIN@28 at line 88 use PPI::Exception::ParserRejection (); # spent 224µs making 1 call to PPI::Tokenizer::BEGIN@88
89
90	2	22µs	2	39µs	# spent 23µs (6+16) within PPI::Tokenizer::BEGIN@90 which was called: # once (6µs+16µs) by PPI::BEGIN@28 at line 90 use vars qw{$VERSION}; # spent 23µs making 1 call to PPI::Tokenizer::BEGIN@90 # spent 16µs making 1 call to vars::import
91					# spent 3µs within PPI::Tokenizer::BEGIN@91 which was called: # once (3µs+0s) by PPI::BEGIN@28 at line 93 BEGIN {
92	1	4µs			$VERSION = '1.215';
93	1	1.57ms	1	3µs	} # spent 3µs making 1 call to PPI::Tokenizer::BEGIN@91
94
- -
99					#####################################################################
100					# Creation and Initialization
101
102					=pod
103
104					=head2 new $file \| \@lines \| \$source
105
106					The main C<new> constructor creates a new Tokenizer object. These
107					objects have no configuration parameters, and can only be used once,
108					to tokenize a single perl source file.
109
110					It takes as argument either a normal scalar containing source code,
111					a reference to a scalar containing source code, or a reference to an
112					ARRAY containing newline-terminated lines of source code.
113
114					Returns a new C<PPI::Tokenizer> object on success, or throws a
115					L<PPI::Exception> exception on error.
116
117					=cut
118
119					# spent 503ms (118+384) within PPI::Tokenizer::new which was called 144 times, avg 3.49ms/call: # 144 times (118ms+384ms) by PPI::Lexer::lex_file at line 159 of PPI/Lexer.pm, avg 3.49ms/call sub new {
120	144	119µs			my $class = ref($_[0]) \|\| $_[0];
121
122					# Create the empty tokenizer struct
123	144	1.61ms			my $self = bless {
124					# Source code
125					source => undef,
126					source_bytes => undef,
127
128					# Line buffer
129					line => undef,
130					line_length => undef,
131					line_cursor => undef,
132					line_count => 0,
133
134					# Parse state
135					token => undef,
136					class => 'PPI::Token::BOM',
137					zone => 'PPI::Token::Whitespace',
138
139					# Output token buffer
140					tokens => [],
141					token_cursor => 0,
142					token_eof => 0,
143
144					# Perl 6 blocks
145					perl6 => [],
146					}, $class;
147
148	144	208µs			if ( ! defined $_[1] ) {
149					# We weren't given anything
150					PPI::Exception->throw("No source provided to Tokenizer");
151
152					} elsif ( ! ref $_[1] ) {
153	144	566µs	144	187ms	my $source = PPI::Util::_slurp($_[1]); # spent 187ms making 144 calls to PPI::Util::_slurp, avg 1.30ms/call
154	144	1.20ms			if ( ref $source ) {
155					# Content returned by reference
156					$self->{source} = $$source;
157					} else {
158					# Errors returned as a string
159					return( $source );
160					}
161
162					} elsif ( _SCALAR0($_[1]) ) {
163					$self->{source} = ${$_[1]};
164
165					} elsif ( _ARRAY0($_[1]) ) {
166					$self->{source} = join '', map { "\n" } @{$_[1]};
167
168					} else {
169					# We don't support whatever this is
170					PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
171					}
172
173					# We can't handle a null string
174	144	289µs			$self->{source_bytes} = length $self->{source};
175	144	3.62ms			if ( $self->{source_bytes} > 1048576 ) {
176					# Dammit! It's ALWAYS the "Perl" modules larger than a
177					# meg that seems to blow up the Tokenizer/Lexer.
178					# Nobody actually writes real programs larger than a meg
179					# Perl::Tidy (the largest) is only 800k.
180					# It is always these idiots with massive Data::Dumper
181					# structs or huge RecDescent parser.
182					PPI::Exception::ParserRejection->throw("File is too large");
183
184					} elsif ( $self->{source_bytes} ) {
185					# Split on local newlines
186	144	163ms	144	162ms	$self->{source} =~ s/(?:\015{1,2}\012\|\015\|\012)/\n/g; # spent 162ms making 144 calls to PPI::Tokenizer::CORE:subst, avg 1.12ms/call
187	144	107ms			$self->{source} = [ split /(?<=\n)/, $self->{source} ];
188
189					} else {
190					$self->{source} = [ ];
191					}
192
193					### EVIL
194					# I'm explaining this earlier than I should so you can understand
195					# why I'm about to do something that looks very strange. There's
196					# a problem with the Tokenizer, in that tokens tend to change
197					# classes as each letter is added, but they don't get allocated
198					# their definite final class until the "end" of the token, the
199					# detection of which occurs in about a hundred different places,
200					# all through various crufty code (that triples the speed).
201					#
202					# However, in general, this does not apply to tokens in which a
203					# whitespace character is valid, such as comments, whitespace and
204					# big strings.
205					#
206					# So what we do is add a space to the end of the source. This
207					# triggers normal "end of token" functionality for all cases. Then,
208					# once the tokenizer hits end of file, it examines the last token to
209					# manually either remove the ' ' token, or chop it off the end of
210					# a longer one in which the space would be valid.
211	15678	34.2ms	15678	39.0ms	if ( List::MoreUtils::any { /^__(?:DATA\|END)__\s*$/ } @{$self->{source}} ) { # spent 34.9ms making 144 calls to List::MoreUtils::any, avg 242µs/call # spent 4.15ms making 15534 calls to PPI::Tokenizer::CORE:match, avg 267ns/call
212					$self->{source_eof_chop} = '';
213					} elsif ( ! defined $self->{source}->[0] ) {
214					$self->{source_eof_chop} = '';
215					} elsif ( $self->{source}->[-1] =~ /\s$/ ) {
216					$self->{source_eof_chop} = '';
217					} else {
218					$self->{source_eof_chop} = 1;
219					$self->{source}->[-1] .= ' ';
220					}
221
222	144	765µs			$self;
223					}
224
- -
229					#####################################################################
230					# Main Public Methods
231
232					=pod
233
234					=head2 get_token
235
236					When using the PPI::Tokenizer object as an iterator, the C<get_token>
237					method is the primary method that is used. It increments the cursor
238					and returns the next Token in the output array.
239
240					The actual parsing of the file is done only as-needed, and a line at
241					a time. When C<get_token> hits the end of the token array, it will
242					cause the parser to pull in the next line and parse it, continuing
243					as needed until there are more tokens on the output array that
244					get_token can then return.
245
246					This means that a number of Tokenizer objects can be created, and
247					won't consume significant CPU until you actually begin to pull tokens
248					from it.
249
250					Return a L<PPI::Token> object on success, C<0> if the Tokenizer had
251					reached the end of the file, or C<undef> on error.
252
253					=cut
254
255					# spent 6.79s (602ms+6.19) within PPI::Tokenizer::get_token which was called 94513 times, avg 72µs/call: # 94513 times (602ms+6.19s) by PPI::Lexer::_get_token at line 1413 of PPI/Lexer.pm, avg 72µs/call sub get_token {
256	94513	17.5ms			my $self = shift;
257
258					# Shortcut for EOF
259	94513	15.6ms			if ( $self->{token_eof}
260					and $self->{token_cursor} > scalar @{$self->{tokens}}
261					) {
262					return 0;
263					}
264
265					# Return the next token if we can
266	94513	298ms	82384	48.3ms	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) { # spent 48.3ms making 82384 calls to PPI::Util::TRUE, avg 587ns/call
267	82384	11.9ms			$self->{token_cursor}++;
268	82384	244ms			return $token;
269					}
270
271	12129	268µs			my $line_rv;
272
273					# Catch exceptions and return undef, so that we
274					# can start to convert code to exception-based code.
275	12129	4.52ms			my $rv = eval {
276					# No token, we need to get some more
277	12129	14.1ms	12129	4.32s	while ( $line_rv = $self->_process_next_line ) { # spent 4.32s making 12129 calls to PPI::Tokenizer::_process_next_line, avg 356µs/call
278					# If there is something in the buffer, return it
279					# The defined() prevents a ton of calls to PPI::Util::TRUE
280	26616	31.1ms	14775	1.81s	if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) { # spent 1.81s making 14775 calls to PPI::Tokenizer::_process_next_line, avg 123µs/call
281	11841	1.48ms			$self->{token_cursor}++;
282	11841	5.62ms			return $token;
283					}
284					}
285	288	56µs			return undef;
286					};
287	12129	80.8ms	11841	8.35ms	if ( $@ ) { # spent 8.35ms making 11841 calls to PPI::Util::TRUE, avg 705ns/call
288					if ( _INSTANCE($@, 'PPI::Exception') ) {
289					$@->throw;
290					} else {
291					my $errstr = $@;
292					$errstr =~ s/^(.*) at line .+$/$1/;
293					PPI::Exception->throw( $errstr );
294					}
295					} elsif ( $rv ) {
296					return $rv;
297					}
298
299	288	63µs			if ( defined $line_rv ) {
300					# End of file, but we can still return things from the buffer
301	288	181µs			if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
302					$self->{token_cursor}++;
303					return $token;
304					}
305
306					# Set our token end of file flag
307	288	82µs			$self->{token_eof} = 1;
308	288	489µs			return 0;
309					}
310
311					# Error, pass it up to our caller
312					undef;
313					}
314
315					=pod
316
317					=head2 all_tokens
318
319					When not being used as an iterator, the C<all_tokens> method tells
320					the Tokenizer to parse the entire file and return all of the tokens
321					in a single ARRAY reference.
322
323					It should be noted that C<all_tokens> does B<NOT> interfere with the
324					use of the Tokenizer object as an iterator (does not modify the token
325					cursor) and use of the two different mechanisms can be mixed safely.
326
327					Returns a reference to an ARRAY of L<PPI::Token> objects on success
328					or throws an exception on error.
329
330					=cut
331
332					sub all_tokens {
333					my $self = shift;
334
335					# Catch exceptions and return undef, so that we
336					# can start to convert code to exception-based code.
337					eval {
338					# Process lines until we get EOF
339					unless ( $self->{token_eof} ) {
340					my $rv;
341					while ( $rv = $self->_process_next_line ) {}
342					unless ( defined $rv ) {
343					PPI::Exception->throw("Error while processing source");
344					}
345
346					# Clean up the end of the tokenizer
347					$self->_clean_eof;
348					}
349					};
350					if ( $@ ) {
351					my $errstr = $@;
352					$errstr =~ s/^(.*) at line .+$/$1/;
353					PPI::Exception->throw( $errstr );
354					}
355
356					# End of file, return a copy of the token array.
357					return [ @{$self->{tokens}} ];
358					}
359
360					=pod
361
362					=head2 increment_cursor
363
364					Although exposed as a public method, C<increment_method> is implemented
365					for expert use only, when writing lexers or other components that work
366					directly on token streams.
367
368					It manually increments the token cursor forward through the file, in effect
369					"skipping" the next token.
370
371					Return true if the cursor is incremented, C<0> if already at the end of
372					the file, or C<undef> on error.
373
374					=cut
375
376					sub increment_cursor {
377					# Do this via the get_token method, which makes sure there
378					# is actually a token there to move to.
379					$_[0]->get_token and 1;
380					}
381
382					=pod
383
384					=head2 decrement_cursor
385
386					Although exposed as a public method, C<decrement_method> is implemented
387					for expert use only, when writing lexers or other components that work
388					directly on token streams.
389
390					It manually decrements the token cursor backwards through the file, in
391					effect "rolling back" the token stream. And indeed that is what it is
392					primarily intended for, when the component that is consuming the token
393					stream needs to implement some sort of "roll back" feature in its use
394					of the token stream.
395
396					Return true if the cursor is decremented, C<0> if already at the
397					beginning of the file, or C<undef> on error.
398
399					=cut
400
401					sub decrement_cursor {
402					my $self = shift;
403
404					# Check for the beginning of the file
405					return 0 unless $self->{token_cursor};
406
407					# Decrement the token cursor
408					$self->{token_eof} = 0;
409					--$self->{token_cursor};
410					}
411
- -
416					#####################################################################
417					# Working With Source
418
419					# Fetches the next line from the input line buffer
420					# Returns undef at EOF.
421					# spent 60.1ms within PPI::Tokenizer::_get_line which was called 27287 times, avg 2µs/call: # 27281 times (60.1ms+0s) by PPI::Tokenizer::_fill_line at line 443, avg 2µs/call # 5 times (10µs+0s) by PPI::Token::HereDoc::__TOKENIZER__on_char at line 222 of PPI/Token/HereDoc.pm, avg 2µs/call # once (3µs+0s) by PPI::Token::HereDoc::__TOKENIZER__on_char at line 211 of PPI/Token/HereDoc.pm sub _get_line {
422	27287	3.41ms			my $self = shift;
423	27287	6.10ms			return undef unless $self->{source}; # EOF hit previously
424
425					# Pull off the next line
426	27143	15.3ms			my $line = shift @{$self->{source}};
427
428					# Flag EOF if we hit it
429	27143	3.09ms			$self->{source} = undef unless defined $line;
430
431					# Return the line (or EOF flag)
432	27143	113ms			return $line; # string or undef
433					}
434
435					# Fetches the next line, ready to process
436					# Returns 1 on success
437					# Returns 0 on EOF
438					# spent 246ms (186+60.1) within PPI::Tokenizer::_fill_line which was called 27281 times, avg 9µs/call: # 26904 times (184ms+59.2ms) by PPI::Tokenizer::_process_next_line at line 490, avg 9µs/call # 372 times (1.89ms+884µs) by PPI::Token::_QuoteEngine::_scan_for_brace_character at line 183 of PPI/Token/_QuoteEngine.pm, avg 7µs/call # 5 times (38µs+16µs) by PPI::Token::_QuoteEngine::_scan_for_unescaped_character at line 137 of PPI/Token/_QuoteEngine.pm, avg 11µs/call sub _fill_line {
439	27281	3.17ms			my $self = shift;
440	27281	3.02ms			my $inscan = shift;
441
442					# Get the next line
443	27281	27.1ms	27281	60.1ms	my $line = $self->_get_line; # spent 60.1ms making 27281 calls to PPI::Tokenizer::_get_line, avg 2µs/call
444	27281	2.96ms			unless ( defined $line ) {
445					# End of file
446	288	32µs			unless ( $inscan ) {
447	288	199µs			delete $self->{line};
448	288	52µs			delete $self->{line_cursor};
449	288	46µs			delete $self->{line_length};
450	288	529µs			return 0;
451					}
452
453					# In the scan version, just set the cursor to the end
454					# of the line, and the rest should just cascade out.
455					$self->{line_cursor} = $self->{line_length};
456					return 0;
457					}
458
459					# Populate the appropriate variables
460	26993	6.62ms			$self->{line} = $line;
461	26993	4.61ms			$self->{line_cursor} = -1;
462	26993	6.80ms			$self->{line_length} = length $line;
463	26993	3.62ms			$self->{line_count}++;
464
465	26993	68.3ms			1;
466					}
467
468					# Get the current character
469					sub _char {
470					my $self = shift;
471					substr( $self->{line}, $self->{line_cursor}, 1 );
472					}
473
- -
478					####################################################################
479					# Per line processing methods
480
481					# Processes the next line
482					# Returns 1 on success completion
483					# Returns 0 if EOF
484					# Returns undef on error
485					# spent 6.13s (857ms+5.28) within PPI::Tokenizer::_process_next_line which was called 26904 times, avg 228µs/call: # 14775 times (272ms+1.54s) by PPI::Tokenizer::get_token at line 280, avg 123µs/call # 12129 times (586ms+3.74s) by PPI::Tokenizer::get_token at line 277, avg 356µs/call sub _process_next_line {
486	26904	3.78ms			my $self = shift;
487
488					# Fill the line buffer
489	26904	903µs			my $rv;
490	26904	23.3ms	26904	243ms	unless ( $rv = $self->_fill_line ) { # spent 243ms making 26904 calls to PPI::Tokenizer::_fill_line, avg 9µs/call
491	288	38µs			return undef unless defined $rv;
492
493					# End of file, finalize last token
494	288	275µs	288	397µs	$self->_finalize_token; # spent 397µs making 288 calls to PPI::Tokenizer::_finalize_token, avg 1µs/call
495	288	450µs			return 0;
496					}
497
498					# Run the __TOKENIZER__on_line_start
499	26616	39.3ms	26616	354ms	$rv = $self->{class}->__TOKENIZER__on_line_start( $self ); # spent 269ms making 14943 calls to PPI::Token::Whitespace::__TOKENIZER__on_line_start, avg 18µs/call # spent 65.6ms making 9695 calls to PPI::Token::Pod::__TOKENIZER__on_line_start, avg 7µs/call # spent 14.1ms making 1834 calls to PPI::Token::End::__TOKENIZER__on_line_start, avg 8µs/call # spent 4.66ms making 144 calls to PPI::Token::BOM::__TOKENIZER__on_line_start, avg 32µs/call
500	26616	3.26ms			unless ( $rv ) {
501					# If there are no more source lines, then clean up
502	16923	9.78ms	144	1.76ms	if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) { # spent 1.76ms making 144 calls to PPI::Tokenizer::_clean_eof, avg 12µs/call
503					$self->_clean_eof;
504					}
505
506					# Defined but false means next line
507	16923	66.4ms			return 1 if defined $rv;
508					PPI::Exception->throw("Error at line $self->{line_count}");
509					}
510
511					# If we can't deal with the entire line, process char by char
512	9693	203ms	149609	4.58s	while ( $rv = $self->_process_next_char ) {} # spent 4.58s making 149609 calls to PPI::Tokenizer::_process_next_char, avg 31µs/call
513	9693	1.15ms			unless ( defined $rv ) {
514					PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
515					}
516
517					# Trigger any action that needs to happen at the end of a line
518	9693	13.4ms	9693	94.6ms	$self->{class}->__TOKENIZER__on_line_end( $self ); # spent 94.4ms making 9549 calls to PPI::Token::Whitespace::__TOKENIZER__on_line_end, avg 10µs/call # spent 224µs making 144 calls to PPI::Token::__TOKENIZER__on_line_end, avg 2µs/call
519
520					# If there are no more source lines, then clean up
521	9693	7.24ms			unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
522					return $self->_clean_eof;
523					}
524
525	9693	37.6ms			return 1;
526					}
527
- -
532					#####################################################################
533					# Per-character processing methods
534
535					# Process on a per-character basis.
536					# Note that due the the high number of times this gets
537					# called, it has been fairly heavily in-lined, so the code
538					# might look a bit ugly and duplicated.
539					# spent 4.58s (1.30+3.28) within PPI::Tokenizer::_process_next_char which was called 149609 times, avg 31µs/call: # 149609 times (1.30s+3.28s) by PPI::Tokenizer::_process_next_line at line 512, avg 31µs/call sub _process_next_char {
540	149609	24.1ms			my $self = shift;
541
542					### FIXME - This checks for a screwed up condition that triggers
543					### several warnings, amoungst other things.
544	149609	48.5ms			if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
545					# $DB::single = 1;
546					return undef;
547					}
548
549					# Increment the counter and check for end of line
550	149609	57.7ms			return 0 if ++$self->{line_cursor} >= $self->{line_length};
551
552					# Pass control to the token class
553	139916	1.69ms			my $result;
554	139916	221ms	139916	2.94s	unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) { # spent 1.87s making 106218 calls to PPI::Token::Whitespace::__TOKENIZER__on_char, avg 18µs/call # spent 362ms making 7754 calls to PPI::Token::Symbol::__TOKENIZER__on_char, avg 47µs/call # spent 299ms making 10634 calls to PPI::Token::Operator::__TOKENIZER__on_char, avg 28µs/call # spent 201ms making 8180 calls to PPI::Token::Unknown::__TOKENIZER__on_char, avg 25µs/call # spent 90.9ms making 1688 calls to PPI::Token::_QuoteEngine::__TOKENIZER__on_char, avg 54µs/call # spent 69.1ms making 3157 calls to PPI::Token::Structure::__TOKENIZER__on_char, avg 22µs/call # spent 38.4ms making 1170 calls to PPI::Token::Number::__TOKENIZER__on_char, avg 33µs/call # spent 13.3ms making 1018 calls to PPI::Token::Number::Float::__TOKENIZER__on_char, avg 13µs/call # spent 1.61ms making 34 calls to PPI::Token::Magic::__TOKENIZER__on_char, avg 47µs/call # spent 654µs making 61 calls to PPI::Token::Cast::__TOKENIZER__on_char, avg 11µs/call # spent 69µs making 2 calls to PPI::Token::DashedWord::__TOKENIZER__on_char, avg 34µs/call
555					# undef is error. 0 is "Did stuff ourself, you don't have to do anything"
556					return defined $result ? 1 : undef;
557					}
558
559					# We will need the value of the current character
560	123420	54.3ms			my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
561	123420	15.8ms			if ( $result eq '1' ) {
562					# If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
563					# the character is part of it.
564
565					# Add the character
566	12474	6.66ms			if ( defined $self->{token} ) {
567					$self->{token}->{content} .= $char;
568					} else {
569					defined($self->{token} = $self->{class}->new($char)) or return undef;
570					}
571
572	12474	37.1ms			return 1;
573					}
574
575					# We have been provided with the name of a class
576	110946	85.8ms	21222	254ms	if ( $self->{class} ne "PPI::Token::$result" ) { # spent 254ms making 21222 calls to PPI::Tokenizer::_new_token, avg 12µs/call
577					# New class
578					$self->_new_token( $result, $char );
579					} elsif ( defined $self->{token} ) {
580					# Same class as current
581					$self->{token}->{content} .= $char;
582					} else {
583					# Same class, but no current
584	37692	61.1ms	37692	85.7ms	defined($self->{token} = $self->{class}->new($char)) or return undef; # spent 85.7ms making 37692 calls to PPI::Token::new, avg 2µs/call
585					}
586
587	110946	352ms			1;
588					}
589
- -
594					#####################################################################
595					# Altering Tokens in Tokenizer
596
597					# Finish the end of a token.
598					# Returns the resulting parse class as a convenience.
599					# spent 218ms within PPI::Tokenizer::_finalize_token which was called 94513 times, avg 2µs/call: # 31193 times (67.2ms+0s) by PPI::Tokenizer::_new_token at line 620, avg 2µs/call # 14291 times (35.5ms+0s) by PPI::Token::Word::__TOKENIZER__commit at line 539 of PPI/Token/Word.pm, avg 2µs/call # 13365 times (29.4ms+0s) by PPI::Token::Structure::__TOKENIZER__commit at line 76 of PPI/Token/Structure.pm, avg 2µs/call # 9549 times (20.9ms+0s) by PPI::Token::Whitespace::__TOKENIZER__on_line_end at line 417 of PPI/Token/Whitespace.pm, avg 2µs/call # 7437 times (16.8ms+0s) by PPI::Token::Operator::__TOKENIZER__on_char at line 112 of PPI/Token/Operator.pm, avg 2µs/call # 7245 times (21.2ms+0s) by PPI::Token::Symbol::__TOKENIZER__on_char at line 216 of PPI/Token/Symbol.pm, avg 3µs/call # 3157 times (6.88ms+0s) by PPI::Token::Structure::__TOKENIZER__on_char at line 70 of PPI/Token/Structure.pm, avg 2µs/call # 2743 times (7.54ms+0s) by PPI::Token::_QuoteEngine::__TOKENIZER__on_char at line 58 of PPI/Token/_QuoteEngine.pm, avg 3µs/call # 1668 times (3.76ms+0s) by PPI::Token::Whitespace::__TOKENIZER__on_line_start at line 165 of PPI/Token/Whitespace.pm, avg 2µs/call # 1252 times (2.71ms+0s) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 213 of PPI/Token/Whitespace.pm, avg 2µs/call # 832 times (2.14ms+0s) by PPI::Token::Number::__TOKENIZER__on_char at line 125 of PPI/Token/Number.pm, avg 3µs/call # 509 times (1.33ms+0s) by PPI::Token::Symbol::__TOKENIZER__on_char at line 174 of PPI/Token/Symbol.pm, avg 3µs/call # 288 times (397µs+0s) by PPI::Tokenizer::_process_next_line at line 494, avg 1µs/call # 148 times (513µs+0s) by PPI::Token::Number::Float::__TOKENIZER__on_char at line 108 of PPI/Token/Number/Float.pm, avg 3µs/call # 146 times (415µs+0s) by PPI::Token::Pod::__TOKENIZER__on_line_start at line 148 of PPI/Token/Pod.pm, avg 3µs/call # 144 times (335µs+0s) by PPI::Tokenizer::_clean_eof at line 635, avg 2µs/call # 144 times (308µs+0s) by PPI::Token::Word::__TOKENIZER__commit at line 458 of PPI/Token/Word.pm, avg 2µs/call # 144 times (299µs+0s) by PPI::Token::Word::__TOKENIZER__commit at line 441 of PPI/Token/Word.pm, avg 2µs/call # 85 times (215µs+0s) by PPI::Token::Unknown::__TOKENIZER__on_char at line 179 of PPI/Token/Unknown.pm, avg 3µs/call # 61 times (125µs+0s) by PPI::Token::Cast::__TOKENIZER__on_char at line 51 of PPI/Token/Cast.pm, avg 2µs/call # 51 times (105µs+0s) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 261 of PPI/Token/Whitespace.pm, avg 2µs/call # 30 times (105µs+0s) by PPI::Token::Magic::__TOKENIZER__on_char at line 228 of PPI/Token/Magic.pm, avg 4µs/call # 22 times (54µs+0s) by PPI::Token::Unknown::__TOKENIZER__on_char at line 216 of PPI/Token/Unknown.pm, avg 2µs/call # 3 times (8µs+0s) by PPI::Token::ArrayIndex::__TOKENIZER__on_char at line 56 of PPI/Token/ArrayIndex.pm, avg 3µs/call # 2 times (5µs+0s) by PPI::Token::DashedWord::__TOKENIZER__on_char at line 95 of PPI/Token/DashedWord.pm, avg 2µs/call # once (2µs+0s) by PPI::Token::Magic::__TOKENIZER__on_char at line 170 of PPI/Token/Magic.pm # once (2µs+0s) by PPI::Token::Unknown::__TOKENIZER__on_char at line 150 of PPI/Token/Unknown.pm # once (2µs+0s) by PPI::Token::HereDoc::__TOKENIZER__on_char at line 218 of PPI/Token/HereDoc.pm # once (2µs+0s) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 316 of PPI/Token/Whitespace.pm sub _finalize_token {
600	94513	16.2ms			my $self = shift;
601	94513	16.8ms			return $self->{class} unless defined $self->{token};
602
603					# Add the token to the token buffer
604	94225	34.9ms			push @{ $self->{tokens} }, $self->{token};
605	94225	16.6ms			$self->{token} = undef;
606
607					# Return the parse class to that of the zone we are in
608	94225	297ms			$self->{class} = $self->{zone};
609					}
610
611					# Creates a new token and sets it in the tokenizer
612					# The defined() in here prevent a ton of calls to PPI::Util::TRUE
613					# spent 681ms (428+253) within PPI::Tokenizer::_new_token which was called 56533 times, avg 12µs/call: # 21222 times (159ms+94.4ms) by PPI::Tokenizer::_process_next_char at line 576, avg 12µs/call # 14291 times (103ms+63.5ms) by PPI::Token::Word::__TOKENIZER__commit at line 533 of PPI/Token/Word.pm, avg 12µs/call # 13365 times (103ms+47.8ms) by PPI::Token::Structure::__TOKENIZER__commit at line 75 of PPI/Token/Structure.pm, avg 11µs/call # 3724 times (24.9ms+10.0ms) by PPI::Token::Whitespace::__TOKENIZER__on_line_start at line 159 of PPI/Token/Whitespace.pm, avg 9µs/call # 1668 times (19.7ms+6.52ms) by PPI::Token::Whitespace::__TOKENIZER__on_line_start at line 164 of PPI/Token/Whitespace.pm, avg 16µs/call # 1055 times (10.0ms+26.4ms) by PPI::Token::Word::__TOKENIZER__commit at line 497 of PPI/Token/Word.pm, avg 35µs/call # 288 times (1.53ms+796µs) by PPI::Token::End::__TOKENIZER__on_line_start at line 84 of PPI/Token/End.pm, avg 8µs/call # 242 times (1.71ms+1.08ms) by PPI::Token::Comment::__TOKENIZER__commit at line 93 of PPI/Token/Comment.pm, avg 12µs/call # 242 times (1.62ms+1.01ms) by PPI::Token::Comment::__TOKENIZER__commit at line 94 of PPI/Token/Comment.pm, avg 11µs/call # 144 times (1.30ms+760µs) by PPI::Token::Word::__TOKENIZER__commit at line 440 of PPI/Token/Word.pm, avg 14µs/call # 144 times (1.29ms+646µs) by PPI::Token::End::__TOKENIZER__on_line_start at line 70 of PPI/Token/End.pm, avg 13µs/call # 144 times (703µs+318µs) by PPI::Token::Word::__TOKENIZER__commit at line 454 of PPI/Token/Word.pm, avg 7µs/call # 2 times (15µs+9µs) by PPI::Token::Whitespace::__TOKENIZER__on_line_start at line 170 of PPI/Token/Whitespace.pm, avg 12µs/call # 2 times (14µs+8µs) by PPI::Token::Number::Float::__TOKENIZER__on_char at line 93 of PPI/Token/Number/Float.pm, avg 11µs/call sub _new_token {
614	56533	9.70ms			my $self = shift;
615					# throw PPI::Exception() unless @_;
616	56533	31.6ms			my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
617					? shift : 'PPI::Token::' . shift;
618
619					# Finalize any existing token
620	56533	38.5ms	31193	67.2ms	$self->_finalize_token if defined $self->{token}; # spent 67.2ms making 31193 calls to PPI::Tokenizer::_finalize_token, avg 2µs/call
621
622					# Create the new token and update the parse class
623	56533	96.6ms	56533	186ms	defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw; # spent 138ms making 53790 calls to PPI::Token::new, avg 3µs/call # spent 24.2ms making 1061 calls to PPI::Token::_QuoteEngine::Full::new, avg 23µs/call # spent 23.6ms making 1682 calls to PPI::Token::_QuoteEngine::Simple::new, avg 14µs/call
624	56533	11.2ms			$self->{class} = $class;
625
626	56533	165ms			1;
627					}
628
629					# At the end of the file, we need to clean up the results of the erroneous
630					# space that we inserted at the beginning of the process.
631					# spent 1.76ms (1.34+424µs) within PPI::Tokenizer::_clean_eof which was called 144 times, avg 12µs/call: # 144 times (1.34ms+424µs) by PPI::Tokenizer::_process_next_line at line 502, avg 12µs/call sub _clean_eof {
632	144	47µs			my $self = shift;
633
634					# Finish any partially completed token
635	144	645µs	288	424µs	$self->_finalize_token if $self->{token}; # spent 335µs making 144 calls to PPI::Tokenizer::_finalize_token, avg 2µs/call # spent 89µs making 144 calls to PPI::Util::TRUE, avg 618ns/call
636
637					# Find the last token, and if it has no content, kill it.
638					# There appears to be some evidence that such "null tokens" are
639					# somehow getting created accidentally.
640	144	132µs			my $last_token = $self->{tokens}->[ -1 ];
641	144	91µs			unless ( length $last_token->{content} ) {
642					pop @{$self->{tokens}};
643					}
644
645					# Now, if the last character of the last token is a space we added,
646					# chop it off, deleting the token if there's nothing else left.
647	144	80µs			if ( $self->{source_eof_chop} ) {
648					$last_token = $self->{tokens}->[ -1 ];
649					$last_token->{content} =~ s/ $//;
650					unless ( length $last_token->{content} ) {
651					# Popping token
652					pop @{$self->{tokens}};
653					}
654
655					# The hack involving adding an extra space is now reversed, and
656					# now nobody will ever know. The perfect crime!
657					$self->{source_eof_chop} = '';
658					}
659
660	144	331µs			1;
661					}
662
- -
667					#####################################################################
668					# Utility Methods
669
670					# Context
671					sub _last_token {
672					$_[0]->{tokens}->[-1];
673					}
674
675					# spent 589µs (488+101) within PPI::Tokenizer::_last_significant_token which was called 52 times, avg 11µs/call: # 51 times (479µs+99µs) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 265 of PPI/Token/Whitespace.pm, avg 11µs/call # once (10µs+2µs) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 321 of PPI/Token/Whitespace.pm sub _last_significant_token {
676	52	19µs			my $self = shift;
677	52	41µs			my $cursor = $#{ $self->{tokens} };
678	52	20µs			while ( $cursor >= 0 ) {
679	104	45µs			my $token = $self->{tokens}->[$cursor--];
680	104	266µs	104	101µs	return $token if $token->significant; # spent 54µs making 52 calls to PPI::Token::Whitespace::significant, avg 1µs/call # spent 46µs making 52 calls to PPI::Element::significant, avg 894ns/call
681					}
682
683					# Nothing...
684					PPI::Token::Whitespace->null;
685					}
686
687					# Get an array ref of previous significant tokens.
688					# Like _last_significant_token except it gets more than just one token
689					# Returns array ref on success.
690					# Returns 0 on not enough tokens
691					# spent 305ms (261+43.9) within PPI::Tokenizer::_previous_significant_tokens which was called 20542 times, avg 15µs/call: # 15490 times (172ms+28.5ms) by PPI::Token::Word::__TOKENIZER__commit at line 430 of PPI/Token/Word.pm, avg 13µs/call # 3157 times (72.8ms+13.4ms) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 222 of PPI/Token/Whitespace.pm, avg 27µs/call # 1866 times (16.1ms+1.91ms) by PPI::Tokenizer::_opcontext at line 741, avg 10µs/call # 25 times (469µs+119µs) by PPI::Token::Unknown::__TOKENIZER__is_an_attribute at line 305 of PPI/Token/Unknown.pm, avg 24µs/call # 2 times (17µs+3µs) by PPI::Token::Unknown::__TOKENIZER__on_char at line 57 of PPI/Token/Unknown.pm, avg 10µs/call # 2 times (11µs+2µs) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 384 of PPI/Token/Whitespace.pm, avg 6µs/call sub _previous_significant_tokens {
692	20542	4.29ms			my $self = shift;
693	20542	2.60ms			my $count = shift \|\| 1;
694	20542	8.90ms			my $cursor = $#{ $self->{tokens} };
695
696	20542	1.91ms			my ($token, @tokens);
697	20542	4.68ms			while ( $cursor >= 0 ) {
698	42181	14.9ms			$token = $self->{tokens}->[$cursor--];
699	42181	53.6ms	42181	40.9ms	if ( $token->significant ) { # spent 25.1ms making 26762 calls to PPI::Element::significant, avg 940ns/call # spent 13.8ms making 13592 calls to PPI::Token::Whitespace::significant, avg 1µs/call # spent 1.88ms making 1824 calls to PPI::Token::Comment::significant, avg 1µs/call # spent 3µs making 3 calls to PPI::Token::Pod::significant, avg 1µs/call
700	26762	10.4ms			push @tokens, $token;
701	26762	107ms			return \@tokens if scalar @tokens >= $count;
702					}
703					}
704
705					# Pad with empties
706	144	424µs			foreach ( 1 .. ($count - scalar @tokens) ) {
707	144	703µs	144	3.03ms	push @tokens, PPI::Token::Whitespace->null; # spent 3.03ms making 144 calls to PPI::Token::Whitespace::null, avg 21µs/call
708					}
709
710	144	466µs			\@tokens;
711					}
712
713	1	7µs			my %OBVIOUS_CLASS = (
714					'PPI::Token::Symbol' => 'operator',
715					'PPI::Token::Magic' => 'operator',
716					'PPI::Token::Number' => 'operator',
717					'PPI::Token::ArrayIndex' => 'operator',
718					'PPI::Token::Quote::Double' => 'operator',
719					'PPI::Token::Quote::Interpolate' => 'operator',
720					'PPI::Token::Quote::Literal' => 'operator',
721					'PPI::Token::Quote::Single' => 'operator',
722					'PPI::Token::QuoteLike::Backtick' => 'operator',
723					'PPI::Token::QuoteLike::Command' => 'operator',
724					'PPI::Token::QuoteLike::Readline' => 'operator',
725					'PPI::Token::QuoteLike::Regexp' => 'operator',
726					'PPI::Token::QuoteLike::Words' => 'operator',
727					);
728
729	1	2µs			my %OBVIOUS_CONTENT = (
730					'(' => 'operand',
731					'{' => 'operand',
732					'[' => 'operand',
733					';' => 'operand',
734					'}' => 'operator',
735					);
736
737					# Try to determine operator/operand context, is possible.
738					# Returns "operator", "operand", or "" if unknown.
739					# spent 34.6ms (16.1+18.5) within PPI::Tokenizer::_opcontext which was called 1866 times, avg 19µs/call: # 1866 times (16.1ms+18.5ms) by PPI::Token::Whitespace::__TOKENIZER__on_char at line 397 of PPI/Token/Whitespace.pm, avg 19µs/call sub _opcontext {
740	1866	419µs			my $self = shift;
741	1866	2.31ms	1866	18.0ms	my $tokens = $self->_previous_significant_tokens(1); # spent 18.0ms making 1866 calls to PPI::Tokenizer::_previous_significant_tokens, avg 10µs/call
742	1866	635µs			my $p0 = $tokens->[0];
743	1866	905µs			my $c0 = ref $p0;
744
745					# Map the obvious cases
746	1866	5.32ms			return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
747	133	334µs	153	247µs	return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0}; # spent 247µs making 153 calls to PPI::Token::content, avg 2µs/call
748
749					# Most of the time after an operator, we are an operand
750	113	485µs	113	168µs	return 'operand' if $p0->isa('PPI::Token::Operator'); # spent 168µs making 113 calls to UNIVERSAL::isa, avg 1µs/call
751
752					# If there's NOTHING, it's operand
753	107	149µs	107	140µs	return 'operand' if $p0->content eq ''; # spent 140µs making 107 calls to PPI::Token::content, avg 1µs/call
754
755					# Otherwise, we don't know
756	107	283µs			return ''
757					}
758
759	1	6µs			1;
760
761					=pod
762
763					=head1 NOTES
764
765					=head2 How the Tokenizer Works
766
767					Understanding the Tokenizer is not for the feint-hearted. It is by far
768					the most complex and twisty piece of perl I've ever written that is actually
769					still built properly and isn't a terrible spaghetti-like mess. In fact, you
770					probably want to skip this section.
771
772					But if you really want to understand, well then here goes.
773
774					=head2 Source Input and Clean Up
775
776					The Tokenizer starts by taking source in a variety of forms, sucking it
777					all in and merging into one big string, and doing our own internal line
778					split, using a "universal line separator" which allows the Tokenizer to
779					take source for any platform (and even supports a few known types of
780					broken newlines caused by mixed mac/pc/*nix editor screw ups).
781
782					The resulting array of lines is used to feed the tokenizer, and is also
783					accessed directly by the heredoc-logic to do the line-oriented part of
784					here-doc support.
785
786					=head2 Doing Things the Old Fashioned Way
787
788					Due to the complexity of perl, and after 2 previously aborted parser
789					attempts, in the end the tokenizer was fashioned around a line-buffered
790					character-by-character method.
791
792					That is, the Tokenizer pulls and holds a line at a time into a line buffer,
793					and then iterates a cursor along it. At each cursor position, a method is
794					called in whatever token class we are currently in, which will examine the
795					character at the current position, and handle it.
796
797					As the handler methods in the various token classes are called, they
798					build up a output token array for the source code.
799
800					Various parts of the Tokenizer use look-ahead, arbitrary-distance
801					look-behind (although currently the maximum is three significant tokens),
802					or both, and various other heuristic guesses.
803
804					I've been told it is officially termed a I<"backtracking parser
805					with infinite lookaheads">.
806
807					=head2 State Variables
808
809					Aside from the current line and the character cursor, the Tokenizer
810					maintains a number of different state variables.
811
812					=over
813
814					=item Current Class
815
816					The Tokenizer maintains the current token class at all times. Much of the
817					time is just going to be the "Whitespace" class, which is what the base of
818					a document is. As the tokenizer executes the various character handlers,
819					the class changes a lot as it moves a long. In fact, in some instances,
820					the character handler may not handle the character directly itself, but
821					rather change the "current class" and then hand off to the character
822					handler for the new class.
823
824					Because of this, and some other things I'll deal with later, the number of
825					times the character handlers are called does not in fact have a direct
826					relationship to the number of actual characters in the document.
827
828					=item Current Zone
829
830					Rather than create a class stack to allow for infinitely nested layers of
831					classes, the Tokenizer recognises just a single layer.
832
833					To put it a different way, in various parts of the file, the Tokenizer will
834					recognise different "base" or "substrate" classes. When a Token such as a
835					comment or a number is finalised by the tokenizer, it "falls back" to the
836					base state.
837
838					This allows proper tokenization of special areas such as __DATA__
839					and __END__ blocks, which also contain things like comments and POD,
840					without allowing the creation of any significant Tokens inside these areas.
841
842					For the main part of a document we use L<PPI::Token::Whitespace> for this,
843					with the idea being that code is "floating in a sea of whitespace".
844
845					=item Current Token
846
847					The final main state variable is the "current token". This is the Token
848					that is currently being built by the Tokenizer. For certain types, it
849					can be manipulated and morphed and change class quite a bit while being
850					assembled, as the Tokenizer's understanding of the token content changes.
851
852					When the Tokenizer is confident that it has seen the end of the Token, it
853					will be "finalized", which adds it to the output token array and resets
854					the current class to that of the zone that we are currently in.
855
856					I should also note at this point that the "current token" variable is
857					optional. The Tokenizer is capable of knowing what class it is currently
858					set to, without actually having accumulated any characters in the Token.
859
860					=back
861
862					=head2 Making It Faster
863
864					As I'm sure you can imagine, calling several different methods for each
865					character and running regexes and other complex heuristics made the first
866					fully working version of the tokenizer extremely slow.
867
868					During testing, I created a metric to measure parsing speed called
869					LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
870					cycles on a typical single-core CPU, and so a Tokenizer running at
871					"1000 lines per gigacycle" should generate around 1200 lines of tokenized
872					code when running on a 1200 MHz processor.
873
874					The first working version of the tokenizer ran at only 350 LPGC, so to
875					tokenize a typical large module such as L<ExtUtils::MakeMaker> took
876					10-15 seconds. This sluggishness made it unpractical for many uses.
877
878					So in the current parser, there are multiple layers of optimisation
879					very carefully built in to the basic. This has brought the tokenizer
880					up to a more reasonable 1000 LPGC, at the expense of making the code
881					quite a bit twistier.
882
883					=head2 Making It Faster - Whole Line Classification
884
885					The first step in the optimisation process was to add a hew handler to
886					enable several of the more basic classes (whitespace, comments) to be
887					able to be parsed a line at a time. At the start of each line, a
888					special optional handler (only supported by a few classes) is called to
889					check and see if the entire line can be parsed in one go.
890
891					This is used mainly to handle things like POD, comments, empty lines,
892					and a few other minor special cases.
893
894					=head2 Making It Faster - Inlining
895
896					The second stage of the optimisation involved inlining a small
897					number of critical methods that were repeated an extremely high number
898					of times. Profiling suggested that there were about 1,000,000 individual
899					method calls per gigacycle, and by cutting these by two thirds a significant
900					speed improvement was gained, in the order of about 50%.
901
902					You may notice that many methods in the C<PPI::Tokenizer> code look
903					very nested and long hand. This is primarily due to this inlining.
904
905					At around this time, some statistics code that existed in the early
906					versions of the parser was also removed, as it was determined that
907					it was consuming around 15% of the CPU for the entire parser, while
908					making the core more complicated.
909
910					A judgment call was made that with the difficulties likely to be
911					encountered with future planned enhancements, and given the relatively
912					high cost involved, the statistics features would be removed from the
913					Tokenizer.
914
915					=head2 Making It Faster - Quote Engine
916
917					Once inlining had reached diminishing returns, it became obvious from
918					the profiling results that a huge amount of time was being spent
919					stepping a char at a time though long, simple and "syntactically boring"
920					code such as comments and strings.
921
922					The existing regex engine was expanded to also encompass quotes and
923					other quote-like things, and a special abstract base class was added
924					that provided a number of specialised parsing methods that would "scan
925					ahead", looking out ahead to find the end of a string, and updating
926					the cursor to leave it in a valid position for the next call.
927
928					This is also the point at which the number of character handler calls began
929					to greatly differ from the number of characters. But it has been done
930					in a way that allows the parser to retain the power of the original
931					version at the critical points, while skipping through the "boring bits"
932					as needed for additional speed.
933
934					The addition of this feature allowed the tokenizer to exceed 1000 LPGC
935					for the first time.
936
937					=head2 Making It Faster - The "Complete" Mechanism
938
939					As it became evident that great speed increases were available by using
940					this "skipping ahead" mechanism, a new handler method was added that
941					explicitly handles the parsing of an entire token, where the structure
942					of the token is relatively simple. Tokens such as symbols fit this case,
943					as once we are passed the initial sigil and word char, we know that we
944					can skip ahead and "complete" the rest of the token much more easily.
945
946					A number of these have been added for most or possibly all of the common
947					cases, with most of these "complete" handlers implemented using regular
948					expressions.
949
950					In fact, so many have been added that at this point, you could arguably
951					reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
952					tokenizer". More tokens are now consumed in "complete" methods in a
953					typical program than are handled by the normal char-by-char methods.
954
955					Many of the these complete-handlers were implemented during the writing
956					of the Lexer, and this has allowed the full parser to maintain around
957					1000 LPGC despite the increasing weight of the Lexer.
958
959					=head2 Making It Faster - Porting To C (In Progress)
960
961					While it would be extraordinarily difficult to port all of the Tokenizer
962					to C, work has started on a L<PPI::XS> "accelerator" package which acts as
963					a separate and automatically-detected add-on to the main PPI package.
964
965					L<PPI::XS> implements faster versions of a variety of functions scattered
966					over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
967					various other places, and implements them identically in XS/C.
968
969					In particular, the skip-ahead methods from the Quote Engine would appear
970					to be extremely amenable to being done in C, and a number of other
971					functions could be cherry-picked one at a time and implemented in C.
972
973					Each method is heavily tested to ensure that the functionality is
974					identical, and a versioning mechanism is included to ensure that if a
975					function gets out of sync, L<PPI::XS> will degrade gracefully and just
976					not replace that single method.
977
978					=head1 TO DO
979
980					- Add an option to reset or seek the token stream...
981
982					- Implement more Tokenizer functions in L<PPI::XS>
983
984					=head1 SUPPORT
985
986					See the L<support section\|PPI/SUPPORT> in the main module.
987
988					=head1 AUTHOR
989
990					Adam Kennedy E<lt>adamk@cpan.orgE<gt>
991
992					=head1 COPYRIGHT
993
994					Copyright 2001 - 2011 Adam Kennedy.
995
996					This program is free software; you can redistribute
997					it and/or modify it under the same terms as Perl itself.
998
999					The full text of the license can be found in the
1000					LICENSE file included with this module.
1001
1002					=cut

					# spent 4.15ms within PPI::Tokenizer::CORE:match which was called 15534 times, avg 267ns/call: # 15534 times (4.15ms+0s) by List::MoreUtils::any at line 211, avg 267ns/call sub PPI::Tokenizer::CORE:match; # opcode
					# spent 162ms within PPI::Tokenizer::CORE:subst which was called 144 times, avg 1.12ms/call: # 144 times (162ms+0s) by PPI::Tokenizer::new at line 186, avg 1.12ms/call sub PPI::Tokenizer::CORE:subst; # opcode