This is nice, well-formatted code. Let's see what we can improve:
Only use one statement per line:
my $examine = shift;
my $detect = 0;
I assume this code is from the body of a subroutine. I'd recommend not using shift
to unpack arguments, but rather list assignment. This at least looks like a formal parameter list:
sub foo {
my ($examine) = @_;
...
}
Otherwise, you can also use syntax extensions like signatures
: sub ($examine) { ... }
.
Where does @DB
come from? Quite likely, a reference to that array should be passed as one parameter.
The Perl community has no single unifying coding style, but there are a few conventions that are shared by a large part of that community. For example, variables should use $snake_case
, and not $CamelCase
. But whatever you choose, just be consistent. Either all variables start with an uppercase letter, or none.
I'd rather use the shorter for
instead of foreach
. The two keywords are absolutely interchangeable.
Builtin operators like keys
are commonly used without parens around the arguments: keys %$ProteinDB
instead of keys(%$ProteinDB)
Don't introduce unnecessary variables like $Set
or @Pointer
.
For example,
my $Set = $ProteinDB->{'ID'};
if($Set =~ /$examine/) {
becomes
if($ProteinDB->{'ID'} =~ /$examine/) {
and this
my @Pointer = keys(%$ProteinDB);
foreach my $key (@Pointer) {
becomes
for my $key (keys %$ProteinDB) {
There is often no need to concatenate strings when you can directly interpolate:
print "$key " . $ProteinDB->{$key} . "\n";
becomes
print "$key $ProteinDB->{$key}\n";
Starting with Perl 5.10, you can use the say
feature. The say
function behaves exactly like print
except that it always appends a newline.
A construct like for my $item (@list) { if (condition($item)) { ... } }
can be simplified using the grep
builtin:
for my $item (grep { condition($_) } @list) {
....
}
Alternatively, you can quickly jump to the next loop iteration if the condition fails, which avoids a level of indentation:
for my $item (@list) {
next if not condition($item);
...
}
Your $detect
variable just has a boolean value. In that case, only test whether it's truthy or falsey, not whether it's equal to 0
:
if (not $detect) {
The wording of your error message isn't exactly clear.
Wrapping all that together, I'd rewrite your code to this:
use feature 'say';
sub foo {
my ($id_regex, $db) = @_;
my $detected = 0;
for my $protein (@$db) {
next if not $protein->{ID} =~ /$id_regex/;
$detected = 1;
for my $key (keys %$protein) {
next if $key eq 'ID' or $key eq 'SQ';
say "$key $protein->{$key}";
say "";
}
say "SQ $protein->{SQ}";
say "//";
}
if (not $detected) {
say "Error: No entry for ID =~ /$id_regex/ could be found!";
}
}
Notice the strategically placed empty lines to emphasize what belongs together, and what is unrelated.
Further observations, not all of which will be applicable to your case:
You are testing that $protein->{ID}
matches some regex. For example, when you have the IDs 1
, 42
, and 123
, then asking for 1
would return both 1
and 123
. The way you've written it you could also ask for matches of arbitrary regexes, e.g. foo(qr/\A1.*1\z/, \@DB)
(any ID that starts and ends with a 1
). Is this what you intendted? Or do you only want those proteins where the ID is actually equal? Then use the eq
string comparision instead of a regex match.
If that is the case, you could also index your proteins in a hash. That takes as much time as looping through all elements, but once you are done a lookup is very fast. Do this if you make multiple lookups:
my %proteins_by_id;
for my $protein (@DB) {
$proteins_by_id{$protein->{ID}} = $protein;
}
or abbreviated:
my %proteins_by_id;
@proteins_by_id{ map { $_->{ID} } @DB } = @DB;
which uses a hash slice @hash{@keys}
, and the map
functions which is like a foreach
but returns a list of results.
Then: my $detected = exists $proteins_by_id{$expected_id}
This of course only works if each ID is unique, otherwise we need a hash of arrays:
my %proteins_by_id;
for my $protein (@DB) {
push @{ $proteins_by_id{$protein->{ID}} }, $protein;
}
When looping over all keys
, they will be in an unspecified order. If you want the output to stay the same between runs, you must sort
them:
for my $key (sort keys %$protein) {
When looping over all keys
, you skip the SQ
entry but then print it out anyway. Is this really necessary? Or do you just want to make sure that it's printed last? If there is no real need, remove this.
You simply print
your error. But unless this is some simple one-off script, it is usually better to return
a value indicating whether your subroutine has completed successfully. Even better: die
with an error. This will abort normal control flow of your script. The error will be passed upwards through the call stack. If it isn't handled, it will terminate the script. Exceptions are often caught via eval { ... }
, but the Try::Tiny
module is better:
try {
... # something that can `die`
} catch {
warn "I caught: $_"; # do something with the error
}; # this ';' is not optional
It is considered a bad practice to print out stuff from a subroutine that actually does something different (like searching for matching entries). If it has to be, return a string that the caller can print out. This is important in larger projects where you need to write tests that check that your program behaves correctly – it's easier to check a return value than to capture everything you print.
Assuming I've read your mind correctly, then this code would be better:
use strict;
use warnings;
use feature 'say';
use Try::Tiny;
# index the @DB by ID:
my %proteins_by_id;
for my $protein (@DB) {
push @{ $proteins_by_id{$protein->{ID}} }, $protein;
}
for my $id (1234, 5678) {
# print out the report or the error message
say try { report_matching_id($id, \%proteins_by_id) } catch { "Error: $_" };
}
sub report_matching_id {
my ($id, $protein_db) = @_;
my $matching_proteins = $protein_db->{$id};
# return early if we can see that there are no matches
# (either the entry doesn't exist, or the array is empty)
if (not $matching_proteins or not @$matching_proteins) {
die "No matching proteins for ID=$id found";
}
my $output = "";
for my $protein (@$matching_proteins) {
# dump each %$protein
for my $key (sort keys %$protein) {
next if $key eq 'ID';
$output .= "$key $protein->{$key}\n\n";
}
$output .= "//\n\n";
}
return $output;
}