Draft: remove all content for review (DO NOT MERGE!)
This branch and merge request is for reviewing your code and code structure. It is only for the purpose of reviewing and should never be merged! Note that any changes made to the repo by you after creation of this merge request are not considered for this review. Please also note that we will not review code style here (we will use automatic tools for that in a future session), but rather we will keep feedback high level at this point. Please address any issues raised (if you haven't already addressed them on your own in the meantime).
Next to the in-line and general code comments we will give during the review sessions (done by me and Mihaela), we keep track of the status of the more formal requirements (i.e., not the actual code) with respect to repository setup, packaging and documentation in the following checklist. Please complete any pieces that remain after the "List checked by reviewers" was set to "Yes" by us (indicating that we are done checking your repo against the checklist). To address these, please check the relevant notebooks and homeworks for more info.
List checked by reviewers
-
Yes -
No
Version control
-
Repo configured correctly: repo public; default branch protected against pushes; fast-forward merges; encourage squash commits; delete source branch by default -
License available in file LICENSE
without any file extension -
.gitignore
available and includes common Python artifacts; no such artifacts (e.g., from building/packaging) are under version control -
Files organized as expected, with at least one and not more than three directories, all in lower case; one directory containing all the tool/app code and named after the app (required), one directory tests
containing all test-related content (required if tests or test files are available), and one directory namedimg
orimages
, containing the screenshots from exercise 1 (fine if omitted or deleted at this point); all other files (LICENSE
,README.md
,.gitignore
,setup.py
, etc.) in repository root directory
Packaging
-
setup.py
available (usedpyproject.toml
as alternative) -
CLI executable available -
CLI arguments available -
Tool can be successfully installed with pip install .
-
CLI executable can be successfully executed with -h
option
Documentation
-
README.md
has at least a synopsis, usage and installation instructions and contact information (can use zavolab-biozentrum@unibas.ch if you don't want to put your own); other sections, as outlined in the course materials, welcome -
Google-style docstrings available for all modules, classes, functions, methods -
Type hints provided at least for all functions & methods
Merge request reports
Activity
marked the checklist item Files organized as expected, with at least one and not more than three directories, all in lower case; one directory containing all the tool/app code and named after the app (required), one directory
tests
containing all test-related content (required if tests or test files are available), and one directory namedimg
orimages
, containing the screenshots from exercise 1 (fine if omitted or deleted at this point); all other files (LICENSE
,README.md
,.gitignore
,setup.py
, etc.) in repository root directory as completed- tsg/__init__.py deleted 100644 → 0
1 """Transcript structure generator package.""" 2 3 4 __version__ = '0.0.0' This is a great way towards keeping a definite version in a single place, and a good reminder to check whether we did cover this pattern in the materials (which, if we didn't, we probably should).
Now, recently I ran into some vexing issues with keeping the
__version__
string in the main package module (__init__.py
of the root package). The reason is that it is typically a place where convencience imports are defined, where a convenience import is defined as a sort of import shortcut for entry points into a package. For example, say, a classTranscriptStructureGenerator
is the main entry point into your package and it is defined in moduletranscript_structure_generator.py
in the packagetsg
. Then, people wanting to make use of that entry point by importing the class, would need to do:from tsg.transcript_structure_generator import TranscriptStructureGenerator
Worse even, if the class was actually defined in a module of some subpackage, say ...
transcript_structure_generator
. Then, the import would become:from tsg.transcript_structure_generator.transcript_structure_generator import TranscriptStructureGenerator
I guess the pattern is clear...
Now, what one can do is to import the class with the above call in an easily accessible location, e.g., in
tsg/__init__.py
. Then users could, in their turn, import the class with simply:from tsg import TranscriptStructureGenerator
Much easier!
Back to the problem with
__version__
being defined in the same module as the convenience imports: If you want to make use of your single source of truth for your version and actually importing__version__
in another module, this will then import the entire entry point, and possibly the entire app! This can have unintended consequeces, especially when, e.g., importing the entry point actually creates some web service or daemon or something of that kind (which is probably a bad pattern anyway, but maybe sometimes desired). It can also potentially lead to circular imports, which are nasty to debug/fix.Running into one of the latter, I have recently moved to a pattern where I store
__version__.py
in a moduleversion.py
in the package root, i.e., here it would be intsg/version.py
. By having a dedicated module for the version that does not contain anything else, importing__version__
anywhere in the package becomes safe.On that topic though: It is actually not safe to import
__version__
outside of the package, e.g., in yourtests/
directory or insetup.py
(another issue I ran into). This is because the actual package may not actually be available (yet) at the time the import is called. It makes sense: When you usesetup.py
to install a package for the first time and one of the first lines of that file is an instruction to import from a package that the file is designed to install further down, then this is bound to cause problems.A way to keep a single source of truth in
package/version.py
while not relying on the package to be available can be found here: https://github.com/elixir-cloud-aai/foca/blob/f476cf129b5acdfb91d0dada1cfe2371a6495e72/setup.py#L8In fact, this PR addresses both of the mentioned issues (detaching convenience imports from version and using
__version__
without importing it) are addressed in this PR for the package mentioned above: https://github.com/elixir-cloud-aai/foca/pull/152/filesSee also this for a more comprehensive list of ways how you can single source your version: https://packaging.python.org/en/latest/guides/single-sourcing-package-version/
Not sure how to do it with
pyproject.toml
, but you could try if theattr:
directive works (gray box in technique 1).By the way, I have just seen that your tool version in
pyproject.toml
andtsg/__init__.py
already diverged. Guess it illustrates that single sourcing the version is rather important.
- tsg/__main__.py deleted 100644 → 0
- tsg/cli.py deleted 100644 → 0
1 import argparse 2 import logging 3 from pathlib import Path 4 5 from tsg.main import sample_transcripts 6 7 8 def setup_logging(loglevel: str = None) -> None: 9 """Set up logging. Loglevel can be one of ["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"]. 10 11 Args: 12 loglevel (str, optional): Level of log output. Defaults to None. Best not to provide type information in two places because of the risk of divergence in the future. Try to keep a single source of info in all places, most definitely in documentation (which anyway runs a risk of becoming outdated when people forget to update it with their code). Adding type hints is preferred over embedding type info in docstrings. Autodoc generation tools like Sphinx know how to parse the type info from type hints, so no info is lost.
- tsg/cli.py deleted 100644 → 0
5 from tsg.main import sample_transcripts 6 7 8 def setup_logging(loglevel: str = None) -> None: 9 """Set up logging. Loglevel can be one of ["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"]. 10 11 Args: 12 loglevel (str, optional): Level of log output. Defaults to None. 13 14 Raises: 15 ValueError: If string that is not a log level is passed, raise error. 16 """ 17 if loglevel: 18 numeric_level = getattr(logging, loglevel.upper()) 19 if not isinstance(numeric_level, int): 20 raise ValueError("Invalid log level: %s" % loglevel) - Comment on lines -18 to -20
Actually, an error (
AttributeError
) would likely already be raised when callinggetattr
on an attribute that doesn't exist on thelogging
object. So likely your code to raise an error for invalid log levels would never be raised.This is also a good opportunity to follow Python's "It's better to ask forgiveness than permission" trope, i.e., catch the
AttributeError
rather than checking and raising an error manually.By the way, perhaps a better way to define the available options is via an
enum
. This gives one the option to type hint to the name of the defined enum, which then automatically lists or at least refers to the available options, dynamically, i.e., no need to update when more options are added or options are removed. You could also rewrite your check by checking enum membership, which - while not following the forgiveness/permission pattern - has the advantage of being very direct and explicit.
- tsg/cli.py deleted 100644 → 0
20 raise ValueError("Invalid log level: %s" % loglevel) 21 else: 22 numeric_level = getattr(logging, "INFO") 23 24 logging.basicConfig( 25 format='[%(asctime)s: %(levelname)s] %(message)s (module "%(module)s")', 26 level=numeric_level, 27 ) 28 29 30 def build_arg_parser() -> argparse.ArgumentParser: 31 parser = argparse.ArgumentParser() 32 parser.add_argument("--transcripts", type=str) 33 parser.add_argument("--annotation", type=str) 34 parser.add_argument("--prob_inclusion", type=float) 35 parser.add_argument("--log", type=str) - Comment on lines -32 to -35
Please consider the comments about flag/option names and command-line interface design in read-sequencer!24 and at the very least:
- add informative
help
messages for each option - provide defaults where reasonable (
--prob-inclusion
and--log
, I guess)
- add informative
- tsg/cli.py deleted 100644 → 0
30 def build_arg_parser() -> argparse.ArgumentParser: 31 parser = argparse.ArgumentParser() 32 parser.add_argument("--transcripts", type=str) 33 parser.add_argument("--annotation", type=str) 34 parser.add_argument("--prob_inclusion", type=float) 35 parser.add_argument("--log", type=str) 36 37 return parser 38 39 40 def get_args() -> argparse.Namespace: 41 parser = build_arg_parser() 42 43 args = parser.parse_args() 44 45 return args - Comment on lines -40 to -45
I'm all for encapsulating code, but here you basically define an extra function just for a single line of code. Given that there are hardly any use cases for calling
build_arg_parser()
outside of a situation where you also want to actually parse these arguments, I would probably merge the two. Otherwise you need to pay the maintenance overhead of having extra functions (documentation, type annotations, writing separate unit tests!).Of course up to you, it's a totally legit pattern as is, and it certainly nicely separates concerns (definition vs action).