Guest User

Untitled

a guest
Feb 8th, 2019
126
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 4.63 KB | None | 0 0
  1. # MLflow API notes
  2.  
  3. ## Tracking store
  4.  
  5. There are currently three implementations of tracking store, with the following
  6. interfaces:
  7.  
  8. * `FileStore(root_directory, artifact_root_uri)`
  9. * `SqlAlchemyStore(db_uri, default_artifact_root)`
  10. * `RestStore(get_host_creds)`
  11.  
  12. In each case, the first argument provides information on the location of
  13. tracking information storage. The second argument to `FileStore` and
  14. `SqlAlchemyStore` sets the root directory within which artfifacts will be
  15. stored. This is overridable when creating an experiment.
  16.  
  17. The principal inconsistency in the current interface is that default artifact
  18. root URIs are determined on construction of the file and sqlalchemy stores, but
  19. by the server in the REST store case.
  20.  
  21. In addition, a custom root artifact URI passed to
  22. `mlflow.tracking.utils._get_store()` is ignored for the file store case (the
  23. `store_uri` is passed as both `store_uri` and `artifact_root_uri` to
  24. `FileStore`).
  25.  
  26. Proposals:
  27.  
  28. 1. The default root artifact URI is made configurable when building a REST
  29. store.
  30. 2. The default root artifact URI is removed as an option when constructing a
  31. file or sqlalchemy store and is either:
  32. a. Configured in the actual file / sqlalchemy stores (stored in a config
  33. file in the file store case and in a config table in the sqlalchemy store
  34. case). This would mirror the behaviour of the REST store, which
  35. determines the default location on the server side.
  36. b. Read from an environment variable instead of being passed in through
  37. code.
  38. 3. Leave the current interface as-is.
  39.  
  40. ## Artifact repository
  41.  
  42. The API for building artifact repositories makes sense, except for the need to
  43. pass through an associated tracking store. This is only used by the DBFS
  44. artifact repository, which uses the tracking store's `get_host_creds` attribute
  45. (only in place on a `RestStore`) to avoid loading the host credentials multiple
  46. times.
  47.  
  48. If the motivation for accessing the host credentials through the tracking store
  49. is to avoid reloading the host credentials multiple times, we propose moving
  50. this implicit caching to another part of the code base. I propose that each
  51. artifact store takes only a URI as an argument, and is then responsible for
  52. loading any extra credentials it needs to access that URI.
  53.  
  54. This is implicitly already the case with artifact repositories like the S3
  55. artifact repository. It relies on boto, which will load AWS credentials from
  56. the caller's environment as appropriate.
  57.  
  58. Proposal:
  59.  
  60. * Remove `store` as an argument from
  61. `mlflow.store.artifact_repo.ArtifactRepository.from_artifact_uri()`
  62. * Remove `get_host_creds` as an argument from
  63. `mlflow.store.dbfs_artifact_repo.DbfsArtifactRepository()`
  64. * Have the `DbfsArtifactRepository` call
  65. `mlflow.utils.databricks_utils.get_databricks_host_creds` to get host
  66. credentials instead of passing them in at construction time (optionally add
  67. caching around the above if it's expensive)
  68.  
  69. This would result in a simpler and consistent interface for constructing
  70. `ArtifactRepository`s - only the URI would be needed.
  71.  
  72. ## Back to tracking stores
  73.  
  74. The above proposed change to the artifact repository interface has the benefit
  75. of having a consistent API for constructing them (only a URI needs to be
  76. passed). The same simplification could be brought to the tracking store by
  77. having the `RestStore` take a tracking URI instead of a
  78. `get_host_creds` function.
  79.  
  80. The difference in logic required between Databricks and non-Databricks REST
  81. stores can be provided by two slightly different implementations:
  82.  
  83. ```python
  84. from mlflow.store.abstract_store import AbstractStore
  85. from mlflow.utils import rest_utils
  86. from mlflow.utils.databricks_utils import get_databricks_host_creds
  87.  
  88. class AbstractRestStore(AbstractStore):
  89.  
  90. def __init__(self, store_uri):
  91. super(AbstractRestStore, self).__init__()
  92. self.store_uri = store_uri
  93.  
  94. @abstractmethod
  95. def _get_host_creds(self):
  96. pass
  97.  
  98. class RestStore(AbstractRestStore):
  99.  
  100. def _get_host_creds(self):
  101. # Currently in mlflow.tracking.utils._get_rest_store
  102. return rest_utils.MlflowHostCreds(
  103. host=self.store_uri,
  104. username=os.environ.get(_TRACKING_USERNAME_ENV_VAR),
  105. password=os.environ.get(_TRACKING_PASSWORD_ENV_VAR),
  106. token=os.environ.get(_TRACKING_TOKEN_ENV_VAR),
  107. ignore_tls_verification=os.environ.get(_TRACKING_INSECURE_TLS_ENV_VAR) == 'true',
  108. )
  109.  
  110. class DatabricksRestStore(AbstractRestStore):
  111.  
  112. def _get_host_creds(self):
  113. # Get the Databricks profile specified by the tracking URI
  114. parsed_uri = urllib.parse.urlparse(self.store_uri)
  115. profile = parsed_uri.netloc
  116. return get_databricks_host_creds(profile)
  117. ```
Add Comment
Please, Sign In to add comment