Patchwork [05,of,10] chgserver: add a context manager to learn what to include for confighash

login
register
mail settings
Submitter Jun Wu
Date June 30, 2016, 4:59 p.m.
Message ID <50d862642eb658dc28ca.1467305940@x1c>
Download mbox | patch
Permalink /patch/15680/
State Changes Requested
Delegated to: Yuya Nishihara
Headers show

Comments

Jun Wu - June 30, 2016, 4:59 p.m.
# HG changeset patch
# User Jun Wu <quark@fb.com>
# Date 1467292128 -3600
#      Thu Jun 30 14:08:48 2016 +0100
# Node ID 50d862642eb658dc28ca878a4ded89ab7303e3a9
# Parent  c02da7bd56e98cf4e905550859031079ed9beb18
# Available At https://bitbucket.org/quark-zju/hg-draft
#              hg pull https://bitbucket.org/quark-zju/hg-draft -r 50d862642eb6
chgserver: add a context manager to learn what to include for confighash

A pitfall about chg is developers need to be aware the ui object passed to
ui/extsetup can have wrong config and should not be depended on. This confuses
people, and causes inconvenience.

Instead of adding another set APIs letting extensions telling chg what need to
be hashed and fixing all problematic extensions one by one, I think it is
better just letting chg collecting these pieces of information automatically.

This patch adds a context manager wrapping common getters and adjusts
what to hash for confighash accordingly. It only includes configs currently
because supporting environment variables is while possible, not elegant
since we have to wrap stdlib.

Note this is not a solution for all cases, namely the following cases still
need extra work:

1. "if ui.config('x'): ui.setconfig('x', 'y')" in uisetup.
   This will overwrite config 'x' and chg will be unable to hash the value
   user provides. The pattern is mainly used in the evolve extension. It
   could be solved if we have clear layers about what are set by extensions
   and what are set by users.

2. Patterns like "self.ui = ui" in uisetup.
   This will keep a reference to a stale ui object. We probably want to
   add devel-warn for the case. It may also be reasonable to add a similar
   check to reposetup.

3. Interacting with filesystem or network in uisetup.
   In general, extensions doing these should be aware of race conditions.

We can also add a top-level extension interface like "chguisetup" called
per chg request. But that should not be a decision made lightly.
Yuya Nishihara - July 3, 2016, 4:44 a.m.
On Thu, 30 Jun 2016 17:59:00 +0100, Jun Wu wrote:
> # HG changeset patch
> # User Jun Wu <quark@fb.com>
> # Date 1467292128 -3600
> #      Thu Jun 30 14:08:48 2016 +0100
> # Node ID 50d862642eb658dc28ca878a4ded89ab7303e3a9
> # Parent  c02da7bd56e98cf4e905550859031079ed9beb18
> # Available At https://bitbucket.org/quark-zju/hg-draft
> #              hg pull https://bitbucket.org/quark-zju/hg-draft -r 50d862642eb6
> chgserver: add a context manager to learn what to include for confighash
> 
> A pitfall about chg is developers need to be aware the ui object passed to
> ui/extsetup can have wrong config and should not be depended on. This confuses
> people, and causes inconvenience.
> 
> Instead of adding another set APIs letting extensions telling chg what need to
> be hashed and fixing all problematic extensions one by one, I think it is
> better just letting chg collecting these pieces of information automatically.

Can the problems you're facing are solved if _configsections are the whitelist?

We can easily build a set of known-good config keys from our codebase. Assuming
config values don't vary amongst repositories and don't change often, it would
be acceptable to spawn new server if unknown config changed. We can't do that
for environment variables, but I can't think of a good reason why they have to
use os.environ in uisetup().

> Note this is not a solution for all cases, namely the following cases still
> need extra work:
> 
> 1. "if ui.config('x'): ui.setconfig('x', 'y')" in uisetup.
>    This will overwrite config 'x' and chg will be unable to hash the value
>    user provides. The pattern is mainly used in the evolve extension. It
>    could be solved if we have clear layers about what are set by extensions
>    and what are set by users.
> 
> 2. Patterns like "self.ui = ui" in uisetup.
>    This will keep a reference to a stale ui object. We probably want to
>    add devel-warn for the case. It may also be reasonable to add a similar
>    check to reposetup.
> 
> 3. Interacting with filesystem or network in uisetup.
>    In general, extensions doing these should be aware of race conditions.

(1) and (2) are generally wrong because the ui passed to uisetup() and
extsetup() are "lui", a temporary copy of ui.
Jun Wu - July 4, 2016, 7:44 p.m.
Excerpts from Yuya Nishihara's message of 2016-07-03 13:44:32 +0900:
> Can the problems you're facing are solved if _configsections are the whitelist?
> 
> We can easily build a set of known-good config keys from our codebase. Assuming
> config values don't vary amongst repositories and don't change often, it would
> be acceptable to spawn new server if unknown config changed. We can't do that
> for environment variables, but I can't think of a good reason why they have to
> use os.environ in uisetup().

I don't quite get your question. So, I draft some questions:

1. Is a static whitelist enough?

  No. The issue is 3rd-party extensions.
  
  We can say this is a limitation of chg and document it somewhere but
  people just don't read docs and yell at you whenever they see hg works
  while chg does not.

2. Other solutions?

  I did have other attempts, like a chg-compat checker, devel-warn or
  thinking about an API to let extensions telling us what they want to hash.

  A chg-compat checker is actually hard to implement correctly since
  extensions depending on other extensions are hard to check.

  devel-warn will pollute extensions.py.

  An API for extensions work but why do we let extensions telling us things
  while we can know ourselves with a little effort?

3. Is "_configitems" necessary since we have "_configsections"?

  I think so. Some extension accesses "ui.debugflag" and wraps core
  functions with a version with a logger. Hashing the whole ui section
  looks undesirable.

Therefore I think the "dynamically learn" approach is the way to go.
Developers don't need to know about chg and aren't forced to rewrite their
code (sometimes it's hard) using other patterns. Developers with chg in mind
can write code to reduce the count of chgservers.

I will send V2 after the unwrapfunction feature so it would be more
confident.
 
> > Note this is not a solution for all cases, namely the following cases still
> > need extra work:
> > 
> > 1. "if ui.config('x'): ui.setconfig('x', 'y')" in uisetup.
> >    This will overwrite config 'x' and chg will be unable to hash the value
> >    user provides. The pattern is mainly used in the evolve extension. It
> >    could be solved if we have clear layers about what are set by extensions
> >    and what are set by users.
> > 
> > 2. Patterns like "self.ui = ui" in uisetup.
> >    This will keep a reference to a stale ui object. We probably want to
> >    add devel-warn for the case. It may also be reasonable to add a similar
> >    check to reposetup.
> > 
> > 3. Interacting with filesystem or network in uisetup.
> >    In general, extensions doing these should be aware of race conditions.
> 
> (1) and (2) are generally wrong because the ui passed to uisetup() and
> extsetup() are "lui", a temporary copy of ui.

Why are they wrong?

For (1), you can have an extension:

    # a.py
    def uisetup(ui):
        # extdiff is in chgserver._configsections
        ui.setconfig('extdiff', 'x', 'x')

And run chg:

    chg --config extensions.s=$PWD/a.py --config extdiff.x=1 version
    chg --config extensions.s=$PWD/a.py --config extdiff.x=2 version

The second command will not start a new server.

For (2), if you save ui from uisetup to somewhere (say, a global variable),
and then use it in reposetup, it can have wrong configs for all things
outside chg confighash. It's not the case for short-lived hg process.

I like the idea of having a disposable ui object for uisetup, dropping all
effects extensions doing on it. But that sounds like a big breaking change.
Yuya Nishihara - July 5, 2016, 1:46 p.m.
On Mon, 4 Jul 2016 20:44:25 +0100, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2016-07-03 13:44:32 +0900:
> > Can the problems you're facing are solved if _configsections are the whitelist?
> > 
> > We can easily build a set of known-good config keys from our codebase. Assuming
> > config values don't vary amongst repositories and don't change often, it would
> > be acceptable to spawn new server if unknown config changed. We can't do that
> > for environment variables, but I can't think of a good reason why they have to
> > use os.environ in uisetup().  
> 
> I don't quite get your question. So, I draft some questions:
> 
> 1. Is a static whitelist enough?
> 
>   No. The issue is 3rd-party extensions.
>   
>   We can say this is a limitation of chg and document it somewhere but
>   people just don't read docs and yell at you whenever they see hg works
>   while chg does not.

As long as third-party extensions use their own config sections, e.g.
"remotefilelog.blah-blah", all of them are included in the hash if the list is
a whitelist of safe sections. If they read e.g. ui.username in uisetup() and
we have to work around that, a static list wouldn't help.

> 2. Other solutions?
> 
>   I did have other attempts, like a chg-compat checker, devel-warn or
>   thinking about an API to let extensions telling us what they want to hash.
> 
>   A chg-compat checker is actually hard to implement correctly since
>   extensions depending on other extensions are hard to check.
> 
>   devel-warn will pollute extensions.py.
> 
>   An API for extensions work but why do we let extensions telling us things
>   while we can know ourselves with a little effort?
>
> 3. Is "_configitems" necessary since we have "_configsections"?
> 
>   I think so. Some extension accesses "ui.debugflag" and wraps core
>   functions with a version with a logger. Hashing the whole ui section
>   looks undesirable.

Huh? Do they want to trust only [ui]debug ignoring --debug option?

> Therefore I think the "dynamically learn" approach is the way to go.
> Developers don't need to know about chg and aren't forced to rewrite their
> code (sometimes it's hard) using other patterns. Developers with chg in mind
> can write code to reduce the count of chgservers.

I prefer the static approach if it can save 90% of the problems with a little
penalty of excessive server redirection. Dynamically updating the list would
make debugging harder and would need more tests.

> I will send V2 after the unwrapfunction feature so it would be more
> confident.
>  
> > > Note this is not a solution for all cases, namely the following cases still
> > > need extra work:
> > > 
> > > 1. "if ui.config('x'): ui.setconfig('x', 'y')" in uisetup.
> > >    This will overwrite config 'x' and chg will be unable to hash the value
> > >    user provides. The pattern is mainly used in the evolve extension. It
> > >    could be solved if we have clear layers about what are set by extensions
> > >    and what are set by users.
> > > 
> > > 2. Patterns like "self.ui = ui" in uisetup.
> > >    This will keep a reference to a stale ui object. We probably want to
> > >    add devel-warn for the case. It may also be reasonable to add a similar
> > >    check to reposetup.
> > > 
> > > 3. Interacting with filesystem or network in uisetup.
> > >    In general, extensions doing these should be aware of race conditions.  
> > 
> > (1) and (2) are generally wrong because the ui passed to uisetup() and
> > extsetup() are "lui", a temporary copy of ui.  
> 
> Why are they wrong?
> 
> For (1), you can have an extension:
> 
>     # a.py
>     def uisetup(ui):
>         # extdiff is in chgserver._configsections
>         ui.setconfig('extdiff', 'x', 'x')
>
> And run chg:
> 
>     chg --config extensions.s=$PWD/a.py --config extdiff.x=1 version
>     chg --config extensions.s=$PWD/a.py --config extdiff.x=2 version
> 
> The second command will not start a new server.

IIRC, config values set in uisetup() aren't copied back to the original ui
and repo.ui, which is why I said generally wrong.

> For (2), if you save ui from uisetup to somewhere (say, a global variable),
> and then use it in reposetup, it can have wrong configs for all things
> outside chg confighash. It's not the case for short-lived hg process.

The ui passed to uisetup() isn't kept updated, which could have a different
value from repo.ui. IMHO, keeping ui globally is a bad idea.
Jun Wu - July 5, 2016, 8:10 p.m.
Excerpts from Yuya Nishihara's message of 2016-07-05 22:46:04 +0900:
> > I don't quite get your question. So, I draft some questions:
> > 
> > 1. Is a static whitelist enough?
> > 
> >   No. The issue is 3rd-party extensions.
> >   
> >   We can say this is a limitation of chg and document it somewhere but
> >   people just don't read docs and yell at you whenever they see hg works
> >   while chg does not.
> 
> As long as third-party extensions use their own config sections, e.g.
> "remotefilelog.blah-blah", all of them are included in the hash if the list is
> a whitelist of safe sections. If they read e.g. ui.username in uisetup() and
> we have to work around that, a static list wouldn't help.

I finally got your idea. It's a "blacklist" about what *not* to hash. I
object because it leads to unnecessary server processes. Given the fact
each process needs 100MB+ memory and probably does not share many pages in
the kernel (not forked from an Python ancestor), I'd treat process number
seriously.
 
> > 2. Other solutions?
> > 
> >   I did have other attempts, like a chg-compat checker, devel-warn or
> >   thinking about an API to let extensions telling us what they want to hash.
> > 
> >   A chg-compat checker is actually hard to implement correctly since
> >   extensions depending on other extensions are hard to check.
> > 
> >   devel-warn will pollute extensions.py.
> > 
> >   An API for extensions work but why do we let extensions telling us things
> >   while we can know ourselves with a little effort?
> >
> > 3. Is "_configitems" necessary since we have "_configsections"?
> > 
> >   I think so. Some extension accesses "ui.debugflag" and wraps core
> >   functions with a version with a logger. Hashing the whole ui section
> >   looks undesirable.
> 
> Huh? Do they want to trust only [ui]debug ignoring --debug option?

--debug is just a sugar setting the config item "ui.debug", which I think is
the source of truth:

    if options['verbose'] or options['debug'] or options['quiet']:
        for opt in ('verbose', 'debug', 'quiet'):
            val = str(bool(options[opt]))
            for ui_ in uis:
                ui_.setconfig('ui', opt, val, '--' + opt)

Currently there is no way to get sys.argv without wrapping
dispatch.runcommand, but the argv stuff is another topic.
 
> > Therefore I think the "dynamically learn" approach is the way to go.
> > Developers don't need to know about chg and aren't forced to rewrite their
> > code (sometimes it's hard) using other patterns. Developers with chg in mind
> > can write code to reduce the count of chgservers.
> 
> I prefer the static approach if it can save 90% of the problems with a little
> penalty of excessive server redirection. Dynamically updating the list would
> make debugging harder and would need more tests.

I understand your worry. I admit I didn't think too carefully about this
approach (but now I'm clear). The issue is: if the servers disagree about
what to hash, they can generate different hashes. It's especially a problem
if one server changes its idea about what to hash.

I think we should freeze the list at chgserver startup to make sure a single
server has consistent view about what to hash all the time.

For multiple servers, the following (bad) situation may happen:

  1. server A tell client to redirect to hash1
  2. client does not find hash1, starts a new server
  3. the new server generates hash2, while there is already a server running
     at hash2, therefore the starting of the new server is a waste.

To solve it, it seems we have to store the "what to hash" information on
disk, which is more complicated than I thought.

Therefore, what I think the next steps are:

  1. Add new configs to chgserver deciding what to hash:
  
        [chgserver]
        confighash.sections = section1, section2
        confighash.items = section.item, section.item

    This would also serve as the "api" for other extensions to hash their
    configs as they just need to use setconfig.

    This is a one-time config, chgserver reads it during startup and forget
    about it afterwards.

  2. (Optionally) add the auto-learn logic. I probably won't actively
     work on this, but put it here as it's a possible step.

What do you think? 1 is pretty easy and straight-forward and solves a lot
of issues. I may want to follow up with the chg-compat checker test then.

> > I will send V2 after the unwrapfunction feature so it would be more
> > confident.
> >  
> > > > Note this is not a solution for all cases, namely the following cases still
> > > > need extra work:
> > > > 
> > > > 1. "if ui.config('x'): ui.setconfig('x', 'y')" in uisetup.
> > > >    This will overwrite config 'x' and chg will be unable to hash the value
> > > >    user provides. The pattern is mainly used in the evolve extension. It
> > > >    could be solved if we have clear layers about what are set by extensions
> > > >    and what are set by users.
> > > > 
> > > > 2. Patterns like "self.ui = ui" in uisetup.
> > > >    This will keep a reference to a stale ui object. We probably want to
> > > >    add devel-warn for the case. It may also be reasonable to add a similar
> > > >    check to reposetup.
> > > > 
> > > > 3. Interacting with filesystem or network in uisetup.
> > > >    In general, extensions doing these should be aware of race conditions.  
> > > 
> > > (1) and (2) are generally wrong because the ui passed to uisetup() and
> > > extsetup() are "lui", a temporary copy of ui.  
> > 
> > Why are they wrong?
> > 
> > For (1), you can have an extension:
> > 
> >     # a.py
> >     def uisetup(ui):
> >         # extdiff is in chgserver._configsections
> >         ui.setconfig('extdiff', 'x', 'x')
> >
> > And run chg:
> > 
> >     chg --config extensions.s=$PWD/a.py --config extdiff.x=1 version
> >     chg --config extensions.s=$PWD/a.py --config extdiff.x=2 version
> > 
> > The second command will not start a new server.
> 
> IIRC, config values set in uisetup() aren't copied back to the original ui
> and repo.ui, which is why I said generally wrong.

That's not true :( Otherwise the "# stolen from
tortoisehg.util.copydynamicconfig()" hack will be unnecessary.

> > For (2), if you save ui from uisetup to somewhere (say, a global variable),
> > and then use it in reposetup, it can have wrong configs for all things
> > outside chg confighash. It's not the case for short-lived hg process.
> 
> The ui passed to uisetup() isn't kept updated, which could have a different
> value from repo.ui. IMHO, keeping ui globally is a bad idea.

Yes. It's definitely a bad pattern (which is why I want to devel-warn
refcount change). But the code is unfortunately so tightly integrated with
that pattern and is not easy to change (I did have attempts to rewrite but
given up finally).
Yuya Nishihara - July 6, 2016, 1:15 p.m.
On Tue, 5 Jul 2016 21:10:43 +0100, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2016-07-05 22:46:04 +0900:
> > > I don't quite get your question. So, I draft some questions:
> > > 
> > > 1. Is a static whitelist enough?
> > > 
> > >   No. The issue is 3rd-party extensions.
> > >   
> > >   We can say this is a limitation of chg and document it somewhere but
> > >   people just don't read docs and yell at you whenever they see hg works
> > >   while chg does not.  
> > 
> > As long as third-party extensions use their own config sections, e.g.
> > "remotefilelog.blah-blah", all of them are included in the hash if the list is
> > a whitelist of safe sections. If they read e.g. ui.username in uisetup() and
> > we have to work around that, a static list wouldn't help.  
> 
> I finally got your idea. It's a "blacklist" about what *not* to hash. I
> object because it leads to unnecessary server processes. Given the fact
> each process needs 100MB+ memory and probably does not share many pages in
> the kernel (not forked from an Python ancestor), I'd treat process number
> seriously.

What you say black is what I see white. Anyway, do you have many variants of
configurations in production? Most of my repository's hgrc files only set
"ui", "paths", and "email", so the not-to-hash list would work for me.

Also, I guess 100MB+ is the VIRT value. I think it is the size of the address
space.

> > Huh? Do they want to trust only [ui]debug ignoring --debug option?  
> 
> --debug is just a sugar setting the config item "ui.debug", which I think is
> the source of truth:
> 
>     if options['verbose'] or options['debug'] or options['quiet']:
>         for opt in ('verbose', 'debug', 'quiet'):
>             val = str(bool(options[opt]))
>             for ui_ in uis:
>                 ui_.setconfig('ui', opt, val, '--' + opt)

Extensions are loaded before the ui options are set.

> > > Therefore I think the "dynamically learn" approach is the way to go.
> > > Developers don't need to know about chg and aren't forced to rewrite their
> > > code (sometimes it's hard) using other patterns. Developers with chg in mind
> > > can write code to reduce the count of chgservers.  
> > 
> > I prefer the static approach if it can save 90% of the problems with a little
> > penalty of excessive server redirection. Dynamically updating the list would
> > make debugging harder and would need more tests.  
> 
> I understand your worry. I admit I didn't think too carefully about this
> approach (but now I'm clear). The issue is: if the servers disagree about
> what to hash, they can generate different hashes. It's especially a problem
> if one server changes its idea about what to hash.
> 
> I think we should freeze the list at chgserver startup to make sure a single
> server has consistent view about what to hash all the time.
> 
> For multiple servers, the following (bad) situation may happen:
> 
>   1. server A tell client to redirect to hash1
>   2. client does not find hash1, starts a new server
>   3. the new server generates hash2, while there is already a server running
>      at hash2, therefore the starting of the new server is a waste.
> 
> To solve it, it seems we have to store the "what to hash" information on
> disk, which is more complicated than I thought.
> 
> Therefore, what I think the next steps are:
> 
>   1. Add new configs to chgserver deciding what to hash:
>   
>         [chgserver]
>         confighash.sections = section1, section2
>         confighash.items = section.item, section.item
> 
>     This would also serve as the "api" for other extensions to hash their
>     configs as they just need to use setconfig.
> 
>     This is a one-time config, chgserver reads it during startup and forget
>     about it afterwards.
> 
>   2. (Optionally) add the auto-learn logic. I probably won't actively
>      work on this, but put it here as it's a possible step.
> 
> What do you think? 1 is pretty easy and straight-forward and solves a lot
> of issues. I may want to follow up with the chg-compat checker test then.

I don't think (1) needs to be a config section, which may be updated by
repository's hgrc. Instead, extensions can update the list by ui/extsetup()

  chgserver = extensions.find('chgserver')
  chgserver._configsections.add(...)

This will be simpler if chgserver gets out of hgext.

> > > > > Note this is not a solution for all cases, namely the following cases still
> > > > > need extra work:
> > > > > 
> > > > > 1. "if ui.config('x'): ui.setconfig('x', 'y')" in uisetup.
> > > > >    This will overwrite config 'x' and chg will be unable to hash the value
> > > > >    user provides. The pattern is mainly used in the evolve extension. It
> > > > >    could be solved if we have clear layers about what are set by extensions
> > > > >    and what are set by users.
> > > > > 
> > > > > 2. Patterns like "self.ui = ui" in uisetup.
> > > > >    This will keep a reference to a stale ui object. We probably want to
> > > > >    add devel-warn for the case. It may also be reasonable to add a similar
> > > > >    check to reposetup.
> > > > > 
> > > > > 3. Interacting with filesystem or network in uisetup.
> > > > >    In general, extensions doing these should be aware of race conditions.    
> > > > 
> > > > (1) and (2) are generally wrong because the ui passed to uisetup() and
> > > > extsetup() are "lui", a temporary copy of ui.    
> > > 
> > > Why are they wrong?
> > > 
> > > For (1), you can have an extension:
> > > 
> > >     # a.py
> > >     def uisetup(ui):
> > >         # extdiff is in chgserver._configsections
> > >         ui.setconfig('extdiff', 'x', 'x')
> > >
> > > And run chg:
> > > 
> > >     chg --config extensions.s=$PWD/a.py --config extdiff.x=1 version
> > >     chg --config extensions.s=$PWD/a.py --config extdiff.x=2 version
> > > 
> > > The second command will not start a new server.  
> > 
> > IIRC, config values set in uisetup() aren't copied back to the original ui
> > and repo.ui, which is why I said generally wrong.  
> 
> That's not true :( Otherwise the "# stolen from
> tortoisehg.util.copydynamicconfig()" hack will be unnecessary.

The hack should be necessary because config values are updated by reading
.hg/hgrc.

Try "hg config foo.bar" with/without a repository to see if ui.setconfig()
survives.

  def uisetup(ui):
      ui.setconfig('foo', 'bar', 'baz')
Jun Wu - July 6, 2016, 2:14 p.m.
Excerpts from Yuya Nishihara's message of 2016-07-06 22:15:23 +0900:
> > I finally got your idea. It's a "blacklist" about what *not* to hash. I
> > object because it leads to unnecessary server processes. Given the fact
> > each process needs 100MB+ memory and probably does not share many pages in
> > the kernel (not forked from an Python ancestor), I'd treat process number
> > seriously.
> 
> What you say black is what I see white. Anyway, do you have many variants of
> configurations in production? Most of my repository's hgrc files only set
> "ui", "paths", and "email", so the not-to-hash list would work for me.
> 
> Also, I guess 100MB+ is the VIRT value. I think it is the size of the address
> space.

In production the config files are managed by Chef and can be complex, the
Chef code has a bunch of sections like:

   if node.in_alpha_tier?
     hgrc['newfeatrue1']['newsetting'] = somevalue
     ...
   elsif node.in_beta_tier?
     ...
   end

For repo hgrc, it has "%include /etc/mercurial/reporc/reponame.rc". This is
to store non-common configs. For example, some repos need hgsubversion and
some don't.

About whitelist vs blacklist, I think it's less important. I prefer the
current situation.
 
> > > Huh? Do they want to trust only [ui]debug ignoring --debug option?  
> > 
> > --debug is just a sugar setting the config item "ui.debug", which I think is
> > the source of truth:
> > 
> >     if options['verbose'] or options['debug'] or options['quiet']:
> >         for opt in ('verbose', 'debug', 'quiet'):
> >             val = str(bool(options[opt]))
> >             for ui_ in uis:
> >                 ui_.setconfig('ui', opt, val, '--' + opt)
> 
> Extensions are loaded before the ui options are set.

You are right. ui.debugflag is set before extensions are loaded, which I
think is not ideal. I will make sure ui.config('ui', 'debug') is the source
of truth while doing the ui/config refactoring. But that's another topic.

> > I understand your worry. I admit I didn't think too carefully about this
> > approach (but now I'm clear). The issue is: if the servers disagree about
> > what to hash, they can generate different hashes. It's especially a problem
> > if one server changes its idea about what to hash.
> > 
> > I think we should freeze the list at chgserver startup to make sure a single
> > server has consistent view about what to hash all the time.
> > 
> > For multiple servers, the following (bad) situation may happen:
> > 
> >   1. server A tell client to redirect to hash1
> >   2. client does not find hash1, starts a new server
> >   3. the new server generates hash2, while there is already a server running
> >      at hash2, therefore the starting of the new server is a waste.
> > 
> > To solve it, it seems we have to store the "what to hash" information on
> > disk, which is more complicated than I thought.
> > 
> > Therefore, what I think the next steps are:
> > 
> >   1. Add new configs to chgserver deciding what to hash:
> >   
> >         [chgserver]
> >         confighash.sections = section1, section2
> >         confighash.items = section.item, section.item
> > 
> >     This would also serve as the "api" for other extensions to hash their
> >     configs as they just need to use setconfig.
> > 
> >     This is a one-time config, chgserver reads it during startup and forget
> >     about it afterwards.
> > 
> >   2. (Optionally) add the auto-learn logic. I probably won't actively
> >      work on this, but put it here as it's a possible step.
> > 
> > What do you think? 1 is pretty easy and straight-forward and solves a lot
> > of issues. I may want to follow up with the chg-compat checker test then.
> 
> I don't think (1) needs to be a config section, which may be updated by
> repository's hgrc. Instead, extensions can update the list by ui/extsetup()
>
>   chgserver = extensions.find('chgserver')
>   chgserver._configsections.add(...)
> 
> This will be simpler if chgserver gets out of hgext.

This is the old "API" discussion that makes extensions explicitly aware of
chg. The main downside I see is, once the API is introduced, it's hard to
remove. Besides, exposing this private variable would also cause trouble if
other extensions modify it in places other than ui/extsetup.

In general, I try to keep extensions "clean" without having code to
explicitly work with chg, thus the auto learn approach proposals.

Thinking about that, I reconsider the auto-learn approach, and it's likely
the cleanest we can get. Because we hash [extensions] already, it's hard
to have inconsistent hashes situation unless the extensions use
outside-world conditions to get config items.

Therefore I proabably want the auto-learn feature in, after the unwrap
series. What do you think?

> > > IIRC, config values set in uisetup() aren't copied back to the original ui
> > > and repo.ui, which is why I said generally wrong.  
> > 
> > That's not true :( Otherwise the "# stolen from
> > tortoisehg.util.copydynamicconfig()" hack will be unnecessary.
> 
> The hack should be necessary because config values are updated by reading
> .hg/hgrc.
> 
> Try "hg config foo.bar" with/without a repository to see if ui.setconfig()
> survives.
> 
>   def uisetup(ui):
>       ui.setconfig('foo', 'bar', 'baz')

I see. I will address all these ugliness with the ui/config refactoring.
Yuya Nishihara - July 6, 2016, 3:05 p.m.
On Wed, 6 Jul 2016 15:14:59 +0100, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2016-07-06 22:15:23 +0900:
> > > I finally got your idea. It's a "blacklist" about what *not* to hash. I
> > > object because it leads to unnecessary server processes. Given the fact
> > > each process needs 100MB+ memory and probably does not share many pages in
> > > the kernel (not forked from an Python ancestor), I'd treat process number
> > > seriously.  
> > 
> > What you say black is what I see white. Anyway, do you have many variants of
> > configurations in production? Most of my repository's hgrc files only set
> > "ui", "paths", and "email", so the not-to-hash list would work for me.
> > 
> > Also, I guess 100MB+ is the VIRT value. I think it is the size of the address
> > space.  
> 
> In production the config files are managed by Chef and can be complex, the
> Chef code has a bunch of sections like:
> 
>    if node.in_alpha_tier?
>      hgrc['newfeatrue1']['newsetting'] = somevalue
>      ...
>    elsif node.in_beta_tier?
>      ...
>    end
> 
> For repo hgrc, it has "%include /etc/mercurial/reporc/reponame.rc". This is
> to store non-common configs. For example, some repos need hgsubversion and
> some don't.

Do they have significant differences in non-core configs other than
"extensions" section?

> About whitelist vs blacklist, I think it's less important. I prefer the
> current situation.

It defines the default behavior of chg, whether or not chg is permissive for
third-party extensions which process their config values in undesired way.
IMHO it's important property of chg.

> >   chgserver = extensions.find('chgserver')
> >   chgserver._configsections.add(...)
> > 
> > This will be simpler if chgserver gets out of hgext.  
> 
> This is the old "API" discussion that makes extensions explicitly aware of
> chg. The main downside I see is, once the API is introduced, it's hard to
> remove. Besides, exposing this private variable would also cause trouble if
> other extensions modify it in places other than ui/extsetup.
> 
> In general, I try to keep extensions "clean" without having code to
> explicitly work with chg, thus the auto learn approach proposals.
> 
> Thinking about that, I reconsider the auto-learn approach, and it's likely
> the cleanest we can get. Because we hash [extensions] already, it's hard
> to have inconsistent hashes situation unless the extensions use
> outside-world conditions to get config items.

Another possible problem is that extensions in repo/hgrc can be loaded
either while booting the server or after the server is fully booted. uisetup()
may behave differently depending on ui or extensions state. I don't know
if it can be a real problem, though.

I understand you greatly invest on improving chg, but the situation where
every repo has different config of third-party extensions seems very specific
to FB. That made me feel your auto-learn function is too complicated.
Jun Wu - July 6, 2016, 3:33 p.m.
Excerpts from Yuya Nishihara's message of 2016-07-07 00:05:06 +0900:
> Do they have significant differences in non-core configs other than
> "extensions" section?

They are mainly about new experimental extension features.
Most of them work fine with chg. The one that does not work well is
fastmanifest:

  https://bitbucket.org/facebook/hg-experimental/src/be7a54df7b36f7e9ac93ae05b9ff7621e32c089c/fastmanifest/__init__.py?at=default&fileviewer=file-view-default#__init__.py-163

It basically keeps a ui object in all kinds of its objects, and down in the
stack the ui object is used to do debugflag test or so. As said, I believed
this is not a good pattern and tried to fix them but the codebase is so
large and I gave up after a few hours.

The developer wants to use --debug to toggle debug output so they should be
able to tell chg to hash ui.debugflag, but not other ui config items.

> > About whitelist vs blacklist, I think it's less important. I prefer the
> > current situation.
> 
> It defines the default behavior of chg, whether or not chg is permissive for
> third-party extensions which process their config values in undesired way.
> IMHO it's important property of chg.

But as I said, my main concern about listing what *not* to hash is
unnecessary processes. The list will also be unfriendly for extensions to
change (if they need to) since all extensions must agree together, only that
can a section be added, which is practically impossible.

Internally, one of our testing infra's senior engineers feels annoyed when
he runs a few hg commands and get 4 or 5 chg processes. I think it's very
important to optimize the number of server processes. (Therefore the
blackbox filename series to replace the --config extensions.blackbox=!
pattern).
Yuya Nishihara - July 7, 2016, 12:49 p.m.
On Wed, 6 Jul 2016 16:33:30 +0100, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2016-07-07 00:05:06 +0900:
> > Do they have significant differences in non-core configs other than
> > "extensions" section?  
> 
> They are mainly about new experimental extension features.
> Most of them work fine with chg. The one that does not work well is
> fastmanifest:
> 
>   https://bitbucket.org/facebook/hg-experimental/src/be7a54df7b36f7e9ac93ae05b9ff7621e32c089c/fastmanifest/__init__.py?at=default&fileviewer=file-view-default#__init__.py-163

For this particular example, wrapping manifest.rev can be moved to reposetup().

 1. extend repo.__class__ to hook repo.manifest
 2. extend manifest.__class__ to hook manifest.rev

> It basically keeps a ui object in all kinds of its objects, and down in the
> stack the ui object is used to do debugflag test or so. As said, I believed
> this is not a good pattern and tried to fix them but the codebase is so
> large and I gave up after a few hours.
> 
> The developer wants to use --debug to toggle debug output so they should be
> able to tell chg to hash ui.debugflag, but not other ui config items.

Hmm, but you can't track the access to ui.debugflag without an ugly hack. Only
ui.config*() calls are recorded.

> > > About whitelist vs blacklist, I think it's less important. I prefer the
> > > current situation.  
> > 
> > It defines the default behavior of chg, whether or not chg is permissive for
> > third-party extensions which process their config values in undesired way.
> > IMHO it's important property of chg.  
> 
> But as I said, my main concern about listing what *not* to hash is
> unnecessary processes. The list will also be unfriendly for extensions to
> change (if they need to) since all extensions must agree together, only that
> can a section be added, which is practically impossible.

As long as the section is specific to the extension, the extension can add
its own section to the white/black list.

> Internally, one of our testing infra's senior engineers feels annoyed when
> he runs a few hg commands and get 4 or 5 chg processes. I think it's very
> important to optimize the number of server processes. (Therefore the
> blackbox filename series to replace the --config extensions.blackbox=!
> pattern).

It sounds like you're urged to reduce the number of chg processes, though
I don't know why it's such important. In that case, I agree the not-to-hash
list isn't acceptable. Also, I don't think your auto-learner can be the
permanent solution since we'll have to fix chg-unfriendly extensions anyway.

Instead, I think you can make the auto-learner extension and enable it as
a temporary way around. The patch 3, [(section, name)] list, seems good to
include in the core.
Jun Wu - July 7, 2016, 1:24 p.m.
Excerpts from Yuya Nishihara's message of 2016-07-07 21:49:01 +0900:
> For this particular example, wrapping manifest.rev can be moved to reposetup().
> 
>  1. extend repo.__class__ to hook repo.manifest
>  2. extend manifest.__class__ to hook manifest.rev
 
The problem will become detecting the unnecessary hg serve's "reposetup" and
unwrap if the config changes. But this discussion is less important here.

> > It basically keeps a ui object in all kinds of its objects, and down in the
> > stack the ui object is used to do debugflag test or so. As said, I believed
> > this is not a good pattern and tried to fix them but the codebase is so
> > large and I gave up after a few hours.
> > 
> > The developer wants to use --debug to toggle debug output so they should be
> > able to tell chg to hash ui.debugflag, but not other ui config items.
> 
> Hmm, but you can't track the access to ui.debugflag without an ugly hack. Only
> ui.config*() calls are recorded.

That would be covered by my ui refactoring plan. I will make config the
source of truth, without sacrificing performance much.

> > > > About whitelist vs blacklist, I think it's less important. I prefer the
> > > > current situation.  
> > > 
> > > It defines the default behavior of chg, whether or not chg is permissive for
> > > third-party extensions which process their config values in undesired way.
> > > IMHO it's important property of chg.  
> > 
> > But as I said, my main concern about listing what *not* to hash is
> > unnecessary processes. The list will also be unfriendly for extensions to
> > change (if they need to) since all extensions must agree together, only that
> > can a section be added, which is practically impossible.
> 
> As long as the section is specific to the extension, the extension can add
> its own section to the white/black list.

It's assuming an extension won't read other extensions' config sections.
However, our recommended practice is to reuse existing config sections (I'm
sure I read it in the wiki but unable to find it now).

> It sounds like you're urged to reduce the number of chg processes, though
> I don't know why it's such important. In that case, I agree the not-to-hash
> list isn't acceptable. Also, I don't think your auto-learner can be the
> permanent solution since we'll have to fix chg-unfriendly extensions anyway.

Haha, not that "urged". Our internal users have all kinds of setups and
requirements and I try to satisfy as many reasonable requirements as I can.

Yes, I agree the leaner approach can solve most issues but does not handle
100% cases. Having some chg-specific API seems inevitable.

I think it's time to discuss about the formal confighash API, as changing
private variables is hacky. I will start a new thread.
 
> Instead, I think you can make the auto-learner extension and enable it as
> a temporary way around. The patch 3, [(section, name)] list, seems good to
> include in the core.

That's a good idea. I'll do it.

Patch

diff --git a/hgext/chgserver.py b/hgext/chgserver.py
--- a/hgext/chgserver.py
+++ b/hgext/chgserver.py
@@ -64,6 +64,7 @@  from mercurial import (
     error,
     extensions,
     osutil,
+    ui as uimod,
     util,
 )
 
@@ -728,3 +729,37 @@  def _unwrapfuncs(wraplist):
         else:
             raise RuntimeError('%s cannot be restored' % name)
     wraplist[:] = []
+
+class _learnconfighash(object):
+    """learn what configs are accessed and add them to confighash"""
+    def __init__(self, ui, extname):
+        self._ui = ui
+        self._extname = extname
+        self._wraplist = []
+
+    def _uiconfigitems(self, orig, uiself, section, *args, **kwargs):
+        if section not in _configsections:
+            _log('add config section %s used by %s to confighash'
+                 % (section, self._extname))
+            _configsections.append(section)
+        return orig(uiself, section, *args, **kwargs)
+
+    def _uiconfig(self, orig, uiself, section, name, *args, **kwargs):
+        if section not in _configsections:
+            pair = (section, name)
+            if pair not in _configitems:
+                _log('add config item %s.%s used by %s to confighash'
+                     % (section, name, self._extname))
+                _configitems.append(pair)
+        return orig(uiself, section, name, *args, **kwargs)
+
+    def _wrapfunc(self, obj, name, newfunc):
+        _wrapfunc(obj, name, newfunc, self._wraplist)
+
+    def __enter__(self):
+        self._wrapfunc(uimod.ui, 'configitems', self._uiconfigitems)
+        self._wrapfunc(uimod.ui, 'config', self._uiconfig)
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        _unwrapfuncs(self._wraplist)
+        self._ui = None