Skip to content

CompareExpressions

Edit on GitHub View Code


This function utilises the SymPy to provide a maths-aware evaluation of a learner's response.

Architecture overview

The execution of the evaluation function follows this pattern:

  • Determine context
  • Parse response and answer data
  • Parse criteria
  • Store input parameters, parsed responses in a key-value store that allows adding new fields, but not editing existing fields
  • Execute generation feedback procedure provided by the context to generate written feedback and tags
  • Serialise generated feedback and tags in a suitably formatted dictionary

Evaluation function

The main evaluation function is found in evaluation.py as has the following signature:

evaluation_function(response : str, answer : str, params: dict, include_test_data=False : bool) -> dict

Input

This is the function that should be called to evaluate a response expression. - response is the response expression submitted by the learner - answer is a reference expression provided by the task author - params is a dictionary with optional parameters, for available parameters and their intended use, see the user documentation - include_test_data is a boolean that controls whether some extra data useful for testing or debugging is returned

Output

The function returns result dictionary with the following fields: - is_correct is a boolean value that indicates whether the response is considered correct or not - feedback is a string that provides information about what the evaluation function found when evaluating the response that is intended to be shown to the learner - tags is a list of strings that encode some information about what the evaluation function has found out about the response, more consistent across similar tasks than the string output in feedback

The returned dictionary will be referred to as the result in this documentation.

Overview

The overall flow of the evaluation procedure can be described as follows:

  1. The function uses the parameters given in params to determine the context of the evaluation. What context means will be discussed in more detail in section TODO: Add section name here.
  2. After the context is determined the response, answer and criteria (either supplied via params or from the context) are analysed an necessary information is stored for future use in a dictionary with frozen valuues, i.e. a dictionary where new items can be added but existing items cannot be changed.
  3. The feedback generating procedure supplied by the context is used to generate feedback based on the contents of the frozen value dictionary.
  4. If all criteria are found to be satisfied the response is considered correct, i.e. the is_correct field in the result is set to true and the feedback string and list of tags generated by the feedback generation procedure are added to their respective fields.

TODO Describe what further information is supplied when include_test_data is set to true.

Context

The context is a data structure that contains at least the following seven pieces of information: - default_parameters A dictionary where the keys are parameter names and the values are the default values that the evaluation function will use unless another value is provided together with the response. The required fields are context-dependent, currently all contexts use the default parameters found in utility\expression_utilities.py and the physical_quantity context adds a few extra fields, see the default parameters defined in context\physical_quantity.py. - expression_parse function that parses expressions (i.e. the response and answer inputs) into the form used by the feedback generation procedure. - expression_preprocess function that performs string manipulations that makes ensures that correctly written input expressions follows the conventions expected by expression_parse. - expression_preview is a function that generates a string that can be turned into a human-readable representation of how the evaluation function interpreted the response. - feedback_procedure_generator function that generates a function for each criteria that can be used to evaluate if the criteria is satisfied or not. The output from this function should be a list of tags that the feedback string generator can use to produce human readable feedback. - feedback_string_generator function that takes tags and outputs human readable feedback strings. - generate_criteria_parser function that generates a parser that can be used to turn the criteria (given in string form) into a form that the feedback generation procedure can use to determine if they are correct or not.

The context can also contain other fields if necessary.

Remark: The current implementation uses a dictionary rather than a dedicated class for ease of iteration during the initial development phase.

There are currently two different contexts: - symbolic: Handles comparisons of various symbolic expressions. Defined in context\symbolic.py. - physical_quantity: Handles comparisons of expressions involving units. Defined in context\physical_quantity.py.

Remark: Handwritten expressions are sent as latex, which requires extra preprocessing before the right context can be determined in some cases. It should be considered whether a new context, perhaps called handwritten, should be created for this purpose.

TODO Describe currently available contexts in detail

symbolic - Comparison of symbolic expressions

Remark: The symbolic context should probably be split into several smaller contexts, the following subdivision is suggested: - numerical: Comparison of expressions that can be evaluated to numerical values (e.g. expressions that are already numerical values or expressions only containing constants). Focuses on identifying if numerical values are greater than, less than, proportional to the expected answer or similar. - symbolic: Comparison of symbolic expressions that cannot be reduced to numerical values. - equality: Comparison of mathematical equalities (with the extra complexities that come with equivalence of equalities compared to equality of expressions). - inequality: Same as equality except for mathematical inequalities (which will require different choices when it comes to what can be considered equivalence). It might be appropriate to combine equality and inequality into one context (called statements or similar). - collection: Comparison of collections (e.g. sets, lists or intervals of the number line). Likely to consist mostly of code for handling comparison of individual elements using the other contexts, and configuring what counts as equivalence between different collections.

symbolic Criteria commands and grammar

Criteria

The criteria commands uses the following productions

    START -> BOOL
    BOOL -> EQUAL
    BOOL -> ORDER
    BOOL -> EQUAL
    BOOL -> EQUAL
    BOOL -> RESERVED written as OTHER
    BOOL -> RESERVED written as RESERVED
    BOOL -> RESERVED contains OTHER
    BOOL -> RESERVED contains RESERVED
    EQUAL_LIST -> EQUAL;EQUAL
    EQUAL_LIST -> EQUAL_LIST;EQUAL
    EQUAL -> OTHER = OTHER
    EQUAL -> RESERVED = OTHER
    EQUAL -> OTHER = RESERVED
    EQUAL -> RESERVED = RESERVED
    EQUAL -> OTHER ORDER OTHER
    EQUAL -> RESERVED ORDER OTHER
    EQUAL -> OTHER ORDER RESERVED
    EQUAL -> RESERVED ORDER RESERVED
    OTHER -> RESERVED OTHER
    OTHER -> OTHER RESERVED
    OTHER -> OTHER OTHER
along the the following base tokens:

  • START: Formal token used to indicate the start of an expression (in practice: any expression that can be reduced to a single START is a parseable criterion).
  • END: Formal token that indicates the end of a tokenized string.
  • NULL: Formal token that denotes a token without meaning, should not appear when an expression is tokenized.
  • BOOL: Expression that can be reduced to either True or False.
  • EQUAL: Token that denotes symbolic equality between the mathematical expressions.
  • EQUALITY: Token that denotes the equality operator =.
  • EQUAL_LIST: Token that denotes a list of equalities.
  • RESERVED: Token that denotes a formal name for a reserved name for an expression. Reserved names include response and answer.
  • ORDER: Token that denotes an order operator. Order operators include >, <, >= and <=.
  • WHERE: Token that denotes the separation of a criteria and a list of equalities that describe substitutions that should be done before the criteria is checked.
  • WRITTEN_AS: Token that denotes that syntactical comparison should be done.
  • CONTAINS: Token that denotes that a mathematical expression is dependent on a symbol or subexpression.
  • SEPARATOR: Token that denotes which symbol is used to separate a the list of equalities used by WHERE.
  • OTHER: Token that denotes any substring that will be passed on for more context specific parsing (e.g. explicit mathematical expressions for symbolic comparisons).
Examples of commonly used criteria

TODO Add examples

physical_quantity - Comparison of expressions that involve units

physical_quantity Criteria commands and grammar

The criteria commands uses the following productions

    START -> BOOL
    BOOL -> EQUAL
    BOOL -> ORDER
    BOOL -> EQUAL where EQUAL
    BOOL -> EQUAL where EQUAL_LIST
    BOOL -> RESERVED written as OTHER
    BOOL -> RESERVED written as RESERVED
    BOOL -> RESERVED contains OTHER
    BOOL -> RESERVED contains RESERVED
    EQUAL_LIST -> EQUAL;EQUAL
    EQUAL_LIST -> EQUAL_LIST;EQUAL
    EQUAL -> OTHER = OTHER
    EQUAL -> RESERVED = OTHER
    EQUAL -> OTHER = RESERVED
    EQUAL -> RESERVED = RESERVED
    EQUAL -> OTHER ORDER OTHER
    EQUAL -> RESERVED ORDER OTHER
    EQUAL -> OTHER ORDER RESERVED
    EQUAL -> RESERVED ORDER RESERVED
    OTHER -> RESERVED OTHER
    OTHER -> OTHER RESERVED
    OTHER -> OTHER OTHER
along the the following base tokens:

  • START: Formal token used to indicate the start of an expression (in practice: any expression that can be reduced to a single START is a parseable criterion).
  • END: Formal token that indicates the end of a tokenized string.
  • NULL: Formal token that denotes a token without meaning, should not appear when an expression is tokenized.
  • BOOL: Expression that can be reduced to either True or False.
  • QUANTITY: Token that denotes a physical quantity, that can be either given as both a value and units, only value (i.e. a dimensionless quantity) or only units.
  • DIMENSION: Token that denotes an expression only containing physical dimensions.
  • START_DELIMITER: Token that denotes a list of equalities.
  • INPUT: Token that denotes any substring that will be passed on for more context specific parsing (e.g. explicit mathematical expressions for symbolic comparisons).
  • matches: Token for operator that checks in two quantities match, i.e. if they are rewritten using the same units, are their values equal (up to chosen tolerance).
  • dimension: Token for expression only involving dimensions (i.e. no values or units).
  • =: Token for operator that checks equality (i.e. compares if value and units are identical separately)
  • <=: Token for operator that checks if a quantity's value is less than or equal to another quantity's value (after both quantities are rewritten on the same units)
  • >=: Token for operator that checks if a quantity's value is greater than or equal to another quantity's value (after both quantities are rewritten on the same units)
  • <: Token for operator that checks if a quantity's value is less than another quantity's value (after both quantities are rewritten on the same units)
  • >: Token for operator that checks if a quantity's value is greater than another quantity's value (after both quantities are rewritten on the same units)
Examples of commonly used criteria

TODO Add examples

Code shared between different contexts

Expression parsing

TODO Describe shared code for expression preprocessing and parsing

TODO Describe shared code for expression parsing parameters

Other shared code

TODO Describe shared default parameters

Feedback and tag generation

  • Generate feedback procedures from criteria, each procedure return a boolean that indicates whether the corresponding criterion is satisfied or not, a string intended to be shown to the student, and a list of tags indicating what was found when checking the criteria
  • For each criterion; run the corresponding procedure and store the result, the feedback string and the list of tags
  • If all criteria are found to be true, then the response is considered correct

Tag conventions

The feedback procedures consists of a series of function calls, the specifics are determined by the particular criteria, that each return a list of strings (called tags). Each tag then indicates what further function calls must be performed to continue the evaluation, as well as what feedback string (if any) should be generated. When there are no remaining function calls the feedback procedure is completed. The tags are formatted according as criteria_name of function call outcome. For tags that are not connected to a specific criteria (e.g. tags that indicate an issue with expression parsing) the criteria name and underscore is omitted.

Returning final results

The function returns result dictionary with the following fields: - is_correct is a boolean value that is set to True is all criteria are satisfied - feedback is a string that is created by joining all strings generated by the feedback procedures with a line break between each string. - tags is a list of strings that is generated by joining all lists of tags generated by feedback procedures and removing duplicates.

Preview function

When the evaluation function preview is called the code in preview.py will be executed. Since different contexts interpret responses in different ways they also have their own preview functions. The context-specific preview functions can be found in preview_implementations.

Remark: Since it is likely that there will be significant overlap between the response preview and the response evaluation (e.g. code for parsing and interpreting the response), it is good practice if they can share as much code as possible to ensure consistency. For this reason it might be better to move the preview functions fully inside the context (either by making a preview subfolder in the context folder, or by moving the implementation of the preview function inside the context files themselves). In this case the preview.py and evaluation.py could also share the same code for determining the right context to use.

Tests

There are two main groups of tests, evaluation tests and preview tests. The evaluation test can be run by calling evaluation_tests.py