pysoccer.algorithms.playerank.models package

Submodules

pysoccer.algorithms.playerank.models.Clusterer module

class pysoccer.algorithms.playerank.models.Clusterer.Clusterer(*args: Any, **kwargs: Any)

Bases: sklearn.base., sklearn.base.

Performance clustering

Attributes:

cluster_centers_array, [n_clusters, n_features]

Coordinates of cluster centers

n_clusters_int

number of clusters found by the algorithm

labels_

Labels of each point

k_rangetuple

minimum and maximum number of clusters to try

verboseboolean

whether or not to show details of the execution

random_stateint

RandomState instance or None, optional, default: None If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by ‘np.random’.

sample_size

None

kmeansscikit-learn KMeans object

None

Parameters
  • k_range – tuple (pair) the minimum and the maximum $k$ to try when choosing the best value of $k$ (the one having the best silhouette score)

  • border_threshold – float the threshold to use for selecting the borderline. It indicates the max silhouette for a borderline point.

  • verbose – boolean verbosity mode. default: False

  • random_state – int RandomState instance or None, optional, default: None If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • sample_size – int the number of samples (rows) that must be used when computing the silhouette score (the function silhouette_score is computationally expensive and generates a Memory Error when the number of samples is too high) default: 10000

  • max_rows – int the maximum number of samples (rows) to be considered for the clustering task (the function silhouette_samples is computationally expensive and generates a Memory Error when the input matrix have too many rows) default: 40000

fit(player_ids, match_ids, dataframe, y=None, kind='single', filename='clusters')

Compute performance clustering.

Parameters
  • X – array-like or sparse matrix, shape=(n_samples, n_features) Training instances to cluster.

  • kind – str single: single cluster multi: multi cluster

  • y – ignored

get_clusters_matrix(kind='single')
predict(X, y=None)

Predict the closest cluster each sample in X belongs to. In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters

X – {array-like, sparse matrix}, shape = [n_samples, n_features] New data to predict.

Returns

multi_labels: array, shape [n_samples,] Index of the cluster each sample belongs to.

pysoccer.algorithms.playerank.models.Clusterer.scalable_silhouette_samples(X, labels, metric='euclidean', n_jobs=1, **kwds)

Compute the Silhouette Coefficient for each sample. The Silhoeutte Coefficient is a measure of how well samples are clustered with samples that are similar to themselves. Clustering models with a high Silhouette Coefficient are said to be dense, where samples in the same cluster are similar to each other, and well separated, where samples in different clusters are not very similar to each other. The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is $(b - a) / max(a, b)$. This function returns the Silhoeutte Coefficient for each sample. The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters.

Parameters
  • X – array [n_samples_a, n_features] Feature array.

  • labels – array, shape = [n_samples] label values for each sample

  • metric – string, or callable The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by metrics.pairwise.pairwise_distances. If X is the distance array itself, use “precomputed” as the metric.

  • **kwds – optional keyword parameters Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.

Returns

silhouette : array, shape = [n_samples] Silhouette Coefficient for each samples.

References Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis”. Computational and Applied Mathematics 20: 53-65. doi:10.1016/0377-0427(87)90125-7. http://en.wikipedia.org/wiki/Silhouette_(clustering)

pysoccer.algorithms.playerank.models.Clusterer.scalable_silhouette_score(X, labels, metric='euclidean', sample_size=None, random_state=None, n_jobs=1, **kwds)

Compute the mean Silhouette Coefficient of all samples. The Silhouette Coefficient is compute using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is $(b - a) / max(a, b)$. To clarify, b is the distance between a sample and the nearest cluster that b is not a part of. This function returns the mean Silhoeutte Coefficient over all samples. To obtain the values for each sample, it uses silhouette_samples. The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

Parameters
  • X – array [n_samples_a, n_features] the Feature array.

  • labels – array, shape = [n_samples] label values for each sample

  • metric – string, or callable The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by metrics.pairwise.pairwise_distances. If X is the distance array itself, use “precomputed” as the metric.

  • sample_size – int or None The size of the sample to use when computing the Silhouette Coefficient. If sample_size is None, no sampling is used.

  • random_state – integer or numpy.RandomState, optional The generator used to initialize the centers. If an integer is given, it fixes the seed. Defaults to the global numpy random number generator.

  • **kwds – optional keyword parameters Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.

Returns

silhouette: float the Mean Silhouette Coefficient for all samples.

References Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis”. Computational and Applied Mathematics 20: 53-65. doi:10.1016/0377-0427(87)90125-7. http://en.wikipedia.org/wiki/Silhouette_(clustering)

pysoccer.algorithms.playerank.models.Rater module

class pysoccer.algorithms.playerank.models.Rater.Rater(alpha_goal=0.0)

Bases: object

Performance rating

Attributes:

ratings_numpy array

the ratings of the performances

Parameters

alpha_goal – float importance of the goal in the evaluation of performance, in the range [0, 1] default=0.0

get_rating(weighted_sum, goals)
predict(dataframe, goal_feature, score_feature, filename='ratings')

Compute the rating of each performance in X

Parameters
  • dataframe – dataframe of playerank scores

  • goal_feature – column name for goal scored dataframe column

  • score_feature – column name for playerank score dataframe column

Returns

ratings_: numpy array

pysoccer.algorithms.playerank.models.Weighter module

class pysoccer.algorithms.playerank.models.Weighter.Weighter(*args: Any, **kwargs: Any)

Bases: sklearn.base.

Automatic weighting of performance features

Attributes:

feature_names_array, [n_features]

names of the features

label_type_str

the label type associated to the game outcome. options: w-dl (victory vs draw or defeat), wd-l (victory or draw vs defeat), w-d-l (victory, draw, defeat)

clf_LinearSVC object

the object of the trained classifier

weights_array, [n_features]

weights of the features computed by the classifier

random_state_int

RandomState instance or None, optional, default: None If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by ‘np.random’.

Parameters
  • label_type – str the label type associated to the game outcome. options: w-dl (victory vs draw or defeat), wd-l (victory or draw vs defeat), w-d-l (victory, draw, defeat)

  • random_state – int RandomState instance or None, optional, default: None If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

fit(dataframe, target, scaled=False, var_threshold=0.001, filename='weights.json')

Compute weights of features.

Parameters
  • dataframe – pandas DataFrame a dataframe containing the feature values and the target values

  • target – str a string indicating the name of the target variable in the dataframe

  • scaled – boolean True if X must be normalized, False otherwise (optional)

  • filename – str the name of the files to be saved (the json file containing the feature weights) default: “weights”

get_feature_names()
get_weights()

Module contents