How do I load categorical data from a numpy array into an Indicator or Embedding column?


How do I load categorical data from a numpy array into an Indicator or Embedding column?



Using Tensorflow 1.8.0, we are running into an issue whenever we attempt to build a categorical column. Here is a full example demonstrating the problem. It runs as-is (using only numeric columns). Uncommenting the indicator column definition and data generates a stack trace ending in tensorflow.python.framework.errors_impl.InternalError: Unable to get element as bytes.


tensorflow.python.framework.errors_impl.InternalError: Unable to get element as bytes.


import tensorflow as tf
import numpy as np

def feature_numeric(key):
return tf.feature_column.numeric_column(key=key, default_value=0)

def feature_indicator(key, vocabulary):
return tf.feature_column.indicator_column(
tf.feature_column.categorical_column_with_vocabulary_list(
key=key, vocabulary_list=vocabulary ))


labels = ['Label1','Label2','Label3']

model = tf.estimator.DNNClassifier(
feature_columns=[
feature_numeric("number"),
# feature_indicator("indicator", ["A","B","C"]),
],
hidden_units=[64, 16, 8],
model_dir='./models',
n_classes=len(labels),
label_vocabulary=labels)

def train(inputs, training):
model.train(
input_fn=tf.estimator.inputs.numpy_input_fn(
x=inputs,
y=training,
shuffle=True
), steps=1)

inputs = {
"number": np.array([1,2,3,4,5]),
# "indicator": np.array([
# ["A"],
# ["B"],
# ["C"],
# ["A", "A"],
# ["A", "B", "C"],
# ]),
}

training = np.array(['Label1','Label2','Label3','Label2','Label1'])

train(inputs, training)



Attempts to use an embedding fare no better. Using only numeric inputs, we can successfully scale to thousands of input nodes, and in fact we have temporarily expanded our categorical features in the preprocessor to simulate indicators.



The documentation for categorical_column_*() and indicator_column() are awash in references to features we're pretty sure we're not using (proto inputs, whatever bytes_list is) but maybe we're wrong on that?


categorical_column_*()


indicator_column()


bytes_list




2 Answers
2



The issue here is related to the ragged shape of the "indicator" input array (some elements are of length 1, one is length 2, one is length 3). If you pad your input lists with some non-vocabulary string (I used "Z" for example since your vocabulary is "A", "B", "C"), you'll get the expected results:


inputs = {
"number": np.array([1,2,3,4,5]),
"indicator": np.array([
["A", "Z", "Z"],
["B", "Z", "Z"],
["C", "Z", "Z"],
["A", "A", "Z"],
["A", "B", "C"]
])
}



You can verify that this works by printing the resulting tensor:


dense = tf.feature_column.input_layer(
inputs,
[
feature_numeric("number"),
feature_indicator("indicator", ["A","B","C"]),
])

with tf.train.MonitoredTrainingSession() as sess:
print(dense)
print(sess.run(dense))





Confirmed that this works! Of course it produces the fun new problem of requiring a fixed array size for input data of completely arbitrary length (but still a fixed number of actual nodes). But that's a different question. Accepted!
– Theodore Lief Gannon
2 days ago



From what I can tell, the difficulty is that you are trying to make an indicator column from an array of arrays.



I collapsed your indicator array to


"indicator": np.array([
"A",
"B",
"C",
"AA",
"ABC",
])



... and the thing ran.



More, I can't find any example where the vocabulary array is anything but a flat array of strings.





It's a nested array because there are five samples; each of them only gets the flat array. If that's still what you meant, I have a counterexample in the Tensorflow docs: tensorflow.org/api_docs/python/tf/feature_column/…
– Theodore Lief Gannon
2 days ago





I believe this is effectively a one-hot; "AA" and "ABC" would wind up in the out-of-vocabulary bucket. I've confirmed that it works, though, which is interesting since we observed a failure on similar data. I'll have to revisit that.
– Theodore Lief Gannon
2 days ago





I saw you first example. Note that the parameters are a naked string, and then a list of strings. No array of arrays.
– Hack Saw
2 days ago





Can't usefully comment on the second point. I'm rooting for you. :)
– Hack Saw
2 days ago






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Export result set on Dbeaver to CSV

Opening a url is failing in Swift